写在前面

很多文本处理的问题都可以变成QA问题：

机器翻译machine translation： (What is the translation into French?)
命名实体识别named entity recognition (NER) ：(What are the named entity tags in this sentence?)
词性识别part-of-speech tagging (POS) ：(What are the part-of-speech tags?)
文本分类classification problems like sentiment analysis： (What is the sentiment?)
指代问题coreference resolution： (Who does ”their” refer to?)

这篇主要介绍QA问答系统中的动态记忆网络模型(Dynamic Memory Network)，它是由4部分构成的，包括输入模块、问题模块、情景记忆模块、输出模块。

数据集

1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary? bathroom 1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel? hallway 4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel? hallway 4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel? office 11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra? bathroom 8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra? bathroom 2

四个模块

输入模块

对输入的句子进行GRU编码，将结果给到情景记忆模块。

如果输入为多个句子的话，那就在每个句子记录下结束时刻的位置，并在GRU编码后输出该位置对应的$ c_t$作为句子的向量表达：

$c_t = GRU(w_t,c_{t-1})$

encoded_facts = []
for facts,facts_mask in zip(allfacts,allfacts_mask):       
	facts_embeds = self.embedding(facts)
	facts_embeds = self.dropout(facts_embeds)
	hidden = self.init_hidden(nfacts)
	output,hidden = self.input_gru(facts_embeds,hidden)
	real_hiddens = []
	for i in range(len(output)):
		real_len = facts_mask[i].data.tolist().count(0)
		final_facts = output[i][real_len-1]
		real_hiddens.append(final_facts)

		hiddens = torch.cat(real_hiddens).view(nfacts, -1).unsqueeze(0)
		encoded_facts.append(hiddens)

encoded_facts = torch.cat(encoded_facts) #bsize,nfacts,hidden

问题模块

对输入的问题进行GRU编码，将结果给到情景记忆模块和回答模块。

同输入模块，将GRU编码后的结束位置的$q_t$作为句子的向量表达：

$q_t = GRU(w_q,q_{t-1})$

encoded_questions = []
ques_embeds = self.embedding(questions)
ques_embeds = self.dropout(ques_embeds)
hidden = self.init_hidden(bsize)
output,hidden = self.question_gru(ques_embeds,hidden) #output:(bsize,qlen,hidden)
for i in range(len(output)):
	real_len = questions_mask[i].data.tolist().count(0)
	final_ques = output[i][real_len-1]
	encoded_questions.append(final_ques)
        
encoded_questions = torch.cat(encoded_questions).view(bsize,-1) #bsize,hidden

情景记忆模块

情景记忆模块输入为$h_t$和$q_t$，模块会生成一个记忆memory，初始时$m = q_t$，然后根据每一次的迭代更新$m^i = GRU(e,m^{i-1})$

注意力机制

保留比例门g充当着attention的作用。特征函数z(c,m,q)提取了9个特征：

将特征函数输入到两层前向网络（分别用tanh和sigmoid激活）：

记忆更新机制

e先随机初始化，在每个句块的遍历时，e会结合句子和旧e去生成新的e的信息：

Need for Multiple Episodes: The iterative nature of this
module allows it to attend to different inputs during each
pass. It also allows for a type of transitive inference, since
the first pass may uncover the need to retrieve additional
facts. For instance, in the example in Figure, we are asked
Where is the football? In the first iteration, the model ought
attend to sentence 7 (John put down the football.), as the
question asks about the football. Only once the model sees
that John is relevant can it reason that the second iteration
should retrieve where John was. Similarly, a second pass
may help for sentiment analysis as we show in the experiments section below.

memory = encoded_questions
for i in range(n_episode):
	e = self.init_hidden(bsize).squeeze(0) 
	encoded_facts_t = encoded_facts.transpose(0, 1) 
	for t in range(nfacts):
		bfact = encoded_facts_t[t] 
		f1 = bfact * encoded_questions 
		f2 = bfact * memory
		f3 = torch.abs(bfact - encoded_questions)
		f4 = torch.abs(bfact - memory)
		#这里只选四个特征
		z = torch.cat([f1, f2, f3, f4], dim=1)
		g = self.gate(z)
		e = g*self.attention_grucell(bfact,e) +(1-g)*e 
	memory =  self.memory_grucell(e,memory) #bsize，hidden

回答模块

回答模块结合memory和question，来生成对问题的答案。也是通过GRU来生成答案的。

$a_0 = m^{T_M}（最后一个memory） \\\\ y_t = softmax(W^a a_t) \\\\ a_t = GRU([y_{t-1},q],a_{t-1})$

有点类似解码器吧。最后注意，这里必须加上一个seqbegin作为回答起始的标志。

answer_hidden = memory
seqbegin = get_variable(torch.LongTensor([self.seqbegin_id]*bsize)) #(bsize,1)
last_word = self.embedding(seqbegin) #(bsize,embed)
preds = []
for i in range(alen):
    inputs = torch.cat([last_word, encoded_questions], dim=1) #(bsize,embed+hidden)
    a = self.answer_grucell(inputs,answer_hidden)
    probs = self.answer_fc(a)

    probs = F.log_softmax(probs.float()) #bsize,vocab_size
    _, indics = torch.max(probs, 1)
    last_word = self.embedding(indics)
    preds.append(probs.view(bsize, 1, -1)) 

 preds = torch.cat(preds, dim=1)

其他

代码参考：https://github.com/plmsmile/NLP-Demos/tree/master/question-answer-DMN

不过最后没怎么用到spt(support_sentence_id)，如果可以换种方式加进去训练应该会有其他效果。

def pad_batch_data(batch_data):
    """
    batch_data = [facts,question,answer,spt]
    return [all_facts, all_facts_mask, questions, questions_mask, answers]
    """
    batch_data = [list(i) for i in zip(*batch_data)]
    
    all_facts, questions, answers,spt = batch_data[0],batch_data[1],batch_data[2],batch_data[3]

    n_facts = max([len(facts) for facts in all_facts])
    flen = max([len(fact) for facts in all_facts for fact in facts])
    qlen = max([len(q) for q in questions])
    alen = max([len(a) for a in answers])
    
    all_facts = [[vectorize(j) for j in i] for i in all_facts ]
    questions = [vectorize(i) for i in questions]
    answers = [vectorize(i) for i in answers]
    
    all_facts_mask = []
    
    for i in range(batch_size):
        #fact pad mask
        facts = all_facts[i]
        for j in range(len(facts)):
            t = flen - len(facts[j])
            if t > 0:
                all_facts[i][j] = facts[j] + [pad_id] * t #pad

        while (len(facts) < n_facts):
            all_facts[i].append([pad_id] * flen)
            
        mask = [tuple(map(lambda v: v == pad_id, fact)) for fact in all_facts[i]]
        all_facts_mask.append(mask)
    
        #question pad mask
        q = questions[i]
        if len(q)<qlen:
            questions[i] =q + (qlen-len(q))*[pad_id]

        #answers
        a = answers[i]
        if len(a)<alen:
            answers[i] = a+(alen-len(a))*[pad_id]

    questions_mask = [(tuple(map(lambda v: v == pad_id, q))) for q in questions]
    all_facts = get_variable(torch.LongTensor(all_facts))
    all_facts_mask = get_variable(torch.ByteTensor(all_facts_mask))
    answers = get_variable(torch.LongTensor(answers))
    questions = torch.LongTensor(questions)
    questions_mask = torch.ByteTensor(questions_mask)
    questions, questions_mask = get_variable(questions), get_variable(questions_mask)
    return all_facts, all_facts_mask, questions, questions_mask, answers

参考链接：

[1]. 使用Dynamic Memory Network实现一个简单QA

[2]. CS224n自然语言处理与深度学习 Lecture Notes Eight