A Neural Probabilistic Language Model

前言

本文来自Yoshua Bengio教授2003年的论文,第一次用神经网络来解决语言模型,且得到了词的向量表示,对NLP的发展起到重要作用。

模型架构如图所示:

nnlm

我们照例先详细看下摘要,然后具体分析文章,最后使用代码实现。

摘要

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.

(统计语言建模的一个目标是学习一种语言中词序列的联合概率函数)

This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.

(由于维度的限制,使用模型进行测试的测试集中的词序列与训练集中的词序列可能会有不同)

Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.

(一般基于n-gram的方法是通过将训练集中出现的短的重叠序列进行连接来得到泛化)

We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring
sentences.

(为了解决维度灾难,本文提出通过学习词的分布式表示,且每条训练语句提供模型语义相近句子的指数级数量信息来进行建模。)

The model learns simultaneously (1) a distributed representation for each word along with. (2) the probability function for word sequences, expressed in terms of these representations.

(模型会同时学习:1. 每个词的分布式表示 2. 词序列的概率函数)

Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to
words forming an already seen sentence.

(模型可以得到泛化是因为一个从未出现的词序列,如果它是由与它相似的词(在其附近的一个代表性的意义上)组成过已经出现的句子的话,那么它获得较高的概率。)

Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge.

(在合理的时间内训练这样的大型模型(以百万计的参数)本身就是一个显着的挑战。)

We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

(我们报告基于神经网络的概率函数的实验,显示出在两个文本语料库,该方法显着改进了最先进的n-gram模型,而且该方法允许利用较长的上下文优势。)

关键词

统计语言模型, 人工神经网络, 分布式表示, 维数灾难

分析

模型的主要任务是训练语言模型,来代替N-gram

主要架构为三层神经网络,如下图所示:

其中,

  • C(i)C(i):词ww对应的词向量,iiww在词表中的索引
  • CC:词向量矩阵

nnlm

现在的任务是输入wtn+1,...,wt1w_{t−n+1},...,w_{t−1} 这前 n-1 个单词,然后预测出下一个单词 wtw_t

计算流程:

  1. 将输入的 n-1 个单词索引(这里的索引是直接通过词表转换的)转为词向量,然后将这 n-1 个词向量进行连接,形成一个 (n1)w(n-1)*w的向量,用 X 表示
  2. 将X送入隐藏层计算,激活函数为tanh
  3. 将结果送入输出层,因为是语言模型,需要根据前n-1个单词预测下一个单词,所以是一个多分类器,用softmax将其分布化。

整个模型最大的计算量集中在最后一层上,因为一般来说词汇表都很大,需要计算每个单词的条件概率,是整个模型的计算瓶颈。

softmax是一个非常低效的处理方式,需要先计算每个单词的概率,并且还要计算指数,指数在计算机中都是用级数来近似的,计算复杂度很高,最后再做归一化处理。此后很多研究都针对这个问题进行了优化,比如层级softmax,比如softmax tree。

代码实现

预处理《天龙八部》数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
vocab = "vocab.txt"
def preprocess(file_in, file_out):
vocab_temp = set()
punc = "[\u3002|\uff1f|\uff01|\uff0c|\u3001|\uff1b|\uff1a|\u201c|\u201d|\u2018|\u2019|\uff08|\uff09|\u300a|\u300b|\u3008|\u3009|\u3010|\u3011|\u300e|\u300f|\u300c|\u300d|\ufe43|\ufe44|\u3014|\u3015|\u2026|\u2014|\uff5e|\ufe4f|\uffe5]"
with open(file_out, 'w', encoding='utf-8') as fout:
with open(file_in, 'r', encoding='utf-8') as fin:
for i in fin:
line = re.sub("[0-9]","#", i.strip("\n")) #将数字替换#
line = re.sub(punc, " ", line) #将标点替换空格,空格在jieba中起到分词间隔作用
temp_line = []
for j in jieba.cut(line):
if j and j!= " ":
vocab_temp.add(j)
temp_line.append(j)
fout.write(" ".join(temp_line) + "\n")

with open(vocab, 'w', encoding='utf-8') as vocab_file:
for word in vocab_temp:
vocab_file.write(word + "\n")
print("vocab size is %d"%len(vocab_temp))

开始训练

导入需要使用的包

1
2
3
4
5
6
7
8
import torch as t
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from collections import Counter
from collections import OrderedDict

设置需要用到的参数

1
2
3
4
5
6
7
8
9
10
11
USE_CUDA = t.cuda.is_available()

NUM_EPOCHS = 1000
BATCH_SIZE = 2 # the batch size
LEARNING_RATE = 0.001 # the initial learning rate
EMBEDDING_SIZE = 300
N_GRAM = 2
HIDDEN_UNIT = 128
UNK = "<unk>" # 避免出现索引溢出
H = N_GRAM * EMBEDDING_SIZE
U = HIDDEN_UNIT

将数据处理为N-gram的形式(这里为2gram)

1
2
3
4
5
6
7
8
9
10
11
with open("tlbb_post.txt", "r", encoding='utf-8') as fin:
test_sentence = [j for i in fin.readlines() for j in i.strip("\n").split(" ")][:10000]
tokens = test_sentence #['段誉', '忽道', '这么', '高', '跳下来', '可不', '摔坏', '了', '么', '你']
trigram = [((test_sentence[i],test_sentence[i+1]),test_sentence[i+2]) for i in range(len(tokens)-N_GRAM)]
#[(('段誉', '忽道'), '这么'), (('忽道', '这么'), '高'), (('这么', '高'), '跳下来'), (('高', '跳下来'), '可不'), (('跳下来', '可不'), '摔坏'), (('可不', '摔坏'), '了'), (('摔坏', '了'), '么'), (('了', '么'), '你')]
words = dict(Counter(tokens).most_common())
#{'的': 16, '了': 15, '龚光杰': 13, '你': 11, '道': 11, '我': 10, '左子穆': 10}
words = sorted(iter(words.keys()), key=words.get, reverse=False)
words += UNK
word2id = { k:i for i,k in enumerate(words) }
id2word = { i:k for i,k in enumerate(words) }

定义MyDataset(这个可以参考本站文章Pytorch中的Dataset与Dataloder)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class MyDataset(Dataset):
def __init__(self, word2id, id2word, tokens):
self.word2id = word2id
self.id2word = id2word
self.tokens = tokens

def __len__(self):
return len(self.tokens) - N_GRAM # 返回训练集的长度

def __getitem__(self, index):
((word_0, word_1), word_2) = trigram[index] # 组合x_data, y_data
word_0 = self.word2id[word_0]
word_1 = self.word2id[word_1]
word_2 = self.word2id[word_2] # target
return word_0, word_1, word_2

定义模型

这里介绍下代码中用到的方法:

torch.nn.Parameter()

将一个固定不可训练的tensor转换成可以训练的类型parameter,并将这个parameter绑定到这个module里面

即将该参数添加进模型中,使其能够通过 model.parameters() 找到、管理、并且更新

torch.nn.Embedding(num_embeddings, embedding_dim)

一个保存了固定字典和大小的简单查找表。这个模块常用来保存词嵌入和用下标检索它们。模块的输入是一个下标的列表,输出是对应的词嵌入

  1. num_embeddings:查询表的大小
  2. embedding_dim:每个查询向量的维度

在如下代码self.embed(word_0)中,是将word_0中的词找到在词表中的位置并替换为对应的词向量

torch.mm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class NNLM(nn.Module):
def __init__(self,vocab, dim):
super(NNLM, self).__init__()
self.embed = nn.Embedding(vocab, dim)
self.H = nn.Parameter(t.randn(EMBEDDING_SIZE * N_GRAM, HIDDEN_UNIT)) #隐藏层的Weight矩阵
self.d = nn.Parameter(t.randn(HIDDEN_UNIT)) #隐藏层的偏置
self.U = nn.Parameter(t.randn(HIDDEN_UNIT, vocab)) # 输出层的Weight矩阵
self.b = nn.Parameter(t.randn(vocab)) #输出层的偏置
self.W = nn.Parameter(t.randn(EMBEDDING_SIZE * N_GRAM, vocab)) # 输入层到输出层的 weight矩阵
'''
words:batch,sequence
# x: [batch_size, n_step*n_class]
'''
def forward(self, word_0, word_1):
batch = word_0.shape[0]
word_0 = self.embed(word_0)
word_1 = self.embed(word_1)
words = t.cat((word_0, word_1), dim=1)
words = words.view(batch, -1) #batch,sequence*dim
tanh = t.tanh(t.mm(words,self.H) + self.d) #tanh:batch,HIDDEN_UNIT
hidden_output = t.mm(tanh, self.U) + self.b #hidden_output:batch,vocab
y = hidden_output + t.mm(words, self.W)
y = F.log_softmax(y,1)

return -y

写评价函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def evaluate(model, word_0, word_1):
model.eval()
word_0 = word_0.long()
word_1 = word_1.long()

softmax = model(word_0, word_1)
predict = t.argmax(softmax,1)

word_0 = word_0.cpu().detach().numpy()
word_1 = word_1.cpu().detach().numpy()
predict = predict.cpu().detach().numpy()
word_sequence = [ (( id2word[word_0[i]], id2word[word_1[i]]),id2word[predict[i]]) for i in range(len(word_0)) ]
# print(word_sequence[:1])
model.train()

开始进入循环进行训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def train(model, dataloader, optimizer, criterion):
model.train()
for e in range(NUM_EPOCHS):
for i, (word_0, word_1, word_2) in enumerate(dataloader):

word_0 = word_0.long()
word_1 = word_1.long()
word_2 = word_2.long()
if USE_CUDA:
word_0 = word_0.cuda()
word_1 = word_1.cuda()
word_2 = word_2.cuda()

optimizer.zero_grad()

softmax = model(word_0, word_1)
loss = criterion(softmax, word_2)
loss.backward()
optimizer.step()

if i % 100 == 0:
print("epoch: {}, iter: {}, loss: {}".format(e, i, loss.item()))
evaluate(model, word_0, word_1)


#embedding_weights = model.input_embeddings()
#np.save("embedding-{}".format(EMBEDDING_SIZE), embedding_weights)
t.save(model.state_dict(), "embedding-{}.pth".format(EMBEDDING_SIZE))

运行函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
if __name__ == '__main__':
word2idx, idx2word, = word2id,id2word
dim = EMBEDDING_SIZE
hidden = HIDDEN_UNIT

model = NNLM(len(word2id.keys()), dim)

for name, parameters in model.named_parameters():
print(name, ':', parameters.size())

model.to(t.device("cuda" if USE_CUDA else 'cpu'))

lr = 1e-4
optimizer = t.optim.SGD(model.parameters(), lr=lr)

dataloader = DataLoader(dataset=MyDataset(word2id, id2word, tokens), batch_size=BATCH_SIZE)
criterion = nn.CrossEntropyLoss()
train(model, dataloader, optimizer, criterion)

分析结果

1
2
3
4
5
6
epoch: 6, iter: 4400, loss: 80.10641479492188
[(('脸上', '微微'), '正'), (('微微', '一红'), '起下')]
epoch: 6, iter: 4500, loss: 72.47634887695312
[(('道', '出手'), '阿胜'), (('出手', '不'), '下场')]
epoch: 6, iter: 4600, loss: 76.37494659423828
[(('有', '刀剑'), '吹气如兰'), (('刀剑', '齐'), '疾刺')]

这次的训练loss没有降低,目前还没找到原因,写完word2vec之后在回头分析

参考: