A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.
(统计语言建模的一个目标是学习一种语言中词序列的联合概率函数)
This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.
(由于维度的限制,使用模型进行测试的测试集中的词序列与训练集中的词序列可能会有不同)
Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
(一般基于n-gram的方法是通过将训练集中出现的短的重叠序列进行连接来得到泛化)
We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring
sentences.
The model learns simultaneously (1) a distributed representation for each word along with. (2) the probability function for word sequences, expressed in terms of these representations.
(模型会同时学习:1. 每个词的分布式表示 2. 词序列的概率函数)
Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to
words forming an already seen sentence.
Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge.
(在合理的时间内训练这样的大型模型(以百万计的参数)本身就是一个显着的挑战。)
We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
vocab = "vocab.txt" defpreprocess(file_in, file_out): vocab_temp = set() punc = "[\u3002|\uff1f|\uff01|\uff0c|\u3001|\uff1b|\uff1a|\u201c|\u201d|\u2018|\u2019|\uff08|\uff09|\u300a|\u300b|\u3008|\u3009|\u3010|\u3011|\u300e|\u300f|\u300c|\u300d|\ufe43|\ufe44|\u3014|\u3015|\u2026|\u2014|\uff5e|\ufe4f|\uffe5]" with open(file_out, 'w', encoding='utf-8') as fout: with open(file_in, 'r', encoding='utf-8') as fin: for i in fin: line = re.sub("[0-9]","#", i.strip("\n")) #将数字替换# line = re.sub(punc, " ", line) #将标点替换空格,空格在jieba中起到分词间隔作用 temp_line = [] for j in jieba.cut(line): if j and j!= " ": vocab_temp.add(j) temp_line.append(j) fout.write(" ".join(temp_line) + "\n")
with open(vocab, 'w', encoding='utf-8') as vocab_file: for word in vocab_temp: vocab_file.write(word + "\n") print("vocab size is %d"%len(vocab_temp))
开始训练
导入需要使用的包
1 2 3 4 5 6 7 8
import torch as t import torch.nn as nn import torch.nn.functional as F from torch.utils.data import Dataset from torch.utils.data import DataLoader
from collections import Counter from collections import OrderedDict
设置需要用到的参数
1 2 3 4 5 6 7 8 9 10 11
USE_CUDA = t.cuda.is_available()
NUM_EPOCHS = 1000 BATCH_SIZE = 2# the batch size LEARNING_RATE = 0.001# the initial learning rate EMBEDDING_SIZE = 300 N_GRAM = 2 HIDDEN_UNIT = 128 UNK = "<unk>"# 避免出现索引溢出 H = N_GRAM * EMBEDDING_SIZE U = HIDDEN_UNIT
将数据处理为N-gram的形式(这里为2gram)
1 2 3 4 5 6 7 8 9 10 11
with open("tlbb_post.txt", "r", encoding='utf-8') as fin: test_sentence = [j for i in fin.readlines() for j in i.strip("\n").split(" ")][:10000] tokens = test_sentence #['段誉', '忽道', '这么', '高', '跳下来', '可不', '摔坏', '了', '么', '你'] trigram = [((test_sentence[i],test_sentence[i+1]),test_sentence[i+2]) for i in range(len(tokens)-N_GRAM)] #[(('段誉', '忽道'), '这么'), (('忽道', '这么'), '高'), (('这么', '高'), '跳下来'), (('高', '跳下来'), '可不'), (('跳下来', '可不'), '摔坏'), (('可不', '摔坏'), '了'), (('摔坏', '了'), '么'), (('了', '么'), '你')] words = dict(Counter(tokens).most_common()) #{'的': 16, '了': 15, '龚光杰': 13, '你': 11, '道': 11, '我': 10, '左子穆': 10} words = sorted(iter(words.keys()), key=words.get, reverse=False) words += UNK word2id = { k:i for i,k in enumerate(words) } id2word = { i:k for i,k in enumerate(words) }