A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.
This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.
Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring
The model learns simultaneously (1) a distributed representation for each word along with. (2) the probability function for word sequences, expressed in terms of these representations.
(模型会同时学习:1. 每个词的分布式表示 2. 词序列的概率函数)
Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to
words forming an already seen sentence.
Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge.
We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
vocab = "vocab.txt" defpreprocess(file_in, file_out): vocab_temp = set() punc = "[\u3002|\uff1f|\uff01|\uff0c|\u3001|\uff1b|\uff1a|\u201c|\u201d|\u2018|\u2019|\uff08|\uff09|\u300a|\u300b|\u3008|\u3009|\u3010|\u3011|\u300e|\u300f|\u300c|\u300d|\ufe43|\ufe44|\u3014|\u3015|\u2026|\u2014|\uff5e|\ufe4f|\uffe5]" with open(file_out, 'w', encoding='utf-8') as fout: with open(file_in, 'r', encoding='utf-8') as fin: for i in fin: line = re.sub("[0-9]","#", i.strip("\n")) #将数字替换# line = re.sub(punc, " ", line) #将标点替换空格,空格在jieba中起到分词间隔作用 temp_line = [] for j in jieba.cut(line): if j and j!= " ": vocab_temp.add(j) temp_line.append(j) fout.write(" ".join(temp_line) + "\n")
with open(vocab, 'w', encoding='utf-8') as vocab_file: for word in vocab_temp: vocab_file.write(word + "\n") print("vocab size is %d"%len(vocab_temp))
import torch as t import torch.nn as nn import torch.nn.functional as F from torch.utils.data import Dataset from torch.utils.data import DataLoader
from collections import Counter from collections import OrderedDict
USE_CUDA = t.cuda.is_available()
NUM_EPOCHS = 1000 BATCH_SIZE = 2# the batch size LEARNING_RATE = 0.001# the initial learning rate EMBEDDING_SIZE = 300 N_GRAM = 2 HIDDEN_UNIT = 128 UNK = "<unk>"# 避免出现索引溢出 H = N_GRAM * EMBEDDING_SIZE U = HIDDEN_UNIT
with open("tlbb_post.txt", "r", encoding='utf-8') as fin: test_sentence = [j for i in fin.readlines() for j in i.strip("\n").split(" ")][:10000] tokens = test_sentence #['段誉', '忽道', '这么', '高', '跳下来', '可不', '摔坏', '了', '么', '你'] trigram = [((test_sentence[i],test_sentence[i+1]),test_sentence[i+2]) for i in range(len(tokens)-N_GRAM)] #[(('段誉', '忽道'), '这么'), (('忽道', '这么'), '高'), (('这么', '高'), '跳下来'), (('高', '跳下来'), '可不'), (('跳下来', '可不'), '摔坏'), (('可不', '摔坏'), '了'), (('摔坏', '了'), '么'), (('了', '么'), '你')] words = dict(Counter(tokens).most_common()) #{'的': 16, '了': 15, '龚光杰': 13, '你': 11, '道': 11, '我': 10, '左子穆': 10} words = sorted(iter(words.keys()), key=words.get, reverse=False) words += UNK word2id = { k:i for i,k in enumerate(words) } id2word = { i:k for i,k in enumerate(words) }