TextCNN-PyTorch
数据集
我们采用情感分析数据集(数据集可以看上文)来进行文本分类的任务。首先对数据集进行大致观察。
数据概览
1 2 3 import pandas as pdtrain_data = pd.read_csv("E:/blog_long/data/data.csv" ,header=None ) print(train_data.head())
0 1
0 1 距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较...
1 1 商务大床房,房间很大,床有2M宽,整体感觉经济实惠不错!
2 1 早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。房间本身很好。
3 1 宾馆在小街道上,不大好找,但还好北京热心同胞很多~宾馆设施跟介绍的差不多,房间很小,确实挺小...
4 1 CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风
可以看到:第一列为情感标签,第二列为文本。
大致观察下前五条数据,数据带有明显的感情标志。且标注的挺准确。
然后对数据的分布做下统计:
首先看下标签数及标签分布,并画出直方图,如下图所示:
1 2 3 4 5 6 import matplotlib.pyplot as plttrain_data["label" ] = train_data.loc[:,0 ] print(train_data["label" ].value_counts()) train_data["label" ].value_counts().plot(kind="bar" ) plt.title("class count" ) plt.xlabel("label" )
可以看出,数据集是一个二分类任务,且标签分布正类:负类 大致为 5:2。
1 5322
0 2443
Name: label, dtype: int64
然后对文本列做一些简单统计并对文本进行划分数据集及预处理。
1 2 3 4 train_data["sentence" ] = train_data.loc[:,1 ] train_data["sentence_len" ] = train_data.loc[:,1 ].apply(lambda x:len(str(x))) print(train_data["sentence_len" ].head()) print(train_data["sentence_len" ].describe(percentiles =[.5 ,.95 ]))
0 50
1 28
2 42
3 175
4 36
Name: sentence_len, dtype: int64
count 7765.000000
mean 128.518609
std 143.629370
min 2.000000
50% 84.000000
95% 383.000000
max 2924.000000
Name: sentence_len, dtype: float64
可以看到句子最短为2个字符,最长为2924个字符,383字符长度可以包含95%的数据,我们使用385作为每个句子的长度,不足补pad,超过则截断
构建词表
我们使用字作为分割单位,将其写入vocab.txt文件中。
其中需要注意:
使用<unk>代表未知字符且将出现次数为1的作为未知字符(这样可以处理测试集中出现的未知字符)
实用<pad>代表需要padding的字符(为了后面讲句子长度进行统一)
1 2 3 4 5 6 7 from collections import Counterwith open("vocab.txt" , 'w' , encoding='utf-8' ) as fout: fout.write("<unk>\n" ) fout.write("<pad>\n" ) vocab = [word for word, freq in Counter(j for i in train_data["sentence" ] for j in i).most_common() if freq>1 ] for i in vocab: fout.write(i+"\n" )
1 2 3 4 5 6 7 8 with open("vocab.txt" , encoding='utf-8' ) as fin: vocab = [i.strip() for i in fin] char2idx = {i:index for index, i in enumerate(vocab)} idx2char = {index:i for index, i in enumerate(vocab)} vocab_size = len(vocab) pad_id = char2idx["<pad>" ] unk_id = char2idx["<unk>" ]
预处理
主要是对句子用索引表示且对句子进行截断与padding
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 sequence_length = 385 def tokenizer (): inputs = [] sentence_char = [[j for j in i] for i in train_data["sentence" ]] for index,i in enumerate(sentence_char): temp = [char2idx.get(j,unk_id) for j in i] if len(temp) < sequence_length: for _ in range(sequence_length-len(temp)): temp.append(pad_id) else : temp = temp[:sequence_length] inputs.append(temp) return inputs data_input = tokenizer()
构建模型
导入torch包并初始化超参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import torchimport torch.nn as nnimport torch.utils.data as Dataimport torch.optim as optimimport torch.nn.functional as Fimport numpy as npdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu" ) Embedding_size = 100 Batch_Size = 36 Kernel = 3 Filter_num = 10 Epoch = 60 Dropout = 0.5 Learning_rate = 1e-3
构建Dataset和Dataloader
这里使用data包的random_split函数进行训练集与测试集的划分。由于是学习使用,没有划分验证集
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 class TextCNNDataSet (Data.Dataset ): def __init__ (self, data_inputs, data_targets ): self.inputs = torch.LongTensor(data_inputs) self.label = torch.LongTensor(data_targets) def __getitem__ (self, index ): return self.inputs[index], self.label[index] def __len__ (self ): return len(self.inputs) TextCNNDataSet = TextCNNDataSet(data_input, list(train_data["label" ])) train_size = int(len(data_input) * 0.8 ) test_size = len(data_input) - train_size train_dataset, test_dataset = torch.utils.data.random_split(TextCNNDataSet, [train_size, test_size]) TrainDataLoader = Data.DataLoader(train_dataset, batch_size=Batch_Size, shuffle=True ) TestDataLoader = Data.DataLoader(test_dataset, batch_size=Batch_Size, shuffle=True )
初始化模型
这里需要注意:
图像中可以利用 (R, G, B) 作为不同channel;文本的输入的channel通常是不同方式的embedding方式(比如 word2vec或Glove)。
也有利用静态词向量和fine-tunning词向量作为不同channel的做法,这里我们使用随机向量作为初始化的向量,且CNN的通道为1。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 num_classs = 2 class TextCNN (nn.Module ): def __init__ (self ): super(TextCNN, self).__init__() self.W = nn.Embedding(vocab_size, embedding_dim=Embedding_size) out_channel = Filter_num self.conv = nn.Sequential( nn.Conv2d(1 , out_channel, (2 , Embedding_size)), nn.ReLU(), nn.MaxPool2d((sequence_length-1 ,1 )), ) self.dropout = nn.Dropout(Dropout) self.fc = nn.Linear(out_channel, num_classs) def forward (self, X ): batch_size = X.shape[0 ] embedding_X = self.W(X) embedding_X = embedding_X.unsqueeze(1 ) conved = self.conv(embedding_X) conved = self.dropout(conved) flatten = conved.view(batch_size, -1 ) output = self.fc(flatten) return output
实例化模型、损失函数与优化器
1 2 3 model = TextCNN().to(device) criterion = nn.CrossEntropyLoss().to(device) optimizer = optim.Adam(model.parameters(),lr=Learning_rate)
1 2 3 4 5 6 7 8 9 10 def binary_acc (pred, y ): """ 计算模型的准确率 :param pred: 预测值 :param y: 实际真实值 :return: 返回准确率 """ correct = torch.eq(pred, y).float() acc = correct.sum() / len(correct) return acc.item()
构建模型训练及测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def train (): avg_acc = [] model.train() for index, (batch_x, batch_y) in enumerate(TrainDataLoader): batch_x, batch_y = batch_x.to(device), batch_y.to(device) pred = model(batch_x) loss = criterion(pred, batch_y) acc = binary_acc(torch.max(pred, dim=1 )[1 ], batch_y) avg_acc.append(acc) optimizer.zero_grad() loss.backward() optimizer.step() avg_acc = np.array(avg_acc).mean() return avg_acc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def evaluate (): """ 模型评估 :param model: 使用的模型 :return: 返回当前训练的模型在测试集上的结果 """ avg_acc = [] model.eval() with torch.no_grad(): for x_batch, y_batch in TestDataLoader: x_batch, y_batch = x_batch.to(device), y_batch.to(device) pred = model(x_batch) acc = binary_acc(torch.max(pred, dim=1 )[1 ], y_batch) avg_acc.append(acc) return np.array(avg_acc).mean()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 model_train_acc, model_test_acc = [], [] for epoch in range(Epoch): train_acc = train() test_acc = evaluate() if epoch %10 == 9 : print("epoch = {}, 训练准确率={}" .format(epoch + 1 , train_acc)) print("epoch = {}, 测试准确率={}" .format(epoch + 1 , test_acc)) model_train_acc.append(train_acc) model_test_acc.append(test_acc) plt.plot(model_train_acc) plt.plot(model_test_acc) plt.ylim(ymin=0.5 , ymax=1.01 ) plt.title("The accuracy of textCNN model" ) plt.legend(['train' , 'test' ]) plt.show()
epoch = 10, 训练准确率=0.8111111153067881
epoch = 10, 测试准确率=0.8551767712289636
epoch = 20, 训练准确率=0.85777136149434
epoch = 20, 测试准确率=0.8571969758380543
epoch = 30, 训练准确率=0.8702954456985341
epoch = 30, 测试准确率=0.8529040488329801
epoch = 40, 训练准确率=0.8877970565950251
epoch = 40, 测试准确率=0.8330808146433397
epoch = 50, 训练准确率=0.8967565897572247
epoch = 50, 测试准确率=0.8236111172220923
epoch = 60, 训练准确率=0.9006101575200958
epoch = 60, 测试准确率=0.8200757625428113
可以看到训练集已经开始过拟合,测试集的准确率在第15个左右的epoch时已经稳定并有下降趋势。以后有时间的话,在做优化,可以看出简单实用1个核的textcnn进行文本分类可以达到85%左右的准确率
参考:
textCNN原理一览与基于Pytorch的文本分类案例
TextCNN 的 PyTorch 实现