TextCNN-PyTorch

数据集

我们采用情感分析数据集(数据集可以看上文)来进行文本分类的任务。首先对数据集进行大致观察。

数据概览

1
2
3
import pandas as pd
train_data = pd.read_csv("E:/blog_long/data/data.csv",header=None)
print(train_data.head())
   0                                                  1
0  1  距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较...
1  1                       商务大床房,房间很大,床有2M宽,整体感觉经济实惠不错!
2  1         早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。房间本身很好。
3  1  宾馆在小街道上,不大好找,但还好北京热心同胞很多~宾馆设施跟介绍的差不多,房间很小,确实挺小...
4  1               CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风

可以看到:第一列为情感标签,第二列为文本。

大致观察下前五条数据,数据带有明显的感情标志。且标注的挺准确。

然后对数据的分布做下统计:

首先看下标签数及标签分布,并画出直方图,如下图所示:

1
2
3
4
5
6
import matplotlib.pyplot as plt
train_data["label"] = train_data.loc[:,0]
print(train_data["label"].value_counts())
train_data["label"].value_counts().plot(kind="bar")
plt.title("class count")
plt.xlabel("label")

可以看出,数据集是一个二分类任务,且标签分布正类:负类 大致为 5:2。

1    5322
0    2443
Name: label, dtype: int64

标签分布

然后对文本列做一些简单统计并对文本进行划分数据集及预处理。

1
2
3
4
train_data["sentence"] = train_data.loc[:,1]
train_data["sentence_len"] = train_data.loc[:,1].apply(lambda x:len(str(x)))
print(train_data["sentence_len"].head())
print(train_data["sentence_len"].describe(percentiles =[.5,.95]))
0     50
1     28
2     42
3    175
4     36
Name: sentence_len, dtype: int64
count    7765.000000
mean      128.518609
std       143.629370
min         2.000000
50%        84.000000
95%       383.000000
max      2924.000000
Name: sentence_len, dtype: float64

可以看到句子最短为2个字符,最长为2924个字符,383字符长度可以包含95%的数据,我们使用385作为每个句子的长度,不足补pad,超过则截断

构建词表

我们使用字作为分割单位,将其写入vocab.txt文件中。

其中需要注意:

  1. 使用<unk>代表未知字符且将出现次数为1的作为未知字符(这样可以处理测试集中出现的未知字符)
  2. 实用<pad>代表需要padding的字符(为了后面讲句子长度进行统一)
1
2
3
4
5
6
7
from collections import Counter
with open("vocab.txt", 'w', encoding='utf-8') as fout:
fout.write("<unk>\n")
fout.write("<pad>\n")
vocab = [word for word, freq in Counter(j for i in train_data["sentence"] for j in i).most_common() if freq>1]
for i in vocab:
fout.write(i+"\n")
1
2
3
4
5
6
7
8
#初始化vocab
with open("vocab.txt", encoding='utf-8') as fin:
vocab = [i.strip() for i in fin]
char2idx = {i:index for index, i in enumerate(vocab)}
idx2char = {index:i for index, i in enumerate(vocab)}
vocab_size = len(vocab)
pad_id = char2idx["<pad>"]
unk_id = char2idx["<unk>"]

预处理

主要是对句子用索引表示且对句子进行截断与padding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
sequence_length = 385
#对输入数据进行预处理
def tokenizer():
inputs = []
sentence_char = [[j for j in i] for i in train_data["sentence"]]
# 将输入文本进行padding
for index,i in enumerate(sentence_char):
temp = [char2idx.get(j,unk_id) for j in i]
if len(temp) < sequence_length:
for _ in range(sequence_length-len(temp)):
temp.append(pad_id)
else:
temp = temp[:sequence_length]
inputs.append(temp)
return inputs
data_input = tokenizer()

构建模型

导入torch包并初始化超参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch
import torch.nn as nn
import torch.utils.data as Data
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Embedding_size = 100
Batch_Size = 36
Kernel = 3
Filter_num = 10
Epoch = 60
Dropout = 0.5
Learning_rate = 1e-3

构建Dataset和Dataloader

这里使用data包的random_split函数进行训练集与测试集的划分。由于是学习使用,没有划分验证集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class TextCNNDataSet(Data.Dataset):
def __init__(self, data_inputs, data_targets):
self.inputs = torch.LongTensor(data_inputs)
self.label = torch.LongTensor(data_targets)

def __getitem__(self, index):
return self.inputs[index], self.label[index]

def __len__(self):
return len(self.inputs)

TextCNNDataSet = TextCNNDataSet(data_input, list(train_data["label"]))
train_size = int(len(data_input) * 0.8)
test_size = len(data_input) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(TextCNNDataSet, [train_size, test_size])

TrainDataLoader = Data.DataLoader(train_dataset, batch_size=Batch_Size, shuffle=True)
TestDataLoader = Data.DataLoader(test_dataset, batch_size=Batch_Size, shuffle=True)

初始化模型

这里需要注意:

图像中可以利用 (R, G, B) 作为不同channel;文本的输入的channel通常是不同方式的embedding方式(比如 word2vec或Glove)。

也有利用静态词向量和fine-tunning词向量作为不同channel的做法,这里我们使用随机向量作为初始化的向量,且CNN的通道为1。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# nn.Conv2d(in_channels,#输入通道数 out_channels,#输出通道数 kernel_size#卷积核大小 )
num_classs = 2
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.W = nn.Embedding(vocab_size, embedding_dim=Embedding_size)
out_channel = Filter_num
self.conv = nn.Sequential(
nn.Conv2d(1, out_channel, (2, Embedding_size)),#卷积核大小为2*Embedding_size
nn.ReLU(),
nn.MaxPool2d((sequence_length-1,1)),
)
self.dropout = nn.Dropout(Dropout)
self.fc = nn.Linear(out_channel, num_classs)

def forward(self, X):
batch_size = X.shape[0]
embedding_X = self.W(X) # [batch_size, sequence_length, embedding_size]
embedding_X = embedding_X.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
conved = self.conv(embedding_X)# [batch_size, output_channel, 1, 1]
conved = self.dropout(conved)
flatten = conved.view(batch_size, -1)# [batch_size, output_channel*1*1]
output = self.fc(flatten)
return output

实例化模型、损失函数与优化器

1
2
3
model = TextCNN().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(),lr=Learning_rate)
1
2
3
4
5
6
7
8
9
10
def binary_acc(pred, y):
"""
计算模型的准确率
:param pred: 预测值
:param y: 实际真实值
:return: 返回准确率
"""
correct = torch.eq(pred, y).float()
acc = correct.sum() / len(correct)
return acc.item()

构建模型训练及测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def train():
avg_acc = []
model.train()
for index, (batch_x, batch_y) in enumerate(TrainDataLoader):
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
pred = model(batch_x)
loss = criterion(pred, batch_y)
acc = binary_acc(torch.max(pred, dim=1)[1], batch_y)
avg_acc.append(acc)
optimizer.zero_grad()
loss.backward()
optimizer.step()
avg_acc = np.array(avg_acc).mean()
return avg_acc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def evaluate():
"""
模型评估
:param model: 使用的模型
:return: 返回当前训练的模型在测试集上的结果
"""
avg_acc = []
model.eval() # 进入测试模式
with torch.no_grad():
for x_batch, y_batch in TestDataLoader:
x_batch, y_batch = x_batch.to(device), y_batch.to(device)
pred = model(x_batch)
acc = binary_acc(torch.max(pred, dim=1)[1], y_batch)
avg_acc.append(acc)
return np.array(avg_acc).mean()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Training cycle
model_train_acc, model_test_acc = [], []
for epoch in range(Epoch):
train_acc = train()
test_acc = evaluate()
if epoch %10 == 9:
print("epoch = {}, 训练准确率={}".format(epoch + 1, train_acc))
print("epoch = {}, 测试准确率={}".format(epoch + 1, test_acc))
model_train_acc.append(train_acc)
model_test_acc.append(test_acc)

plt.plot(model_train_acc)
plt.plot(model_test_acc)
plt.ylim(ymin=0.5, ymax=1.01)
plt.title("The accuracy of textCNN model")
plt.legend(['train', 'test'])
plt.show()
epoch = 10, 训练准确率=0.8111111153067881
epoch = 10, 测试准确率=0.8551767712289636
epoch = 20, 训练准确率=0.85777136149434
epoch = 20, 测试准确率=0.8571969758380543
epoch = 30, 训练准确率=0.8702954456985341
epoch = 30, 测试准确率=0.8529040488329801
epoch = 40, 训练准确率=0.8877970565950251
epoch = 40, 测试准确率=0.8330808146433397
epoch = 50, 训练准确率=0.8967565897572247
epoch = 50, 测试准确率=0.8236111172220923
epoch = 60, 训练准确率=0.9006101575200958
epoch = 60, 测试准确率=0.8200757625428113

准确率

可以看到训练集已经开始过拟合,测试集的准确率在第15个左右的epoch时已经稳定并有下降趋势。以后有时间的话,在做优化,可以看出简单实用1个核的textcnn进行文本分类可以达到85%左右的准确率

参考:

textCNN原理一览与基于Pytorch的文本分类案例

TextCNN 的 PyTorch 实现