最新推文情感分析模型比较DistilBERT vs XLnet

语言: CN / TW / HK

一、介绍

自然语言处理(NLP)是一种非常普遍的处理方法文本挖掘技术,主要用于分析对话,意见分享决定企业战略,用户评论和理解公众行为。作为数据科学和技术不断发展,越来越多的模型出现晚于BERT(双向编码器)可以在Twitter上应用情感分析。最新的预训练模型海量数据可以在文本分类上获得更高的分数,比传统模型更精确的预测。

本文旨提供了新的两个模型DistilBERT和XLnet在数据集上的表现 (仇恨、讽刺和攻击)。同时,该方案的结果也可以进行用卡迪夫大学提供的标准评估数据进行比较。

Model Emoji Emotion Hate Irony Offensive Sentiment Stance ALL(TE) Reference
BERTweet 33.4 79.3 56.4 82.1 79.5 73.4 71.2 67.9 BERTweet
RoBERTa-Retrained 31.4 78.5 52.3 61.7 80.5 72.6 69.3 65.2 TweetEval
RoBERTa-Base 30.9 76.1 46.6 59.7 79.5 71.3 68 61.3 TweetEval
RoBERTa-Twitter 29.3 72.0 49.9 65.4 77.1 69.1 66.7 61.0 TweetEval
FastText 25.8 65.2 50.6 63.1 73.4 62.9 65.4 58.1 TweetEval
LSTM 24.7 66.0 52.6 62.8 71.7 58.3 59.4 56.5 TweetEval
SVM 29.3 64.7 36.7 61.7 52.3 62.9 67.3 53.5 TweetEval

本文通过DistilBERT和XLNet构建模型来分类Twitter情感,并得新基准报告,每个模型依次应用于三个不同的数据集(hate, irony, and offensive), 步骤如下:

1. 规范化:

从GitHub下载TweetEval项目后,将这些数据集转换为标准表单。文本规范化是将原始文本转换为标准形式的关键过程。它包括以下步骤:Y将所有大写字母转换为小写字母Y删除特殊字符和不需要的字符,如HTML标签和标点符号Y删除空白和表情符号示例:" @user @user pick the GOOD Mood up 😄 "被转换为" pick the GOOD Mood up "

#Define the root folder where we put the datasets
base_dir = '/content/tweeteval/datasets/'

#Create hate dataframe
hate = DataPrep(base_dir, 'hate') 
hate_dict_train, hate_dict_val, hate_dict_test = hate.dataframe()
df_hate = hate.dataframe_merge()
not_hate, hate = hate.binary_split()

#Create irony dataframe
irony = DataPrep(base_dir, 'irony') 
irony_dict_train, irony_dict_val, irony_dict_test = irony.dataframe()
df_irony = irony.dataframe_merge()
not_irony, irony = irony.binary_split()

#Create offensive dataframe
offensive = DataPrep(base_dir, 'offensive')
offensive_dict_train, offensive_dict_val, offensive_dict_test = offensive.dataframe()
df_offensive = offensive.dataframe_merge()
not_offensive, offensive = offensive.binary_split()

print("The sample of pre-precessed dataset is shown below:")
hate.head()

 

2. 向量化

开始构建模型和向量化数据前面,我们先要从Huggingface下载模型。我们使用了两个来自Huggingface的库来实现分析 https://huggingface.co/transformers/

  • DistilBertTokenizer: ’distilbert-base-uncased’
  • XLNetTokenizer: ’xlnet-base-cased‘’

我们将标准化文本输入标记器并获取的特性(向量化),从而提取id和掩码并输入到分类模型。

from transformers import TFXLNetModel, XLNetTokenizer, TFDistilBertModel, DistilBertTokenizer

xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
xlnet_model = TFXLNetModel.from_pretrained('xlnet-base-cased')

dbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
dbert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')
#DistilBert Tokenizer
dl_hate_train_input = get_tokenization(hate_dict_train['text'], dbert_tokenizer)
dl_hate_val_input = get_tokenization(hate_dict_val['text'], dbert_tokenizer)
dl_hate_test_input = get_tokenization(hate_dict_test['text'], dbert_tokenizer)

dl_irony_train_input = get_tokenization(irony_dict_train['text'], dbert_tokenizer)
dl_irony_val_input = get_tokenization(irony_dict_val['text'], dbert_tokenizer)
dl_irony_test_input = get_tokenization(irony_dict_test['text'], dbert_tokenizer)

dl_offensive_train_input = get_tokenization(offensive_dict_train['text'], dbert_tokenizer)
dl_offensive_val_input = get_tokenization(offensive_dict_val['text'], dbert_tokenizer)
dl_offensive_test_input = get_tokenization(offensive_dict_test['text'], dbert_tokenizer)

#XLNet Tokenizer
xl_hate_train_input = get_tokenization(hate_dict_train['text'], xlnet_tokenizer)
xl_hate_val_input = get_tokenization(hate_dict_val['text'], xlnet_tokenizer)
xl_hate_test_input = get_tokenization(hate_dict_test['text'], xlnet_tokenizer)

xl_irony_train_input = get_tokenization(irony_dict_train['text'], xlnet_tokenizer)
xl_irony_val_input = get_tokenization(irony_dict_val['text'], xlnet_tokenizer)
xl_irony_test_input = get_tokenization(irony_dict_test['text'], xlnet_tokenizer)

xl_offensive_train_input = get_tokenization(offensive_dict_train['text'], xlnet_tokenizer)
xl_offensive_val_input = get_tokenization(offensive_dict_val['text'], xlnet_tokenizer)
xl_offensive_test_input = get_tokenization(offensive_dict_test['text'], xlnet_tokenizer)

3. 模型构建

我们用Tensorflow Keras 来构建模型,在隐藏层之间。我们添加一个dense节点的形状(32)来提升模型的复杂性。然后,增加了一个dropout降低每一层的影响。

def create_model(model_name):
    """ Creates the model. It is composed of the XLNet main block and then
    a classification head its added
    """
    # Define token ids as inputs
    word_ids = tf.keras.Input(shape=(120,), name='word_ids', dtype='int32')
    word_attention = tf.keras.Input(shape=(120,), name='word_attention', dtype='int32')
    # word_seq = tf.keras.Input(shape=(120,), name='word_seq', dtype='int32')

    # Call XLNet model
    xlnet_encodings = model_name([word_ids,word_attention])[0]

    # CLASSIFICATION HEAD
    # Collect last step from last hidden state (CLS)
    doc_encoding = tf.squeeze(xlnet_encodings[:, -1:, :], axis=1)
    # Apply dropout for regularization
    dense = tf.keras.layers.Dense(32, activation='relu', name='encoding')(doc_encoding)
    drop = tf.keras.layers.Dropout(0.1)(dense)
    # Final output
    outputs = tf.keras.layers.Dense(1, activation='sigmoid', name='outputs')(drop)

    # Compile model
    model = tf.keras.Model(inputs=[word_ids,word_attention], outputs=[outputs])
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=2e-5), loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])

    return model
  • DistilBertModel: ’distilbert-base-uncased’

  • XLNetModel: ’xlnet-base-cased’

4. 开始训练

Hate数据集拥有最多的样本数(20970),其次是offensive(4601)和irony(14100)。我们做的第一个准备工作是平衡数据集。我们创建了一个名为“不平衡抽样”的函数,通过减少数据样本来平衡数据集。该函数将数据集分成两组,然后删除具有更多样本的部分的行,直到具有不同标签的样本数量相等。

def imbalance_under_sampling(dfname):
  df_label_a = dfname[dfname['label'] == 1]
  df_label_b = dfname[dfname['label'] == 0]
  if df_label_a.shape[0] > df_label_b.shape[0]:
    df_label_a = df_label_a.sample(df_label_b.shape[0],random_state=1)
    df = pd.concat([df_label_b, df_label_a], axis=0)
    # print('Drop the number of samples to: ',df.label.value_counts())
    return df
  else:
    df_label_b = df_label_b.sample(df_label_a.shape[0],random_state=1)
    df = pd.concat([df_label_b, df_label_a], axis=0)
    # print('Drop the number of samples to: ',df.label.value_counts())
    return df

CallBack:


callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=4, min_delta=0.001, restore_best_weights=True),
    tf.keras.callbacks.LearningRateScheduler(warmup, verbose=0),
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy', factor=1e-6, patience=2, verbose=0, mode='auto', min_delta=0.001, cooldown=0, min_lr=1e-6)
]

 用Hate训练代码介绍一下代码的运行过程

1. 填入向量化后的数据到模型进行训练

hist_dl_hate = DistilBert_hate.fit(x=dl_hate_train_input, y=hate_dict_train.label, epochs=25, batch_size=16, validation_data=(dl_hate_val_input, hate_dict_val.label), callbacks=callbacks)

2. 预测数据


preds_dl_hate = DistilBert_hate.predict(dl_hate_test_input, verbose=True)

3. 转化而为二分类结果集,然后展现confusion_matrix:

hate_pred_dl_label = [i[0] for i in preds_dl_hate.round().astype(int)]
cm = confusion_matrix(hate_dict_test.label,hate_pred_dl_label)
plot_confusion_matrix(cm, normalize=False,target_names=['Not_hate', 'Hate'],title="DistilBert Confusion Matrix for Hate")

完整代码和运行结果的NoteBook在我的链接中:https://github.com/gibsonx/CE888/blob/master/Assignment/Assignment_2.ipynb 可以直接用Colab打开

二、结论:

两个模型之间的性能对比

DistilBERT参数更少一些意味着运算速度更快,所以它适合使用在需要快速获得结果的场景。相比XLNet参数更加复杂,适合在对精度要求更高的场景下应用