教你如何從0-1使用機器學習利器Sklearn實現文字分類企業級案例_阿杰筆記

我報名參加金石計劃1期挑戰——瓜分10萬獎池，這是我的第4篇文章，點選檢視活動詳情

前言

想象一下，您可以瞭解網際網路上人們的心情。也許你對它的全部不感興趣，但前提是人們今天在你最喜歡的社交媒體平臺上感到高興。在本教程之後，您將具備執行此操作的能力。在此過程中，您將瞭解（深度）神經網路的當前進展以及如何將它們應用於文字。

用機器學習從文字中讀取情緒被稱為情感分析，它是文字分類中的突出用例之一。這屬於自然語言處理（NLP）這個非常活躍的研究領域。文字分類的其他常見用例包括垃圾郵件檢測、客戶查詢的自動標記以及將文字分類為已定義的主題。那麼你怎麼能做到這一點呢？這裡我們會講到使用詞袋模型去實現，那什麼叫做詞袋模型呢，如下圖所示：

正文

選擇資料集

在開始之前，我們先來看看我們有哪些資料。繼續從 UCI 機器學習儲存庫的Sentiment Labeled Sentences 資料集中下載資料集。

順便說一句，當您想嘗試一些演算法時，這個儲存庫是機器學習資料集的絕佳來源。該資料集包括來自 IMDb、Amazon 和 Yelp 的標籤評論。每條評論都標記為 0 分表示負面情緒或 1 分表示正面情緒。

載入資料集

這裡我們使用pandas工具去讀取資料，然後對資料進行一次合併操作，pd.concat(df_list)，注意這個是列式合併，即資料行增加。 ``` import pandas as pd

filepath_dict = {'yelp': 'data/sentiment_analysis/yelp_labelled.txt', 'amazon': 'data/sentiment_analysis/amazon_cells_labelled.txt', 'imdb': 'data/sentiment_analysis/imdb_labelled.txt'}

df_list = [] for source, filepath in filepath_dict.items(): df = pd.read_csv(filepath, names=['sentence', 'label'], sep='t') df['source'] = source # Add another column filled with the source name df_list.append(df)

df = pd.concat(df_list) print(df.iloc[0]) ```

文字轉特徵資料 CountVectorizer

特徵提取，這裡使用的是詞向量轉換，sklearn是帶有這個方法，因此使用起來也十分方便。 ``` sentences = ['John likes ice cream', 'John hates chocolate.'] from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=0, lowercase=False) vectorizer.fit(sentences) print(vectorizer.vocabulary_) print(vectorizer.transform(sentences).toarray()) ```

資料集切分 train_test_split

資料集切分這裡和其他的資料集切分一樣，直接使用train_test_split api即可，這裡不再贅述。 ``` from sklearn.model_selection import train_test_split

df_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values y = df_yelp['label'].values

sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000) ```

基於詞袋模型分類

分類使用的方法是邏輯迴歸演算法，邏輯迴歸是一個分類任務，也是一個十分經典的演算法，因此是十分合適的。 ``` from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression() classifier.fit(X_train, y_train) score = classifier.score(X_test, y_test)

print("Accuracy:", score)

for source in df['source'].unique(): df_source = df[df['source'] == source] sentences = df_source['sentence'].values y = df_source['label'].values

sentences_train, sentences_test, y_train, y_test = train_test_split(
    sentences, y, test_size=0.25, random_state=1000)

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print('Accuracy for {} data: {:.4f}'.format(source, score))

``` 好啦，這樣就可以了。趕緊試試吧！

總結

詞袋模型是比較經典的一個機器學習模型，之所以叫做詞袋，是因為它只在乎詞是否出現過，不關心這些詞在文章中的順序和結構，因此它在對具有相似內容的判別具有很強的判別性，另外，邏輯迴歸也是一個相當成熟和優秀的分類演算法，因此使用它做分類任務在企業中也很適用。