theme: smartblue

公眾號：尤而小屋
作者：Peter
編輯：Peter

大家好，我是Peter~

今天給大家帶來一篇新的kaggle文章：極度不均衡的信用卡資料分析，主要內容包含：

理解資料：通過直方圖、箱型圖等輔助理解資料分佈
預處理：歸一化和分佈情況；資料分割
隨機取樣：上取樣和下采樣，主要是欠取樣（下采樣）
異常檢測：如何從資料中找到異常點，並且進行刪除
資料建模：利用邏輯迴歸和神經網路進行建模分析
模型評價：分類模型的多種評價指標

原notebook地址為：http://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets/notebook

非均衡：信用卡資料中欺詐和非欺詐的比例是不均衡的，肯定是非欺詐的比例佔據絕大多數。本文提供一種方法：如何處理這種極度不均衡的資料

匯入庫

匯入各種庫和包：繪圖、特徵工程、降維、分類模型、評價指標相關等

```python import numpy as np import pandas as pd

import tensorflow as tf

import plotly_express as px import plotly.graph_objects as go

子圖

from plotly.subplots import make_subplots import matplotlib.pyplot as plt import matplotlib.patches as mpatches import seaborn as sns

降維

from sklearn.manifold import TSNE from sklearn.decomposition import PCA, TruncatedSVD

import time

plt.rcParams["font.sans-serif"]=["SimHei"] #設定字型 plt.rcParams["axes.unicode_minus"]=False #正常顯示負號

分類庫

from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier

特徵工程相關的庫

from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold, StratifiedKFold

from sklearn.pipeline import make_pipeline from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline

上取樣

from imblearn.over_sampling import SMOTE

欠取樣

from imblearn.under_sampling import NearMiss from imblearn.metrics import classification_report_imbalanced from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report

統計數量

from collections import Counter import collections import warnings warnings.filterwarnings("ignore") ```

基本資訊

讀取資料，檢視基本資訊

資料的形狀如下：

In [3]:

df.shape

Out[3]:

(284807, 31)

In [4]:

```

缺失值的最大值

df.isnull().sum().max() ```

Out[4]:

0

結果表明是沒有缺失值的。

下面是檢視資料中欄位的相關型別，我們發現有30個float64型別，1個int64型別

In [5]:

pd.value_counts(df.dtypes)

Out[5]:

float64 30 int64 1 dtype: int64

In [6]:

columns = df.columns columns

Out[6]:

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'], dtype='object')

檢視資料的統計資訊：

df.describe()

正負樣本不均衡

In [8]:

df["Class"].value_counts(normalize=True)

Out[8]:

python 0 0.998273 # 不欺詐 1 0.001727 # 欺詐 Name: Class, dtype: float64

我們發現屬於0類的樣本遠高於屬於1的樣本，非常地不均衡。這就是本文重點關注的問題。

In [9]:

```

繪圖

colors = ["red", "blue"]

sns.countplot("Class", data=df, palette=colors) plt.title("Class Distributions \n (0-No Fraud & 1-Fraud)") plt.show() ```

通過柱狀圖也能夠明顯觀察到非欺詐-0 和欺詐-1的比例是極度不均衡的。

檢視特徵分佈

部分特徵的分佈，發現存在偏態狀況：

直方圖分佈

In [10]:

```python fig, ax = plt.subplots(1,2,figsize=(18,6))

amount_val = df["Amount"].values time_val = df["Time"].values

sns.distplot(amount_val, ax=ax[0], color="r") ax[0].set_title("Amount", fontsize=14) ax[0].set_xlim([min(amount_val), max(amount_val)]) # 設定範圍

sns.distplot(time_val, ax=ax[1], color="b") ax[1].set_title("Time", fontsize=14) ax[1].set_xlim([min(time_val), max(time_val)]) # 設定範圍

plt.show() ```

觀察兩個欄位Amount和Time在不同取值下的分佈情況，發現：

Amount的偏態現象嚴重，極大多數的資料集中在左側
Time中，資料主要集中在兩個階段

特徵分佈箱型圖

檢視每個特徵取值的箱型圖：

資料預處理

資料縮放和分佈

針對Amount和Time欄位的歸一化操作。其他欄位已經進行了歸一化的操作。

StandardScaler：將資料減去均值除以標準差
RobustScaler：如果資料有離群點，有對資料中心化和資料的縮放魯棒性更強的引數

In [13]:

```python from sklearn.preprocessing import StandardScaler, RobustScaler

ss = StandardScaler()

rs = RobustScaler()

好方法

df['scaled_amount'] = rs.fit_transform(df['Amount'].values.reshape(-1,1)) df['scaled_time'] = rs.fit_transform(df['Time'].values.reshape(-1,1)) ```

In [14]:

刪除原始欄位，使用歸一化後的欄位和資料

df['Amount'].values.reshape(-1,1) # 個人新增

技巧1：新欄位位置

將新生成的欄位放在最前面

```python

把兩個縮放的欄位放在最前面

1、單獨提出來

scaled_amount = df['scaled_amount'] scaled_time = df['scaled_time']

2、刪除原欄位資訊

df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)

3、插入

df.insert(0, 'scaled_amount', scaled_amount) df.insert(1, 'scaled_time', scaled_time) ```

分割資料（基於原DataFrame）

在開始進行隨機欠取樣之前，我們需要將原始資料進行分割。

儘管我們會對資料進行欠取樣和上取樣，但是我們希望在測試的時候，仍然是使用原始的資料集。

In [18]:

from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit

檢視Class中0-no fraud和1-fraud的比例：

In [19]:

df["Class"].value_counts(normalize=True)

Out[19]:

0 0.998273 1 0.001727 Name: Class, dtype: float64

生成特徵資料集X和標籤資料y：

In [20]:

X = df.drop("Class", axis=1) y = df["Class"]

In [21]:

技巧2：生成隨機索引

```python sfk = StratifiedKFold( n_splits=5, # 生成5份 random_state=None, shuffle=False)

for train_index, test_index in sfk.split(X,y): # 隨機生成的index print(train_index) print("------------") print(test_index)

# 根據隨機生成的索引再生成資料
original_X_train = X.iloc[train_index]
original_X_test = X.iloc[test_index]

original_y_train = y.iloc[train_index]
original_y_test = y.iloc[test_index]

[ 30473 30496 31002 ... 284804 284805 284806]

[ 0 1 2 ... 57017 57018 57019] [ 0 1 2 ... 284804 284805 284806]

[ 30473 30496 31002 ... 113964 113965 113966] [ 0 1 2 ... 284804 284805 284806]

[ 81609 82400 83053 ... 170946 170947 170948] [ 0 1 2 ... 284804 284805 284806]

[150654 150660 150661 ... 227866 227867 227868] [ 0 1 2 ... 227866 227867 227868]

[212516 212644 213092 ... 284804 284805 284806] ```

將生成的資料轉成numpy陣列：

In [22]:

```python original_Xtrain = original_X_train.values original_Xtest = original_X_test.values

original_ytrain = original_y_train.values original_ytest = original_y_test.values ```

檢視訓練集 original_ytrain 和 original_ytest 的唯一值以及每個唯一值所佔的比例：

In [23]:

技巧3：資料唯一值及比例

```python

訓練集

針對的是numpy陣列

train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True)

測試集

test_unique_label, test_counts_label = np.unique(original_ytest, return_counts=True) ```

In [24]:

```python print(train_counts_label / len(original_ytrain))

print(test_counts_label / len(original_ytest)) [0.99827076 0.00172924] [0.99827952 0.00172048] ```

欠取樣

原理

欠取樣也稱之為下采樣，主要是通過刪除原資料中類別較多的資料，從而和類別少的資料達到平衡，以免造成模型的過擬合。

步驟

確定資料不平衡度是多少：通過value_counts()來統計，檢視每個類別的數量和佔比
在本例中一旦我們確定了fraud的數量，我們就需要將no-fraud的數量取樣和其相同，形成50%：50%
實施取樣之後，隨機打亂取樣的子樣本

缺點

下采樣會造成資料資訊的缺失。比如原資料中no-fraud有284315條資料，但是經過欠取樣只有492，大量的資料被放棄了。

實施取樣

取出欺詐的資料，同時從非欺詐中取出相同的長度的資料：

```python

欺詐的資料

fraud_df = df[df["Class"] == 1]

從非欺詐的資料中取出相同的長度len(fraud_df)

no_fraud_df = df[df["Class"] == 0][:len(fraud_df)]

492+492

normal_distributed_df = pd.concat([fraud_df, no_fraud_df]) normal_distributed_df.shape

再次隨機打亂資料

new_df = normal_distributed_df.sample(frac=1, random_state=123) ```

均勻分佈

現在我們發現樣本是均勻的：

In [28]:

```

顯示數量

new_df["Class"].value_counts() ```

Out[28]:

1 492 0 492 Name: Class, dtype: int64

In [29]:

```

顯示比例

new_df["Class"].value_counts(normalize=True) ```

Out[29]:

1 0.5 0 0.5 Name: Class, dtype: float64

In [30]:

當我們再次檢視資料分佈的時候發現：已經是均勻分佈了

``` sns.countplot("Class", data=new_df, palette=colors)

plt.title("Equally Distributed Classes", fontsize=12) plt.show() ```

原始資料df

corr = df.corr() sns.heatmap(corr, cmap="coolwarm_r",annot_kws={"size":20}, ax=ax1) ax1.set_title("Imbalanced Correlation Matrix", fontsize=14)

欠取樣資料new_df

new_corr = new_df.corr() sns.heatmap(new_corr, cmap="coolwarm_r",annot_kws={"size":20}, ax=ax2) ax2.set_title("SubSample Correlation Matrix", fontsize=14)

plt.show() ```

小結：

正相關：特徵V2、V4、V11、V19是正相關的。值越大，結果越可能出現fraud
負相關：特徵V17, V14, V12 和 V10 是負相關的；值越小，結果越可能出現fraud

箱型圖

In [32]:

負相關的特徵箱型圖

```python

負相關

f, axes = plt.subplots(ncols=4, figsize=(20,4))

sns.boxplot(x="Class", y="V17", data=new_df, palette=colors, ax=axes[0])

axes[0].set_title('V17')

sns.boxplot(x="Class", y="V14", data=new_df, palette=colors, ax=axes[1]) axes[1].set_title('V14')

sns.boxplot(x="Class", y="V12", data=new_df, palette=colors, ax=axes[2])

axes[2].set_title('V12')

sns.boxplot(x="Class", y="V10", data=new_df, palette=colors, ax=axes[3]) axes[3].set_title('V10')

plt.show() ```

正相關特徵的箱型圖：

```python

正相關

f, axes = plt.subplots(ncols=4, figsize=(20,4))

sns.boxplot(x="Class", y="V2", data=new_df, palette=colors, ax=axes[0]) axes[0].set_title('V2')

sns.boxplot(x="Class", y="V4", data=new_df, palette=colors, ax=axes[1]) axes[1].set_title('V4')

sns.boxplot(x="Class", y="V11", data=new_df, palette=colors, ax=axes[2]) axes[2].set_title('V11')

sns.boxplot(x="Class", y="V19", data=new_df, palette=colors, ax=axes[3]) axes[3].set_title('V19')

plt.show() ```

異常檢測

目的

異常檢測的目的主要是：發現數據中的離群點來進行刪除。

方法

IQR：我們通過第75個百分位和第25個百分位之間的差異來計算。我們的目標是建立一個超過第75和 25 個百分位的閾值，以防某些例項超過此閾值，該例項將被刪除。
箱型圖boxplot：除了很容易看到第 25 和第 75 個百分位數（正方形的兩端）之外，還很容易看到極端異常值（超出下限和上限的點）

異常值去除權衡

在通過四分位法刪除異常值的時候，我們通過將一個數字（例如1.5）乘以（四分位距）來確定閾值。該閾值越高，檢測到的異常值越少，反之檢測到的異常值越多。

直方圖（正態）

In [34]:

```python

檢視3個特徵的分佈

from scipy.stats import norm

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,6))

v14_fraud = new_df["V14"].loc[new_df["Class"] == 1].values sns.distplot(v14_fraud, ax=ax1, fit=norm, color="#FB8861") ax1.set_title("V14", fontsize=14)

v12_fraud = new_df["V12"].loc[new_df["Class"] == 1].values sns.distplot(v12_fraud, ax=ax2, fit=norm, color="#56F9BB") ax2.set_title("V12", fontsize=14)

v10_fraud = new_df["V10"].loc[new_df["Class"] == 1].values sns.distplot(v10_fraud, ax=ax3, fit=norm, color="#C5B3F9") ax2.set_title("V10", fontsize=14)

plt.show() ```

技巧：刪除離群點

刪除3個特徵下的離群點，以V12為例：

In [35]:

```

陣列

v12_fraud = new_df["V12"].loc[new_df["Class"] == 1]

25%和75%分位數

q1, q3 = v12_fraud.quantile(0.25), v12_fraud.quantile(0.75) iqr = q3 - q1 ```

In [36]:

```python

確定上下限

v12_cut_off = iqr * 1.5

v12_lower = q1 - v12_cut_off v12_upper = q3 + v12_cut_off

print(v12_lower) print(v12_upper)

-17.25930926645337 5.597044719256134 ```

In [37]:

```python

確定離群點

outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper] print(outliers) print("------------") print("離群點數量：",len(outliers)) [-17.6316063138707, -17.7691434633638, -18.6837146333443, -18.5536970096458, -18.0475965708216, -18.4311310279993]

離群點數量： 6 ```

下面執行刪除離群點的操作：

In [38]:

```python

技巧：如何刪除異常值

new_df = new_df.drop(new_df[(new_df["V12"] > v12_upper) | (new_df["V12"] < v12_lower)].index) new_df ```

對其他的特徵執行相同的操作：

可以看到：欠取樣之後的資料原本是984，現在變成了978條資料，刪除了6個離群點的資料

In [39]:

```python

對V10和V14執行同樣的操作

陣列

v14_fraud = new_df["V14"].loc[new_df["Class"] == 1] q1, q3 = v14_fraud.quantile(0.25), v14_fraud.quantile(0.75) iqr = q3 - q1

v14_cut_off = iqr * 1.5 v14_lower = q1 - v14_cut_off v14_upper = q3 + v14_cut_off

outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper]

new_df = new_df.drop(new_df[(new_df["V14"] > v14_upper) | (new_df["V14"] < v14_lower)].index) ```

In [40]:

```python

對V10和V14執行同樣的操作

陣列

v10_fraud = new_df["V10"].loc[new_df["Class"] == 1] q1, q3 = v10_fraud.quantile(0.25), v10_fraud.quantile(0.75) iqr = q3 - q1

v10_cut_off = iqr * 1.5 v10_lower = q1 - v10_cut_off v10_upper = q3 + v10_cut_off

outliers = [x for x in v10_fraud if x < v10_lower or x > v10_upper]

new_df = new_df.drop(new_df[(new_df["V10"] > v10_upper) | (new_df["V10"] < v10_lower)].index) ```

檢視刪除了異常點後的資料：

In [42]:

```python f, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(20,10))

colors = ['#B3F9C5', '#f9c5b3']

sns.boxplot(x="Class", y="V14", data=new_df, ax=ax1, palette=colors) ax1.set_title("V14", fontsize=14) ax1.annotate("Fewer extreme", xy=(0.98,-17.5), xytext=(0,-12), arrowprops=dict(facecolor="black"), fontsize=14)

sns.boxplot(x="Class", y="V12", data=new_df, ax=ax2, palette=colors) ax2.set_title("V12", fontsize=14) ax2.annotate("Fewer extreme", xy=(0.98,-17), xytext=(0,-12), arrowprops=dict(facecolor="black"), fontsize=14)

sns.boxplot(x="Class", y="V10", data=new_df, ax=ax3, palette=colors) ax3.set_title("V10", fontsize=14) ax3.annotate("Fewer extreme", # 註釋名稱 xy=(0.98,-16.5), # 位置 xytext=(0,-12), # 註釋文字的座標點，二維元組，預設xy arrowprops=dict(facecolor="black"), # 箭頭顏色 fontsize=14)

plt.show() ```

降維和聚類

理解t-SNE

詳細地址：http://www.youtube.com/watch?v=NEaUSP4YerM

欠取樣資料降維

對3種不同方法實施欠取樣：

In [43]:

``` X = new_df.drop("Class", axis=1) y = new_df["Class"]

t-SNE降維

t0 = time.time() X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("T-SNE: ", (t1 - t0)) T-SNE: 5.750015020370483 ```

In [44]:

```

PCA降維

t0 = time.time() X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("PCA: ", (t1 - t0)) PCA: 0.02214193344116211 ```

In [45]:

```

TruncatedSVD降維

t0 = time.time() X_reduced_svd = TruncatedSVD(n_components=2, algorithm="randomized", random_state=42).fit_transform(X.values) t1 = time.time() print("TruncatedSVD: ", (t1 - t0)) TruncatedSVD: 0.01066279411315918 ```

繪圖

In [46]:

```python f, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(24,6))

標題設定

f.suptitle("Clusters using Dimensionality Reduction", fontsize=14)

blue_patch = mpatches.Patch(color="#0A0AFF", label="No Fraud") red_patch = mpatches.Patch(color="#AF0000", label="Fraud")

t-SNE

ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y==0), cmap="coolwarm", label="No Fraud", linewidths=2 ) ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y==0), cmap="coolwarm", label="Fraud", linewidths=2 ) ax1.set_title("t-SNE", fontsize=14) # 子圖示題設定 ax1.grid(True) # 設定網格 ax1.legend(handles=[blue_patch,red_patch])

PCA

ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y==0), cmap="coolwarm", label="No Fraud", linewidths=2 ) ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y==0), cmap="coolwarm", label="Fraud", linewidths=2 ) ax2.set_title("PCA", fontsize=14) # 標題設定 ax2.grid(True) # 設定網格 ax2.legend(handles=[blue_patch,red_patch])

TruncatedSVD

ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y==0), cmap="coolwarm", label="No Fraud", linewidths=2 ) ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y==0), cmap="coolwarm", label="Fraud", linewidths=2 ) ax3.set_title("TruncatedSVD", fontsize=14) # 標題設定 ax3.grid(True) # 設定網格 ax3.legend(handles=[blue_patch,red_patch])

plt.show() ```

基於欠取樣的分類建模

4個分類模型

採用4個不同模型的分類來訓練資料，看哪個模型在欺詐資料上表現的更好。首先需要對資料進行劃分：訓練集和測試集

In [47]:

```

1、特徵和標籤資料

X = new_df.drop("Class", axis=1) y = new_df["Class"] ```

In [48]:

```

2、資料已經歸一化，直接切分

from sklearn.model_selection import train_test_split

8-2的比例

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=44) ```

In [49]:

```

3、將資料轉成陣列，然後傳給模型

X_train = X_train.values X_test = X_test.values

y_train = y_train.values y_test = y_test.values ```

In [50]:

```

4、建立4個模型

classifiers = { "邏輯迴歸LogisiticRegression": LogisticRegression(), "K近鄰KNearest": KNeighborsClassifier(), "支援向量機分類Support Vector Classifier": SVC(), "決策樹分類DecisionTreeClassifier": DecisionTreeClassifier() }

for key, classifier in classifiers.items(): classifier.fit(X_train, y_train) # 模型訓練 training_score = cross_val_score(classifier, # 模型 X_train, # 訓練集資料 y_train, cv=5) # 5折交叉驗證

print("模型-", key, 
  "5次平均得分：", round(training_score.mean(), 2)*100)

模型- 邏輯迴歸LogisiticRegression 5次平均得分： 93.0 模型- K近鄰KNearest 5次平均得分： 93.0 模型- 支援向量機分類Support Vector Classifier 5次平均得分： 93.0 模型- 決策樹分類DecisionTreeClassifier 5次平均得分： 91.0 ```

網格搜尋

針對不同測模型實施網格搜尋，尋找最優引數

In [51]:

``` from sklearn.model_selection import GridSearchCV

邏輯迴歸

lr_params = {"penalty":["l1", "l2"], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000] } grid_lr = GridSearchCV(LogisticRegression(), lr_params) grid_lr.fit(X_train, y_train)

最好的引數組合

best_para_lr = grid_lr.best_estimator_ best_para_lr ```

Out[51]:

LogisticRegression(C=0.1)

In [52]:

```

k近鄰

knn_params = {"n_neighbors": list(range(2,5,1)), "algorithm":["auto","ball_tree","kd_tree","brute"] }

grid_knn = GridSearchCV(KNeighborsClassifier(), knn_params) grid_knn.fit(X_train, y_train)

最好的引數組合

best_para_knn = grid_knn.best_estimator_ best_para_knn ```

Out[52]:

KNeighborsClassifier(n_neighbors=2)

In [53]:

```

支援向量機分類

svc_params = {"C":[0.5, 0.7, 0.9, 1], "kernel":["rbf","poly","sigmoid","linear"] }

grid_svc = GridSearchCV(SVC(), svc_params) grid_svc.fit(X_train, y_train)

best_para_svc = grid_svc.best_estimator_ best_para_svc ```

Out[53]:

SVC(C=0.9, kernel='linear')

In [54]:

```

決策樹

dt_params = {"criterion":["gini","entropy"], "max_depth":list(range(2, 5, 1)), "min_samples_leaf": list(range(5,7,1)) }

grid_dt = GridSearchCV(DecisionTreeClassifier(), dt_params) grid_dt.fit(X_train, y_train)

best_para_dt = grid_dt.best_estimator_ best_para_dt ```

Out[54]:

DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)

重新訓練並評分

基於最優引數重新計算得分：

In [55]:

``` lr_score = cross_val_score(best_para_lr, X_train, y_train,cv=5)

print("邏輯迴歸交叉驗證得分：", round(lr_score.mean() * 100, 2).astype(str) + "%") 邏輯迴歸交叉驗證得分： 93.63% ```

In [56]:

``` knn_score = cross_val_score(best_para_knn, X_train, y_train,cv=5)

print("KNN交叉驗證得分：", round(knn_score.mean() * 100, 2).astype(str) + "%") KNN交叉驗證得分： 93.37% ```

In [57]:

``` svc_score = cross_val_score(best_para_svc, X_train, y_train,cv=5)

print("SVC交叉驗證得分：", round(svc_score.mean() * 100, 2).astype(str) + "%") SVC交叉驗證得分： 93.5% ```

In [58]:

``` dt_score = cross_val_score(best_para_dt, X_train, y_train,cv=5)

print("決策樹交叉驗證得分：", round(dt_score.mean() * 100, 2).astype(str) + "%") 決策樹交叉驗證得分： 93.24% ```

小結：通過不同模型的交叉驗證得分我們發現，邏輯迴歸模型是最高的

基於欠取樣資料的交叉驗證

主要是基於Near-Miss演算法來實現欠取樣：

Near-miss-1：選擇到最近的三個樣本平均距離最小的多數類樣本
Near-miss-2：選擇到最遠的三個樣本平均距離最小的多數類樣本
Near-miss-3：為每個少數類樣本選擇給定數目的最近多數類樣本
最遠距離：選擇到最近的三個樣本平均距離最大的多樣類樣本

In [59]:

``` undersample_X = df.drop("Class", axis=1) undersample_y = df["Class"]

sfk = StratifiedKFold( n_splits=5, # 生成5份 random_state=None, shuffle=False)

for train_index , test_index in sfk.split(undersample_X,undersample_y): # print("Train: ", train_index) # print("Test: ", test_index)

undersample_Xtrain = undersample_X.iloc[train_index]
undersample_Xtest = undersample_X.iloc[test_index]

undersample_ytrain = undersample_y.iloc[train_index]
undersample_ytest = undersample_y.iloc[test_index]

undersample_Xtrain = undersample_Xtrain.values undersample_Xtest = undersample_Xtest.values undersample_ytrain = undersample_ytrain.values undersample_ytest = undersample_ytest.values

5個評價指標

undersample_accuracy = [] undersample_precision = [] undersample_recall = [] undersample_f1 = [] undersample_auc = [] ```

使用近鄰缺失Near-Miss演算法來檢視資料分佈：

In [60]:

``` X_nearmiss, y_nearmiss = NearMiss().fit_resample(undersample_X.values, undersample_y.values)

print("NearMiss Label Distributions: {}", format(Counter(y_nearmiss))) NearMiss Label Distributions: {} Counter({0: 492, 1: 492}) ```

實施交叉驗證：

In [61]:

``` for train, test in sfk.split(undersample_Xtrain, undersample_ytrain): undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy="majority"), best_para_lr)

# 模型訓練
undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])

# 對測試集預測
undersample_prediction = undersample_model.predict(undersample_Xtrain[test])

# y_test真實值和預測值的評分
undersample_accuracy.append(undersample_pipeline.score(original_Xtrain[test], original_ytrain[test]))
undersample_precision.append(precision_score(original_ytrain[test], undersample_prediction))
undersample_recall.append(recall_score(original_ytrain[test], undersample_prediction))
undersample_f1.append(f1_score(original_ytrain[test], undersample_prediction))
undersample_auc.append(roc_auc_score(original_ytrain[test], undersample_prediction))

```

繪製學習曲線

In [62]:

from sklearn.model_selection import ShuffleSplit, learning_curve

In [63]:

```python def plot_learning_curve(est1,est2,est3,est4,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(0.1, 1, 5)):

f, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2,figsize=(20,14), sharey=True)

if ylim is not None:
    plt.ylim(*ylim)


# 模型1
train_sizes, train_scores, test_scores = learning_curve(
    est1, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax1.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="#ff9124")
ax1.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax1.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
         label="Training score")
ax1.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
         label="Cross-validation score")
ax1.set_title("邏輯迴歸學習曲線", fontsize=14)
ax1.set_xlabel('Training size (m)')
ax1.set_ylabel('Score')
ax1.grid(True)
ax1.legend(loc="best")

# 模型2-knn
train_sizes, train_scores, test_scores = learning_curve(
    est2, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax2.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="#ff9124")
ax2.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax2.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
         label="Training score")
ax2.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
         label="Cross-validation score")
ax2.set_title("k近鄰學習曲線", fontsize=14)
ax2.set_xlabel('Training size (m)')
ax2.set_ylabel('Score')
ax2.grid(True)
ax2.legend(loc="best")

# 模型3-支援向量機
train_sizes, train_scores, test_scores = learning_curve(
    est3, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax3.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="#ff9124")
ax3.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax3.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
         label="Training score")
ax3.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
         label="Cross-validation score")
ax3.set_title("支援向量機學習曲線", fontsize=14)
ax3.set_xlabel('Training size (m)')
ax3.set_ylabel('Score')
ax3.grid(True)
ax3.legend(loc="best")

# 模型4-決策樹
train_sizes, train_scores, test_scores = learning_curve(
    est4, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax4.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="#ff9124")
ax4.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax4.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
         label="Training score")
ax4.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
         label="Cross-validation score")
ax4.set_title("決策樹學習曲線", fontsize=14)
ax4.set_xlabel('Training size (m)')
ax4.set_ylabel('Score')
ax4.grid(True)
ax4.legend(loc="best")

return plt

```

In [64]:

```python cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42 )

plot_learning_curve(best_para_lr, best_para_knn, best_para_svc, best_para_dt, X_train, y_train, (0.87,1.01), cv=cv, n_jobs=4 )

plt.show ```

roc曲線

In [65]:

from sklearn.metrics import roc_curve, roc_auc_score from sklearn.model_selection import cross_val_predict

In [66]:

``` lr_pred = cross_val_predict(best_para_lr, X_train, y_train, cv=5,

method="decision_function"

knn_pred = cross_val_predict(best_para_knn, X_train, y_train, cv=5,

method="decision_function"

svc_pred = cross_val_predict(best_para_svc, X_train, y_train, cv=5,

method="decision_function"

dt_pred = cross_val_predict(best_para_dt, X_train, y_train, cv=5,

method="decision_function"

```

In [67]:

print('Logistic Regression: ', roc_auc_score(y_train, lr_pred)) print('KNears Neighbors: ', roc_auc_score(y_train, knn_pred)) print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred)) print('Decision Tree Classifier: ', roc_auc_score(y_train, dt_pred)) Logistic Regression: 0.934970120644943 KNears Neighbors: 0.9314677528469951 Support Vector Classifier: 0.9339060209719247 Decision Tree Classifier: 0.930932179501635

In [68]:

```python log_fpr, log_tpr, log_thresold = roc_curve(y_train, lr_pred) knear_fpr, knear_tpr, knear_threshold = roc_curve(y_train, knn_pred) svc_fpr, svc_tpr, svc_threshold = roc_curve(y_train, svc_pred) tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train, dt_pred)

def graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svc_fpr, svc_tpr, tree_fpr, tree_tpr): plt.figure(figsize=(16,8)) plt.title('ROC Curve \n Top 4 Classifiers', fontsize=18) plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, lr_pred))) plt.plot(knear_fpr, knear_tpr, label='KNears Neighbors Classifier Score: {:.4f}'.format(roc_auc_score(y_train, knn_pred))) plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier Score: {:.4f}'.format(roc_auc_score(y_train, svc_pred))) plt.plot(tree_fpr, tree_tpr, label='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train, dt_pred)))

plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.01, 1, 0, 1])

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', 
             xy=(0.5, 0.5), 
             xytext=(0.6, 0.3),
            arrowprops=dict(facecolor='#6E726D', shrink=0.05),
            )
plt.legend()

graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svc_fpr, svc_tpr, tree_fpr, tree_tpr) plt.show() ```

探索邏輯迴歸評價指標

探索在邏輯迴歸模型的分類評價指標：

In [69]:

```python def logistic_roc_curve(log_fpr, log_tpr): plt.figure(figsize=(12,8)) plt.title('Logistic Regression ROC Curve', fontsize=16) plt.plot(log_fpr, log_tpr, 'b-', linewidth=2) plt.plot([0, 1], [0, 1], 'r--') plt.xlabel('False Positive Rate', fontsize=16) plt.ylabel('True Positive Rate', fontsize=16) plt.axis([-0.01,1,0,1])

logistic_roc_curve(log_fpr, log_tpr) plt.show() ```

``` from sklearn.metrics import precision_recall_curve

precision, recall, threshold = precision_recall_curve(y_train, lr_pred) ```

In [71]:

```python from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score y_pred = best_para_lr.predict(X_train)

Overfitting Case

print('---' * 20) print('Recall Score: {:.2f}'.format(recall_score(y_train, y_pred))) print('Precision Score: {:.2f}'.format(precision_score(y_train, y_pred))) print('F1 Score: {:.2f}'.format(f1_score(y_train, y_pred))) print('Accuracy Score: {:.2f}'.format(accuracy_score(y_train, y_pred)))

print('---' * 20) print("Accuracy Score: {:.2f}".format(np.mean(undersample_accuracy))) print("Precision Score: {:.2f}".format(np.mean(undersample_precision))) print("Recall Score: {:.2f}".format(np.mean(undersample_recall))) print("F1 Score: {:.2f}".format(np.mean(undersample_f1))) print('---' * 20)

基於原資料

Recall Score: 0.92 Precision Score: 0.79 F1 Score: 0.85 Accuracy Score: 0.84

# 基於欠取樣的資料 Accuracy Score: 0.75 Precision Score: 0.00 Recall Score: 0.24 F1 Score: 0.00

```

kaggle實戰：極度不均衡的信用卡資料分析

theme: smartblue

匯入庫

子圖

降維

分類庫

特徵工程相關的庫

上取樣

欠取樣

統計數量

基本資訊

缺失值的最大值

正負樣本不均衡

繪圖

檢視特徵分佈

直方圖分佈

特徵分佈箱型圖

資料預處理

資料縮放和分佈

ss = StandardScaler()

好方法

技巧1：新欄位位置

把兩個縮放的欄位放在最前面

1、單獨提出來

2、刪除原欄位資訊

3、插入

分割資料（基於原DataFrame）

技巧2：生成隨機索引

[ 30473 30496 31002 ... 284804 284805 284806]

技巧3：資料唯一值及比例

訓練集

針對的是numpy陣列

測試集

欠取樣

原理

步驟

缺點

實施取樣

欺詐的資料

從非欺詐的資料中取出相同的長度len(fraud_df)

492+492

再次隨機打亂資料

均勻分佈

顯示數量

顯示比例

相關性分析

係數矩陣熱力圖

原始資料df

欠取樣資料new_df

箱型圖

負相關

正相關

異常檢測

目的

方法

異常值去除權衡

直方圖（正態）

檢視3個特徵的分佈

技巧：刪除離群點

陣列

25%和75%分位數

確定上下限

確定離群點

技巧：如何刪除異常值

對V10和V14執行同樣的操作

陣列

對V10和V14執行同樣的操作

陣列

降維和聚類

理解t-SNE

欠取樣資料降維

t-SNE降維

PCA降維

TruncatedSVD降維

繪圖

標題設定

t-SNE

PCA

TruncatedSVD

基於欠取樣的分類建模