kaggle實戰:極度不均衡的信用卡資料分析
theme: smartblue
公眾號:尤而小屋
作者:Peter
編輯:Peter
大家好,我是Peter~
今天給大家帶來一篇新的kaggle文章:極度不均衡的信用卡資料分析,主要內容包含:
- 理解資料:通過直方圖、箱型圖等輔助理解資料分佈
- 預處理:歸一化和分佈情況;資料分割
- 隨機取樣:上取樣和下采樣,主要是欠取樣(下采樣)
- 異常檢測:如何從資料中找到異常點,並且進行刪除
- 資料建模:利用邏輯迴歸和神經網路進行建模分析
- 模型評價:分類模型的多種評價指標
原notebook地址為:http://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets/notebook
非均衡:信用卡資料中欺詐和非欺詐的比例是不均衡的,肯定是非欺詐的比例佔據絕大多數。本文提供一種方法:如何處理這種極度不均衡的資料
匯入庫
匯入各種庫和包:繪圖、特徵工程、降維、分類模型、評價指標相關等
```python import numpy as np import pandas as pd
import tensorflow as tf
import plotly_express as px import plotly.graph_objects as go
子圖
from plotly.subplots import make_subplots import matplotlib.pyplot as plt import matplotlib.patches as mpatches import seaborn as sns
降維
from sklearn.manifold import TSNE from sklearn.decomposition import PCA, TruncatedSVD
import time
plt.rcParams["font.sans-serif"]=["SimHei"] #設定字型 plt.rcParams["axes.unicode_minus"]=False #正常顯示負號
分類庫
from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier
特徵工程相關的庫
from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.pipeline import make_pipeline from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
上取樣
from imblearn.over_sampling import SMOTE
欠取樣
from imblearn.under_sampling import NearMiss from imblearn.metrics import classification_report_imbalanced from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
統計數量
from collections import Counter import collections import warnings warnings.filterwarnings("ignore") ```
基本資訊
讀取資料,檢視基本資訊
資料的形狀如下:
In [3]:
df.shape
Out[3]:
(284807, 31)
In [4]:
```
缺失值的最大值
df.isnull().sum().max() ```
Out[4]:
0
結果表明是沒有缺失值的。
下面是檢視資料中欄位的相關型別,我們發現有30個float64型別,1個int64型別
In [5]:
pd.value_counts(df.dtypes)
Out[5]:
float64 30
int64 1
dtype: int64
In [6]:
columns = df.columns
columns
Out[6]:
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
'Class'],
dtype='object')
檢視資料的統計資訊:
df.describe()
正負樣本不均衡
In [8]:
df["Class"].value_counts(normalize=True)
Out[8]:
python
0 0.998273 # 不欺詐
1 0.001727 # 欺詐
Name: Class, dtype: float64
我們發現屬於0類的樣本遠高於屬於1的樣本,非常地不均衡。這就是本文重點關注的問題。
In [9]:
```
繪圖
colors = ["red", "blue"]
sns.countplot("Class", data=df, palette=colors) plt.title("Class Distributions \n (0-No Fraud & 1-Fraud)") plt.show() ```
通過柱狀圖也能夠明顯觀察到非欺詐-0 和 欺詐-1的比例是極度不均衡的。
檢視特徵分佈
部分特徵的分佈,發現存在偏態狀況:
直方圖分佈
In [10]:
```python fig, ax = plt.subplots(1,2,figsize=(18,6))
amount_val = df["Amount"].values time_val = df["Time"].values
sns.distplot(amount_val, ax=ax[0], color="r") ax[0].set_title("Amount", fontsize=14) ax[0].set_xlim([min(amount_val), max(amount_val)]) # 設定範圍
sns.distplot(time_val, ax=ax[1], color="b") ax[1].set_title("Time", fontsize=14) ax[1].set_xlim([min(time_val), max(time_val)]) # 設定範圍
plt.show() ```
觀察兩個欄位Amount和Time在不同取值下的分佈情況,發現:
- Amount的偏態現象嚴重,極大多數的資料集中在左側
- Time中,資料主要集中在兩個階段
特徵分佈箱型圖
檢視每個特徵取值的箱型圖:
資料預處理
資料縮放和分佈
針對Amount和Time欄位的歸一化操作。其他欄位已經進行了歸一化的操作。
- StandardScaler:將資料減去均值除以標準差
- RobustScaler:如果資料有離群點,有對資料中心化和資料的縮放魯棒性更強的引數
In [13]:
```python from sklearn.preprocessing import StandardScaler, RobustScaler
ss = StandardScaler()
rs = RobustScaler()
好方法
df['scaled_amount'] = rs.fit_transform(df['Amount'].values.reshape(-1,1)) df['scaled_time'] = rs.fit_transform(df['Time'].values.reshape(-1,1)) ```
In [14]:
刪除原始欄位,使用歸一化後的欄位和資料
df['Amount'].values.reshape(-1,1) # 個人新增
技巧1:新欄位位置
將新生成的欄位放在最前面
```python
把兩個縮放的欄位放在最前面
1、單獨提出來
scaled_amount = df['scaled_amount'] scaled_time = df['scaled_time']
2、刪除原欄位資訊
df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)
3、插入
df.insert(0, 'scaled_amount', scaled_amount) df.insert(1, 'scaled_time', scaled_time) ```
分割資料(基於原DataFrame)
在開始進行隨機欠取樣之前,我們需要將原始資料進行分割。
儘管我們會對資料進行欠取樣和上取樣,但是我們希望在測試的時候,仍然是使用原始的資料集。
In [18]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
檢視Class中0-no fraud和1-fraud的比例:
In [19]:
df["Class"].value_counts(normalize=True)
Out[19]:
0 0.998273
1 0.001727
Name: Class, dtype: float64
生成特徵資料集X和標籤資料y:
In [20]:
X = df.drop("Class", axis=1)
y = df["Class"]
In [21]:
技巧2:生成隨機索引
```python sfk = StratifiedKFold( n_splits=5, # 生成5份 random_state=None, shuffle=False)
for train_index, test_index in sfk.split(X,y): # 隨機生成的index print(train_index) print("------------") print(test_index)
# 根據隨機生成的索引再生成資料
original_X_train = X.iloc[train_index]
original_X_test = X.iloc[test_index]
original_y_train = y.iloc[train_index]
original_y_test = y.iloc[test_index]
[ 30473 30496 31002 ... 284804 284805 284806]
[ 0 1 2 ... 57017 57018 57019] [ 0 1 2 ... 284804 284805 284806]
[ 30473 30496 31002 ... 113964 113965 113966] [ 0 1 2 ... 284804 284805 284806]
[ 81609 82400 83053 ... 170946 170947 170948] [ 0 1 2 ... 284804 284805 284806]
[150654 150660 150661 ... 227866 227867 227868] [ 0 1 2 ... 227866 227867 227868]
[212516 212644 213092 ... 284804 284805 284806] ```
將生成的資料轉成numpy陣列:
In [22]:
```python original_Xtrain = original_X_train.values original_Xtest = original_X_test.values
original_ytrain = original_y_train.values original_ytest = original_y_test.values ```
檢視訓練集 original_ytrain 和 original_ytest 的唯一值以及每個唯一值所佔的比例:
In [23]:
技巧3:資料唯一值及比例
```python
訓練集
針對的是numpy陣列
train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True)
測試集
test_unique_label, test_counts_label = np.unique(original_ytest, return_counts=True) ```
In [24]:
```python print(train_counts_label / len(original_ytrain))
print(test_counts_label / len(original_ytest)) [0.99827076 0.00172924] [0.99827952 0.00172048] ```
欠取樣
原理
欠取樣也稱之為下采樣,主要是通過刪除原資料中類別較多的資料,從而和類別少的資料達到平衡,以免造成模型的過擬合。
步驟
- 確定資料不平衡度是多少:通過value_counts()來統計,檢視每個類別的數量和佔比
- 在本例中一旦我們確定了fraud的數量,我們就需要將no-fraud的數量取樣和其相同,形成50%:50%
- 實施取樣之後,隨機打亂取樣的子樣本
缺點
下采樣會造成資料資訊的缺失。比如原資料中no-fraud有284315條資料,但是經過欠取樣只有492,大量的資料被放棄了。
實施取樣
取出欺詐的資料,同時從非欺詐中取出相同的長度的資料:
```python
欺詐的資料
fraud_df = df[df["Class"] == 1]
從非欺詐的資料中取出相同的長度len(fraud_df)
no_fraud_df = df[df["Class"] == 0][:len(fraud_df)]
492+492
normal_distributed_df = pd.concat([fraud_df, no_fraud_df]) normal_distributed_df.shape
再次隨機打亂資料
new_df = normal_distributed_df.sample(frac=1, random_state=123) ```
均勻分佈
現在我們發現樣本是均勻的:
In [28]:
```
顯示數量
new_df["Class"].value_counts() ```
Out[28]:
1 492
0 492
Name: Class, dtype: int64
In [29]:
```
顯示比例
new_df["Class"].value_counts(normalize=True) ```
Out[29]:
1 0.5
0 0.5
Name: Class, dtype: float64
In [30]:
當我們再次檢視資料分佈的時候發現:已經是均勻分佈了
``` sns.countplot("Class", data=new_df, palette=colors)
plt.title("Equally Distributed Classes", fontsize=12) plt.show() ```
相關性分析
相關性分析主要是通過相關係數矩陣來實現的。下面繪製基於原始資料和欠取樣資料的相關係數矩陣圖:
係數矩陣熱力圖
In [31]:
```python f, (ax1, ax2) = plt.subplots(2,1,figsize=(24, 20))
原始資料df
corr = df.corr() sns.heatmap(corr, cmap="coolwarm_r",annot_kws={"size":20}, ax=ax1) ax1.set_title("Imbalanced Correlation Matrix", fontsize=14)
欠取樣資料new_df
new_corr = new_df.corr() sns.heatmap(new_corr, cmap="coolwarm_r",annot_kws={"size":20}, ax=ax2) ax2.set_title("SubSample Correlation Matrix", fontsize=14)
plt.show() ```
小結:
- 正相關:特徵V2、V4、V11、V19是正相關的。值越大,結果越可能出現fraud
- 負相關:特徵V17, V14, V12 和 V10 是負相關的;值越小,結果越可能出現fraud
箱型圖
In [32]:
負相關的特徵箱型圖
```python
負相關
f, axes = plt.subplots(ncols=4, figsize=(20,4))
sns.boxplot(x="Class", y="V17", data=new_df, palette=colors, ax=axes[0])
axes[0].set_title('V17')
sns.boxplot(x="Class", y="V14", data=new_df, palette=colors, ax=axes[1]) axes[1].set_title('V14')
sns.boxplot(x="Class", y="V12", data=new_df, palette=colors, ax=axes[2])
axes[2].set_title('V12')
sns.boxplot(x="Class", y="V10", data=new_df, palette=colors, ax=axes[3]) axes[3].set_title('V10')
plt.show() ```
正相關特徵的箱型圖:
```python
正相關
f, axes = plt.subplots(ncols=4, figsize=(20,4))
sns.boxplot(x="Class", y="V2", data=new_df, palette=colors, ax=axes[0]) axes[0].set_title('V2')
sns.boxplot(x="Class", y="V4", data=new_df, palette=colors, ax=axes[1]) axes[1].set_title('V4')
sns.boxplot(x="Class", y="V11", data=new_df, palette=colors, ax=axes[2]) axes[2].set_title('V11')
sns.boxplot(x="Class", y="V19", data=new_df, palette=colors, ax=axes[3]) axes[3].set_title('V19')
plt.show() ```
異常檢測
目的
異常檢測的目的主要是:發現數據中的離群點來進行刪除。
方法
- IQR:我們通過第75個百分位和第25個百分位之間的差異來計算。我們的目標是建立一個超過第75和 25 個百分位的閾值,以防某些例項超過此閾值,該例項將被刪除。
- 箱型圖boxplot:除了很容易看到第 25 和第 75 個百分位數(正方形的兩端)之外,還很容易看到極端異常值(超出下限和上限的點)
異常值去除權衡
在通過四分位法刪除異常值的時候,我們通過將一個數字(例如1.5)乘以(四分位距)來確定閾值。該閾值越高,檢測到的異常值越少,反之檢測到的異常值越多。
直方圖(正態)
In [34]:
```python
檢視3個特徵的分佈
from scipy.stats import norm
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,6))
v14_fraud = new_df["V14"].loc[new_df["Class"] == 1].values sns.distplot(v14_fraud, ax=ax1, fit=norm, color="#FB8861") ax1.set_title("V14", fontsize=14)
v12_fraud = new_df["V12"].loc[new_df["Class"] == 1].values sns.distplot(v12_fraud, ax=ax2, fit=norm, color="#56F9BB") ax2.set_title("V12", fontsize=14)
v10_fraud = new_df["V10"].loc[new_df["Class"] == 1].values sns.distplot(v10_fraud, ax=ax3, fit=norm, color="#C5B3F9") ax2.set_title("V10", fontsize=14)
plt.show() ```
技巧:刪除離群點
刪除3個特徵下的離群點,以V12為例:
In [35]:
```
陣列
v12_fraud = new_df["V12"].loc[new_df["Class"] == 1]
25%和75%分位數
q1, q3 = v12_fraud.quantile(0.25), v12_fraud.quantile(0.75) iqr = q3 - q1 ```
In [36]:
```python
確定上下限
v12_cut_off = iqr * 1.5
v12_lower = q1 - v12_cut_off v12_upper = q3 + v12_cut_off
print(v12_lower) print(v12_upper)
-17.25930926645337 5.597044719256134 ```
In [37]:
```python
確定離群點
outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper] print(outliers) print("------------") print("離群點數量:",len(outliers)) [-17.6316063138707, -17.7691434633638, -18.6837146333443, -18.5536970096458, -18.0475965708216, -18.4311310279993]
離群點數量: 6 ```
下面執行刪除離群點的操作:
In [38]:
```python
技巧:如何刪除異常值
new_df = new_df.drop(new_df[(new_df["V12"] > v12_upper) | (new_df["V12"] < v12_lower)].index) new_df ```
對其他的特徵執行相同的操作:
可以看到:欠取樣之後的資料原本是984,現在變成了978條資料,刪除了6個離群點的資料
In [39]:
```python
對V10和V14執行同樣的操作
陣列
v14_fraud = new_df["V14"].loc[new_df["Class"] == 1] q1, q3 = v14_fraud.quantile(0.25), v14_fraud.quantile(0.75) iqr = q3 - q1
v14_cut_off = iqr * 1.5 v14_lower = q1 - v14_cut_off v14_upper = q3 + v14_cut_off
outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper]
new_df = new_df.drop(new_df[(new_df["V14"] > v14_upper) | (new_df["V14"] < v14_lower)].index) ```
In [40]:
```python
對V10和V14執行同樣的操作
陣列
v10_fraud = new_df["V10"].loc[new_df["Class"] == 1] q1, q3 = v10_fraud.quantile(0.25), v10_fraud.quantile(0.75) iqr = q3 - q1
v10_cut_off = iqr * 1.5 v10_lower = q1 - v10_cut_off v10_upper = q3 + v10_cut_off
outliers = [x for x in v10_fraud if x < v10_lower or x > v10_upper]
new_df = new_df.drop(new_df[(new_df["V10"] > v10_upper) | (new_df["V10"] < v10_lower)].index) ```
檢視刪除了異常點後的資料:
In [42]:
```python f, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(20,10))
colors = ['#B3F9C5', '#f9c5b3']
sns.boxplot(x="Class", y="V14", data=new_df, ax=ax1, palette=colors) ax1.set_title("V14", fontsize=14) ax1.annotate("Fewer extreme", xy=(0.98,-17.5), xytext=(0,-12), arrowprops=dict(facecolor="black"), fontsize=14)
sns.boxplot(x="Class", y="V12", data=new_df, ax=ax2, palette=colors) ax2.set_title("V12", fontsize=14) ax2.annotate("Fewer extreme", xy=(0.98,-17), xytext=(0,-12), arrowprops=dict(facecolor="black"), fontsize=14)
sns.boxplot(x="Class", y="V10", data=new_df, ax=ax3, palette=colors) ax3.set_title("V10", fontsize=14) ax3.annotate("Fewer extreme", # 註釋名稱 xy=(0.98,-16.5), # 位置 xytext=(0,-12), # 註釋文字的座標點,二維元組,預設xy arrowprops=dict(facecolor="black"), # 箭頭顏色 fontsize=14)
plt.show() ```
降維和聚類
理解t-SNE
詳細地址:http://www.youtube.com/watch?v=NEaUSP4YerM
欠取樣資料降維
對3種不同方法實施欠取樣:
In [43]:
``` X = new_df.drop("Class", axis=1) y = new_df["Class"]
t-SNE降維
t0 = time.time() X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("T-SNE: ", (t1 - t0)) T-SNE: 5.750015020370483 ```
In [44]:
```
PCA降維
t0 = time.time() X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("PCA: ", (t1 - t0)) PCA: 0.02214193344116211 ```
In [45]:
```
TruncatedSVD降維
t0 = time.time() X_reduced_svd = TruncatedSVD(n_components=2, algorithm="randomized", random_state=42).fit_transform(X.values) t1 = time.time() print("TruncatedSVD: ", (t1 - t0)) TruncatedSVD: 0.01066279411315918 ```
繪圖
In [46]:
```python f, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(24,6))
標題設定
f.suptitle("Clusters using Dimensionality Reduction", fontsize=14)
blue_patch = mpatches.Patch(color="#0A0AFF", label="No Fraud") red_patch = mpatches.Patch(color="#AF0000", label="Fraud")
t-SNE
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y==0), cmap="coolwarm", label="No Fraud", linewidths=2 ) ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y==0), cmap="coolwarm", label="Fraud", linewidths=2 ) ax1.set_title("t-SNE", fontsize=14) # 子圖示題設定 ax1.grid(True) # 設定網格 ax1.legend(handles=[blue_patch,red_patch])
PCA
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y==0), cmap="coolwarm", label="No Fraud", linewidths=2 ) ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y==0), cmap="coolwarm", label="Fraud", linewidths=2 ) ax2.set_title("PCA", fontsize=14) # 標題設定 ax2.grid(True) # 設定網格 ax2.legend(handles=[blue_patch,red_patch])
TruncatedSVD
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y==0), cmap="coolwarm", label="No Fraud", linewidths=2 ) ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y==0), cmap="coolwarm", label="Fraud", linewidths=2 ) ax3.set_title("TruncatedSVD", fontsize=14) # 標題設定 ax3.grid(True) # 設定網格 ax3.legend(handles=[blue_patch,red_patch])
plt.show() ```
基於欠取樣的分類建模
4個分類模型
採用4個不同模型的分類來訓練資料,看哪個模型在欺詐資料上表現的更好。首先需要對資料進行劃分:訓練集和測試集
In [47]:
```
1、特徵和標籤資料
X = new_df.drop("Class", axis=1) y = new_df["Class"] ```
In [48]:
```
2、資料已經歸一化,直接切分
from sklearn.model_selection import train_test_split
8-2的比例
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=44) ```
In [49]:
```
3、將資料轉成陣列,然後傳給模型
X_train = X_train.values X_test = X_test.values
y_train = y_train.values y_test = y_test.values ```
In [50]:
```
4、建立4個模型
classifiers = { "邏輯迴歸LogisiticRegression": LogisticRegression(), "K近鄰KNearest": KNeighborsClassifier(), "支援向量機分類Support Vector Classifier": SVC(), "決策樹分類DecisionTreeClassifier": DecisionTreeClassifier() }
for key, classifier in classifiers.items(): classifier.fit(X_train, y_train) # 模型訓練 training_score = cross_val_score(classifier, # 模型 X_train, # 訓練集資料 y_train, cv=5) # 5折交叉驗證
print("模型-", key,
"5次平均得分:", round(training_score.mean(), 2)*100)
模型- 邏輯迴歸LogisiticRegression 5次平均得分: 93.0 模型- K近鄰KNearest 5次平均得分: 93.0 模型- 支援向量機分類Support Vector Classifier 5次平均得分: 93.0 模型- 決策樹分類DecisionTreeClassifier 5次平均得分: 91.0 ```
網格搜尋
針對不同測模型實施網格搜尋,尋找最優引數
In [51]:
``` from sklearn.model_selection import GridSearchCV
邏輯迴歸
lr_params = {"penalty":["l1", "l2"], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000] } grid_lr = GridSearchCV(LogisticRegression(), lr_params) grid_lr.fit(X_train, y_train)
最好的引數組合
best_para_lr = grid_lr.best_estimator_ best_para_lr ```
Out[51]:
LogisticRegression(C=0.1)
In [52]:
```
k近鄰
knn_params = {"n_neighbors": list(range(2,5,1)), "algorithm":["auto","ball_tree","kd_tree","brute"] }
grid_knn = GridSearchCV(KNeighborsClassifier(), knn_params) grid_knn.fit(X_train, y_train)
最好的引數組合
best_para_knn = grid_knn.best_estimator_ best_para_knn ```
Out[52]:
KNeighborsClassifier(n_neighbors=2)
In [53]:
```
支援向量機分類
svc_params = {"C":[0.5, 0.7, 0.9, 1], "kernel":["rbf","poly","sigmoid","linear"] }
grid_svc = GridSearchCV(SVC(), svc_params) grid_svc.fit(X_train, y_train)
best_para_svc = grid_svc.best_estimator_ best_para_svc ```
Out[53]:
SVC(C=0.9, kernel='linear')
In [54]:
```
決策樹
dt_params = {"criterion":["gini","entropy"], "max_depth":list(range(2, 5, 1)), "min_samples_leaf": list(range(5,7,1)) }
grid_dt = GridSearchCV(DecisionTreeClassifier(), dt_params) grid_dt.fit(X_train, y_train)
best_para_dt = grid_dt.best_estimator_ best_para_dt ```
Out[54]:
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
重新訓練並評分
基於最優引數重新計算得分:
In [55]:
``` lr_score = cross_val_score(best_para_lr, X_train, y_train,cv=5)
print("邏輯迴歸交叉驗證得分:", round(lr_score.mean() * 100, 2).astype(str) + "%") 邏輯迴歸交叉驗證得分: 93.63% ```
In [56]:
``` knn_score = cross_val_score(best_para_knn, X_train, y_train,cv=5)
print("KNN交叉驗證得分:", round(knn_score.mean() * 100, 2).astype(str) + "%") KNN交叉驗證得分: 93.37% ```
In [57]:
``` svc_score = cross_val_score(best_para_svc, X_train, y_train,cv=5)
print("SVC交叉驗證得分:", round(svc_score.mean() * 100, 2).astype(str) + "%") SVC交叉驗證得分: 93.5% ```
In [58]:
``` dt_score = cross_val_score(best_para_dt, X_train, y_train,cv=5)
print("決策樹交叉驗證得分:", round(dt_score.mean() * 100, 2).astype(str) + "%") 決策樹交叉驗證得分: 93.24% ```
小結:通過不同模型的交叉驗證得分我們發現,邏輯迴歸模型是最高的
基於欠取樣資料的交叉驗證
主要是基於Near-Miss演算法來實現欠取樣:
- Near-miss-1:選擇到最近的三個樣本平均距離最小的多數類樣本
- Near-miss-2:選擇到最遠的三個樣本平均距離最小的多數類樣本
- Near-miss-3:為每個少數類樣本選擇給定數目的最近多數類樣本
- 最遠距離:選擇到最近的三個樣本平均距離最大的多樣類樣本
In [59]:
``` undersample_X = df.drop("Class", axis=1) undersample_y = df["Class"]
sfk = StratifiedKFold( n_splits=5, # 生成5份 random_state=None, shuffle=False)
for train_index , test_index in sfk.split(undersample_X,undersample_y): # print("Train: ", train_index) # print("Test: ", test_index)
undersample_Xtrain = undersample_X.iloc[train_index]
undersample_Xtest = undersample_X.iloc[test_index]
undersample_ytrain = undersample_y.iloc[train_index]
undersample_ytest = undersample_y.iloc[test_index]
undersample_Xtrain = undersample_Xtrain.values undersample_Xtest = undersample_Xtest.values undersample_ytrain = undersample_ytrain.values undersample_ytest = undersample_ytest.values
5個評價指標
undersample_accuracy = [] undersample_precision = [] undersample_recall = [] undersample_f1 = [] undersample_auc = [] ```
使用近鄰缺失Near-Miss演算法來檢視資料分佈:
In [60]:
``` X_nearmiss, y_nearmiss = NearMiss().fit_resample(undersample_X.values, undersample_y.values)
print("NearMiss Label Distributions: {}", format(Counter(y_nearmiss))) NearMiss Label Distributions: {} Counter({0: 492, 1: 492}) ```
實施交叉驗證:
In [61]:
``` for train, test in sfk.split(undersample_Xtrain, undersample_ytrain): undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy="majority"), best_para_lr)
# 模型訓練
undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
# 對測試集預測
undersample_prediction = undersample_model.predict(undersample_Xtrain[test])
# y_test真實值和預測值的評分
undersample_accuracy.append(undersample_pipeline.score(original_Xtrain[test], original_ytrain[test]))
undersample_precision.append(precision_score(original_ytrain[test], undersample_prediction))
undersample_recall.append(recall_score(original_ytrain[test], undersample_prediction))
undersample_f1.append(f1_score(original_ytrain[test], undersample_prediction))
undersample_auc.append(roc_auc_score(original_ytrain[test], undersample_prediction))
```
繪製學習曲線
In [62]:
from sklearn.model_selection import ShuffleSplit, learning_curve
In [63]:
```python def plot_learning_curve(est1,est2,est3,est4,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(0.1, 1, 5)):
f, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2,figsize=(20,14), sharey=True)
if ylim is not None:
plt.ylim(*ylim)
# 模型1
train_sizes, train_scores, test_scores = learning_curve(
est1, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax1.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="#ff9124")
ax1.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax1.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
label="Training score")
ax1.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
label="Cross-validation score")
ax1.set_title("邏輯迴歸學習曲線", fontsize=14)
ax1.set_xlabel('Training size (m)')
ax1.set_ylabel('Score')
ax1.grid(True)
ax1.legend(loc="best")
# 模型2-knn
train_sizes, train_scores, test_scores = learning_curve(
est2, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax2.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="#ff9124")
ax2.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax2.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
label="Training score")
ax2.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
label="Cross-validation score")
ax2.set_title("k近鄰學習曲線", fontsize=14)
ax2.set_xlabel('Training size (m)')
ax2.set_ylabel('Score')
ax2.grid(True)
ax2.legend(loc="best")
# 模型3-支援向量機
train_sizes, train_scores, test_scores = learning_curve(
est3, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax3.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="#ff9124")
ax3.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax3.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
label="Training score")
ax3.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
label="Cross-validation score")
ax3.set_title("支援向量機學習曲線", fontsize=14)
ax3.set_xlabel('Training size (m)')
ax3.set_ylabel('Score')
ax3.grid(True)
ax3.legend(loc="best")
# 模型4-決策樹
train_sizes, train_scores, test_scores = learning_curve(
est4, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
ax4.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="#ff9124")
ax4.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
ax4.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
label="Training score")
ax4.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
label="Cross-validation score")
ax4.set_title("決策樹學習曲線", fontsize=14)
ax4.set_xlabel('Training size (m)')
ax4.set_ylabel('Score')
ax4.grid(True)
ax4.legend(loc="best")
return plt
```
In [64]:
```python cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42 )
plot_learning_curve(best_para_lr, best_para_knn, best_para_svc, best_para_dt, X_train, y_train, (0.87,1.01), cv=cv, n_jobs=4 )
plt.show ```
roc曲線
In [65]:
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import cross_val_predict
In [66]:
``` lr_pred = cross_val_predict(best_para_lr, X_train, y_train, cv=5,
method="decision_function"
)
knn_pred = cross_val_predict(best_para_knn, X_train, y_train, cv=5,
method="decision_function"
)
svc_pred = cross_val_predict(best_para_svc, X_train, y_train, cv=5,
method="decision_function"
)
dt_pred = cross_val_predict(best_para_dt, X_train, y_train, cv=5,
method="decision_function"
)
```
In [67]:
print('Logistic Regression: ', roc_auc_score(y_train, lr_pred))
print('KNears Neighbors: ', roc_auc_score(y_train, knn_pred))
print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred))
print('Decision Tree Classifier: ', roc_auc_score(y_train, dt_pred))
Logistic Regression: 0.934970120644943
KNears Neighbors: 0.9314677528469951
Support Vector Classifier: 0.9339060209719247
Decision Tree Classifier: 0.930932179501635
In [68]:
```python log_fpr, log_tpr, log_thresold = roc_curve(y_train, lr_pred) knear_fpr, knear_tpr, knear_threshold = roc_curve(y_train, knn_pred) svc_fpr, svc_tpr, svc_threshold = roc_curve(y_train, svc_pred) tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train, dt_pred)
def graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svc_fpr, svc_tpr, tree_fpr, tree_tpr): plt.figure(figsize=(16,8)) plt.title('ROC Curve \n Top 4 Classifiers', fontsize=18) plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, lr_pred))) plt.plot(knear_fpr, knear_tpr, label='KNears Neighbors Classifier Score: {:.4f}'.format(roc_auc_score(y_train, knn_pred))) plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier Score: {:.4f}'.format(roc_auc_score(y_train, svc_pred))) plt.plot(tree_fpr, tree_tpr, label='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train, dt_pred)))
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.01, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)',
xy=(0.5, 0.5),
xytext=(0.6, 0.3),
arrowprops=dict(facecolor='#6E726D', shrink=0.05),
)
plt.legend()
graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svc_fpr, svc_tpr, tree_fpr, tree_tpr) plt.show() ```
探索邏輯迴歸評價指標
探索在邏輯迴歸模型的分類評價指標:
In [69]:
```python def logistic_roc_curve(log_fpr, log_tpr): plt.figure(figsize=(12,8)) plt.title('Logistic Regression ROC Curve', fontsize=16) plt.plot(log_fpr, log_tpr, 'b-', linewidth=2) plt.plot([0, 1], [0, 1], 'r--') plt.xlabel('False Positive Rate', fontsize=16) plt.ylabel('True Positive Rate', fontsize=16) plt.axis([-0.01,1,0,1])
logistic_roc_curve(log_fpr, log_tpr) plt.show() ```
``` from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_train, lr_pred) ```
In [71]:
```python from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score y_pred = best_para_lr.predict(X_train)
Overfitting Case
print('---' * 20) print('Recall Score: {:.2f}'.format(recall_score(y_train, y_pred))) print('Precision Score: {:.2f}'.format(precision_score(y_train, y_pred))) print('F1 Score: {:.2f}'.format(f1_score(y_train, y_pred))) print('Accuracy Score: {:.2f}'.format(accuracy_score(y_train, y_pred)))
print('---' * 20) print("Accuracy Score: {:.2f}".format(np.mean(undersample_accuracy))) print("Precision Score: {:.2f}".format(np.mean(undersample_precision))) print("Recall Score: {:.2f}".format(np.mean(undersample_recall))) print("F1 Score: {:.2f}".format(np.mean(undersample_f1))) print('---' * 20)
基於原資料
Recall Score: 0.92 Precision Score: 0.79 F1 Score: 0.85 Accuracy Score: 0.84
# 基於欠取樣的資料 Accuracy Score: 0.75 Precision Score: 0.00 Recall Score: 0.24 F1 Score: 0.00
```
- 基於機器學習分類演算法的鋼材缺陷檢測分類
- JSON資料,Python搞定!
- 邏輯迴歸:信貸違規預測!
- kaggle實戰-腫瘤資料統計分析
- Pandas操作mysql資料庫!
- 數學公式編輯神器-Mathpix Snipping Tool
- 精選20個Pandas統計函式
- 露一手,利用Python分析房價
- 德國信貸資料建模baseline!
- Python函式傳參機制詳解
- Python爬蟲周遊全國-蘭州站
- 一道Pandas題:3種解法
- 機器學習實戰:基於3大分類模型的中風病人預測
- 利用seaborn繪製柱狀圖
- 機器學習實戰:基於MNIST資料集的二分類問題
- 基於深度學習Keras的深圳租房建模
- 機器學習高頻使用程式碼片段
- Python入門:Python變數和賦值
- 深度學習框架Keras入門保姆教程
- kaggle實戰:極度不均衡的信用卡資料分析