德國信貸數據建模baseline!

語言: CN / TW / HK

theme: smartblue

公眾號:尤而小屋
作者:Peter
編輯:Peter

大家好,我是Peter~

本文是基於3大樹模型對一份德國信貸數據的簡單建模,可以作為一份baseline,最後也提出了優化的方向。主要內容包含:

導入庫

導入的庫用於數據處理、可視化、建模等

```python import pandas as pd import numpy as np

1、基於plotly

import plotly as py import plotly.express as px import plotly.graph_objects as go py.offline.init_notebook_mode(connected = True) from plotly.subplots import make_subplots # 多子圖

2、基於matplotlib

import matplotlib.pyplot as plt import matplotlib.patches as mpatches %matplotlib inline

中文顯示問題

設置字體

plt.rcParams["font.sans-serif"]=["SimHei"]

正常顯示負號

plt.rcParams["axes.unicode_minus"]=False

3、基於seaborn

import seaborn as sns

plt.style.use("fivethirtyeight")

plt.style.use('ggplot')

數據標準化、分割、交叉驗證

from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler,LabelEncoder from sklearn.model_selection import train_test_split,cross_val_score

模型

from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC

模型評價

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, precision_score, f1_score

忽略notebook中的警告

import warnings warnings.filterwarnings("ignore") ```

數據簡介

數據來自UCI官網:http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

基本信息:1000條數據 + 20個變量 + 目標變量 + 無缺失值

特徵變量的中文與英文含義:

  • 特徵向量中文:1.支票賬户狀態;2.借款週期;3.歷史信用;4.借款目的;5.信用額度;6.儲蓄賬户狀態;7.當前就業狀態;8.分期付款佔可支配收入百分比;9.性別與婚姻狀態;10.他人擔保信息;11.現居住地;12.財產狀態;13.年齡;14.其他分期情況;15.房產狀態;16.信用卡數量;17.工作狀態;18.贍養人數;19.電話號碼註冊情況;20.是否有海外工作經歷

  • 特徵向量對應英文:1.status_account, 2.duration, 3.credit_history, 4,purpose, 5.amount, 6.svaing_account, 7.present_emp, 8.income_rate, 9.personal_status, 10.other_debtors, 11.residence_info, 12.property, 13.age, 14.inst_plans, 15.housing, 16.num_credits, 17.job, 18.dependents, 19.telephone, 20.foreign_worker

讀入數據

下載的數據沒有表頭,網上搜索到對應英文表頭,生成DataFrame:

In [4]:

df.shape

Out[4]:

(1000, 21)

In [5]:

df.dtypes # 字段類型

Out[5]:

checking_account_status object duration int64 credit_history object purpose object credit_amount int64 savings object present_employment object installment_rate int64 personal object other_debtors object present_residence int64 property object age int64 other_installment_plans object housing object existing_credits int64 job object dependents int64 telephone object foreign_worker object customer_type int64 dtype: object

In [6]:

```

不同的字段類型統計

pd.value_counts(df.dtypes.values) ```

Out[6]:

object 13 int64 8 dtype: int64

In [7]:

df.isnull().sum()

Out[7]:

checking_account_status 0 duration 0 credit_history 0 purpose 0 credit_amount 0 savings 0 present_employment 0 installment_rate 0 personal 0 other_debtors 0 present_residence 0 property 0 age 0 other_installment_plans 0 housing 0 existing_credits 0 job 0 dependents 0 telephone 0 foreign_worker 0 customer_type 0 dtype: int64

不同字段下的取值統計

In [8]:

columns = df.columns # 字段 columns

Out[8]:

python Index(['checking_account_status', 'duration', 'credit_history', 'purpose', 'credit_amount', 'savings', 'present_employment', 'installment_rate', 'personal', 'other_debtors', 'present_residence', 'property', 'age', 'other_installment_plans', 'housing', 'existing_credits', 'job', 'dependents', 'telephone', 'foreign_worker', 'customer_type'], dtype='object')

1、針對字符類型字段的取值情況統計:

```python string_columns = df.select_dtypes(include="object").columns

兩個基本參數:設置行、列

fig = make_subplots(rows=3, cols=5)

for i, v in enumerate(string_columns):
r = i // 5 + 1 c = (i+1) % 5

data = df[v].value_counts().reset_index()

if c ==0:
    fig.add_trace(go.Bar(x=data["index"],y=data[v],
                         text=data[v],name=v),
                  row=r, col=5)
else:
    fig.add_trace(go.Bar(x=data["index"],y=data[v],
                         text=data[v],name=v),
                 row=r, col=c)

fig.update_layout(width=1000, height=900)

fig.show() ```

2、針對數值型字段的分佈情況:

```python number_columns = df.select_dtypes(exclude="object").columns.tolist() number_columns

兩個基本參數:設置行、列

fig = make_subplots(rows=2, cols=4) # 2行4列

for i, v in enumerate(number_columns): # number_columns 長度是8 r = i // 4 + 1 c = (i+1) % 4

if c ==0:
    fig.add_trace(go.Box(y=df[v].tolist(),name=v),
             row=r, col=4)
else:
    fig.add_trace(go.Box(y=df[v].tolist(),name=v),
             row=r, col=c)

fig.update_layout(width=1000, height=900)

fig.show() ```

字段處理

支票狀態-checking_account_status

中文含義:現有支票帳户的狀態

  • A11:<0 DM
  • A12:0 <= x <200 DM
  • A13:> = 200 DM /至少一年的薪水分配
  • A14:無支票帳户)

In [11]:

df["checking_account_status"].value_counts()

Out[11]:

A14 394 A11 274 A12 269 A13 63 Name: checking_account_status, dtype: int64

In [12]:

``` fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="checking_account_status", data=df)

plt.title("number of checking_account_status")

for p in ax.patches: ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20) plt.show() ```

在這裏我們根據每個人的支票賬户金額的大小進行硬編碼

In [13]:

```python

A11:<0 DM,A12:0 <= x <200 DM,A13:> = 200 DM /至少一年的薪水分配,A14:無支票帳户

編碼1

cas = {"A11": 1,"A12":2, "A13":3, "A14":0} df["checking_account_status"] = df["checking_account_status"].map(cas) ```

借款週期-duration

中文含義是:持續時間(月)

In [14]:

duration = df["duration"].value_counts() duration.head()

Out[14]:

24 184 12 179 18 113 36 83 6 75 Name: duration, dtype: int64

In [15]:

``` fig = px.violin(df,y="duration")

fig.show() ```

信用卡歷史-credit_history

中文含義

  • A30:未提取任何信用/已全額償還所有信用額
  • A31:已償還該銀行的所有信用額
  • A32:已到期已償還的現有信用額
  • A33:過去的還款延遲
  • A34:關鍵帳户/其他信用額現有(不在此銀行)

In [17]:

ch = df["credit_history"].value_counts().reset_index() ch

Out[17]:

| | index | credit_history | | ---: | ----: | -------------: | | 0 | A32 | 530 | | 1 | A34 | 293 | | 2 | A33 | 88 | | 3 | A31 | 49 | | 4 | A30 | 40 |

In [18]:

```python fig = px.pie(ch,names="index",values="credit_history")

fig.update_traces( textposition='inside', textinfo='percent+label' )

fig.show() ```

```python

編碼2:獨熱碼

df_credit_history = pd.get_dummies(df["credit_history"]) df = df.join(df_credit_history) df.drop("credit_history", inplace=True, axis=1) ```

借款目的-purpose

借款目的

In [20]:

```

統計每個目的下的人數,根據人數的多少來實施硬編碼

purpose = df["purpose"].value_counts().sort_values(ascending=True).reset_index()

purpose.columns = ["purpose", "number"]

purpose ```

```python

編碼3

df["purpose"] = df["purpose"].map(dict(zip(purpose.purpose,purpose.index))) ```

信用額度-credit_amount

表示的是信用額度

In [22]:

px.violin(df["credit_amount"])

賬户儲蓄-savings

賬户/債券儲蓄(A61:<100 DM,A62:100 <= x <500 DM,A63:500 <= x <1000 DM,A64:> = 1000 DM,A65:未知/無儲蓄賬户

In [24]:

string_columns

Out[24]:

Index(['checking_account_status', 'credit_history', 'purpose', 'savings', 'present_employment', 'personal', 'other_debtors', 'property', 'other_installment_plans', 'housing', 'job', 'telephone', 'foreign_worker'], dtype='object')

In [25]:

df["savings"].value_counts()

Out[25]:

A61 603 A65 183 A62 103 A63 63 A64 48 Name: savings, dtype: int64

In [26]:

```

編碼6:硬編碼

savings = {"A61":1,"A62":2, "A63":3, "A64":4,"A65":0}

df["savings"] = df["savings"].map(savings) ```

目前狀態-present_employment

  • A71:待業
  • A72:<1年
  • A73:1 <= x <4年
  • A74:4 <= x <7年
  • A75:..> = 7年

In [28]:

df["present_employment"].value_counts()

Out[28]:

A73 339 A75 253 A74 174 A72 172 A71 62 Name: present_employment, dtype: int64

In [29]:

```

編碼7:獨熱碼

df_present_employment = pd.get_dummies(df["present_employment"]) ```

In [30]:

``` df = df.join(df_present_employment)

df.drop("present_employment", inplace=True, axis=1) ```

個人婚姻狀態和性別-personal

個人婚姻狀況和性別(A91:男性:離婚/分居,A92:女性:離婚/分居/已婚,A93:男性:單身,A94:男性:已婚/喪偶,A95:女性:單身)

In [31]:

```

編碼8:獨熱碼

df_personal = pd.get_dummies(df["personal"]) df = df.join(df_personal)

df.drop("personal", inplace=True, axis=1) ```

其他擔保人-other_debtors

A101:無,A102:共同申請人,A103:擔保人

In [32]:

```

編碼9:獨熱碼

df_other_debtors = pd.get_dummies(df["other_debtors"]) df = df.join(df_other_debtors)

df.drop("other_debtors", inplace=True, axis=1) ```

資產-property

In [33]:

```

編碼10:獨熱碼

df_property = pd.get_dummies(df["property"]) df = df.join(df_property)

df.drop("property", inplace=True, axis=1) ```

住宿-housing

A151:租房,A152:自有,A153:免費

In [34]:

```

編碼11:獨熱碼

df_housing = pd.get_dummies(df["housing"]) df = df.join(df_housing)

df.drop("housing", inplace=True, axis=1) ```

其他投資計劃-other_installment_plans

A141:銀行,A142:店鋪,A143:無

In [35]:

```python fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="other_installment_plans", data=df)

plt.title("number of other_installment_plans")

for p in ax.patches: ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20) plt.show() ```

```python

編碼12:獨熱碼

df_other_installment_plans = pd.get_dummies(df["other_installment_plans"]) df = df.join(df_other_installment_plans)

df.drop("other_installment_plans", inplace=True, axis=1) ```

工作-job

  • A171 : 非技術人員-非居民
  • A172:非技術人員-居民
  • A173:技術人員/官員
  • A174:管理/個體經營/高度合格的員工/官員

In [37]:

``` fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="job", data=df)

plt.title("number of job")

for p in ax.patches: ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20) plt.show() ```

```python

編碼13:獨熱碼

df_job = pd.get_dummies(df["job"]) df = df.join(df_job)

df.drop("job", inplace=True, axis=1) ```

電話-telephone

A191:無,A192:有,登記在客户名下

In [39]:

```

編碼14:獨熱碼

df_telephone = pd.get_dummies(df["telephone"]) df = df.join(df_telephone)

df.drop("telephone", inplace=True, axis=1) ```

是否國外工作-foreign_worker

A201: 有,A202: 無

In [40]:

```

編碼15:獨熱碼

df_foreign_worker = pd.get_dummies(df["foreign_worker"]) df = df.join(df_foreign_worker)

df.drop("foreign_worker", inplace=True, axis=1) ```

兩種類型顧客統計-customer_type

預測類別:1 =良好,2 =不良

In [41]:

``` fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="customer_type", data=df)

plt.title("number of customer_type")

for p in ax.patches: ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20) plt.show() ```

打亂數據shuffle

In [42]:

``` from sklearn.utils import shuffle

隨機打亂數據

df = shuffle(df).reset_index(drop=True) ```

建模

數據分割

In [44]:

```

選取特徵

X = df.drop("customer_type",axis=1)

目標變量

y = df['customer_type'] from sklearn.model_selection import train_test_split ```

In [45]:

```

2-8比例

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42) ```

數據標準化

In [46]:

``` ss = StandardScaler()

X_train = ss.fit_transform(X_train) ```

In [47]:

y_train

Out[47]:

556 1 957 1 577 2 795 2 85 1 .. 106 1 270 2 860 1 435 1 102 2 Name: customer_type, Length: 200, dtype: int64

In [48]:

```

分別求出訓練集的均值和標準差

mean_ = ss.mean_ # 均值 var_ = np.sqrt(ss.var_) # 標準差 ```

將上面求得的均值和標準差用於測試集中:

In [50]:

```

歸一化之後的測試集中的特徵數據

X_test = (X_test - mean_) / var_ ```

模型1:決策樹

In [51]:

``` dt = DecisionTreeClassifier(max_depth=5)

dt.fit(X_train, y_train) ```

Out[51]:

DecisionTreeClassifier(max_depth=5)

In [52]:

```

預測

y_pred = dt.predict(X_test) y_pred[:5] ```

Out[52]:

array([2, 1, 1, 2, 1])

In [53]:

```

混淆矩陣

confusion_mat = metrics.confusion_matrix(y_test,y_pred) confusion_mat ```

Out[53]:

array([[450, 118], [137, 95]])

In [54]:

```python

混淆矩陣可視化

classes = ["良好","不良"]

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes) disp.plot( include_values=True, # 混淆矩陣每個單元格上顯示具體數值 cmap="GnBu", # matplotlib識別的顏色圖 ax=None,
xticks_rotation="horizontal",
values_format="d"
)

plt.show() ```

```python

auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred) # 測試值和預測值 auc_roc

0.5008681398737251 ```

模型2:隨機森林

In [56]:

rf = RandomForestClassifier() rf.fit(X_train, y_train)

Out[56]:

RandomForestClassifier()

In [57]:

```

預測

y_pred = rf.predict(X_test) y_pred[:5] ```

Out[57]:

array([1, 1, 1, 2, 1])

In [58]:

```

混淆矩陣

confusion_mat = metrics.confusion_matrix(y_test,y_pred) confusion_mat ```

Out[58]:

array([[476, 92], [142, 90]])

In [59]:

```python

混淆矩陣可視化

classes = ["良好","不良"]

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes) disp.plot( include_values=True, # 混淆矩陣每個單元格上顯示具體數值 cmap="GnBu", # matplotlib識別的顏色圖 ax=None,
xticks_rotation="horizontal",
values_format="d"
)

plt.show() ```

```python

auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred) # 真實值和預測值 auc_roc

0.6129796017484215 ```

模型3:XGboost

In [62]:

``` from xgboost.sklearn import XGBClassifier

定義 XGBoost模型

clf = XGBClassifier()

X_train = X_train.values

X_test = X_test.values

```

In [63]:

clf.fit(X_train, y_train)

Out[63]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None)

In [65]:

```

先轉成數組再傳進來

X_test = X_test.values

y_pred = clf.predict(X_test) y_pred[:5] ```

Out[65]:

array([1, 1, 1, 2, 1])

In [66]:

```

混淆矩陣

confusion_mat = metrics.confusion_matrix(y_test,y_pred) confusion_mat ```

Out[66]:

array([[445, 123], [115, 117]])

In [67]:

```python

混淆矩陣可視化

classes = ["良好","不良"]

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes) disp.plot( include_values=True, # 混淆矩陣每個單元格上顯示具體數值 cmap="GnBu", # matplotlib識別的顏色圖 ax=None,
xticks_rotation="horizontal",
values_format="d"
)

plt.show() ```

```python

auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred) # 真實值和預測值 auc_roc

0.6438805245264692 ```

模型優化

基於相關係數進行特徵篩選

```

y:customer_type是目標變量

1、計算每個特徵和目標變量的相關係數

data = pd.concat([X,y],axis=1)

corr = data.corr() corr[:5] ```

相關係數的描述統計信息:發現整體的相關係數(絕對值)都比較小

熱力圖

``` ax = plt.subplots(figsize=(20,16))

ax = sns.heatmap(corr, vmax=0.8, square=True, annot=True, # 顯示數據 cmap="YlGnBu") ```

根據相關係數篩選前20個變量

``` k = 20

cols = corr.nlargest(k,"customer_type")["customer_type"].index cols ```

Index(['customer_type', 'duration', 'checking_account_status', 'credit_amount', 'A30', 'A31', 'A124', 'A72', 'A141', 'A151', 'A201', 'A153', 'A92', 'installment_rate', 'A102', 'A142', 'A91', 'A32', 'A174', 'A71'], dtype='object')

``` cm = np.corrcoef(data[cols].values.T)

hm = plt.subplots(figsize=(10,10)) # 調整畫布大小 hm = sns.heatmap(data[cols].corr(), # 前10個屬性的相關係數 annot=True, square=True) plt.show() ```

篩選相關係數絕對值大於0.1的變量

``` threshold = 0.1

corrmat = data.corr() top_corr_features = corrmat.index[abs(corrmat["customer_type"]) > threshold]

plt.figure(figsize=(10,10))

g = sns.heatmap(data[top_corr_features].corr(), # 大於0.5的特徵構成的DF的相關係數矩陣 annot=True, square=True, cmap="nipy_spectral_r" ) ```

新數據建模

```

篩選出為True的特徵

useful_col = corrmat.index[abs(corrmat["customer_type"]) > threshold].tolist()
```

new_df = df[useful_col] new_df.head()

數據切分

```

選取特徵

X = new_df.drop("customer_type",axis=1)

目標變量

y = new_df['customer_type'] ```

```

3-7比例

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42) ```

標準化

ss = StandardScaler() X_train = ss.fit_transform(X_train)

```

分別求出訓練集的均值和標準差

mean_ = ss.mean_ # 均值 var_ = np.sqrt(ss.var_) # 標準差

歸一化之後的測試集中的特徵數據

X_test = (X_test - mean_) / var_ ```

建模

``` from xgboost.sklearn import XGBClassifier

定義 XGBoost模型

clf = XGBClassifier() ```

clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None)

In [80]:

```

先轉成數組再傳進來

X_test = X_test.values

y_pred = clf.predict(X_test) y_pred[:5] ```

Out[80]:

array([2, 1, 2, 2, 1])

In [81]:

```

混淆矩陣

confusion_mat = metrics.confusion_matrix(y_test,y_pred) confusion_mat ```

Out[81]:

array([[406, 94], [ 96, 104]])

In [82]:

```

auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred) # 真實值和預測值 auc_roc ```

Out[82]:

0.666

優化方向

經過3種不同樹模型的建模,我們發現模型的AUC值並不是很高。AUC 值是一個概率值,AUC 值越大,分類算法越好。可以考慮優化的方向:

  1. 特徵工程處理:這個可以重點優化。目前對原始的特徵變量使用了3種不同類型編碼、獨熱碼和硬編碼;有些字段的編碼方式需要優化。
  2. 篩選變量:相關係數是用來檢測兩個連續型變量之間線性相關的程度;特徵變量和最終因變量的關係不一定線性相關。本文中觀察到相關係數都很低,似乎佐證了這點。後續考慮通過其他方法來篩選變量進行建模
  3. 模型調優:通過網格搜索等優化單個模型的參數,或者通過模型融合來增強整體效果。

數據集獲取

關注公眾號【尤而小屋】,回覆德國即可領取本文數據集。