theme: smartblue

公眾號：尤而小屋
作者：Peter
編輯：Peter

大家好，我是Peter~

今天給大家帶來一篇機器學習在工業數據的實戰文章：基於機器學習分類算法的鋼材缺陷檢測分類

本文的數據集是來自uci，專門為機器學習提供數據的一個網站：http://archive.ics.uci.edu/ml/index.php

該數據集包含了7種帶鋼缺陷類型（鋼板故障的7種類型：裝飾、Z_劃痕、K_劃痕、污漬、骯髒、顛簸、其他故障），帶鋼缺陷的27種特徵數據

本文的主要知識點：

數據信息

具體查看官網：http://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults

數據預處理

導入數據

In [1]:

```python import pandas as pd import numpy as np

import plotly_express as px import plotly.graph_objects as go

子圖

from plotly.subplots import make_subplots

import matplotlib.pyplot as plt import seaborn as sns sns.set_theme(style="whitegrid") %matplotlib inline

忽略警告

import warnings warnings.filterwarnings('ignore') ```

In [2]:

df = pd.read_excel("faults.xlsx") df.head()

Out[2]:

數據分割

將7種不同的類型和前面的特徵字段分開：

```python df1 = df.loc[:,"Pastry":] # 7種不同的類型 df2 = df.loc[:,:"SigmoidOfAreas"] # 全部是特徵字段

分類數據

df1.head()
```

下面是27個特徵的數據：

分類標籤生成

將7種不同的標籤進行分類生成：

類型編碼

In [7]:

```python dic = {} for i, v in enumerate(columns): dic[v]=i # 類別從0開始

dic ```

Out[7]:

python {'Pastry': 0, 'Z_Scratch': 1, 'K_Scatch': 2, 'Stains': 3, 'Dirtiness': 4, 'Bumps': 5, 'Other_Faults': 6}

In [8]:

``` df1["Label"] = df1["Label"].map(dic)

df1.head() ```

Out[8]:

數據合併

In [9]:

df2["Label"] = df1["Label"] df2.head()

EDA

數據的基本統計信息

In [10]:

```

缺失值

df2.isnull().sum() ```

結果顯示是沒有缺失值的：

單個特徵分佈

```python parameters = df2.columns[:-1].tolist()

sns.boxplot(data=df2, y="Steel_Plate_Thickness") plt.show() ```

從箱型圖中能夠觀察到單個特徵的取值分佈情況。下面繪製全部參數的取值分佈箱型圖：

```python

兩個基本參數：設置行、列

fig = make_subplots(rows=7, cols=4) # 1行2列

fig = go.Figure()

添加兩個數據軌跡，形成圖形

for i, v in enumerate(parameters):
r = i // 4 + 1 c = (i+1) % 4

if c ==0:
    fig.add_trace(go.Box(y=df2[v].tolist(),name=v),
             row=r, col=4)
else:
    fig.add_trace(go.Box(y=df2[v].tolist(),name=v),
             row=r, col=c)

fig.update_layout(width=1000, height=900)

fig.show() ```

幾點結論：

特徵之間的取值範圍不同：從負數到10M
部分特徵的取值中存在異常值
有些特徵的取值只存在0和1

樣本不均衡

每種類別數量

In [15]:

```

每種類型的數量

df2["Label"].value_counts() ```

Out[15]:

6 673 5 402 2 391 1 190 0 158 3 72 4 55 Name: Label, dtype: int64

可以看到第6類的樣本有673條，但是第4類的樣本只有55條。明顯地不均衡

SMOTE解決

In [16]:

X = df2.drop("Label",axis=1) y = df2[["Label"]]

In [17]:

```

使用imlbearn庫中上採樣方法中的SMOTE接口

from imblearn.over_sampling import SMOTE

設置隨機數種子

smo = SMOTE(random_state=42) X_smo, y_smo = smo.fit_resample(X, y) y_smo ```

統計一下每個類別的數量：

數據歸一化

特徵矩陣歸一化

In [19]:

```python from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler

ss = StandardScaler() data_ss = ss.fit_transform(X_smo)

還原到原數據

origin_data = ss.inverse_transform(data_ss)

```

歸一化後的特徵矩陣

In [21]:

df3 = pd.DataFrame(data_ss, columns=X_smo.columns) df3.head()

Out[21]:

添加y_smo

In [22]:

df3["Label"] = y_smo df3.head()

建模

隨機打亂數據

In [23]:

from sklearn.utils import shuffle df3 = shuffle(df3)

數據集劃分

In [24]:

X = df3.drop("Label",axis=1) y = df3[["Label"]]

In [25]:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

建模與評價

用函數的形式來解決：

In [26]:

```python from sklearn.model_selection import cross_val_score # 交叉驗證得分 from sklearn import metrics # 模型評價

def build_model(model, X_test, y_test):

model.fit(X_train, y_train)
# 預測概率
y_proba = model_LR.predict_proba(X_test)
# 找出概率值最大的所在索引，作為預測的分類結果
y_pred = np.argmax(y_proba,axis=1)
y_test = np.array(y_test).reshape(943)

print(f"{model}模型得分：")
print("召回率: ",metrics.recall_score(y_test, y_pred, average="macro"))
print("精準率: ",metrics.precision_score(y_test, y_pred, average="macro"))

```

```python

邏輯迴歸（分類）

from sklearn.linear_model import LogisticRegression

建立模型

model_LR = LogisticRegression()

調用函數

build_model(model_LR, X_test, y_test)

LogisticRegression()模型得分：召回率: 0.8247385525937151 精準率: 0.8126617210922679 ```

下面是單獨建立每個模型：

邏輯迴歸

建模

In [28]:

``` from sklearn.linear_model import LogisticRegression # 邏輯迴歸（分類） from sklearn.model_selection import cross_val_score # 交叉驗證得分 from sklearn import metrics # 模型評價

建立模型

model_LR = LogisticRegression() model_LR.fit(X_train, y_train) ```

Out[28]:

LogisticRegression()

預測

In [29]:

```

預測概率

y_proba = model_LR.predict_proba(X_test) y_proba[:3] ```

Out[29]:

array([[4.83469692e-01, 4.23685363e-07, 1.08028560e-10, 3.19294899e-07, 8.92035714e-02, 1.33695855e-02, 4.13956408e-01], [3.49120137e-03, 6.25018002e-03, 9.36037717e-03, 3.64702993e-01, 1.96814910e-01, 1.35722642e-01, 2.83657697e-01], [1.82751269e-05, 5.55981861e-01, 3.16768568e-05, 4.90023258e-03, 2.84504970e-03, 3.67190965e-01, 6.90319398e-02]])

In [30]:

```

找出概率值最大的所在索引，作為預測的分類結果

y_pred = np.argmax(y_proba,axis=1) y_pred[:3] ```

Out[30]:

array([0, 3, 1])

評價

In [31]:

```

混淆矩陣

confusion_matrix = metrics.confusion_matrix(y_test, y_pred) confusion_matrix ```

Out[31]:

python array([[114, 6, 0, 0, 7, 11, 10], [ 0, 114, 1, 0, 2, 4, 4], [ 0, 1, 130, 0, 0, 0, 2], [ 0, 0, 0, 140, 0, 1, 0], [ 1, 0, 0, 0, 120, 3, 6], [ 13, 3, 2, 0, 3, 84, 11], [ 21, 13, 9, 2, 9, 25, 71]])

In [32]:

y_pred.shape

Out[32]:

(943,)

In [33]:

y_test = np.array(y_test).reshape(943)

In [34]:

python print("召回率: ",metrics.recall_score(y_test, y_pred, average="macro")) print("精準率: ",metrics.precision_score(y_test, y_pred, average="macro")) 召回率: 0.8247385525937151 精準率: 0.8126617210922679

隨機森林迴歸

SVR

決策樹迴歸

神經網絡

GBDT

```python from sklearn.ensemble import GradientBoostingClassifier gbdt = GradientBoostingClassifier(

loss='deviance',

learning_rate=1,

n_estimators=5,

subsample=1,

min_samples_split=2,

min_samples_leaf=1,

max_depth=2,

init=None,

random_state=None,

max_features=None,

verbose=0,

max_leaf_nodes=None,

warm_start=False

)

gbdt.fit(X_train, y_train)

預測概率

y_proba = gbdt.predict_proba(X_test)

最大概率的索引

y_pred = np.argmax(y_proba,axis=1)

print("召回率: ",metrics.recall_score(y_test, y_pred, average="macro")) print("精準率: ",metrics.precision_score(y_test, y_pred, average="macro"))

召回率: 0.9034547294196564 精準率: 0.9000750791353891 ```

LightGBM

結果

| 模型 | Recall | Precision | | ------------ | ------- | --------- | | 邏輯迴歸 | 0.82473 | 0.8126 | | 隨機森林迴歸 | 0.9176 | 0.9149 | | SVR | 0.8897 | 0.8856 | | 決策樹迴歸 | 0.8698 | 0.8646 | | 神經網絡 | 0.8908 | 0.8863 | | GBDT | 0.9034 | 0.9 | | LightGBM | 0.9363 | 0.9331 |

上述結果很明顯：

集成學習的方案LightGBM、GBDT、隨機森林的效果是高於其他的模型
LightGBM 模型效果最佳！