机器学习-特征选择-随机森林

标签: 机器学习  机器学习  python

Section I: Code Bundle
  • 第一部分:Feature Importance Sorted via Random Forest

代码

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestClassifier

#Section 1: Prepare data
plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

#Section 2: Load data and split it into train/test dataset
wine=datasets.load_wine()
X,y=wine.data,wine.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

#Section 3: Select features via Random Forest
feat_labels=wine.feature_names
forest=RandomForestClassifier(n_estimators=500,random_state=1)
forest.fit(X_train,y_train)

importances=forest.feature_importances_
indices=np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f+1,30,feat_labels[indices[f]],importances[indices[f]]))

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]),importances[indices],align='center')
plt.xticks(range(X_train.shape[1]),feat_labels,rotation=90)
plt.xlim([-1,X_train.shape[1]])
plt.tight_layout()
plt.savefig('./fig1.png')
plt.show()

结果
在这里插入图片描述

 1) proline                        0.179927
 2) color_intensity                0.153158
 3) flavanoids                     0.146123
 4) alcohol                        0.138224
 5) od280/od315_of_diluted_wines   0.114818
 6) hue                            0.077525
 7) total_phenols                  0.058236
 8) malic_acid                     0.030856
 9) alcalinity_of_ash              0.030000
10) proanthocyanins                0.025713
11) magnesium                      0.025135
12) nonflavanoid_phenols           0.011548
13) ash                            0.008738
  • 第二部分:The Application of SelectFromModel

代码

#Section 1: Feature Selection From SelectFromMode
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestClassifier

#Section 2: Prepare data
plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

#Section 3: Load data and split it into train/test dataset
wine=datasets.load_wine()
X,y=wine.data,wine.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

#Section 4: Select features via Random Forest
feat_labels=wine.feature_names
forest=RandomForestClassifier(n_estimators=500,random_state=1)

from sklearn.feature_selection import SelectFromModel

sfm=SelectFromModel(forest,threshold=0.1)
X_selected=sfm.fit_transform(X_train,y_train)
print("Number of samples that meet the criterion:",X_selected.shape[0])

forest.fit(X_train,y_train)
importances=forest.feature_importances_
indices=np.argsort(importances)[::-1]
for f in range(X_selected.shape[1]):
    print("%2d) %-*s %f" % (f+1,30,feat_labels[indices[f]],importances[indices[f]]))

运行结果

Number of samples that meet the criterion: 124
 1) proline                        0.179927
 2) color_intensity                0.153158
 3) flavanoids                     0.146123
 4) alcohol                        0.138224
 5) od280/od315_of_diluted_wines   0.114818

参考文献
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

版权声明:本文为Santorinisu原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/Santorinisu/article/details/104423381