Training_model(2)

已经清洗处理了两个数据文件:

下面对这两个数据中的特征进行合并,然后Light Gradient Boosting Machine训练模型,之前只用客户数据的预测评分结果是0.734这次加入了客户信用报告信息

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

load data

train_data = pd.read_csv('data/no_select_train.csv')
test_data = pd.read_csv('data/no_select_test.csv')
bureau_data = pd.read_csv('data/bureau_features.csv')
bureau_data.shape
(305811, 94)
train_data.shape
(307511, 268)
test_data.shape
(48744, 267)

新增加了客户历史信用记录

Build Model

def model(features, test_features, n_folds = 10):
    # 取出ID列
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    # TARGET
    labels = features[['TARGET']].astype(int)
    # 去掉ID和TARGET
    features = features.drop(['SK_ID_CURR', 'TARGET'], axis = 1)
    test_features = test_features.drop(['SK_ID_CURR'], axis = 1)
    # 特征名字
    feature_names = list(features.columns)
    # 10折交叉验证
    k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
    # test predictions
    test_predictions = np.zeros(test_features.shape[0])
    # validation predictions
    out_of_fold = np.zeros(features.shape[0])
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    # 记录每次的scores
    valid_scores = []
    train_scores = []

    # Iterate through each fold
    count = 0
    for train_indices, valid_indices in k_fold.split(features):

        # Training data for the fold
        train_features = features.loc[train_indices, :]
        train_labels = labels.loc[train_indices, :]
        # Validation data for the fold
        valid_features = features.loc[valid_indices, :]
        valid_labels = labels.loc[valid_indices, :]
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)

        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = 'auto',
                  early_stopping_rounds = 100, verbose = 200)

        # Record the best iteration
        best_iteration = model.best_iteration_

        # 测试集的结果
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1]/n_folds

        # 验证集结果
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        # feature importance
        feature_importance_values += model.feature_importances_ / n_folds
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']

        valid_scores.append(valid_score)
        train_scores.append(train_score)

        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        count += 1
        print("%d_fold is over"%count)
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    # feature importance
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)

    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))

    # dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 

    return submission, metrics, feature_importances

特征聚合

左连接

train_data = train_data.merge(bureau_data, on = 'SK_ID_CURR', how = 'left')
train_data.shape
(307511, 361)
train_data.TARGET.value_counts()
0    282686
1     24825
Name: TARGET, dtype: int64

左连接,样本数目不变

test_data = test_data.merge(bureau_data, on = 'SK_ID_CURR', how = 'left')
test_data.shape
(48744, 360)
submit5,metrics,feature_importance = model(train_data, test_data, n_folds= 10)
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811087   valid's auc: 0.776033
[400]   train's auc: 0.84353    valid's auc: 0.776933
Early stopping, best iteration is:
[405]   train's auc: 0.844392   valid's auc: 0.77704
1_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811909   valid's auc: 0.763234
[400]   train's auc: 0.844175   valid's auc: 0.763789
Early stopping, best iteration is:
[310]   train's auc: 0.830712   valid's auc: 0.763865
2_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811204   valid's auc: 0.771043
[400]   train's auc: 0.844348   valid's auc: 0.772213
Early stopping, best iteration is:
[375]   train's auc: 0.840635   valid's auc: 0.772462
3_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811886   valid's auc: 0.772251
[400]   train's auc: 0.84428    valid's auc: 0.773033
Early stopping, best iteration is:
[411]   train's auc: 0.845925   valid's auc: 0.773179
4_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811444   valid's auc: 0.772315
Early stopping, best iteration is:
[277]   train's auc: 0.825443   valid's auc: 0.773191
5_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.810696   valid's auc: 0.779534
[400]   train's auc: 0.844023   valid's auc: 0.780508
Early stopping, best iteration is:
[321]   train's auc: 0.832286   valid's auc: 0.781193
6_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.810864   valid's auc: 0.776371
[400]   train's auc: 0.84338    valid's auc: 0.777393
Early stopping, best iteration is:
[447]   train's auc: 0.850437   valid's auc: 0.777718
7_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811775   valid's auc: 0.76751
[400]   train's auc: 0.844592   valid's auc: 0.76855
Early stopping, best iteration is:
[385]   train's auc: 0.842334   valid's auc: 0.768839
8_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811555   valid's auc: 0.773908
[400]   train's auc: 0.844837   valid's auc: 0.776798
Early stopping, best iteration is:
[438]   train's auc: 0.850265   valid's auc: 0.776938
9_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.812462   valid's auc: 0.76771
[400]   train's auc: 0.845181   valid's auc: 0.768348
Early stopping, best iteration is:
[393]   train's auc: 0.844196   valid's auc: 0.768489
10_fold is over
submit5.head()
 
SK_ID_CURR TARGET
0 100001 0.264254
1 100005 0.554692
2 100013 0.220868
3 100028 0.252361
4 100038 0.705160
metrics
 
fold train valid
0 0 0.844392 0.777040
1 1 0.830712 0.763865
2 2 0.840635 0.772462
3 3 0.845925 0.773179
4 4 0.825443 0.773191
5 5 0.832286 0.781193
6 6 0.850437 0.777718
7 7 0.842334 0.768839
8 8 0.850265 0.776938
9 9 0.844196 0.768489
10 overall 0.840663 0.773266
submit5.to_csv('submit5.csv', index = False)

这里写图片描述

特征重要性

feature_importance = feature_importance.sort_values(by = 'importance')
feature_importance = feature_importance.set_index(['feature'])
feature_importance.plot(kind = 'barh', figsize = (10, 100))

 这里写图片描述

importance == 0的特征

feature_importance = feature_importance.reset_index()
# importance == 0 的特征
weak_importance_features = list(feature_importance[feature_importance['importance'] == 0].feature)
weak_importance_features
['ORGANIZATION_TYPE_Industry: type 10',
 'FLAG_DOCUMENT_21',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_19',
 'child_to_non_child_ratio',
 'FLAG_DOCUMENT_17',
 'EMERGENCYSTATE_MODE_No',
 'ORGANIZATION_TYPE_Industry: type 13',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_10',
 'ORGANIZATION_TYPE_XNA',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_2',
 'ORGANIZATION_TYPE_Trade: type 1',
 'ORGANIZATION_TYPE_Industry: type 4',
 'ORGANIZATION_TYPE_Industry: type 6',
 'ORGANIZATION_TYPE_Religion',
 'ORGANIZATION_TYPE_Industry: type 8',
 'NAME_TYPE_SUITE_Group of people',
 'CREDIT_DAY_OVERDUE_min',
 'ORGANIZATION_TYPE_Mobile',
 'CNT_CREDIT_PROLONG_min',
 'OCCUPATION_TYPE_IT staff',
 'OCCUPATION_TYPE_HR staff',
 'ORGANIZATION_TYPE_Advertising',
 'ORGANIZATION_TYPE_Cleaning',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'NAME_INCOME_TYPE_Pensioner',
 'NAME_INCOME_TYPE_Student',
 'FLAG_CONT_MOBILE',
 'NAME_INCOME_TYPE_Businessman',
 'AMT_CREDIT_SUM_OVERDUE_min']

Drop weak feature

train_data = train_data.drop(weak_importance_features, axis = 1)
test_data = test_data.drop(weak_importance_features, axis = 1)

Training model

submit5_1,metrics, feature_importance = model(train_data, test_data, n_folds= 10)
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811087   valid's auc: 0.776033
[400]   train's auc: 0.84353    valid's auc: 0.776933
Early stopping, best iteration is:
[405]   train's auc: 0.844392   valid's auc: 0.77704
1_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811909   valid's auc: 0.763234
[400]   train's auc: 0.844175   valid's auc: 0.763789
Early stopping, best iteration is:
[310]   train's auc: 0.830712   valid's auc: 0.763865
2_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811204   valid's auc: 0.771043
[400]   train's auc: 0.844348   valid's auc: 0.772213
Early stopping, best iteration is:
[375]   train's auc: 0.840635   valid's auc: 0.772462
3_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811886   valid's auc: 0.772251
[400]   train's auc: 0.84428    valid's auc: 0.773033
Early stopping, best iteration is:
[411]   train's auc: 0.845925   valid's auc: 0.773179
4_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811444   valid's auc: 0.772315
Early stopping, best iteration is:
[277]   train's auc: 0.825443   valid's auc: 0.773191
5_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.810696   valid's auc: 0.779534
[400]   train's auc: 0.844023   valid's auc: 0.780508
Early stopping, best iteration is:
[321]   train's auc: 0.832286   valid's auc: 0.781193
6_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.810864   valid's auc: 0.776371
[400]   train's auc: 0.84338    valid's auc: 0.777393
Early stopping, best iteration is:
[447]   train's auc: 0.850437   valid's auc: 0.777718
7_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811775   valid's auc: 0.76751
[400]   train's auc: 0.844592   valid's auc: 0.76855
Early stopping, best iteration is:
[385]   train's auc: 0.842334   valid's auc: 0.768839
8_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.811555   valid's auc: 0.773908
[400]   train's auc: 0.844837   valid's auc: 0.776798
Early stopping, best iteration is:
[438]   train's auc: 0.850265   valid's auc: 0.776938
9_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.812462   valid's auc: 0.76771
[400]   train's auc: 0.845181   valid's auc: 0.768348
Early stopping, best iteration is:
[475]   train's auc: 0.855547   valid's auc: 0.768638
10_fold is over
submit5_1.head()
SK_ID_CURR TARGET
0 100001 0.264469
1 100005 0.554822
2 100013 0.220767
3 100028 0.252001
4 100038 0.705017
submit5_1.to_csv('submit5_1.csv',index = False)

这里写图片描述

metrics
fold train valid
0 0 0.844392 0.777040
1 1 0.830712 0.763865
2 2 0.840635 0.772462
3 3 0.845925 0.773179
4 4 0.825443 0.773191
5 5 0.832286 0.781193
6 6 0.850437 0.777718
7 7 0.842334 0.768839
8 8 0.850265 0.776938
9 9 0.855547 0.768638
10 overall 0.841798 0.773265
feature_importance.plot(kind = 'barh', figsize = (10, 100))

feature importance
这里写图片描述

特征聚合(内连接)

train_data = pd.read_csv('data/no_select_train.csv')
test_data = pd.read_csv('data/no_select_test.csv')
bureau_data = pd.read_csv('data/bureau_features.csv')
train_data = train_data.merge(bureau_data, left_on = 'SK_ID_CURR', right_on = 'SK_ID_CURR')
test_data = test_data.merge(bureau_data, on = 'SK_ID_CURR', how = 'left')
train_data.shape
(263491, 361)
submit6, metrics, feature_importance = model(train_data, test_data, n_folds=10)
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.822727   valid's auc: 0.770619
[400]   train's auc: 0.859328   valid's auc: 0.77186
Early stopping, best iteration is:
[323]   train's auc: 0.846435   valid's auc: 0.77214
1_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.822  valid's auc: 0.776964
Early stopping, best iteration is:
[247]   train's auc: 0.83182    valid's auc: 0.777768
2_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.822354   valid's auc: 0.779729
[400]   train's auc: 0.858779   valid's auc: 0.780386
Early stopping, best iteration is:
[311]   train's auc: 0.843865   valid's auc: 0.78096
3_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.822429   valid's auc: 0.77889
[400]   train's auc: 0.859017   valid's auc: 0.779426
Early stopping, best iteration is:
[368]   train's auc: 0.853733   valid's auc: 0.779905
4_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.822714   valid's auc: 0.776522
Early stopping, best iteration is:
[185]   train's auc: 0.819432   valid's auc: 0.776777
5_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.822052   valid's auc: 0.781852
[400]   train's auc: 0.858622   valid's auc: 0.782497
Early stopping, best iteration is:
[394]   train's auc: 0.857795   valid's auc: 0.782713
6_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.822871   valid's auc: 0.766198
Early stopping, best iteration is:
[288]   train's auc: 0.840463   valid's auc: 0.76693
7_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.82312    valid's auc: 0.764778
Early stopping, best iteration is:
[297]   train's auc: 0.841792   valid's auc: 0.765418
8_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.821681   valid's auc: 0.783672
[400]   train's auc: 0.859008   valid's auc: 0.785079
Early stopping, best iteration is:
[406]   train's auc: 0.859982   valid's auc: 0.785162
9_fold is over
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.823228   valid's auc: 0.771831
[400]   train's auc: 0.859472   valid's auc: 0.772901
Early stopping, best iteration is:
[385]   train's auc: 0.857328   valid's auc: 0.773237
10_fold is over
submit6.to_csv('submit6.csv',index = False)

这里写图片描述

metrics
fold train valid
0 0 0.846435 0.772140
1 1 0.831820 0.777768
2 2 0.843865 0.780960
3 3 0.853733 0.779905
4 4 0.819432 0.776777
5 5 0.857795 0.782713
6 6 0.840463 0.766930
7 7 0.841792 0.765418
8 8 0.859982 0.785162
9 9 0.857328 0.773237
10 overall 0.845265 0.776006

特征的重要性

feature_importance = feature_importance.sort_values(by = 'importance')

feature_importance = feature_importance.set_index('feature')

feature_importance.plot(kind = 'barh', figsize = (10, 100))

这里写图片描述

总结一下:

  左连接,保证训练数据数量不变,如果在bureau.csv中没有数据的样本会有大量缺失值,但是在加入客户历史信用报告后,评分从0.723增长到0.773.对于重要性为0的特征,去掉也不影响模型的表现,但是也没有提升.内连接的方式,训练数据会少40000+的样本,对评分结果是有影响的,下降了0.002;接下来,会用更多的数据用户的现金消费和POS消费的数据POS_CASH_balance.csv

版权声明:本文为u014281392原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/u014281392/article/details/81229260

智能推荐

TYVJ 4864 天天去哪吃 || 清北学堂金秋杯大奖赛

题目描述: 记录一下i这个值上次出现的位置在哪里,就是pre...

java反编译

jvm 把Boolean类型的值flag当做int类型处理。​​​ Foo.java: 由 class 文件生成 jasm 文件:java -jar asmtools.jar jdis Foo.class > Foo.jasm  修改jasm文件: 执行反编译: java -jar jd-gui-1.6.6.jar File 打开Foo.class文件:b修改为2 重新执行java...

【学习笔记】03-v-html的学习和示例

v-html的认识和使用 示例: 显示结果: 注意:v-html是有复制的...

Java实现在线考试系统(系统介绍)

1.和现在有的考试系统有以下几种优势: a.和现在有的系统比较起来,本系统有科目、章节、老师、学生、班级等信息的管理,还有批阅试卷查看已批阅试卷等。传统的考试系统划分并不细,业务功能简单。 b.和学校的考试系统还有外面的考试系统比较起来,本系统是B/S结构,学校的考试系统一般为C/S结构,性能方面不如B/S结构,并且C/S接口需要安装客户端,客户端压力很大,我的系统只需要电脑具有浏览器,在同一局域...

计算机视觉--多视几何初步尝试

基础矩阵的原理 K和K’分别是两个相机的参数矩阵。p和p’是X在平面π的坐标表示。所以可以得出 具体计算过程 代码: #!/usr/bin/env python coding: utf-8 from PIL import Image from numpy import * from pylab import * import numpy as np from imp ...

猜你喜欢

java初学者怎么学习才可以快速入门

java初学者怎么学习才可以快速入门 一、了解JAVA 我们要知道:Java是由Sun Microsystems公司于1995年5月推出的Java面向对象程序设计语言。 Java之父:詹姆斯·高斯林 1.1 java的三个体系 Java SE(Java Platform Standard Edition)。Java SE 以前称为 J2SE。它允许开发和部署在桌面、服务器、嵌入式环境...

字段属性之主键&增删改查&自增长&唯一键约束

字段属性之主键&自增长&唯一键约束 主键 主键:primary key 主要的键 一张表中只有一个字段可以使用对应的键,用来唯一的约束该字段里面的数据,不能重复,这种称之为主键 一张表只能最多一个主键 增加主键 SQL操作中有多种方式增加主键大体分为三种 1.在创建表的时候直接在字段之后跟primary key关键字(主键本身不允许为空) 优点:非常直接:缺点:只能使用一个字段作为...

linux下 基于libmad的socket多用户mp3音频在线播放服务器

在众多大神的帮助下,这个在线播放流媒体服务器终于完成啦。。。。 这个mp3流媒体服务器设计的思路是,服务器程序server用多线程实现和多个客户端的通信(这是必然的),然后发送给客户端当前的音频列表公客户端选择,之后根据k客户端的选择给多个客户端传输相应mp3文件的数据,同时,客户端进行实时地音频解码并播放。 关于libmad开源mp3音频解码库的使用,见上一篇博客吧。。。。 在服务器程序这一端,...

Nginx

Nginx Nginx简介: Nginx是一个高性能的http和反向代理服务器,特点是有内存少,并发能力强,事实上Nginx的并发能力确实在同类型网页服务器中表现较好, Nginx用作web服务器:Nginx可以作为静态页面的web服务器,同时还支持CGI语言,但不支持java,java程序只能通过Tomcat配合完成。Nginx专为性能优化而开发,性能是其最重要的考量,实现上非常注重效率,能经受...

SpringCloud Alibaba - Sentinel入门案例(二)(流控规则 | 直接 / 关联 / 链路 / 快速失败 / Warm Up / 排队等待)

SpringCloud Alibaba - Sentinel入门案例(二)(流控规则 | 直接 / 关联 / 链路 / 快速失败 / Warm Up / 排队等待) 回溯 Sentinel 基本概念 正文 环境准备 流控规则介绍 简单介绍 对 阈值类型 / 单机阈值做 测试 流控模式 直接流控模式 关联流控模式 链路流控模式 资源名称的修改 链路模式正文 坑来了,怎么解决? 禁止收敛URL的入口 ...