对于想从事数据分析行业的小伙伴来说,参加数据竞赛可以有效提高职场竞争力,下面我将讲解一下我前几天参加的阿里天池的数据竞赛:O2O优惠券使用预测,该比赛作为长期赛,可以随时提交结果,查看自己的排名。
本赛题的比赛背景:随着移动设备的完善和普及,移动互联网+各行各业进入了高速发展阶段,这其中以O2O消费最为吸引眼球。O2O行业天然关联数亿消费者,各类APP每天记录了超过百亿条用户行为和位置记录,因而成为大数据科研和商业化运营的最佳结合点之一。 以优惠券盘活老用户或吸引新客户进店消费是O2O的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言,滥发的优惠券可能降低品牌声誉,同时难以估算营销成本。 个性化投放是提高优惠券核销率的重要技术,它可以让具有一定偏好的消费者得到真正的实惠,同时赋予商家更强的营销能力。本次大赛为参赛选手提供了O2O场景相关的丰富数据,希望参赛选手通过分析建模,精准预测用户是否会在规定时间内使用相应优惠券。
本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。数据集有三个,分别是ccf_offline_stage1_train.csv(用户线下消费和优惠券领取行为)、ccf_online_stage1_train.csv(用户线上点击/消费和优惠券领取行为)、ccf_offline_stage1_test_revised.csv(用户O2O线下优惠券使用预测样本)
首先导入各种库:
import os,sys,pickle
import numpy as np
import pandas as pd
from datetime import date
from sklearn.model_selection import KFold,train_test_split,StratifiedKFold,cross_val_score,GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier,LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss,roc_auc_score,auc,roc_curve
from sklearn.preprocessing import MinMaxScaler
读取文件:
# 导入数据
dfoff = pd.read_csv('ccf_offline_stage1_train.csv')
dfon = pd.read_csv('ccf_online_stage1_train.csv')
dftest = pd.read_csv('ccf_offline_stage1_test_revised.csv')
pd.set_option('display.max_columns',None)
观察一下用户线下消费和优惠券使用情况:
dfoff.head()
可以看到User_id表示用户的ID,Merchant_id表示商户的ID,Coupon_id表示优惠券ID,Discount_rate表示折扣率,Distance表示用户经常活动的地点离该商户的最近门店距离,Date_received表示用户领取优惠券的时间,Date表示用户消费的日期。
简单统计一下用户使用优惠券的情况:
dfoff = dfoff.fillna('null')
dftest = dftest.fillna('null')
print('有优惠券,购买商品:%d' % dfoff[(dfoff['Date_received']!='null')&(dfoff['Date']!='null')].shape[0])
print('有优惠券,未购商品:%d' % dfoff[(dfoff['Date_received']!='null')&(dfoff['Date']=='null')].shape[0])
print('无优惠券,购买商品:%d' % dfoff[(dfoff['Date_received']=='null')&(dfoff['Date']!='null')].shape[0])
print('无优惠券,未购商品:%d' % dfoff[(dfoff['Date_received']=='null')&(dfoff['Date']=='null')].shape[0])
可以发现在数据集中,通过使用优惠券来购买商品的只有75382条数据(感觉优惠券没什么吸引力...):
查看一下折扣率的类型:
# 打折率Discount_rate
print('Discount_rate类型:\n',dfoff['Discount_rate'].unique())
发现打折率分3种情况:
'null'表示没有打折
[0,1]表示折扣率
x:y表示满x减y
构建函数对折扣率提取特征:
'''
处理方式:
打折类型:getDiscountType()
折扣率:convertRate()
满多少:getDiscountMan()
减多少:getDiscountJian()
'''
def getDiscountType(row):
if row == 'null':
return 'null'
elif ':' in row:
return 1
else:
return 0
def convertRate(row):
if row == 'null':
return 1.0
elif ':' in row:
rows = row.split(':')
return 1.0 - float(rows[1])/float(rows[0])
else:
return float(row)
def getDiscountMan(row):
if ':' in row:
rows = row.split(':')
return int(rows[0])
else:
return 0
def getDiscountJian(row):
if ':' in row:
rows = row.split(':')
return int(rows[1])
else:
return 0
def processData(df):
# convert discount_rate
df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
df['discount_rate'] = df['Discount_rate'].apply(convertRate)
df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
print(df['discount_rate'].unique())
return df
执行函数,效果如下:
dfoff =processData(dfoff)
dftest = processData(dftest)
dfoff.head()
观察Distance数据:
# 距离Distance
print('Distance类型:',dfoff['Distance'].unique())
将空值'null'用-1代替,并将距离由字符串转换为整形:
# 将距离str转为int
dfoff['distance'] = dfoff['Distance'].replace('null',-1).astype(int)
print(dfoff['distance'].unique())
dftest['distance'] = dftest['Distance'].replace('null',-1).astype(int)
print(dftest['distance'].unique())
筛选出领券日期,并排序:
[20160101.0, 20160102.0, 20160103.0, 20160104.0, 20160105.0, 20160106.0, 20160107.0, 20160108.0, 20160109.0, 20160110.0, 20160111.0, 20160112.0, 20160113.0, 20160114.0, 20160115.0, 20160116.0, 20160117.0, 20160118.0, 20160119.0, 20160120.0, 20160121.0, 20160122.0, 20160123.0, 20160124.0, 20160125.0, 20160126.0, 20160127.0, 20160128.0, 20160129.0, 20160130.0, 20160131.0, 20160201.0, 20160202.0, 20160203.0, 20160204.0, 20160205.0, 20160206.0, 20160207.0, 20160208.0, 20160209.0, 20160210.0, 20160211.0, 20160212.0, 20160213.0, 20160214.0, 20160215.0, 20160216.0, 20160217.0, 20160218.0, 20160219.0, 20160220.0, 20160221.0, 20160222.0, 20160223.0, 20160224.0, 20160225.0, 20160226.0, 20160227.0, 20160228.0, 20160229.0, 20160301.0, 20160302.0, 20160303.0, 20160304.0, 20160305.0, 20160306.0, 20160307.0, 20160308.0, 20160309.0, 20160310.0, 20160311.0, 20160312.0, 20160313.0, 20160314.0, 20160315.0, 20160316.0, 20160317.0, 20160318.0, 20160319.0, 20160320.0, 20160321.0, 20160322.0, 20160323.0, 20160324.0, 20160325.0, 20160326.0, 20160327.0, 20160328.0, 20160329.0, 20160330.0, 20160331.0, 20160401.0, 20160402.0, 20160403.0, 20160404.0, 20160405.0, 20160406.0, 20160407.0, 20160408.0, 20160409.0, 20160410.0, 20160411.0, 20160412.0, 20160413.0, 20160414.0, 20160415.0, 20160416.0, 20160417.0, 20160418.0, 20160419.0, 20160420.0, 20160421.0, 20160422.0, 20160423.0, 20160424.0, 20160425.0, 20160426.0, 20160427.0, 20160428.0, 20160429.0, 20160430.0, 20160501.0, 20160502.0, 20160503.0, 20160504.0, 20160505.0, 20160506.0, 20160507.0, 20160508.0, 20160509.0, 20160510.0, 20160511.0, 20160512.0, 20160513.0, 20160514.0, 20160515.0, 20160516.0, 20160517.0, 20160518.0, 20160519.0, 20160520.0, 20160521.0, 20160522.0, 20160523.0, 20160524.0, 20160525.0, 20160526.0, 20160527.0, 20160528.0, 20160529.0, 20160530.0, 20160531.0, 20160601.0, 20160602.0, 20160603.0, 20160604.0, 20160605.0, 20160606.0, 20160607.0, 20160608.0, 20160609.0, 20160610.0, 20160611.0, 20160612.0, 20160613.0, 20160614.0, 20160615.0]
筛选出消费日期,并排序:
date_buy = dfoff['Date'].unique()
date_buy = sorted(date_buy[date_buy != 'null'])
领取优惠券与消费的时间范围:
print('优惠券收到日期从',date_received[0],'到',date_received[-1])
print('消费日期从',date_buy[0],'到',date_buy[-1])
领券日期可以分解出不同的特征,比如周末的时候领取优惠券,可能会印象深刻,有消费欲望,这里用weekday_type来区分,同时用one-hot编码来对日期提取特征(one hot编码是将类别变量转换为机器学习算法易于利用的一种形式的过程):
'''
关于领券日期的特征:
weekday : {null,1,2,3,4,5,6,7}
weekday_type : {1,0}{周六和周日为1,其他为0}
Weekday_1 : {1,0,0,0,0,0,0}
Weekday_2 : {0,1,0,0,0,0,0}
Weekday_3 : {0,0,1,0,0,0,0}
Weekday_4 : {0,0,0,1,0,0,0}
Weekday_5 : {0,0,0,0,1,0,0}
Weekday_6 : {0,0,0,0,0,1,0}
Weekday_7 : {0,0,0,0,0,0,1}
'''
构造可以识别某日是星期几的函数,并执行:
def getWeekday(row):
if row == 'null':
return row
else:
# row[0:4]表示年份,row[4:6]表示月份,row[6:8]表示日期
return date(int(row[0:4]),int(row[4:6]),int(row[6:8])).weekday()+1
dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)
构造新的列weekday_type,区分是不是周末:
dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)
对dfoff和dftest进行one-hot编码:
# change weekday to one-hot encoding
weekdaycols = ['weekday_'+str(i) for i in range(1,8)]
# print(weekdaycols)
tmpdf = pd.get_dummies(dfoff['weekday'].replace('null',np.nan))
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf
tmpdf = pd.get_dummies(dftest['weekday'].replace('null',np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf
到此为止,所有特征已经构造完毕,如下所示:
'''
所有特征:
discount_rate
discount_type
discount_man
discount_jian
distance
weekday
weekday_type
weekday_1
weekday_2
weekday_3
weekday_4
weekday_5
weekday_6
weekday_7
'''
因为训练集并没有告诉我们用户到底有没有使用优惠券进行消费,所以需要我们根据领券时间与消费时间去判断,并打上标签:
#标签标注:
#Date_received=='null':表示没有领到优惠券,无需考虑,y=-1
#(Date_received!='null')&(Date-Date_received<=15):表示领取优惠券且在15天内使用,即正样本,y=1
#(Date_received!='null')&(Date=='null')|(Date-Date_received>15):表示领取优惠券未在15天内使用,即负样本,y=-1
def label(row):
if row['Date_received']=='null':
return -1
if row['Date']!='null':
td = pd.to_datetime(row['Date'],format='%Y%m%d')-pd.to_datetime(row['Date_received'],format='%Y%m%d')
if td<=pd.Timedelta(15,'D'):
return 1
return 0
dfoff['label'] = dfoff.apply(label,axis=1)
print(dfoff['label'].value_counts())
dfoff.head()
下面进入建模阶段,首先划分训练集与测试集,我们将训练集时间范围设为20160101-20160515 ,验证集时间范围设为20160516-20160615:
# 划分训练集/验证集
df = dfoff[dfoff['label']!=-1].copy()
df['Date_received'] = pd.to_datetime(df['Date_received'],format='%Y%m%d')
train = df[(df['Date_received']<'20160516')].copy()
valid = df[(df['Date_received']>='20160516')&(df['Date_received']<='20160615')].copy()
print('Train Set:\n',train['label'].value_counts())
print('Valid Set:\n',valid['label'].value_counts())
# 特征数量
original_feature = ['discount_rate','discount_type','discount_man','discount_jian',\
'distance','weekday','weekday_type']+weekdaycols
print('共有特征:',len(original_feature),'个')
print(original_feature)
接下来建立模型,我们将使用线性模型SGDClassifier,它是sklearn自带的一个库,具体参数可以在sklearn的官方文档中查询,之后使用管道机制来加快模型的运算,然后使用网格搜索来寻找最优参数:
# 建立模型
def check_model(data,predictors):
classifier = lambda:SGDClassifier(
loss='log' ,# loss function:logistic regression
penalty='elasticnet', # L1&L2
fit_intercept=True, #是否存在截距,默认存在
max_iter=100,
shuffle=True, # Whether or not the training data should be shuffled after each epoch
n_jobs=1, # The number of processors to use
class_weight=None # Weights associated with classes.If not given ,all classes are supposed to have weight one.
)
# 管道机制使得参数集在新数据集(比如测试集)上的重复使用,管道机制实现了对全部步骤的流式化封装和管理
model = Pipeline(steps=[
('ss',StandardScaler()), # transformer
('en',classifier()) # estimator
])
parameters = {
'en__alpha':[0.001,0.01,0.1], #注意en后面是两个下划线__
'en__l1_ratio':[0.001,0.01,0.1] #注意en后面是两个下划线__,l1后面是一个下划线_
}
# StratifiedKFold用法类似Kfold,但是他是分层采样,确保训练集,测试集中各类别样本的比例与原始数据集中相同
folder = StratifiedKFold(n_splits=3,shuffle=True)
# Exhaustive search over specified parameter values for an estimator
grid_search = GridSearchCV(
model,
parameters,
cv=folder,
n_jobs=-1, # -1 means using all processors
verbose=1
)
grid_search = grid_search.fit(data[predictors],data['label'])
return grid_search
进行模型训练:
# 训练
predictors = original_feature
model = check_model(train,predictors)
进行模型验证,并计算AUC值(所有优惠券的AUC的平均值):
# 对验证集中每个优惠券预测的结果计算AUC,再对所有优惠券的AUC求平均。计算AUC的时候,如果label只有一类,就直接跳过,因为AUC无法计算
y_valid_pred = model.predict_proba(valid[predictors])
valid1 = valid.copy()
valid1['pred_prob'] = y_valid_pred[:,1]
valid1.head()
# 计算AUC
vg = valid1.groupby(['Coupon_id'])
aucs = []
for i in vg:
tmpdf = i[1]
if len(tmpdf['label'].unique())!=2:
continue
fpr,tpr,thresholds = roc_curve(tmpdf['label'],tmpdf['pred_prob'],pos_label=1)
aucs.append(auc(fpr,tpr))
print(np.average(aucs))
最终验证集的AUC值是0.5309425926481126,之后对测试集进行预测,将结果上传到天池官网就可以查看成绩啦!
想要进一步提高成绩和排名,可以去数据集里提取更多的特征,多去论坛看看大神们的思路。就分享到这了,我会继续更新数据分析领域的文章,欢迎关注。