Participated in [1st_Beginner Limited Competition] Bank Customer Targeting of SIGNATE, a data science analysis competition site.
Since I was successfully promoted to Intermidiate, I wrote an article based on my notes and reflections, hoping that more people can easily participate in the competition.
A data science competition run by the Japanese company SIGNATE.
Ministry of Economy, Trade and Industry, Mynavi, NTT, Sansan, etc. are also participating
You can think of it as the Japanese version of kaggle.
Since the forum is not active, it seems that there is not much tendency to exchange opinions and information between participants "yet". The forum was a little moving towards the end of this Beginer competition, so it may change from now on.
(By the way, Official Video ), The correct intonation seems to be "Sig (↑) Ney (→) To (↓)" Is it the same as the trigonometric tangent?)
This is the first memorable competition for Beginners, which is scheduled to be held every month. If you exceed a certain score (AUC is 0.85 this time) by absolute evaluation instead of competition, you can be promoted to a higher-grade Intermidiate.
Please refer to Official Site for details.
Operated by Benchmark Score Script ) Is published
If you predict this 100 times with an appropriate random number for statement and then take the average, the AUC will be 0.854, so you can promote it.
output_df = pd.DataFrame()
for i in range(100):
#Split test data from training data
X_train , X_valid , y_train , y_valid = train_test_split(X_train_origin,y_train_origin,test_size = 0.3 , random_state = i , stratify=y_train_origin)
#The model used is LGB (without parameter tuning)
lgb_train = lgb.Dataset(X_train,y_train,categorical_feature = categorical_features)
lgb_eval = lgb.Dataset(X_valid , y_valid , reference = lgb_train , categorical_feature = categorical_features)
params = {
"objective":"binary"
}
model = lgb.train(
params,lgb_train,
valid_sets=[lgb_train,lgb_eval],
verbose_eval = 10,
num_boost_round = 1000,
early_stopping_rounds=10
)
y_pred = model.predict(X_test_origin,num_iteration=model.best_iteration)
output_df[i] = y_pred
submit_df["1"] = output_df.mean(axis = 1)
submit_df.to_csv('submit.csv',index=False,header=None)
__ It turns out that the strategy of "averaging multiple prediction results" is very powerful __
This is not interesting, so I decided to do my best to exceed AUC 0.860.
Roughly repeated trial and error with the following procedure
Calculate statistics such as mean and median by grouping by objective variable Visualization with histogram / bar graph I also calculated the correlation coefficient etc.
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows' ,500)
pd.set_option('display.max_columns', 100)
train_df = pd.read_csv("../../data/train.csv")
#month is going to be combined with day so I changed it to a number
#December does not exist in this data
month_dict = {"jan":1,"feb":2,"mar":3,"apr":4,"may":5,"jun":6,"jul":7,"aug":8,"sep":9,"oct":10,"nov":11,"dec":12}
month_int = [train_df["month"][i] for i in range(len(train_df))]
train_df["month"] = month_int
#age
#age
#I was worried about how to handle over 60 years old, so I rolled it up
#Converted to just over 60 years old, every teenager up to 59 years old
age_list = list(train_df["age"])
new_age_list = []
for i in range(len(age_list)):
if age_list[i] == 60:
new_age_list.append(60)
elif age_list[i] > 60:
new_age_list.append(70)
else:
new_age_list.append(int(age_list[i]/10)*10)
train_df["age_round"] = new_age_list
train_df.describe()
numeric_col_list = ["age","balance","duration","campaign","pdays","previous"]
categorical_col_list = [categorical_feature for categorical_feature in train_df.columns if categorical_feature not in numeric_col_list]
categorical_col_list.remove("id")
#For numeric, group by objective variable to calculate statistics
for target_col in numeric_col_list:
print("\ntarget_col:",target_col)
display(train_df.groupby("y")[target_col].describe())
#For categorical, group by objective variable and count
for target_col in categorical_col_list:
print("\ntarget_col:",target_col)
display(train_df.groupby(["y",target_col])[target_col].count())
#Visualize and check the mean and median
#Mean and median by 01 of y
y0_df = train_df.query('y == 0')
y1_df = train_df.query('y == 1')
for target_col in numeric_col_list:
#0,Average value by 1
y0_target_col_mean = y0_df[target_col].mean()
y1_target_col_mean = y1_df[target_col].mean()
#0,Median by 1
y0_target_col_median = y0_df[target_col].median()
y1_target_col_median = y1_df[target_col].median()
#Vertical axis setting
#1 of the maximum of the average and median.1x is the vertical axis
#The magnification can be anything as long as it is easy to see
graph_y_length = (1.1*max(y0_target_col_mean,y1_target_col_mean,y0_target_col_median,y1_target_col_median))
plt.title(target_col + ": mean")
plt.ylabel(target_col)
plt.ylim([0,graph_y_length],)
plt.bar(["y0_mean","y1_mean"],[y0_target_col_mean,y1_target_col_mean])
#plt.annotate(y0_target_col_mean,y0_target_col_mean)
plt.show()
plt.title(target_col + ": median")
plt.ylabel(target_col)
plt.ylim([0,graph_y_length],)
plt.bar(["y0_median","y1_median"],[y0_target_col_median,y1_target_col_median],color = "green")
plt.show()
#Grasp the whole picture even with a histogram
for target_col in numeric_col_list:
#Axis settings
graph_x_length = 1.1*max(train_df[target_col])
graph_y_length = len(y0_df)
print("y0",target_col)
plt.xlim([0,graph_x_length],)
plt.ylim([0,graph_y_length],)
plt.hist(y0_df[target_col],bins = 20)
plt.show()
graph_y_length = graph_y_length/10
print("y1",target_col)
plt.xlim([0,graph_x_length],)
plt.ylim([0,graph_y_length],)
plt.hist(y1_df[target_col],bins = 20)
plt.show()
#Correlation coefficient calculation
train_df.corr()
#Pick up only the correlation coefficient of the objective variable
train_df.corr()["y"]
I also tried to draw a line graph in chronological order, but I will omit it. With this kind of feeling, I was thinking about which features would contribute significantly to the objective variable and whether I could create any features.
I confirmed that there are dates that do not exist, such as about 90 years old or February 30th. I considered whether to process or delete it, but I used it as it is
・ Pdays and balance are deleted ・ Create a month-end flag -Create a date column that combines month and day
From basic tabulation and confirmation of raw data Features that are likely to have a strong influence on the objective variable are processed to create new features. I deleted it if I decided that it would not affect me I also referred to the feature importance described below.
optuna is a library that tunes hyperparameters using Bayesian optimization Scored higher than grid search and random search
I referred to the following two articles
Hyperparameter automatic optimization by LightGBM Tuner extension
The point is to use optuna for lightgbm I had an error with optuna 2.0 that was already on my computer so I ran it with optuna 1.3
import pandas as pd
import optuna
import functools
import warnings
warnings.simplefilter('ignore')
from sklearn.model_selection import train_test_split
#This guy is really capable
import optuna.integration.lightgbm as lgb
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
%matplotlib
#Read data
train_df = pd.read_csv("../../data/train.csv")
test_df = pd.read_csv("../../data/test.csv")
submit_df = pd.read_csv("../../data/submit_sample.csv",header = None)
#Substitute dummy objective variables so that training data and test data can be understood
test_df["y"] = -999
#Combine training data and test data
all_df = pd.concat([train_df,test_df])
del train_df , test_df
all_df.reset_index(inplace=True)
#Month-end flag
month_list = list(all_df["month"])
day_list = list(all_df["day"])
end_of_month_flag = []
for i in range(len(all_df)):
if day_list[i] in [30,31]:
end_of_month_flag.append(1)
elif day_list[i] == 29 and month_list[i] in ["feb","apr","jun","sep","nov"]:
end_of_month_flag.append(1)
elif day_list[i] == 28 and month_list[i] == "feb":
end_of_month_flag.append(1)
else:
end_of_month_flag.append(0)
all_df["end_of_month_flag"] = end_of_month_flag
#Month(month column)To a number
month_dict = {"jan":1,"feb":2,"mar":3,"apr":4,"may":5,"jun":6,"jul":7,"aug":8,"sep":9,"oct":10,"nov":11,"dec":12}
month_int = [month_dict[all_df["month"][i]] for i in range(len(all_df))]
all_df["month"] = month_int
#month_day column creation
#I think it's okay to treat the decision tree as an int as long as you know the size relationship.
month_day = []
month_day = all_df["month"]*100 + all_df["day"]
all_df["month_day"] = month_day
#Delete unnecessary columns
del all_df["index"]
del all_df["id"]
del all_df["pdays"]
del all_df["balance"]
#Specify the column name of the category
categorical_features = ["job","marital","education","default","housing","loan","contact","month","poutcome","end_of_month_flag"]
#Label encoding
#I did target encoding, but the score didn't change
for col in categorical_features:
lbl = preprocessing.LabelEncoder()
lbl.fit(all_df[col])
lbl.transform(all_df[col])
all_df[col] = lbl.transform(all_df[col])
#Division of training data and test data
train_df = all_df[all_df["y"] != -999]
test_df = all_df[all_df["y"] == -999]
#Divide into explanatory variables and objective variables
origin_y_train = train_df["y"]
origin_X_train = train_df.drop(["y"],axis = 1)
origin_X_test = test_df.drop(["y"],axis = 1)
output_df = pd.DataFrame()
##Execution from here on my PC(10th generation core i5 memory 8GB)Then it took more than 10 hours
for i in range(100):
#Split test data from training data
X_train , X_valid , y_train , y_valid = train_test_split(origin_X_train,origin_y_train,test_size = 0.3 , random_state = i , stratify=origin_y_train)
#Data set creation
lgb_train = lgb.Dataset(X_train,y_train,categorical_feature = categorical_features,free_raw_data=False)
lgb_eval = lgb.Dataset(X_valid , y_valid , reference = lgb_train , categorical_feature = categorical_features,free_raw_data=False)
params ={"objective":"binary",
"metric":"auc"
}
best_params, tuning_history = dict(), list()
booster = lgb.train(params, lgb_train, valid_sets=[lgb_train,lgb_eval],
verbose_eval=0,
best_params=best_params,
tuning_history=tuning_history)
print("Best Params:", best_params)
print("Tuning history:", tuning_history)
#Added hyperparameters with the highest AUC
params.update(best_params)
#Learning
model = lgb.train(
params,lgb_train,
valid_sets=[lgb_train,lgb_eval],
verbose_eval = 10,
num_boost_round = 1000,
early_stopping_rounds=10
)
#Predicted by competition data
y_pred = model.predict(origin_X_test,num_iteration=model.best_iteration)
output_df[i] = y_pred
submit_df = pd.read_csv("../../data/submit_sample.csv",header = None)
del submit_df[1]
submit_df["1"] = output_df.mean(axis = 1)
submit_df
submit_df.to_csv("../../output/result.csv",header=None,index=False)
#During trial and error, we confirmed feature importance and engineered features.
lgb.plot_importance(model, height=0.5, figsize=(8,12))
__ When I submitted this, the AUC was 0.859__
Furthermore, the result of predicting 100 times by removing the day column and the month-end flag was 0.859.
When I took the average of these __2 predicts and submitted it, I was able to safely exceed 0.860 __ Congratulations
I tried to improve the score by optimizing the hyperparameters of one model using catboost, but the score is only a slight increase and it is not worth the calculation time, It seemed hopeful to turn lightgbm with an appropriate random number for statement, so I gave it up.
I wish I had stacked xgboost and NN together After investigating, it seems that combining multiple methods is one of the royal road strategies of the competition.
I was satisfied with exceeding the target of 0.860, so I entered the storehouse.
I thought it would be better to make the processing of features into a function. I want to be able to write beautiful code even through trial and error
Until the end Thank you for reading If you have any questions, suggestions, or improvements that are difficult to understand, please feel free to comment.