This is an article summarizing the implementation and parameters of Random Forest.
A model that combines multiple decision trees to improve prediction performance.
The flow of learning is as follows
① Prepare multiple decision tree models (2) For the learning data of each decision tree, allow duplication from the original learning data and randomly extract the same number (increase the variation of learning by subtly changing the learning data for each decision tree). ③ Give the final answer from the prediction of each decision tree. Classification model → majority vote Regression model → mean
Random forest is a method classified as bagging for ensemble learning.
Extract different data (bootstrap method) to create multiple different models (weak learners). After that, the average of the created multiple models is used as the final model.
This time, we will focus on the evaluation of [SIGNATE] automobiles. Link below. https://signate.jp/competitions/122
Read the data and change "String" to "Numeric".
python.py
import pandas as pd
import numpy as np
#Data reading
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop('id', axis = 1)
#Explanatory variable
df = df.replace({'buying': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'maint': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'doors': {'2': 2, '3': 3, '4': 4, '5': 5, '5more': 6}})
df = df.replace({'persons': {'2': 2, '4': 4, 'more': 6}})
df = df.replace({'lug_boot': {'small': 1, 'med': 2, 'big': 3}})
df = df.replace({'safety': {'low': 1, 'med': 2, 'high': 3}})
#Objective variable
df = df.replace({'class': {'unacc': 1, 'acc': 2, 'good': 3, 'vgood': 4}})
Classify into training data and evaluation data.
python.py
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state = 0)
#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('class', axis=1)
y_train = train_set['class']
#Explain the evaluation data Variable data(X_train)And objective variable data(y_train)Divided into
X_test = test_set.drop('class', axis=1)
y_test = test_set['class']
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(691, 6)
(173, 6)
(691,)
(173,)
python.py
#Random forest
from sklearn.ensemble import RandomForestClassifier
#Evaluation
from sklearn import metrics
model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)
print(metrics.classification_report(y_test, pred))
precision recall f1-score support
1 0.97 0.96 0.97 114
2 0.84 0.88 0.86 42
3 0.71 0.56 0.63 9
4 0.89 1.00 0.94 8
accuracy 0.92 173
macro avg 0.85 0.85 0.85 173
weighted avg 0.92 0.92 0.92 173
The correct answer rate is 92%. Next, let's adjust the parameters.
Here are some of the most important parameter adjustments.
①n_estimators
Number of decision tree models Specify an integer (default: 100)
②criterion
Indicator for dividing data into decision tree models 'gini': Gini coefficient (default) 'entropy': Cross entropy
③max_depth
Depth of each decision tree model Specify an integer or None (default: None) Parameters that are important for suppressing overfitting Typically, Small value: low accuracy Large value: High accuracy but prone to overfitting
④min_samples_split
Number of samples needed to split a node (When the number of samples in the node becomes less than the specified value, the division of the decision tree stops) Specify an integer or decimal (default: None) In general, too small a value can easily overfit the model.
⑤max_leaf_nodes
Number of leaves in the decision tree model Specify an integer or None (default: None)
⑥min_samples_leaf
Number of samples required for leaves after division of decision tree Specify an integer or decimal (default: 1)
python.py
#Grit search
from sklearn.model_selection import GridSearchCV
#Specify the parameter you want to verify
search_gs = {
"max_depth": [None, 5, 25],
"n_estimators":[150, 180],
"min_samples_split": [4, 8, 12],
"max_leaf_nodes": [None, 10, 30],
}
model_gs = RandomForestClassifier()
#Grit search settings
gs = GridSearchCV(model_gs,
search_gs,
cv = 5,
iid = False)
#Learning
gs.fit(X_train, y_train)
#Display of optimal parameters
print(gs.best_params_)
{'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 4, 'n_estimators': 180}
Check the result
python.py
clf_rand = RandomForestClassifier(max_depth = None,
max_leaf_nodes = None,
min_samples_split = 4,
n_estimators =180)
model_rand = clf_rand.fit(X_train, y_train)
pred_rand = model_rand.predict(X_test)
print(metrics.classification_report(y_test, pred_rand))
precision recall f1-score support
1 1.00 0.97 0.99 114
2 0.87 0.95 0.91 42
3 0.71 0.56 0.63 9
4 0.89 1.00 0.94 8
accuracy 0.95 173
macro avg 0.87 0.87 0.87 173
weighted avg 0.95 0.95 0.95 173
Correct answer rate improved from 92% to 95%!
It is important to adjust the parameters, but if you want to improve the accuracy rate further, I think that data preprocessing (extraction of features) will be important!
Recommended Posts