Introduction

This is an article summarizing the implementation and parameters of Random Forest.

What is Random Forest?

A model that combines multiple decision trees to improve prediction performance.

Decision tree: A model that divides data by Yes or No and gives an answer, which is one of the machine learning methods.

The flow of learning is as follows

① Prepare multiple decision tree models (2) For the learning data of each decision tree, allow duplication from the original learning data and randomly extract the same number (increase the variation of learning by subtly changing the learning data for each decision tree). ③ Give the final answer from the prediction of each decision tree. Classification model → majority vote Regression model → mean

Random forest features

Random forest is a method classified as bagging for ensemble learning.

Ensemble learning: See below https://qiita.com/hara_tatsu/items/336f9fff08b9743dc1d2

Bagging

Extract different data (bootstrap method) to create multiple different models (weak learners). After that, the average of the created multiple models is used as the final model.

Bootstrap method: The same amount of data is randomly extracted multiple times from all data. (It does not divide the data)

Implementation

This time, we will focus on the evaluation of [SIGNATE] automobiles. Link below. https://signate.jp/competitions/122

Data preprocessing

Read the data and change "String" to "Numeric".

`python.py`


import pandas as pd
import numpy as np

#Data reading
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop('id', axis = 1)

#Explanatory variable
df = df.replace({'buying': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'maint': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'doors': {'2': 2, '3': 3, '4': 4, '5': 5, '5more': 6}})
df = df.replace({'persons': {'2': 2, '4': 4, 'more': 6}})
df = df.replace({'lug_boot': {'small': 1, 'med': 2, 'big': 3}})
df = df.replace({'safety': {'low': 1, 'med': 2, 'high': 3}})

#Objective variable
df = df.replace({'class': {'unacc': 1, 'acc': 2, 'good': 3, 'vgood': 4}})

Classify into training data and evaluation data.

`python.py`


from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state = 0)

#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('class', axis=1)
y_train = train_set['class']
 
#Explain the evaluation data Variable data(X_train)And objective variable data(y_train)Divided into
X_test = test_set.drop('class', axis=1)
y_test = test_set['class']

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(691, 6)
(173, 6)
(691,)
(173,)

Random forest implementation

`python.py`


#Random forest
from sklearn.ensemble import RandomForestClassifier
#Evaluation
from sklearn import metrics

model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)

print(metrics.classification_report(y_test, pred))


              precision    recall  f1-score   support

           1       0.97      0.96      0.97       114
           2       0.84      0.88      0.86        42
           3       0.71      0.56      0.63         9
           4       0.89      1.00      0.94         8

    accuracy                           0.92       173
   macro avg       0.85      0.85      0.85       173
weighted avg       0.92      0.92      0.92       173

The correct answer rate is 92%. Next, let's adjust the parameters.

Parameter overview

Here are some of the most important parameter adjustments.

Please check here for details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier

①n_estimators

Number of decision tree models Specify an integer (default: 100)

②criterion

Indicator for dividing data into decision tree models 'gini': Gini coefficient (default) 'entropy': Cross entropy

③max_depth

Depth of each decision tree model Specify an integer or None (default: None) Parameters that are important for suppressing overfitting Typically, Small value: low accuracy Large value: High accuracy but prone to overfitting

④min_samples_split

Number of samples needed to split a node (When the number of samples in the node becomes less than the specified value, the division of the decision tree stops) Specify an integer or decimal (default: None) In general, too small a value can easily overfit the model.

⑤max_leaf_nodes

Number of leaves in the decision tree model Specify an integer or None (default: None)

⑥min_samples_leaf

Number of samples required for leaves after division of decision tree Specify an integer or decimal (default: 1)

Implemented parameter adjustment with grit search

`python.py`


#Grit search
from sklearn.model_selection import GridSearchCV

#Specify the parameter you want to verify
search_gs = {
"max_depth": [None, 5, 25],
"n_estimators":[150, 180],
"min_samples_split": [4, 8, 12],
"max_leaf_nodes": [None, 10, 30],
}

model_gs = RandomForestClassifier()
#Grit search settings
gs = GridSearchCV(model_gs,
                  search_gs,
                  cv = 5,
                  iid = False)
#Learning
gs.fit(X_train, y_train)
#Display of optimal parameters
print(gs.best_params_)

{'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 4, 'n_estimators': 180}

Check the result

`python.py`


clf_rand = RandomForestClassifier(max_depth = None, 
                                  max_leaf_nodes = None, 
                                  min_samples_split = 4, 
                                  n_estimators =180)
model_rand = clf_rand.fit(X_train, y_train)
pred_rand = model_rand.predict(X_test)

print(metrics.classification_report(y_test, pred_rand))



              precision    recall  f1-score   support

           1       1.00      0.97      0.99       114
           2       0.87      0.95      0.91        42
           3       0.71      0.56      0.63         9
           4       0.89      1.00      0.94         8

    accuracy                           0.95       173
   macro avg       0.87      0.87      0.87       173
weighted avg       0.95      0.95      0.95       173

in conclusion

Correct answer rate improved from 92% to 95%!

It is important to adjust the parameters, but if you want to improve the accuracy rate further, I think that data preprocessing (extraction of features) will be important!

[PYTHON] Random forest (implementation / parameter summary)

Introduction

What is Random Forest?

Random forest features

Bagging

Implementation

Data preprocessing

python.py

python.py

Random forest implementation

python.py

Parameter overview

Implemented parameter adjustment with grit search

python.py

python.py

in conclusion

`python.py`

`python.py`

`python.py`

`python.py`

`python.py`