Introduction

Create a classifier for breast cancer datasets in Wisconsin to determine whether breast cancer tumors are benign or malignant, by tuning random forests and hyperparameters. The data is included in sklearn, and the number of data is 569, of which 212 are benign, 357 are malignant, and 30 types of features.

series

-Calculation of coefficient of determination by linear multiple regression and model selection -Calculation of coefficient of determination by linear multiple regression and selection of model part_2 -Calculation of contribution rate by simple regression analysis -Linear regression and narrowing down features -Logistic regression (classification) and tuning of hyperparameters -Linear SVC (classification) and hyperparameter tuning -Nonlinear SVC (classification) and tuning of hyperparameters -Decision tree (classification) and hyperparameter tuning -Decision tree (classification) and hyperparameter tuning 2 -Random forest (classification) and hyperparameter tuning

What is Random Forest?

Proposed by Leo Breiman in 2001 [1] A machine learning algorithm used for classification, regression and clustering. It is an ensemble learning algorithm that uses a decision tree as a weak learner, and its name is derived from the use of a large number of decision trees learned from randomly sampled training data. (From wikipedia)

Random forest hyperparameters

See below for details. RandomForestClassifier

Hyperparameters	Choices	default
n_estimators	int type	10
criterion	gini、entropy	gini
max_depth	int type or None	None
min_samples_split	int, float type	2
min_samples_leaf	int, float type	1
min_weight_fraction_leaf	float type	0
max_features	int, float type, None, auto, sqrt, log2	auto
max_leaf_nodes	int type or None	None
min_impurity_decrease	float type	0
min_impurity_split	float type	1e-7
bootstrap	bool type	True
oob_score	bool type	False
n_jobs	int type or None	None
random_state	int type, RandomState instance or None	None
verbose	int type	0
warm_start	bool type	False
class_weight	Dictionary type, balanced, balanced_subsample or None	None

procedure

--Reading breast cancer data --Separation of training data and test data

Condition setting --Random forest execution (grid search) --Comparison with no hyperparameter tuning

Implementation by python

%%time
from tqdm import tqdm
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

#Reading breast cancer data
cancer_data = load_breast_cancer()

#Separation of training data and test data
train_X, test_X, train_y, test_y = train_test_split(cancer_data.data, cancer_data.target, random_state=0)

#Condition setting
max_score = 0
SearchMethod = 0
RFC_grid = {RandomForestClassifier(): {"n_estimators": [i for i in range(1, 21)],
                                       "criterion": ["gini", "entropy"],
                                       "max_depth":[i for i in range(1, 5)],
                                       "random_state": [i for i in range(0, 101)]
                                      }}

#Random forest execution
for model, param in tqdm(RFC_grid.items()):
    clf = GridSearchCV(model, param)
    clf.fit(train_X, train_y)
    pred_y = clf.predict(test_X)
    score = f1_score(test_y, pred_y, average="micro")

    if max_score < score:
        max_score = score
        best_param = clf.best_params_
        best_model = model.__class__.__name__

print("Best score:{}".format(max_score))
print("model:{}".format(best_model))
print("parameter:{}".format(best_param))

#Comparison with no hyperparameter adjustment
model = RandomForestClassifier()
model.fit(train_X, train_y)
score = model.score(test_X, test_y)
print("")
print("Default score:", score)

result

100%|███████████████████████████████████████████| 1/1 [10:39<00:00, 639.64s/it]
Best score:0.965034965034965
model:RandomForestClassifier
parameter:{'criterion': 'entropy', 'max_depth': 4, 'n_estimators': 14, 'random_state': 62}

Default score: 0.951048951049
Wall time: 10min 39s

in conclusion

By tuning the hyperparameters, we were able to obtain a higher accuracy rate than the default.

[PYTHON] Random forest (classification) and hyperparameter tuning