Create a classifier for breast cancer datasets in Wisconsin to determine whether breast cancer tumors are benign or malignant, by tuning random forests and hyperparameters. The data is included in sklearn, and the number of data is 569, of which 212 are benign, 357 are malignant, and 30 types of features.
-Calculation of coefficient of determination by linear multiple regression and model selection -Calculation of coefficient of determination by linear multiple regression and selection of model part_2 -Calculation of contribution rate by simple regression analysis -Linear regression and narrowing down features -Logistic regression (classification) and tuning of hyperparameters -Linear SVC (classification) and hyperparameter tuning -Nonlinear SVC (classification) and tuning of hyperparameters -Decision tree (classification) and hyperparameter tuning -Decision tree (classification) and hyperparameter tuning 2 -Random forest (classification) and hyperparameter tuning
Proposed by Leo Breiman in 2001 [1] A machine learning algorithm used for classification, regression and clustering. It is an ensemble learning algorithm that uses a decision tree as a weak learner, and its name is derived from the use of a large number of decision trees learned from randomly sampled training data. (From wikipedia)
See below for details. RandomForestClassifier
Hyperparameters | Choices | default |
---|---|---|
n_estimators | int type | 10 |
criterion | gini、entropy | gini |
max_depth | int type or None | None |
min_samples_split | int, float type | 2 |
min_samples_leaf | int, float type | 1 |
min_weight_fraction_leaf | float type | 0 |
max_features | int, float type, None, auto, sqrt, log2 | auto |
max_leaf_nodes | int type or None | None |
min_impurity_decrease | float type | 0 |
min_impurity_split | float type | 1e-7 |
bootstrap | bool type | True |
oob_score | bool type | False |
n_jobs | int type or None | None |
random_state | int type, RandomState instance or None | None |
verbose | int type | 0 |
warm_start | bool type | False |
class_weight | Dictionary type, balanced, balanced_subsample or None | None |
--Reading breast cancer data --Separation of training data and test data
%%time
from tqdm import tqdm
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
#Reading breast cancer data
cancer_data = load_breast_cancer()
#Separation of training data and test data
train_X, test_X, train_y, test_y = train_test_split(cancer_data.data, cancer_data.target, random_state=0)
#Condition setting
max_score = 0
SearchMethod = 0
RFC_grid = {RandomForestClassifier(): {"n_estimators": [i for i in range(1, 21)],
"criterion": ["gini", "entropy"],
"max_depth":[i for i in range(1, 5)],
"random_state": [i for i in range(0, 101)]
}}
#Random forest execution
for model, param in tqdm(RFC_grid.items()):
clf = GridSearchCV(model, param)
clf.fit(train_X, train_y)
pred_y = clf.predict(test_X)
score = f1_score(test_y, pred_y, average="micro")
if max_score < score:
max_score = score
best_param = clf.best_params_
best_model = model.__class__.__name__
print("Best score:{}".format(max_score))
print("model:{}".format(best_model))
print("parameter:{}".format(best_param))
#Comparison with no hyperparameter adjustment
model = RandomForestClassifier()
model.fit(train_X, train_y)
score = model.score(test_X, test_y)
print("")
print("Default score:", score)
100%|███████████████████████████████████████████| 1/1 [10:39<00:00, 639.64s/it]
Best score:0.965034965034965
model:RandomForestClassifier
parameter:{'criterion': 'entropy', 'max_depth': 4, 'n_estimators': 14, 'random_state': 62}
Default score: 0.951048951049
Wall time: 10min 39s
By tuning the hyperparameters, we were able to obtain a higher accuracy rate than the default.
Recommended Posts