[PYTHON] Data analysis for improving POG 3-Regression analysis-

Review up to the last time

Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~ shows the causal relationship between the horse profile and the prizes won during the POG period. By analyzing, we were able to grasp general trends such as "mares are disadvantageous" and "earlier birth is more advantageous".

Purpose of this time

Determine the possibility of predicting POG period prizes based on horse profiles by regression analysis.

Data analysis

Handling of qualitative data

Let's take a look at the contents of the data to be analyzed again.

AnalysePOG_160203.jpg

Since we want to predict the prize amount based on the profile of each horse, the objective variable is "POG period prize_year-round" and the explanatory variables are "gender", "month of birth", "trainer", "owner", "producer". , "Origin", "Seri transaction price", "Father", "Mother and father" would be appropriate. However, since the "seri transaction price" was not found to have a significant relationship in the previous analysis, it is excluded from the analysis this time.

By the way, among the explanatory variables, the data other than the "seri transaction price" are so-called qualitative data. Therefore, regression analysis cannot be performed as it is.

In such a case, it seems to be a general method * to perform regression analysis after converting qualitative data into dummy variables so that they can be treated as quantitative data.

pandas has a function that converts qualitative data into dummy variables. An example is shown below.

python


horse_df = pd.read_csv('./horse_db/horse_prof_2010_2014_mod.csv', encoding='utf-8', header=0, index_col=0)
pd.get_dummies(horse_df[u'sex'])[:3]

AnalysePOG_160203.jpg

Simple regression analysis

In this analysis, the OLS (least squares method) of the statsmodels module is used. The code used for the analysis is shown below.

python


#Module import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font='Osaka')
import statsmodels.api as sm
import IPython.display as display
%matplotlib inline

#Read analysis source data
horse_df = pd.read_csv('./horse_db/horse_prof_2010_2014_mod.csv', encoding='utf-8', header=0, index_col=0)
#horse_df = horse_df[:50]

#Convert qualitative data to dummy variables
use_col = [
    u'sex',
    #u'Birth month',
    #u'Trainer',
    #u'Horse owner',
    #u'Producer',
    #u'Origin',
    #u'father',
    #u'Mother father',
    ]

if len(use_col) == 1:
    dum = pd.get_dummies(horse_df[use_col[0]])
else:
    dum = pd.get_dummies(horse_df[use_col])    

# X,Definition of y
X_col = dum.columns
y_col = u'POG period prize_Year-round'
tmp_df = pd.concat([dum, horse_df[y_col]], axis=1)
tmp_df = tmp_df.dropna()
tmp_df = tmp_df.applymap(np.int) 
X = tmp_df[X_col].ix[:,:]
X = sm.add_constant(X)
y = tmp_df[y_col]

#Model generation
model = sm.OLS(y,X)

#result
results = model.fit()
y_predict = results.predict()
plt.plot(y_predict, y, marker='o', ls='None', label='_'.join(use_col))
plt.xlabel(u'Forecast')
plt.ylabel(u'Actual measurement')
plt.legend(loc=0)
plt.title('R^2: %.3f,  F: %.3f' % (results.rsquared, results.fvalue))
plt.savefig('./figure/fig_'+'_'.join(use_col)+'.png')
#display.display(results.summary())

sex

fig_性別.png

Birth month

fig_生まれ月.png

Trainer

fig_調教師.png

Horse owner

fig_馬主.png

Producer

fig_生産者.png

Origin

fig_産地.png

father

fig_父.png

Mother father

fig_母父.png

Multiple regression analysis

fig_性別_生まれ月_調教師_馬主_生産者_産地_父_母父.png

This summary

Regression analysis was performed with the objective variable as "POG period prize_year-round" and the explanatory variable as various horse profiles ("gender", "father", etc.). Both simple regression analysis and multiple regression analysis have a small R ^ 2, and it was found that it is difficult to predict the prize amount from the horse profile.

from now on

Discriminant analysis (identification of unwinned, average open horses, first-class horses) Analysis focusing on pedigree

Recommended Posts

Data analysis for improving POG 3-Regression analysis-
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Data analysis for improving POG 1 ~ Web scraping with Python ~
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Preprocessing template for data analysis (Python)
Python visualization tool for data analysis work
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Data analysis Titanic 2
Data analysis python
Poisson regression analysis
Regression analysis method
Data analysis Titanic 3
Prepare a programming language environment for data analysis
Analysis for Data Scientists: Qiita Self-Article Summary 2020 (Practice)
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
An introduction to statistical modeling for data analysis
How to use data analysis tools for beginners
Creating multiple output models for regression analysis [Beginner]
Data analysis with python 2
Organizing basic procedures for data analysis and statistical processing (4)
Data analysis using xarray
[For beginners] How to study Python3 data analysis exam
Folder structure for analysis
Data analysis using Python 0
Data analysis overview python
Analysis of measurement data ①-Memorandum of understanding for scipy fitting-
Source analysis for Django--INSTALLED_APPS
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
Stop thinking for use in data analysis competition LightGBM
Python data analysis template
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation
Basics of regression analysis
Regression analysis with NumPy
Data analysis with Python
Regression analysis in Python
[CovsirPhy] COVID-19 Python Package for Data Analysis: Scenario Analysis (Parameter Comparison)
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
[Understand in the shortest time] Python basics for data analysis
Which should I study, R or Python, for data analysis?
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
<Python> Build a dedicated server for Jupyter Notebook data analysis
Introduction to Statistical Modeling for Data Analysis GLM Model Selection
Data set for machine learning
My python data analysis container
Multidimensional data analysis library xarray
What is Logistic Regression Analysis?
Multiple regression analysis with Keras
[Python] Notes on data analysis
Python data analysis learning notes
Simple regression analysis in Python
Wrap analysis part1 (data preparation)
[PyTorch] Data Augmentation for segmentation
Data analysis using python pandas
Analyzing Twitter Data | Trend Analysis