[PYTHON] I tried to predict and submit Titanic survivors with Kaggle

The execution code, contents, and explanation are described at the following URL.

jupyter notebook https://github.com/spica831/kaggle_titanic/blob/master/titanic.ipynb


I participated in a hackathon to estimate the price of a house in Kaggle I couldn't solve it in time due to lack of knowledge about how to use python and how to analyze it. Therefore, as a revenge, we predicted the survival of Titanic. https://www.kaggle.com/c/titanic

Predict home selling prices with Kaggle

House Prices: Advanced Regression Techniques https://www.kaggle.com/c/house-prices-advanced-regression-techniques

From the conclusion, the correct answer rate of Titanic's prediction was 0.7512.


#Import required packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
#Read value
df = pd.read_csv("./input/train.csv")

Display the value. スクリーンショット 2017-04-19 22.53.49.png


String replacement

Apparently, strings are used for names and genders. Since it cannot be used for analysis as it is, is it gender (Sex) or boarding rank? Since there are few character patterns such as (Embarked), they are replaced with numerical values such as 0, 1, and 2, respectively.

In addition, age (Age) has a missing value (NaN), so all were replaced with 0.

df.Embarked = df.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
#df.Cabin = df.Cabin.replace('NaN', 0)
df.Sex = df.Sex.replace(['male', 'female'], [0, 1])
df.Age = df.Age.replace('NaN', 0)

Delete column

Items that are difficult to handle, such as Name and Ticket Cabin, have been deleted for each column. (painful)

df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Result of preprocessing

All could be replaced with numerical values.


スクリーンショット 2017-04-19 22.54.08.png


Correlation coefficient

First calculate the correlation coefficient

Refer to the following wiki for the correlation coefficient https://ja.wikipedia.org/wiki/%E7%9B%B8%E9%96%A2%E4%BF%82%E6%95%B0 image

Correlation coefficient value

#Calculate the correlation coefficient
corrmat = df.corr()

スクリーンショット 2017-04-19 22.54.18.png

Correlation coefficient heat map

f, ax = plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=.8, square=True)


It was found that there was a correlation.


Preparation before learning

Divide into answers (train_labels Survived here) and parameters (train_features other than Survived here)

train_labels = df['Survived'].values
train_features = df
train_features.drop('Survived', axis=1, inplace=True)
train_features = train_features.values.astype(np.int64)

Learn with support vector machine

Finally, we created a two-class classification learner with a linear SVM in scikit-learn. (Detailed parameters are not set in particular, but it was better to perform L1 and L2 regularization)

from sklearn import svm
#Standard = svm.LinearSVC(C=1.0, intercept_scaling=1, multi_class=False , loss="l1", penalty="l2", dual=True)
svm = svm.LinearSVC()
svm.fit(train_features, train_labels)


Read the test value calculated this time

df_test = pd.read_csv("./input/test.csv")

Advance preparation

#Delete unnecessary columns
df_test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

#Numerical replacement of strings
df_test.Embarked = df_test.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
df_test.Sex = df_test.Sex.replace(['male', 'female'], [0, 1])
df_test.Age = df_test.Age.replace('NaN', 0)

#Convert to array value
test_features = df_test.values.astype(np.int64)

Classify with SVM.

y_test_pred = svm.predict(test_features)


Convert to a form that can be submitted to Kaggle

#Reload the test value and add a column classified by SVM
df_out = pd.read_csv("./input/test.csv")
df_out["Survived"] = y_test_pred

#Output to the output directory


As mentioned at the beginning, the correct answer rate for Titanic's prediction was 0.7512. However, I was satisfied because I was able to form and submit in a short time of a few hours.

Things to improve

There were many points that needed to be improved during the creation.


  1. Age should be divided into two, excluding NaN and a certain value of NaN.
  2. Looking at the histogram, if the Gaussian distribution is on the left, it should be logarithmic to approach the Gaussian distribution. (Dr. Andrew also said that at Coursera.)
  3. No value whitening was performed.
  4. I should have done my best to convert the values of a large number of discarded strings into numbers. In particular, I didn't want to throw away Cabin and Ticket.


  1. I was only looking at the correlation coefficient.


  1. The value was not regularized
  2. Non-linear SVM and other classifiers were not examined.


I was able to produce output in a short time, so I achieved my goal. However, I deeply realized that I did not have the time and experience to come up with the optimal calculation method by using what I had learned so far in a short amount of time.

