项目作者: girishkuniyal

项目描述 :
Titanic Survival Exploration is classical problem of classication. Real datasets are provide by kaggle
高级语言: Jupyter Notebook
项目地址: git://github.com/girishkuniyal/Titanic-Survival-Exploration.git


Titanic Survival Exploration

Python 2.7 Problem Classification

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Problem Reference : https://www.kaggle.com/c/titanic

Dependencies

  • numpy
  • pandas
  • matplotlib
  • scikit-learn
  • jupyter notebook
  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. import seaborn as sns
  5. import matplotlib
  1. # loading training dataset with labels
  2. df = pd.read_csv("data/train.csv")
  3. df.head(n=5)

































































































PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

  1. #Remove the most incompleted fields
  2. df_reduced = df.drop(['PassengerId','Name','Ticket','Cabin'],axis=1)
  3. df_clean = df_reduced.dropna()
  4. df_clean.head(n=5)









































































Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S

  1. from sklearn.feature_extraction import DictVectorizer
  1. # feature extraction from datasets
  2. vec = DictVectorizer(sparse=False)
  3. df_dict = df_clean.to_dict(orient = 'records')
  4. df_dict_feature = vec.fit_transform(df_dict)
  5. df_feature = vec.get_feature_names()
  6. print df_feature
  7. df_dict_feature
  1. ['Age', 'Embarked=C', 'Embarked=Q', 'Embarked=S', 'Fare', 'Parch', 'Pclass', 'Sex=female', 'Sex=male', 'SibSp', 'Survived']
  2. array([[ 22., 0., 0., ..., 1., 1., 0.],
  3. [ 38., 1., 0., ..., 0., 1., 1.],
  4. [ 26., 0., 0., ..., 0., 0., 1.],
  5. ...,
  6. [ 19., 0., 0., ..., 0., 0., 1.],
  7. [ 26., 1., 0., ..., 1., 0., 1.],
  8. [ 32., 0., 1., ..., 1., 0., 0.]])
  1. df_final = pd.DataFrame(df_dict_feature,columns=df_feature)
  2. X = df_final.drop('Survived',axis=1)
  3. y = df_final['Survived']
  4. df_final.head(n=5) #final prepared dataset with features and labels



























































































Age Embarked=C Embarked=Q Embarked=S Fare Parch Pclass Sex=female Sex=male SibSp Survived
0 22.0 0.0 0.0 1.0 7.2500 0.0 3.0 0.0 1.0 1.0 0.0
1 38.0 1.0 0.0 0.0 71.2833 0.0 1.0 1.0 0.0 1.0 1.0
2 26.0 0.0 0.0 1.0 7.9250 0.0 3.0 1.0 0.0 0.0 1.0
3 35.0 0.0 0.0 1.0 53.1000 0.0 1.0 1.0 0.0 1.0 1.0
4 35.0 0.0 0.0 1.0 8.0500 0.0 3.0 0.0 1.0 0.0 0.0

  1. #plotting of some insight of data male vs female
  2. plt.figure(figsize=(12,2),dpi=80)
  3. plt.subplot(1,2,1)
  4. df_male=df_final[df_final['Sex=male']==1]
  5. df_female=df_final[df_final['Sex=female']==1]
  6. df_male['Survived'].value_counts().sort_index().plot(kind='barh',color='#1abc9c')
  7. plt.xlabel('Number of male')
  8. plt.ylabel('Survived(True/False)')
  9. plt.title('Male Data of Survival')
  10. plt.subplot(1,2,2)
  11. df_female['Survived'].value_counts().sort_index().plot(kind='barh',color='#e74c3c')
  12. plt.xlabel('Number of female')
  13. plt.ylabel('Survived(True/False)')
  14. plt.title('Female Data of Survival')
  15. plt.show()

png

  1. ratio_male= df_male['Survived'].sum()/len(df_male)
  2. ratio_female = df_female['Survived'].sum()/len(df_female)
  3. print 'Ratio of male survival',ratio_male
  4. print 'Ratio of female survival',ratio_female
  5. #As we see female have good chance of survival if you are female then kudos !! sorry boys
  1. Ratio of male survival 0.205298013245
  2. Ratio of female survival 0.752895752896
  1. from sklearn.model_selection import train_test_split
  2. X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=42)
  1. plt.figure(figsize=(6,4),dpi=100)
  2. colors= ['#e74c3c','#2ecc71']
  3. label_name = set(y_train)
  4. plt.scatter(X_train.Age,X_train.Fare,c=y_train,alpha=0.7,cmap=matplotlib.colors.ListedColormap(colors))
  5. plt.xlim(-5,90)
  6. plt.ylim(-20,300)
  7. plt.xlabel('Age')
  8. plt.ylabel('Fare')
  9. plt.show()

png

  1. #Using Random Forest Classifier
  2. from sklearn.ensemble import RandomForestClassifier
  3. from sklearn.metrics import accuracy_score
  4. clf1 = RandomForestClassifier(n_estimators=15)
  5. clf1.fit(X_train,y_train)
  6. y_predict = clf1.predict(X_train)
  7. print 'Model accuracy_score: ',accuracy_score(y_train,y_predict)
  8. y_test_predict = clf1.predict(X_test)
  9. print 'Test accuracy score',accuracy_score(y_test,y_test_predict)
  1. Model accuracy_score: 0.985324947589
  2. Test accuracy score 0.774468085106
  1. # Support vector machine for classification
  2. from sklearn import svm
  3. from sklearn.metrics import accuracy_score
  4. clf2 = svm.SVC(gamma=0.00016,C=1000)
  5. clf2.fit(X_train,y_train)
  6. y_predict = clf2.predict(X_train)
  7. print 'Model accuracy_score: ',accuracy_score(y_train,y_predict)
  8. y_test_predict = clf2.predict(X_test)
  9. print 'Test accuracy score',accuracy_score(y_test,y_test_predict)
  1. Model accuracy_score: 0.834381551363
  2. Test accuracy score 0.782978723404
  1. y_test_main_predict = clf2.predict(X_test)
  1. #final classifier with some good hyperparameter which we learn from tuning
  2. clf3 = svm.SVC(gamma=0.00016,C=1000)
  3. clf3.fit(X,y)
  4. y_trainCSV_predict = clf3.predict(X)
  5. print 'Model accuracy_score: ',accuracy_score(y,y_trainCSV_predict)
  1. Model accuracy_score: 0.824438202247
  1. # Cleaning test data
  2. df_test = pd.read_csv("data/test_corrected.csv")
  3. df_reduced_test = df_test.drop(['Unnamed: 11','Unnamed: 12','Unnamed: 13','PassengerId','Name','Ticket','Cabin','Unnamed: 0'],axis=1)
  4. df_clean_test = df_reduced_test.dropna()
  5. vec_test = DictVectorizer(sparse=False)
  6. df_dict_test = df_clean_test.to_dict(orient = 'records')
  7. df_dict_feature_test = vec_test.fit_transform(df_dict_test)
  8. df_feature_test = vec_test.get_feature_names()
  9. df_final_test = pd.DataFrame(df_dict_feature_test,columns=df_feature_test)
  10. df_final_test.head(n=5)





















































































Age Embarked=C Embarked=Q Embarked=S Fare Parch Pclass Sex=female Sex=male SibSp
0 34.5 0.0 1.0 0.0 7.8292 0.0 3.0 0.0 1.0 0.0
1 47.0 0.0 0.0 1.0 7.0000 0.0 3.0 1.0 0.0 1.0
2 62.0 0.0 1.0 0.0 9.6875 0.0 2.0 0.0 1.0 0.0
3 27.0 0.0 0.0 1.0 8.6625 0.0 3.0 0.0 1.0 0.0
4 22.0 0.0 0.0 1.0 12.2875 1.0 3.0 1.0 0.0 1.0

  1. #After clean test data and we see lots of age are missings .lets write model for predicting age
  2. y_test_predict_final = clf3.predict(df_final_test)
  3. df_test['Survived'] = y_test_predict_final
  4. df_result = df_test[['PassengerId','Survived']]
  5. df_result.to_csv('data/result.csv')

Results:

The kaggle test data on this prediction model strategy gives accuracy of 76.56 % .