Titanic Survival Analysis¶
Problem Statement¶
Analyze and visualize the Titanic passenger data to gain insights into survival factors. Manipulate the dataset, calculate, and visualize key survival metrics such as survival rates, passenger demographics, and class-based survival patterns, which are crucial in understanding the factors that influenced survival during the disaster.
import pandas as pd
import plotly.express as px
from plotly.offline import plot
from IPython.display import HTML
df = pd.read_csv('Titanic-Dataset.csv')
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Understanding the Data:¶
PassengerId: This unique ID for each row doesn't influence the target variable.
Survived: This is the outcome variable we aim to predict, where:
- 1 indicates survival
- 0 indicates the passenger did not survive
Pclass(Passenger Class): Reflecting the passenger's socio-economic status, this ordinal categorical feature has three levels:
- 1 for Upper Class
- 2 for Middle Class
- 3 for Lower Class
Name, Sex, and Age are .
SibSp represents the total count of a passenger's siblings and spouse aboard.
Parch counts the passenger's parents and children on board.
Ticket shows the passenger's ticket number.
Fare indicates how much the passenger paid for the journey.
Cabin denotes the passenger's cabin number.
Embarked indicates the port where the passenger boarded the Titanic, with three categorical options:
- C for Cherbourg
- Q for Queenstown
- S for Southampton
Number of Rows and Columns in the dataset¶
print(df.shape)
(891, 12)
missing_values = df.isnull().sum()
print(f'Column \'{missing_values.idxmax()}\' had \'{missing_values.max()}\' missing values which is highest than other.')
Column 'Cabin' had '687' missing values which is highest than other.
missing_values
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
Handling Missing Values¶
Age: Impute the missing values using the median age of the dataset.Cabin: Drop the entire column, as a large portion of the data is missing.Embarked: Impute the missing values using the mode (most frequent value) of the column.
Replace NaN with median of 'Age' column¶
df['Age'] = df['Age'].fillna(df['Age'].median())
Drop the 'Cabin' column¶
df = df.drop('Cabin', axis=1)
Replace NaN with mode of 'Embarked' column¶
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 891 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Embarked 891 non-null object dtypes: float64(2), int64(5), object(4) memory usage: 76.7+ KB
Data Visualization¶
Change 'Pclass' variable from numerical datatype to categorical¶
def changeToStr(x):
if x == 1:
return 'Upper Class'
if x == 2:
return 'Middle Class'
if x == 3:
return 'Third Class'
df['Pclass'] = df['Pclass'].apply(changeToStr)
How many passengers were in each class¶
pclass_count = df['Pclass'].value_counts().reset_index('Pclass')
pclass_count
fig = px.bar(
pclass_count,
x='Pclass',
y='count',
color='Pclass',
text_auto=True)
fig.update_layout(
title='Passenger Class Counts',
xaxis_title='Pclass',
yaxis_title='Counts',
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Age Distribution and Survival¶
We calculate the survival rate by age group as = (Count of no. of 1s)/(No. of people)*100
age_survival = df[['Survived', 'Age']].groupby('Age')['Survived'].apply(lambda x: (x.sum()/ len(x)) * 100)
We calculate the survival rate by age by (count of No.of 1s in survived column / Total No. of People)*100. Count of no.of 1s would simply be the sum as the No.of 0s as 0s won't be counted while calculating the sum and no. of people would be the length of the array for the particular age as the group by function returns an array of the survived column for every age.
fig = px.histogram(age_survival, text_auto=True, color_discrete_sequence=['indianred'], nbins=20)
fig.update_layout(
title='Histogram of Survival Age',
xaxis_title='Ages',
template='plotly_white'
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
With the help of above histogram we can easily see that younger passengers, particularly children, had higher survival rates.
Survival Rate based on Passenger Class(Pclass) and Sex¶
pclass_sex = df.groupby(['Pclass', 'Sex'])['Survived'].mean().reset_index()
pclass_sex['Survival Rate %'] = pclass_sex['Survived'] * 100
fig = px.bar(pclass_sex,
x='Pclass',
y='Survival Rate %',
color='Sex',
barmode='group',
title='Survival Rate by Pclass and Gender',
text_auto='.4s'
)
fig.update_layout(
height=500
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Survival Rate - Men vs Women¶
sex_survival = df[['Survived', 'Sex']].groupby('Sex')['Survived'].apply(lambda x : (x.sum() / len(x)) * 100).reset_index()
sex_survival
fig = px.bar(sex_survival,
x='Sex',
y='Survived',
color='Sex',
text_auto='.4s')
fig.update_layout(
title='Survival Rate of Female vs Male',
xaxis_title='Sex',
yaxis_title='Survival Rate in %',
height=600
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
From the above bar chart we can see that Female had higher survival rate than male
Embarkation Point and Survival¶
def convertToLong(x):
if x == 'C':
return 'Cherbourg'
if x == 'Q':
return 'Queenstown'
if x == 'S':
return 'Southampton'
df['Embarked'] = df['Embarked'].apply(convertToLong)
embark_survival = df[['Embarked', 'Survived']].groupby('Embarked').apply(lambda x: (x.sum() / len(x)) * 100).reset_index('Embarked')
fig = px.bar(
embark_survival,
x='Embarked',
y='Survived',
color='Embarked',
text_auto='.4s'
)
fig.update_layout(
title='Survival Rate Based on Embarkation Point',
xaxis_title='Embarked Point',
yaxis_title='Survival Rate in %',
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
From the above bar chart we can see that passengers from 'Cherbourg' had the highest survival rate.
Family Size and Survival¶
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1
familysize_survival = df[['Family_Size', 'Survived']].groupby('Family_Size')['Survived'].apply(lambda x : (x.sum() / len(x)) * 100)
fig = px.bar(
familysize_survival[:-2],
text_auto='.4s'
)
fig.update_layout(
title='Survival Rate Based on Family Size',
xaxis_title='Family Size',
yaxis_title='Survival Rate in %',
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
As we can see from above chart larger family sizes are associated with lower survival rates.
Data Analysis Findings¶
Age and Survival : Younger passenger had a higher survival rate.
Gender and Survival : Female had significantly higher survival rate than male.
Passenger Class : First class passengers had higher survival rate compare to other class showed the social economic status plays an important role.
Embarkation port : The survival rate of Cherbourg port had slightly higher than other ports.
Family Size : Family size of 1-4 member had better survival outcomes as compare to others.