Titanic Survival Analysis¶

Problem Statement¶

Analyze and visualize the Titanic passenger data to gain insights into survival factors. Manipulate the dataset, calculate, and visualize key survival metrics such as survival rates, passenger demographics, and class-based survival patterns, which are crucial in understanding the factors that influenced survival during the disaster.

In [41]:
import pandas as pd
import plotly.express as px
from plotly.offline import plot
from IPython.display import HTML
In [ ]:
df = pd.read_csv('Titanic-Dataset.csv')
df.head()
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Understanding the Data:¶

PassengerId: This unique ID for each row doesn't influence the target variable.

Survived: This is the outcome variable we aim to predict, where:

  • 1 indicates survival
  • 0 indicates the passenger did not survive

Pclass(Passenger Class): Reflecting the passenger's socio-economic status, this ordinal categorical feature has three levels:

  • 1 for Upper Class
  • 2 for Middle Class
  • 3 for Lower Class

Name, Sex, and Age are .

SibSp represents the total count of a passenger's siblings and spouse aboard.

Parch counts the passenger's parents and children on board.

Ticket shows the passenger's ticket number.

Fare indicates how much the passenger paid for the journey.

Cabin denotes the passenger's cabin number.

Embarked indicates the port where the passenger boarded the Titanic, with three categorical options:

  • C for Cherbourg
  • Q for Queenstown
  • S for Southampton

Number of Rows and Columns in the dataset¶

In [43]:
print(df.shape)
(891, 12)

Missing values¶

Which Column had highest number of missing values ?¶

In [44]:
missing_values = df.isnull().sum()
print(f'Column \'{missing_values.idxmax()}\' had \'{missing_values.max()}\' missing values which is highest than other.')
Column 'Cabin' had '687' missing values which is highest than other.
In [45]:
missing_values
Out[45]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Handling Missing Values¶

  • Age: Impute the missing values using the median age of the dataset.
  • Cabin: Drop the entire column, as a large portion of the data is missing.
  • Embarked: Impute the missing values using the mode (most frequent value) of the column.
Replace NaN with median of 'Age' column¶
In [46]:
df['Age'] = df['Age'].fillna(df['Age'].median())
Drop the 'Cabin' column¶
In [47]:
df = df.drop('Cabin', axis=1)
Replace NaN with mode of 'Embarked' column¶
In [48]:
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
In [49]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB

Data Visualization¶

Change 'Pclass' variable from numerical datatype to categorical¶

In [50]:
def changeToStr(x):
    if x == 1:
        return 'Upper Class'
    if x == 2:
        return 'Middle Class'
    if x == 3:
        return 'Third Class'

df['Pclass'] = df['Pclass'].apply(changeToStr)

How many passengers were in each class¶

In [51]:
pclass_count = df['Pclass'].value_counts().reset_index('Pclass')
pclass_count
fig = px.bar(
    pclass_count,
    x='Pclass', 
    y='count', 
    color='Pclass', 
    text_auto=True)

fig.update_layout(
    title='Passenger Class Counts',
    xaxis_title='Pclass',
    yaxis_title='Counts',
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Out[51]:

Age Distribution and Survival¶

We calculate the survival rate by age group as = (Count of no. of 1s)/(No. of people)*100

In [52]:
age_survival = df[['Survived', 'Age']].groupby('Age')['Survived'].apply(lambda x: (x.sum()/ len(x)) * 100)

We calculate the survival rate by age by (count of No.of 1s in survived column / Total No. of People)*100. Count of no.of 1s would simply be the sum as the No.of 0s as 0s won't be counted while calculating the sum and no. of people would be the length of the array for the particular age as the group by function returns an array of the survived column for every age.

In [53]:
fig = px.histogram(age_survival, text_auto=True, color_discrete_sequence=['indianred'], nbins=20)

fig.update_layout(
    title='Histogram of Survival Age',
    xaxis_title='Ages',
    template='plotly_white'
)

HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Out[53]:

With the help of above histogram we can easily see that younger passengers, particularly children, had higher survival rates.

Survival Rate based on Passenger Class(Pclass) and Sex¶

In [54]:
pclass_sex = df.groupby(['Pclass', 'Sex'])['Survived'].mean().reset_index()
pclass_sex['Survival Rate %'] = pclass_sex['Survived'] * 100

fig = px.bar(pclass_sex, 
             x='Pclass', 
             y='Survival Rate %',
             color='Sex',
             barmode='group',
             title='Survival Rate by Pclass and Gender',
             text_auto='.4s'
             )

fig.update_layout(
    height=500
)

HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Out[54]:

Survival Rate - Men vs Women¶

In [55]:
sex_survival = df[['Survived', 'Sex']].groupby('Sex')['Survived'].apply(lambda x : (x.sum() / len(x)) * 100).reset_index()
sex_survival
fig = px.bar(sex_survival,
             x='Sex',
             y='Survived', 
             color='Sex',
             text_auto='.4s')

fig.update_layout(
    title='Survival Rate of Female vs Male',
    xaxis_title='Sex',
    yaxis_title='Survival Rate in %',
    height=600
)

HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Out[55]:

From the above bar chart we can see that Female had higher survival rate than male

Embarkation Point and Survival¶

In [56]:
def convertToLong(x):
    if x == 'C':
        return 'Cherbourg'
    if x == 'Q':
        return 'Queenstown'
    if x == 'S':
        return 'Southampton'

df['Embarked'] = df['Embarked'].apply(convertToLong)
In [57]:
embark_survival = df[['Embarked', 'Survived']].groupby('Embarked').apply(lambda x: (x.sum() / len(x)) * 100).reset_index('Embarked')

fig = px.bar(
    embark_survival,
    x='Embarked',
    y='Survived',
    color='Embarked',
    text_auto='.4s'
)

fig.update_layout(
    title='Survival Rate Based on Embarkation Point',
    xaxis_title='Embarked Point',
    yaxis_title='Survival Rate in %',
)
HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Out[57]:

From the above bar chart we can see that passengers from 'Cherbourg' had the highest survival rate.

Family Size and Survival¶

In [58]:
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1
In [59]:
familysize_survival = df[['Family_Size', 'Survived']].groupby('Family_Size')['Survived'].apply(lambda x : (x.sum() / len(x)) * 100)

fig = px.bar(
    familysize_survival[:-2],
    text_auto='.4s'
)

fig.update_layout(
    title='Survival Rate Based on Family Size',
    xaxis_title='Family Size',
    yaxis_title='Survival Rate in %',
)

HTML(plot(fig, include_plotlyjs='cdn', output_type='div'))
Out[59]:

As we can see from above chart larger family sizes are associated with lower survival rates.

Data Analysis Findings¶

  • Age and Survival : Younger passenger had a higher survival rate.

  • Gender and Survival : Female had significantly higher survival rate than male.

  • Passenger Class : First class passengers had higher survival rate compare to other class showed the social economic status plays an important role.

  • Embarkation port : The survival rate of Cherbourg port had slightly higher than other ports.

  • Family Size : Family size of 1-4 member had better survival outcomes as compare to others.

In [ ]: