Credit Risk Analysis

¶

Table of Contents
  • Step 1: Import Libraries
  • Step 2: Read Dataset
  • Step 3: Dataset Overview
    • Step 3.1: Dataset Basic Information
    • Step 3.2: Summary Statistics for Numerical Variables
    • Step 3.3: Summary Statistics for Categorical Variables
  • Step 4: Missing Value Analysis
    • Step 4.1: Indentiy missing values
    • Step 4.2: Missing Values Treatment
  • Step 5: Value Standardization
  • Step 6: Univariate Analysis
    • Step 6.1 : Univariate Analysis of Numerical Variables
    • Step 6.2 : Univariate Analysis of Categorical Variables
  • Step 7: Correlation
  • Step 8: Outliers Treatment
  • Step 9: Merging Dataset
  • Step 10: Bivariate Analysis
    • Step 10.1 : Categorical columns vs TARGET
    • Step 10.2 : Numerical columns vs TARGET
  • Step 11: Business Oriented Conclusions

Business Understanding¶

The aim of this analysis is to study customer data using Exploratory Data Analysis to uncover patterns, trends, and key variables that strongly influence loan default behavior. By identifying these factors, the company can make informed lending decisions, reduce the risk of default, and improve the overall effectiveness of credit risk management.


Step 1 : Import Libraries¶

Description - In this task we import all the necessary libraries required for our project.

In [1]:
import pandas as pd     # For data manipulation and analysis
import numpy as np      # For numrical computation
import matplotlib.pyplot as plt     # For plotting graphs and charts
import seaborn as sns       # For plotting graphs and charts in more advanced way
import math     # For mathematical operations

import warnings
warnings.filterwarnings('ignore')       # Filter out warnings to ignore them

Step 2 : Read Dataset¶

In [2]:
# Load CSV file 'application_data.csv' into a DataFrame
application_data = pd.read_csv("C:/Users/ASUS/OneDrive/Desktop/Datasets/Credit Risk Analysis Dataset/application_data.csv") 
In [3]:
# Clients with payment difficulties
application_data[application_data['TARGET'] == 1]
Out[3]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
26 100031 1 Cash loans F N Y 0 112500.0 979992.0 27076.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 2.0 2.0
40 100047 1 Cash loans M N Y 0 202500.0 1193580.0 35028.0 ... 0 0 0 0 0.0 0.0 0.0 2.0 0.0 4.0
42 100049 1 Cash loans F N N 0 135000.0 288873.0 16258.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0
81 100096 1 Cash loans F N Y 0 81000.0 252000.0 14593.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307448 456186 1 Cash loans M N N 1 207000.0 450000.0 32746.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0
307475 456215 1 Cash loans F N N 1 144000.0 1303200.0 46809.0 ... 0 0 0 0 0.0 0.0 0.0 1.0 0.0 6.0
307481 456225 1 Cash loans M N Y 0 225000.0 297000.0 19975.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
307489 456233 1 Cash loans F N Y 0 225000.0 521280.0 23089.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 2.0
307509 456254 1 Cash loans F N Y 0 171000.0 370107.0 20205.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

24825 rows × 122 columns

In [4]:
# Client with payment difficulties/other cases
application_data[application_data['TARGET'] == 0]
Out[4]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
5 100008 0 Cash loans M N Y 0 99000.0 490495.5 27517.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307505 456249 0 Cash loans F N Y 0 112500.0 225000.0 22050.0 ... 0 0 0 0 0.0 0.0 0.0 2.0 0.0 0.0
307506 456251 0 Cash loans M N N 0 157500.0 254700.0 27558.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
307507 456252 0 Cash loans F N Y 0 72000.0 269550.0 12001.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
307508 456253 0 Cash loans F N Y 0 153000.0 677664.0 29979.0 ... 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0
307510 456255 0 Cash loans F N N 0 157500.0 675000.0 49117.5 ... 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0

282686 rows × 122 columns

Identify Target Variable¶

Description - The target variabel in the dataset is 'TARGET'. It is a binary variable that indicates whether a client had difficulties to repay a loan

  • 1 - The client had late payments (i.e., payment difficulties) on at least one of the first few installments
  • 0 - The client had no such payment issues or other cases
Check Distribution of the target variable¶

Description - We check the distribution of the target variable using barplot and analyse that target variable is imbalance or not

In [5]:
# Count and percentage of target variable
target_counts = application_data['TARGET'].value_counts()
target_percent = application_data['TARGET'].value_counts(normalize=True)*100

# Create DataFrame for plotting
target_df = pd.DataFrame({
    'Label': ['Other Cases (0)', 'Payment Difficulty (1)'],
    'Count': target_counts.values,
    'Percent': target_percent.values
})

# Plotting bar charts
plt.figure(figsize=(8, 5))      # (x-axis=8, y-axis=5)
barplot = sns.barplot(data=target_df, x='Label', y='Count', hue='Label')

# For adding percentage labels on top
for index, row in target_df.iterrows():
    barplot.text(index, row['Count'] + 1000, f"{row['Percent']:.2f}%", color='black', ha='center', fontweight='bold')

plt.title("Distribution of TARGET Variable", fontsize=14, fontweight='bold')        # For graph title
plt.ylabel("Number of Clients")     # For y axis labeling
plt.xlabel("Loan Repayment Status")     # For x axis labeling
plt.grid(axis='y', linestyle='--', alpha=0.5)
sns.despine()       # For removing boxline from graph

plt.tight_layout()
plt.show()
No description has been provided for this image
  • A large portion of clients belong to Other Cases (0).
  • A smaller portion of clients belong to Payment Difficulty (1).

This shows that the dataset is imbalanced, with far fewer defaulting clients (i.e., Payment Difficulty = 1).

What is data imbalance?
Data imbalance means that the classes in a variable (such as the target column) are not evenly distributed. One class (like non-defaulters) significantly outnumbers the other (like defaulters), which can lead to biased analysis or misleading results.


In [6]:
dataset_description = pd.read_csv("C:/Users/ASUS/OneDrive/Desktop/Datasets/Credit Risk Analysis Dataset/columns_description.csv")
dataset_description
Out[6]:
Unnamed: 0 Table Row Description Special
0 1 application_data SK_ID_CURR ID of loan in our sample NaN
1 2 application_data TARGET Target variable (1 - client with payment diffi... NaN
2 5 application_data NAME_CONTRACT_TYPE Identification if loan is cash or revolving NaN
3 6 application_data CODE_GENDER Gender of the client NaN
4 7 application_data FLAG_OWN_CAR Flag if the client owns a car NaN
... ... ... ... ... ...
155 209 previous_application.csv DAYS_FIRST_DUE Relative to application date of current applic... time only relative to the application
156 210 previous_application.csv DAYS_LAST_DUE_1ST_VERSION Relative to application date of current applic... time only relative to the application
157 211 previous_application.csv DAYS_LAST_DUE Relative to application date of current applic... time only relative to the application
158 212 previous_application.csv DAYS_TERMINATION Relative to application date of current applic... time only relative to the application
159 213 previous_application.csv NFLAG_INSURED_ON_APPROVAL Did the client requested insurance during the ... NaN

160 rows × 5 columns

Step 3 : Dataset Overview¶

Step 3.1 : Dataset Basic Information¶

In [7]:
# Displaying the summary of dataframe such as non-null count, data types and memory usage
application_data.iloc[:,1:50].info() # .iloc for selecting column by index. Here we only fetch info for 1 to 49 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 49 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   TARGET                       307511 non-null  int64  
 1   NAME_CONTRACT_TYPE           307511 non-null  object 
 2   CODE_GENDER                  307511 non-null  object 
 3   FLAG_OWN_CAR                 307511 non-null  object 
 4   FLAG_OWN_REALTY              307511 non-null  object 
 5   CNT_CHILDREN                 307511 non-null  int64  
 6   AMT_INCOME_TOTAL             307511 non-null  float64
 7   AMT_CREDIT                   307511 non-null  float64
 8   AMT_ANNUITY                  307499 non-null  float64
 9   AMT_GOODS_PRICE              307233 non-null  float64
 10  NAME_TYPE_SUITE              306219 non-null  object 
 11  NAME_INCOME_TYPE             307511 non-null  object 
 12  NAME_EDUCATION_TYPE          307511 non-null  object 
 13  NAME_FAMILY_STATUS           307511 non-null  object 
 14  NAME_HOUSING_TYPE            307511 non-null  object 
 15  REGION_POPULATION_RELATIVE   307511 non-null  float64
 16  DAYS_BIRTH                   307511 non-null  int64  
 17  DAYS_EMPLOYED                307511 non-null  int64  
 18  DAYS_REGISTRATION            307511 non-null  float64
 19  DAYS_ID_PUBLISH              307511 non-null  int64  
 20  OWN_CAR_AGE                  104582 non-null  float64
 21  FLAG_MOBIL                   307511 non-null  int64  
 22  FLAG_EMP_PHONE               307511 non-null  int64  
 23  FLAG_WORK_PHONE              307511 non-null  int64  
 24  FLAG_CONT_MOBILE             307511 non-null  int64  
 25  FLAG_PHONE                   307511 non-null  int64  
 26  FLAG_EMAIL                   307511 non-null  int64  
 27  OCCUPATION_TYPE              211120 non-null  object 
 28  CNT_FAM_MEMBERS              307509 non-null  float64
 29  REGION_RATING_CLIENT         307511 non-null  int64  
 30  REGION_RATING_CLIENT_W_CITY  307511 non-null  int64  
 31  WEEKDAY_APPR_PROCESS_START   307511 non-null  object 
 32  HOUR_APPR_PROCESS_START      307511 non-null  int64  
 33  REG_REGION_NOT_LIVE_REGION   307511 non-null  int64  
 34  REG_REGION_NOT_WORK_REGION   307511 non-null  int64  
 35  LIVE_REGION_NOT_WORK_REGION  307511 non-null  int64  
 36  REG_CITY_NOT_LIVE_CITY       307511 non-null  int64  
 37  REG_CITY_NOT_WORK_CITY       307511 non-null  int64  
 38  LIVE_CITY_NOT_WORK_CITY      307511 non-null  int64  
 39  ORGANIZATION_TYPE            307511 non-null  object 
 40  EXT_SOURCE_1                 134133 non-null  float64
 41  EXT_SOURCE_2                 306851 non-null  float64
 42  EXT_SOURCE_3                 246546 non-null  float64
 43  APARTMENTS_AVG               151450 non-null  float64
 44  BASEMENTAREA_AVG             127568 non-null  float64
 45  YEARS_BEGINEXPLUATATION_AVG  157504 non-null  float64
 46  YEARS_BUILD_AVG              103023 non-null  float64
 47  COMMONAREA_AVG               92646 non-null   float64
 48  ELEVATORS_AVG                143620 non-null  float64
dtypes: float64(17), int64(20), object(12)
memory usage: 115.0+ MB
In [8]:
# Displaying total number of rows and columns in dataset (rows,columns)
application_data.shape
Out[8]:
(307511, 122)
In [9]:
# Displaing columns
application_data.columns
Out[9]:
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)
In [10]:
# Displaying first five rows of the dataset to view it's structure and content
application_data.head()
Out[10]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

In [11]:
numeric = application_data.select_dtypes('number').columns
object_categorical = application_data.select_dtypes(['object', 'category']).columns
print(f'Total numbers of Numeric column: {len(numeric)}')
print(f'Total numbers of Categorical columns: {len(object_categorical)}')
Total numbers of Numeric column: 106
Total numbers of Categorical columns: 16

Inferences:¶

  • Number of entries: The dataset consists of 307511 entries.
  • Columns: There are 122 columns
  • Data Types: There are total 106 numeric columns and 16 object columns

Step 3.2 : Summary Statistics for Numerical Variables¶

In [12]:
# Get the summary statistics for numerical variables
application_data.describe().T
Out[12]:
count mean std min 25% 50% 75% max
SK_ID_CURR 307511.0 278180.518577 102790.175348 100002.0 189145.5 278202.0 367142.5 456255.0
TARGET 307511.0 0.080729 0.272419 0.0 0.0 0.0 0.0 1.0
CNT_CHILDREN 307511.0 0.417052 0.722121 0.0 0.0 0.0 1.0 19.0
AMT_INCOME_TOTAL 307511.0 168797.919297 237123.146279 25650.0 112500.0 147150.0 202500.0 117000000.0
AMT_CREDIT 307511.0 599025.999706 402490.776996 45000.0 270000.0 513531.0 808650.0 4050000.0
... ... ... ... ... ... ... ... ...
AMT_REQ_CREDIT_BUREAU_DAY 265992.0 0.007000 0.110757 0.0 0.0 0.0 0.0 9.0
AMT_REQ_CREDIT_BUREAU_WEEK 265992.0 0.034362 0.204685 0.0 0.0 0.0 0.0 8.0
AMT_REQ_CREDIT_BUREAU_MON 265992.0 0.267395 0.916002 0.0 0.0 0.0 0.0 27.0
AMT_REQ_CREDIT_BUREAU_QRT 265992.0 0.265474 0.794056 0.0 0.0 0.0 0.0 261.0
AMT_REQ_CREDIT_BUREAU_YEAR 265992.0 1.899974 1.869295 0.0 0.0 1.0 3.0 25.0

106 rows × 8 columns

Step 3.3 : Summary Statistics for Categorical Variables¶

In [ ]:
 
In [13]:
application_data.describe(include=['object']).T
Out[13]:
count unique top freq
NAME_CONTRACT_TYPE 307511 2 Cash loans 278232
CODE_GENDER 307511 3 F 202448
FLAG_OWN_CAR 307511 2 N 202924
FLAG_OWN_REALTY 307511 2 Y 213312
NAME_TYPE_SUITE 306219 7 Unaccompanied 248526
NAME_INCOME_TYPE 307511 8 Working 158774
NAME_EDUCATION_TYPE 307511 5 Secondary / secondary special 218391
NAME_FAMILY_STATUS 307511 6 Married 196432
NAME_HOUSING_TYPE 307511 6 House / apartment 272868
OCCUPATION_TYPE 211120 18 Laborers 55186
WEEKDAY_APPR_PROCESS_START 307511 7 TUESDAY 53901
ORGANIZATION_TYPE 307511 58 Business Entity Type 3 67992
FONDKAPREMONT_MODE 97216 4 reg oper account 73830
HOUSETYPE_MODE 153214 3 block of flats 150503
WALLSMATERIAL_MODE 151170 7 Panel 66040
EMERGENCYSTATE_MODE 161756 2 No 159428

Step 4 : Missing Values Analysis¶

Step 4.1: Indentiy missing values¶

In [14]:
# Calculate percentage of missing values
missing_values = application_data.isnull().sum() / len(application_data) *100 

# Sort the top 50 values in Descending format      
missing_values = missing_values.sort_values(ascending=False)

missing_values.head(20)
Out[14]:
COMMONAREA_AVG              69.872297
COMMONAREA_MODE             69.872297
COMMONAREA_MEDI             69.872297
NONLIVINGAPARTMENTS_MEDI    69.432963
NONLIVINGAPARTMENTS_MODE    69.432963
NONLIVINGAPARTMENTS_AVG     69.432963
FONDKAPREMONT_MODE          68.386172
LIVINGAPARTMENTS_AVG        68.354953
LIVINGAPARTMENTS_MEDI       68.354953
LIVINGAPARTMENTS_MODE       68.354953
FLOORSMIN_MODE              67.848630
FLOORSMIN_AVG               67.848630
FLOORSMIN_MEDI              67.848630
YEARS_BUILD_AVG             66.497784
YEARS_BUILD_MODE            66.497784
YEARS_BUILD_MEDI            66.497784
OWN_CAR_AGE                 65.990810
LANDAREA_MEDI               59.376738
LANDAREA_AVG                59.376738
LANDAREA_MODE               59.376738
dtype: float64

As we can see many of the columns has more than 40% of missing values. Other columns has less missing values compare to this. So there is no meaning of replace the missing values because it affect our analysis. So the better way to handle this columns is just drop it. We will consider other columns for analysis.

In [15]:
# Include only more than 40% missing value columns
remove_missing_column = missing_values.loc[missing_values > 40].index
In [16]:
# Drop the columns that has more than 40% missing values
application_data_update = application_data.drop(columns=remove_missing_column)

# columns that has missing values less than 40%
missing_values_gt_0 = application_data_update.isnull().sum()  / len(application_data) * 100
missing_values_gt_0 = missing_values_gt_0[missing_values_gt_0 > 0].sort_values(ascending=False)
missing_values_gt_0
Out[16]:
OCCUPATION_TYPE               31.345545
EXT_SOURCE_3                  19.825307
AMT_REQ_CREDIT_BUREAU_WEEK    13.501631
AMT_REQ_CREDIT_BUREAU_MON     13.501631
AMT_REQ_CREDIT_BUREAU_YEAR    13.501631
AMT_REQ_CREDIT_BUREAU_QRT     13.501631
AMT_REQ_CREDIT_BUREAU_HOUR    13.501631
AMT_REQ_CREDIT_BUREAU_DAY     13.501631
NAME_TYPE_SUITE                0.420148
OBS_30_CNT_SOCIAL_CIRCLE       0.332021
DEF_60_CNT_SOCIAL_CIRCLE       0.332021
DEF_30_CNT_SOCIAL_CIRCLE       0.332021
OBS_60_CNT_SOCIAL_CIRCLE       0.332021
EXT_SOURCE_2                   0.214626
AMT_GOODS_PRICE                0.090403
AMT_ANNUITY                    0.003902
CNT_FAM_MEMBERS                0.000650
DAYS_LAST_PHONE_CHANGE         0.000325
dtype: float64

Step 4.2: Missing Values Treatment¶

Column 'OCCUPATION_TYPE'

In [17]:
# Plotting bar charts for visualization of data in easy manner

occupation_counts = application_data_update['OCCUPATION_TYPE'].value_counts().reset_index()
plt.figure(figsize=(10, 5))
sns.barplot(data=occupation_counts, x='OCCUPATION_TYPE', y='count', hue='OCCUPATION_TYPE')
plt.xticks(rotation=90)
plt.title("Distribution of OCCUPATION_TYPE")
plt.xlabel("Occupation Type")
plt.ylabel("Number of Clients")
sns.despine
plt.tight_layout()
plt.show()
No description has been provided for this image

This column has 31% missing values so substitute with other values such as mode is not suitable in this case. Because of this reason we retain the column as it is.

Column 'EXT_SOURCE_3'

In [18]:
# print top 10 values
application_data_update['EXT_SOURCE_3'].head(10)
Out[18]:
0    0.139376
1         NaN
2    0.729567
3         NaN
4         NaN
5    0.621226
6    0.492060
7    0.540654
8    0.751724
9         NaN
Name: EXT_SOURCE_3, dtype: float64
In [19]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of EXT_SOURCE_3")
sns.histplot(data=application_data_update, x='EXT_SOURCE_3', bins=10, color='Salmon')
plt.tight_layout() 
plt.show()
No description has been provided for this image
In [20]:
print(f'Mean: {application_data_update['EXT_SOURCE_3'].mean()}')
print(f'Median: {application_data_update['EXT_SOURCE_3'].median()}')
print(f'Mode: {application_data_update['EXT_SOURCE_3'].mode()[0]}')
Mean: 0.5108529061799657
Median: 0.5352762504724826
Mode: 0.746300213050371

The EXT_SOURCE_3 feature is a normalized external risk score ranging between 0 and 1. Higher values indicate lower credit risk. It has 19% missing data we retained the column and imputed missing values using the median. It plays a critical role in identifying risky applicants.

In [21]:
# Fill null values with median values
application_data_update['EXT_SOURCE_3'].fillna(application_data_update['EXT_SOURCE_3'].median(), inplace=True)

Column 'AMT_REQ_CREDIT_BUREAU_WEEK' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the week before they applied for the loan.

In [22]:
application_data_update['AMT_REQ_CREDIT_BUREAU_WEEK'].value_counts()
Out[22]:
AMT_REQ_CREDIT_BUREAU_WEEK
0.0    257456
1.0      8208
2.0       199
3.0        58
4.0        34
6.0        20
5.0        10
8.0         5
7.0         2
Name: count, dtype: int64
In [23]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_WEEK")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_WEEK', color='Salmon')
plt.show()
No description has been provided for this image

This shows that most clients had 0 inquiries in the last week. And most of the values are constant if we do analysis based on this column we can't get such better result for that we ignore this column.

Column 'AMT_REQ_CREDIT_BUREAU_MON' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the last month before they applied for the loan.

In [24]:
application_data_update['AMT_REQ_CREDIT_BUREAU_MON'].value_counts().head(10)
Out[24]:
AMT_REQ_CREDIT_BUREAU_MON
0.0    222233
1.0     33147
2.0      5386
3.0      1991
4.0      1076
5.0       602
6.0       343
7.0       298
9.0       206
8.0       185
Name: count, dtype: int64
In [25]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_MON")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_MON', color='Salmon')
plt.show()
No description has been provided for this image

As same before most of the values are constant we ignore this column.

Column 'AMT_REQ_CREDIT_BUREAU_YEAR' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the one year before they applied for the loan.

In [26]:
application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].value_counts().head(10)
Out[26]:
AMT_REQ_CREDIT_BUREAU_YEAR
0.0    71801
1.0    63405
2.0    50192
3.0    33628
4.0    20714
5.0    12052
6.0     6967
7.0     3869
8.0     2127
9.0     1096
Name: count, dtype: int64
In [27]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_YEAR")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_YEAR', color='Salmon', bins=25)
plt.show()
No description has been provided for this image
In [28]:
# Calculate Average(mean) of the column
print(f'Mean: {application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].mean()}') 

# Calculate center value(median) of the column
print(f'Median: {application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].median()}')

# Calculate most frequent value(mode) of the column
print(f'Mode: {application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].mode()[0]}')
Mean: 1.899974435321363
Median: 1.0
Mode: 0.0

As we can see from the histogram most of the data lies between 0.0 to 3.0. The column had ~13.5% missing values. Since the most common value is 0.0, indicating no inquiries in the past year, we imputed missing values with the mode. This approach is justified as it reflects the behavior of the majority and avoids introducing bias.

In [29]:
# Fill the missing values with mode 0.0
application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].fillna(0.0, inplace=True)

Column 'AMT_REQ_CREDIT_BUREAU_QRT' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the last three month before they applied for the loan.

In [30]:
application_data_update['AMT_REQ_CREDIT_BUREAU_QRT'].head(10)
Out[30]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
5    1.0
6    1.0
7    0.0
8    0.0
9    NaN
Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64
In [31]:
application_data_update['AMT_REQ_CREDIT_BUREAU_QRT'].value_counts()
Out[31]:
AMT_REQ_CREDIT_BUREAU_QRT
0.0      215417
1.0       33862
2.0       14412
3.0        1717
4.0         476
5.0          64
6.0          28
8.0           7
7.0           7
261.0         1
19.0          1
Name: count, dtype: int64
In [32]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_QRT")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_QRT', color='Salmon')
plt.show()
No description has been provided for this image

As we can see most of the values are constant and fall into one category we ignore this column.

Column 'AMT_REQ_CREDIT_BUREAU_HOUR' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit one hour before they applied for the loan.

In [33]:
application_data_update['AMT_REQ_CREDIT_BUREAU_HOUR'].head()
Out[33]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
In [34]:
application_data_update['AMT_REQ_CREDIT_BUREAU_HOUR'].value_counts()
Out[34]:
AMT_REQ_CREDIT_BUREAU_HOUR
0.0    264366
1.0      1560
2.0        56
3.0         9
4.0         1
Name: count, dtype: int64
In [35]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_HOUR")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_HOUR', color='Salmon')
plt.show()
No description has been provided for this image

Again we can see most of the values are constant and fall into one category we ignore this column.

Column 'AMT_REQ_CREDIT_BUREAU_DAY' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit one day before they applied for the loan.

In [36]:
application_data_update['AMT_REQ_CREDIT_BUREAU_DAY'].head()
Out[36]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64
In [37]:
application_data_update['AMT_REQ_CREDIT_BUREAU_DAY'].value_counts()
Out[37]:
AMT_REQ_CREDIT_BUREAU_DAY
0.0    264503
1.0      1292
2.0       106
3.0        45
4.0        26
5.0         9
6.0         8
9.0         2
8.0         1
Name: count, dtype: int64
In [38]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_DAY")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_DAY', color='Salmon')
plt.show()
No description has been provided for this image

As we can see most of the values are 0 we ignore this column also

Column 'NAME_TYPE_SUITE' Description - The column describes who accompanied the client during the loan application

In [39]:
application_data_update['NAME_TYPE_SUITE'].head()
Out[39]:
0    Unaccompanied
1           Family
2    Unaccompanied
3    Unaccompanied
4    Unaccompanied
Name: NAME_TYPE_SUITE, dtype: object
In [40]:
application_data_update['NAME_TYPE_SUITE'].value_counts(normalize=True)*100
Out[40]:
NAME_TYPE_SUITE
Unaccompanied      81.159562
Family             13.111205
Spouse, partner     3.713029
Children            1.066884
Other_B             0.578018
Other_A             0.282804
Group of people     0.088499
Name: proportion, dtype: float64
In [41]:
# Here we leverged the used of countplot to see counts of values in each category

plt.figure(figsize=(8,5))
plt.title("Client Accompaniment at Loan Application")
sns.countplot(data=application_data_update, x='NAME_TYPE_SUITE', hue='NAME_TYPE_SUITE', legend=False)
plt.xlabel("Accompaniment Type")
plt.ylabel("Number of Clients")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Most clients 81% came unaccompanied, with a smaller portion accompanied by other category. Missing values were filled with 'Unaccompanied' as mode (most frequent category.)

In [42]:
application_data_update['NAME_TYPE_SUITE'].mode()
Out[42]:
0    Unaccompanied
Name: NAME_TYPE_SUITE, dtype: object
In [43]:
# Fill null values with mode value
application_data_update['NAME_TYPE_SUITE'].fillna('Unaccompanied', inplace=True)
In [44]:
application_data_update['OBS_30_CNT_SOCIAL_CIRCLE'].head()
Out[44]:
0    2.0
1    1.0
2    0.0
3    2.0
4    0.0
Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64
In [45]:
application_data_update['OBS_30_CNT_SOCIAL_CIRCLE'].value_counts().head(10)
Out[45]:
OBS_30_CNT_SOCIAL_CIRCLE
0.0    163910
1.0     48783
2.0     29808
3.0     20322
4.0     14143
5.0      9553
6.0      6453
7.0      4390
8.0      2967
9.0      2003
Name: count, dtype: int64

Columns 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE'

In [46]:
# We simultaneously plot four columns becuase they are correlated with values 

columns = ['OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE']

fig, axes = plt.subplots(2, 2, figsize=(14, 10)) 
axes = axes.flatten()

for idx, column in enumerate(columns):
    sns.countplot(data=application_data_update, x=column, palette='Paired', legend=False, ax=axes[idx])
    axes[idx].set_title(f'Countplot for {column}')
    axes[idx].tick_params(axis='x', rotation=90)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [47]:
# Find out correlation of below columns with Target variable

result = application_data_update[['OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
    'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'TARGET']].corr()
In [48]:
# Plot correlation with heatmap

plt.figure(figsize=(8,5))
plt.title('Correlation with TARGET')
sns.heatmap(result, annot=True, cmap='Reds')
plt.tight_layout()
plt.show()
No description has been provided for this image

As we can see from above heatmap that four columns has not strong correlation with TARGET variable and avoiding complexity we not analyse this column

Column 'EXT_SOURCE_2

In [49]:
application_data_update['EXT_SOURCE_2'].head()
Out[49]:
0    0.262949
1    0.622246
2    0.555912
3    0.650442
4    0.322738
Name: EXT_SOURCE_2, dtype: float64
In [50]:
application_data_update['EXT_SOURCE_2'].value_counts()
Out[50]:
EXT_SOURCE_2
0.285898    721
0.262258    417
0.265256    343
0.159679    322
0.265312    306
           ... 
0.353855      1
0.200589      1
0.146779      1
0.288490      1
0.269471      1
Name: count, Length: 119831, dtype: int64
In [51]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title("Distribution of EXT_SOURCE_2")
sns.histplot(application_data_update['EXT_SOURCE_2'], bins=10, kde=True, color='Salmon')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [52]:
print(f'Mean: {application_data_update['EXT_SOURCE_2'].mean()}')
print(f'Median: {application_data_update['EXT_SOURCE_2'].median()}')
print(f'Mode: {application_data_update['EXT_SOURCE_2'].mode()[0]}')
Mean: 0.5143926741308464
Median: 0.5659614260608526
Mode: 0.2858978721410488

The EXT_SOURCE_2 feature is a normalized external risk score ranging between 0 and 1. Higher values indicate lower credit risk. It has ~0.21% missing data we retained the column and imputed missing values using the median.

In [53]:
# Fill null values with median of 'EXT_SOURCE_2'
application_data_update['EXT_SOURCE_2'].fillna(application_data_update['EXT_SOURCE_3'].median(), inplace=True)

Column 'AMT_GOODS_PRICE' Description - The column represents the price of goods for which the loan was issued.

In [54]:
application_data_update['AMT_GOODS_PRICE'].head()
Out[54]:
0     351000.0
1    1129500.0
2     135000.0
3     297000.0
4     513000.0
Name: AMT_GOODS_PRICE, dtype: float64
In [55]:
application_data_update['AMT_GOODS_PRICE'].value_counts()
Out[55]:
AMT_GOODS_PRICE
450000.0     26022
225000.0     25282
675000.0     24962
900000.0     15416
270000.0     11428
             ...  
1305706.5        1
2155500.0        1
113724.0         1
171468.0         1
559836.0         1
Name: count, Length: 1002, dtype: int64
In [56]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8, 5))
plt.title('Distribution of AMT_GOODS_PRICE')
sns.histplot(data=application_data_update, x='AMT_GOODS_PRICE', bins=35, kde=True, color='Salmon')
plt.xlabel('Goods Price')
plt.ylabel('Number of Clients')
plt.tight_layout()
plt.show()
No description has been provided for this image

Most values are concentrated in a few specific tiers, likely reflecting standard loan packages for that reason we impute the missing values with median. This variable may provide insights when analyzed in relation to income and loan amount.

In [57]:
# Fill null values with median
application_data_update['AMT_GOODS_PRICE'].fillna(application_data_update['AMT_GOODS_PRICE'].median(), inplace=True)

Column 'AMT_ANNUITY' Description - The annual payment amount required to repay the loan

In [58]:
application_data_update['AMT_ANNUITY'].value_counts().head(5)
Out[58]:
AMT_ANNUITY
9000.0     6385
13500.0    5514
6750.0     2279
10125.0    2035
37800.0    1602
Name: count, dtype: int64
In [59]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8, 5))
plt.title('Distribution of AMT_ANNUITY')
sns.histplot(data=application_data_update, x='AMT_ANNUITY', bins=30, kde=True, color='Salmon')
plt.xlabel('Annuity Amount')
plt.tight_layout()
plt.show()
No description has been provided for this image

As we can from the above histogram values is cluster. We can impute the missing values with median

In [60]:
# Fill null values with median of 'AMT_ANNUITY'
application_data_update['AMT_ANNUITY'].fillna(application_data_update['AMT_ANNUITY'].median(), inplace=True)

Column 'CNT_FAM_MEMBERS' Description - Number of family members client have

In [61]:
application_data_update['CNT_FAM_MEMBERS'].value_counts().head(5)
Out[61]:
CNT_FAM_MEMBERS
2.0    158357
1.0     67847
3.0     52601
4.0     24697
5.0      3478
Name: count, dtype: int64
In [62]:
# We used countplot for see category count of family members

plt.figure(figsize=(8,5))
plt.title('Distribution of CNT_FAM_MEMBERS')
sns.countplot(data=application_data_update, x='CNT_FAM_MEMBERS', palette='Paired')
plt.xlabel('Count of Family Members')
plt.ylabel('Number of Applicants')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Most of the values are lies in the 2.0 and the missing values percentage of this column is 0.000650 very low. So we can impute missing values with median

In [63]:
# Fill null values with median
application_data_update['CNT_FAM_MEMBERS'].fillna(application_data_update['CNT_FAM_MEMBERS'].median(), inplace=True)

Column 'DAYS_LAST_PHONE_CHANGE' Description - The day before the application client change their phone number.

In [64]:
application_data_update['DAYS_LAST_PHONE_CHANGE'].value_counts().head(10)
Out[64]:
DAYS_LAST_PHONE_CHANGE
 0.0      37672
-1.0       2812
-2.0       2318
-3.0       1763
-4.0       1285
-5.0        824
-6.0        537
-7.0        442
-8.0        278
-476.0      222
Name: count, dtype: int64
In [65]:
# Convert the values from days to years
application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'] = -round(application_data_update['DAYS_LAST_PHONE_CHANGE'] / 365, 0)
In [66]:
# Replace -0.0 to 0.0
application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].replace(-0.0, 0.0, inplace=True)
In [67]:
application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].value_counts()
Out[67]:
DAYS_LAST_PHONE_CHANGE(In Year)
0.0     64176
1.0     58998
2.0     48141
4.0     34677
3.0     32321
5.0     31080
6.0     16620
7.0     12631
8.0      5300
9.0      2986
10.0      509
11.0       70
12.0        1
Name: count, dtype: int64
In [68]:
# Here we use histogram to see distribution of data and frequency of data points.

plt.figure(figsize=(8,5))
plt.title('Distribution of DAYS_LAST_PHONE_CHANGE(In Year)')
sns.histplot(data=application_data_update, x='DAYS_LAST_PHONE_CHANGE(In Year)', bins=13, legend=False, color='Salmon')
plt.xlabel('Years')
plt.ylabel('Number of Applicants')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [69]:
print(f'Median: {application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].median()}')
print(f'Mode: {application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].mode()[0]}')
Median: 2.0
Mode: 0.0

As we can see from histogram most of the data lies into the first three values (0.0, 1.0, 2.0). And the percentage of missing values is 0.000325 or we can say negligible as they not make such big impact of our analysis but for better practise we drop the missing value rows.

In [70]:
application_data_update = application_data_update.dropna(subset=['DAYS_LAST_PHONE_CHANGE(In Year)'])
In [71]:
# check missing values percentage of dataset
(application_data_update.isnull().sum() / len(application_data_update) * 100).sort_values(ascending=False).head(20)
Out[71]:
OCCUPATION_TYPE               31.345322
AMT_REQ_CREDIT_BUREAU_WEEK    13.501350
AMT_REQ_CREDIT_BUREAU_MON     13.501350
AMT_REQ_CREDIT_BUREAU_DAY     13.501350
AMT_REQ_CREDIT_BUREAU_QRT     13.501350
AMT_REQ_CREDIT_BUREAU_HOUR    13.501350
OBS_60_CNT_SOCIAL_CIRCLE       0.332022
OBS_30_CNT_SOCIAL_CIRCLE       0.332022
DEF_60_CNT_SOCIAL_CIRCLE       0.332022
DEF_30_CNT_SOCIAL_CIRCLE       0.332022
AMT_ANNUITY                    0.000000
AMT_CREDIT                     0.000000
AMT_INCOME_TOTAL               0.000000
CNT_CHILDREN                   0.000000
FLAG_OWN_REALTY                0.000000
FLAG_OWN_CAR                   0.000000
CODE_GENDER                    0.000000
NAME_CONTRACT_TYPE             0.000000
TARGET                         0.000000
SK_ID_CURR                     0.000000
dtype: float64
In [72]:
# Drop the unnecessary columns that we not analyse 
application_data_update = application_data_update.drop(columns=[
    'AMT_REQ_CREDIT_BUREAU_QRT',
    'AMT_REQ_CREDIT_BUREAU_HOUR',
    'AMT_REQ_CREDIT_BUREAU_DAY',
    'AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_MON',
    'OBS_30_CNT_SOCIAL_CIRCLE',
    'DEF_60_CNT_SOCIAL_CIRCLE',
    'DEF_30_CNT_SOCIAL_CIRCLE',
    'OBS_60_CNT_SOCIAL_CIRCLE'
])

Step 5 : Value Standardization¶

Description - In this step we transformed columns from days to years for better readability and interpretation. We also renamed the corresponding column headers to reflect the updated units.

In [73]:
columns_to_change = ['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH']

for column in columns_to_change:
    application_data_update[column+'(In Year)'] = -round(application_data_update[column] / 365, 0)      # change days to year by diving it 365 and create new columns
In [74]:
# Here we drop the unnecessary columns that are not relevent to our analysis
application_data_update = application_data_update.drop(columns=[
    'DAYS_BIRTH',
    'DAYS_EMPLOYED',
    'DAYS_REGISTRATION',
    'DAYS_ID_PUBLISH',
    'DAYS_LAST_PHONE_CHANGE'
])

Step 6 : Univariate Analysis¶

We undertake univariate analysis on the dataset's features, based on their datatype:

  • For numerical/continuous data: We employ histograms to gain insight into the distribution of each feature. This allows us to understand the central tendency, spread, and shape of the dataset's distribution.

  • For categorical data: Bar plots are utilized to visualize the frequency of each category. This provides a clear representation of the prominence of each category within the respective feature.

By employing these visualization techniques, we're better positioned to understand the individual characteristics of each feature in the dataset.

Step 6.1 : Univariate Analysis of Numerical Variables¶

In [75]:
numeric_columns = application_data_update.select_dtypes(include=['number'])
columns = numeric_columns.columns
In [76]:
# Plot histograms in subplots
cols = 3
rows = int(np.ceil(len(columns) / cols))
fig, axes = plt.subplots(rows, cols, figsize=(16, rows * 3))
axes = axes.flatten()

for i, col in enumerate(columns[2:]):
    sns.histplot(numeric_columns[col], kde=True, bins=30, color='skyblue', ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Frequency')

# Remove unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.suptitle("Histograms of Numeric Columns", fontsize=18, y=1.02)
plt.show()
No description has been provided for this image

Inferences:¶

  1. CNT_CHILDREN: Most clients have no children, and the count drops sharply with 1 or more children. A few records with very high numbers such as 19 are likely outliers.

  2. AMT_INCOME_TOTAL: Highly right-skewed distribution. Most applicants earn below 500000 but there are extreme values going up to 117 million, indicating possible outliers.

  3. AMT_CREDIT: Also right-skewed. Common loan amounts appear around 270,000 to 600,000, but it extends to over 4 million.

  4. AMT_ANNUITY: Most annuity amounts are in the range of 10,000 to 40,000, with a sharp peak. A long right tail hints at some high annuity obligations.

  5. EXT_SOURCE_2: Distribution is fairly normal-shaped — indicating this normalized score is well-behaved

  6. EXT_SOURCE_3: Similar to EXT_SOURCE_2, this score shows a near-normal shape, though slightly more skewed.

  7. DAYS_BIRTH (In Year): Majority of loan applicants are between 30 and 60 years old. Distribution is slightly left-skewed. Can be a strong feature younger applicants may be riskier.

  8. DAYS_EMPLOYED (In Year): Most valid employment durations are under 50 years, with the bulk between 0 and 10 years

  9. REGION_POPULATION_RELATIVE: Heavily right-skewed. Most applicants are from less populated regions. A few regions have much higher population densities.

  10. CNT_FAM_MEMBERS: Most applicants belong to 2-3 member families. Very large family sizes (10+) are rare and may influence credit behavior.

Step 6.2 : Univariate Analysis of Categorical Variables¶

In [77]:
categorical_columns = application_data_update.select_dtypes(include=['object'])
cat_columns = categorical_columns.columns
In [78]:
# With the help of countplot we analyse the categorical variables


# Number of rows and columns for subplots
cols = 2
rows = math.ceil(len(cat_columns) / cols)

# Plot setup with subplots
fig, axes = plt.subplots(rows, cols, figsize=(20, rows * 7))
axes = axes.flatten()

# Plot countplot
for i, col in enumerate(cat_columns):
    ax = axes[i]
    total = categorical_columns[col].notna().sum()
    sns.countplot(data=categorical_columns, x=col, ax=ax, hue=col, legend=False)
    ax.set_title(col, fontsize=10)
    ax.tick_params(axis='x', rotation=90)
    ax.set_xlabel('')
    ax.set_ylabel('Count')
    
    # Add % labels on top of bar
    for p in ax.patches:
        count = int(p.get_height())
        percentage = 100 * count / total
        ax.text(p.get_x() + p.get_width()/2., p.get_height() + 500,
                f'{percentage:.1f}%', ha='center', fontsize=8)

# Remove unused axes
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.suptitle('Countplots of Categorical Variable', fontsize=18, y=1.02)
plt.show()
No description has been provided for this image

Inferences:¶

  1. NAME_CONTRACT_TYPE: 90.5% of applicants have Cash loans and only 9.5% are Revolving loans. Cash loans dominate the dataset and may represent larger or one-time financing.

  2. CODE_GENDER: 65.8% of applicants are female, and 34.2% are male. A rare category XNA (0.01%) is likely a placeholder or anomaly.

  3. FLAG_OWN_REALTY: 69.4% own real estate, while 30.6% do not. This may suggest better creditworthiness or asset security among a majority

  4. FLAG_OWN_CAR: 66% do not own a car, while 34% do. Car ownership might relate to financial stability

  5. NAME_TYPE_SUITE: 81.2% of clients were unaccompanied, followed by 13.1% with family and others.

  6. NAME_INCOME_TYPE: Majority are Working (51.6%), followed by Pensioners (18%), and Commercial associates (23.3%).Very few unemployed or students, possibly filtered out due to low credit approval chances.

  7. NAME_EDUCATION_TYPE: 71% have secondary education, followed by 24.3% with higher education. Low numbers with lower secondary and even fewer with academic degrees education might influence income, and indirectly default risk.

  8. NAME_FAMILY_STATUS: Most applicants are married (~64%), followed by single (14.8%). Other statuses like widowed or civil marriage are rare. Marital status might correlate with loan behavior or stability.

  9. NAME_HOUSING_TYPE: Dominated by those living in House / Apartment (88.7%). A few use With parents, Municipal apartment. It might indicate applicant stability or urban/rural status.

  10. OCCUPATION_TYPE: Laborers (26.1%), Sales staff (15.2%), and Core staff (13.1%) are the most common. where as IT staff, HR, Secretaries, and Realty agents are least frequent.

Step 7 : Correlation¶

In [79]:
numeric_columns = application_data_update.select_dtypes(include=['number'])
columns = numeric_columns.columns
In [80]:
# Compute correlation with 'TARGET'
target_corr = numeric_columns.corr()['TARGET'].drop('TARGET')  # Drop TARGET correlation with self

# Convert into DataFrame for heatmap
target_corr_df = pd.DataFrame(target_corr).sort_values(by='TARGET', ascending=False)

plt.figure(figsize=(6, len(target_corr_df) // 2))
sns.heatmap(target_corr_df, annot=True, cmap='coolwarm', linewidths=0.5, fmt=".2f")
plt.title("Correlation of Features with TARGET")
plt.tight_layout()
plt.show()
No description has been provided for this image

As we can see from the above heatmap of correlation that columns such as FLAG_DOCUMENT_2, FLAG_DOCUMENT_3... , FLAG_DOCUMENT_21 are not much contributing to our analysis and with correlation of the TARGET variable for this reason easy analysis of the dataset we drop this columns.

In [81]:
# Drop all FLAG_DOCUMENT_* columns for avoid complexity in dataset

document_columns = [col for col in application_data_update.columns if col.startswith('FLAG_DOCUMENT')]
application_data_update.drop(columns=document_columns, inplace=True)

Step 8 : Outliers Treatment¶

In [82]:
numeric_columns = application_data_update.select_dtypes(include=['number'])
columns = numeric_columns.columns
In [83]:
desc = numeric_columns.describe().T
desc.head()
Out[83]:
count mean std min 25% 50% 75% max
SK_ID_CURR 307510.0 278181.038399 102789.938288 100002.0 189146.25 278202.5 367142.75 456255.0
TARGET 307510.0 0.080729 0.272419 0.0 0.00 0.0 0.00 1.0
CNT_CHILDREN 307510.0 0.417053 0.722122 0.0 0.00 0.0 1.00 19.0
AMT_INCOME_TOTAL 307510.0 168798.058473 237123.519274 25650.0 112500.00 147150.0 202500.00 117000000.0
AMT_CREDIT 307510.0 599027.041665 402491.016691 45000.0 270000.00 513531.0 808650.00 4050000.0
In [84]:
# Set the number of columns per row in the plot grid
cols_per_row = 3
total_plots = len(columns)
rows = total_plots // cols_per_row + int(total_plots % cols_per_row != 0)

# Create the subplots
plt.figure(figsize=(18, rows * 4))

for i, col in enumerate(numeric_columns):
    plt.subplot(rows, cols_per_row, i + 1)
    sns.boxplot(x=application_data_update[col], color='teal')
    plt.xlabel('')
    plt.title(col)
    plt.tight_layout()

plt.suptitle("Boxplots for Numerical Columns", fontsize=18, y=1.02)
plt.show()
No description has been provided for this image

Inference¶

  1. CNT_CHILDREN
    • Maximum count of children is 19 but 75% of data have more than 1 child
    • Capping the outliers
  2. AMT_INCOME_TOTAL
    • Max is = 117000000 and Median is = 147150
  3. AMT_ANNUITY
    • Max = 258025 much larger than the 75th percentile 34596
  4. FLAG_MOBIL
    • Min, Max, Mean are 1 that means constant column.
In [85]:
# Handle outliers of CNT_CHILDREN column
application_data_update['CNT_CHILDREN'] = application_data_update['CNT_CHILDREN'].clip(upper=5)     # It will replace values above 5 with 5
In [86]:
# Handle outliers of AMT_INCOME_TOTAL
cap = application_data_update['AMT_INCOME_TOTAL'].quantile(0.99)        # Cap outliers with 0.99 value 
application_data_update['AMT_INCOME_TOTAL'] = application_data_update['AMT_INCOME_TOTAL'].clip(upper=cap)
application_data_update['AMT_INCOME_TOTAL'].value_counts().sort_index(ascending=False)
Out[86]:
AMT_INCOME_TOTAL
472500.0    3094
469800.0       1
468823.5       1
468000.0      10
466956.0       1
            ... 
27000.0       66
26550.0        2
26460.0        1
26100.0        3
25650.0        2
Name: count, Length: 2370, dtype: int64
In [87]:
# Handle outliers of 'AMT_ANNUITY'
application_data_update['AMT_ANNUITY'] = np.log1p(application_data_update['AMT_ANNUITY'])   # log1p is numpy function for handling the outliers and normalize the data
In [88]:
# Drop the 'FLAG_MOBIL' column for it has constant values that not provide as much insights for analysis
application_data_update.drop(columns='FLAG_MOBIL', inplace=True)

Step 9 : Merging Dataset¶

In [89]:
previous_application = pd.read_csv("C:/Users/ASUS/OneDrive/Desktop/Datasets/Credit Risk Analysis Dataset/previous_application.csv")
In [90]:
previous_train = application_data_update.merge(previous_application, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
In [91]:
previous_train.shape
Out[91]:
(1413701, 79)
In [92]:
previous_train.head()
Out[92]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE_x CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT_x AMT_ANNUITY_x ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 10.114619 ... Auto technology 24.0 low_normal POS other with interest 365243.0 -565.0 125.0 -25.0 -17.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 10.482892 ... XNA 12.0 low_normal Cash X-Sell: low 365243.0 -716.0 -386.0 -536.0 -527.0 1.0
2 100003 0 Cash loans F N N 0 270000.0 1293502.5 10.482892 ... Furniture 6.0 middle POS industry with interest 365243.0 -797.0 -647.0 -647.0 -639.0 0.0
3 100003 0 Cash loans F N N 0 270000.0 1293502.5 10.482892 ... Consumer electronics 12.0 middle POS household with interest 365243.0 -2310.0 -1980.0 -1980.0 -1976.0 1.0
4 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 8.817446 ... Connectivity 4.0 middle POS mobile without interest 365243.0 -784.0 -694.0 -724.0 -714.0 0.0

5 rows × 79 columns

In [93]:
# see the list of columns in 'previous_train'
previous_train.columns
Out[93]:
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE_x', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT_x', 'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x',
       'NAME_TYPE_SUITE_x', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START_x', 'HOUR_APPR_PROCESS_START_x',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_LAST_PHONE_CHANGE(In Year)',
       'DAYS_BIRTH(In Year)', 'DAYS_EMPLOYED(In Year)',
       'DAYS_REGISTRATION(In Year)', 'DAYS_ID_PUBLISH(In Year)', 'SK_ID_PREV',
       'NAME_CONTRACT_TYPE_y', 'AMT_ANNUITY_y', 'AMT_APPLICATION',
       'AMT_CREDIT_y', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE_y',
       'WEEKDAY_APPR_PROCESS_START_y', 'HOUR_APPR_PROCESS_START_y',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
       'CODE_REJECT_REASON', 'NAME_TYPE_SUITE_y', 'NAME_CLIENT_TYPE',
       'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
       'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
       'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
       'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
       'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')

Step 10 : Bivariate Analysis¶

Step 10.1 : Categorical columns vs TARGET¶

In [94]:
# Categorical columns for analysis
categorical_cols = ['CODE_GENDER', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 
                    'NAME_HOUSING_TYPE', 'NAME_CONTRACT_STATUS', 
                    'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']

# Grid setup
cols = 2
rows = math.ceil(len(categorical_cols) / cols)

fig, axes = plt.subplots(rows, cols, figsize=(14, rows * 5))
axes = axes.flatten()

# Plotting each subplot
for i, col in enumerate(categorical_cols):
    sns.countplot(data=previous_train, x=col, hue='TARGET', ax=axes[i])
    axes[i].set_title(f'{col} vs Loan Default')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Count')
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].legend(title='TARGET')

# Hide extra axes if any
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.suptitle('Bivariate Analysis: Categorical Columns vs Loan Default', fontsize=16, y=1.02)
plt.show()
No description has been provided for this image

Inference:¶

  1. CODE_GENDER: Female applicants make up the majority and also contribute to more defaults. Although females have slightly more defaults this may be due to higher application volume not higher risk.

  2. NAME_INCOME_TYPE: Applicants who are 'Working' contribute to the highest number of defaults, followed by 'Commercial associate' and 'Pensioner'. Risk assessment should consider not just income, but income stability. Commercial associates may include self-employed or irregular earners, which could explain higher risk. Pensioners, despite being retired, may represent a lower risk due to predictable income streams.

  3. NAME_EDUCATION_TYPE: Most applicants have 'Secondary education', which also corresponds to most defaults. Higher education correlates with lower defaults suggesting a link between education level and financial responsibility. We can se education as a soft risk indicator and possibly offer better terms to highly educated clients.

  4. NAME_HOUSING_TYPE: 'House/apartment' is the dominant housing type among applicants. Housing type alone may not be a strong risk indicator, but those renting may indicate less asset security, which could slightly increase risk.

  5. NAME_CONTRACT_STATUS: Most defaults occur in the 'Approved' status expected because that's where the loan was granted. Other statuses like 'Refused' or 'Canceled' don't impact risk.

  6. NAME_FAMILY_STATUS: Married applicants are the largest group and also account for most defaults.

  7. OCCUPATION_TYPE: Laborers, sales staff, and drivers show higher counts of defaults. These occupations may correlate with lower or unstable income, higher physical job risks, or lower financial literacy.

In [95]:
# Create function that create 4 graphs for bivariate analysis

def plotting(column, hue, data):
    """
    Plots 4 visualizations for the relationship between two categorical columns:
    - Pie chart of `column`
    - Bar chart: % of hue categories within `column`
    - Countplot of hue split by column
    - Countplot of column split by hue
    """
    col = column

    # Create figure
    fig = plt.figure(figsize=(14, 14))

    # Subplot 1: Pie Chart of column
    ax1 = plt.subplot(221)
    data[col].value_counts().plot.pie(autopct="%1.0f%%", startangle=90, pctdistance=0.85, textprops={'fontsize': 8}, ax=ax1)
    plt.title(f'Distribution of values for: {column}')
    ax1.set_ylabel('')  # Remove ylabel

    # Subplot 2: % Distribution of hue within each column category
    ax2 = plt.subplot(222)
    dist = pd.crosstab(data[col], data[hue], normalize='index') * 100
    dist.plot(kind='bar', stacked=True, ax=ax2)
    plt.title(f'Percentage of {hue} within each {column}')
    plt.xticks(rotation=90)
    plt.xlabel('')

    # Subplot 3: Countplot of hue per column category
    ax3 = plt.subplot(223)
    sns.countplot(data=data, x=col, hue=hue, ax=ax3)
    plt.xticks(rotation=90)
    plt.xlabel('')
    plt.title(f'{hue} count per {col}')

    # Subplot 4: Countplot of column per hue category
    ax4 = plt.subplot(224)
    sns.countplot(data=data, x=hue, hue=col, ax=ax4)
    plt.xticks(rotation=90)
    plt.xlabel('')
    plt.title(f'{col} count per {hue}')

    plt.tight_layout()
    plt.show()
In [96]:
# Plot graph for 'CODE_GENDER' and 'NAME_CONTRACT_STATUS'

plotting('CODE_GENDER', 'NAME_CONTRACT_STATUS', data=previous_train)
No description has been provided for this image

Obeservation:

Female applicants account for the vast majority of loan applications, making up 68% of the total, compared to 32% for males. While females secure the highest number of approved loans in absolute terms, this is primarily due to their higher application volume. Interestingly, male applicants exhibit a slightly higher approval rate than females, who, in turn, experience a marginally higher rate of loan cancellations and refusals. The "Gender 'XNA'" category is negligible.

In [97]:
# Plot graph for 'NAME_INCOME_TYPE' and 'NAME_CONTRACT_STATUS'

plotting('NAME_INCOME_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
No description has been provided for this image

Obeservation:

The majority of loan applicants (over 51%) are working professionals, followed by commercial associates (23%) and pensioners (19%). While working professionals account for the highest volume of approved loans, state servants demonstrate the most favorable approval rate relative to their application volume. Conversely, unemployed individuals and those on maternity leave face significantly higher rates of loan cancellation and rejection, despite being smaller applicant groups. Commercial associates and pensioners show a more balanced distribution of approvals and rejections, indicating varied risk profiles within these segments.

In [98]:
# Plot graph for 'NAME_EDUCATION_TYPE' and 'NAME_CONTRACT_STATUS'
plotting('NAME_EDUCATION_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
No description has been provided for this image

Obeservation:

The majority of loan applicants (around 73%) have Secondary / secondary special education, making this group central to loan activity. While this segment accounts for most approved, canceled, and refused loans due to their sheer volume, applicants with Academic and Higher education demonstrate the highest approval rates, suggesting they are lower-risk. Conversely, individuals with Lower or Incomplete education experience more cancellations or rejections.

In [99]:
# Plot graph for 'NAME_HOUSING_TYPE' and 'NAME_CONTRACT_STATUS'

plotting('NAME_HOUSING_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
No description has been provided for this image

Observation:

The overwhelming majority of loan applicants (nearly 89%) reside in a "House/apartment," indicating a strong prevalence of stable housing situations within the applicant pool. The next largest segments are individuals living with parents (4%) and renters (4%). Interestingly, loan approval rates remain relatively consistent across all housing types, suggesting that housing status alone does not significantly influence lending decisions. Due to their sheer volume, house/apartment residents naturally account for the highest absolute numbers of both loan approvals and refusals.

In [100]:
# Plot graph for 'NAME_FAMILY_STATUS' and 'NAME_CONTRACT_STATUS'

plotting('NAME_FAMILY_STATUS', 'NAME_CONTRACT_STATUS', data=previous_train)
No description has been provided for this image

Observation:

Married applicants constitute the largest demographic in loan applications, making up 64% of the dataset. Following them are single individuals (13%) and those in a civil marriage (10%). Loan approval rates are notably consistent across all family statuses, hovering around 60%. Due to their significant presence, married individuals account for the highest volume of both approved and refused loans.

In [101]:
# Plot graph for 'OCCUPATION_TYPE' and 'NAME_CONTRACT_STATUS'

plotting('OCCUPATION_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
No description has been provided for this image

Observation:

Laborers form the largest occupational group among loan applicants at 26%, followed by sales staff (16%) and core staff (12%). Core staff, sales staff, and high-skill tech staff consistently exhibit the highest loan approval rates, all exceeding 60%. Conversely, laborers and drivers face slightly higher refusal rates compared to applicants in technical or managerial positions. Occupations with minimal representation, such as HR staff and secretaries, have a negligible impact on overall loan volume and approval trends.

Step 10.2 : Numerical columns vs TARGET¶

In [102]:
categorical_columns_prev = previous_train.select_dtypes(include=['number'])
cat_columns = categorical_columns_prev.columns
cat_columns
Out[102]:
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT_x', 'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x',
       'REGION_POPULATION_RELATIVE', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
       'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'HOUR_APPR_PROCESS_START_x', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_LAST_PHONE_CHANGE(In Year)',
       'DAYS_BIRTH(In Year)', 'DAYS_EMPLOYED(In Year)',
       'DAYS_REGISTRATION(In Year)', 'DAYS_ID_PUBLISH(In Year)', 'SK_ID_PREV',
       'AMT_ANNUITY_y', 'AMT_APPLICATION', 'AMT_CREDIT_y', 'AMT_DOWN_PAYMENT',
       'AMT_GOODS_PRICE_y', 'HOUR_APPR_PROCESS_START_y',
       'NFLAG_LAST_APPL_IN_DAY', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
       'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
       'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
       'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')
In [103]:
# Define numerical columns and target column
numerical_cols = ['CNT_CHILDREN','AMT_INCOME_TOTAL','CNT_FAM_MEMBERS',
                  'AMT_CREDIT_x', 'DAYS_BIRTH(In Year)', 'EXT_SOURCE_3']
target_col = 'TARGET'

# Set up subplot dimensions
cols = 2
rows = -(-len(numerical_cols) // cols)  # Ceiling division
fig, axes = plt.subplots(rows, cols, figsize=(14, rows * 4))
axes = axes.flatten()

# Plot boxplots
for i, col in enumerate(numerical_cols):
    ax = axes[i]
    sns.boxplot(data=previous_train, x=target_col, y=col, ax=ax)
    ax.set_title(f'{col} vs {target_col}')
    # ax.set_xlabel('Loan Default (TARGET)')
    ax.set_xlabel('')
    ax.set_ylabel('')

# Remove any unused subplot spaces
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.suptitle('Bivariate Analysis: Numerical Variables vs TARGET', fontsize=16, y=1.02)
plt.show()
No description has been provided for this image

Observation:

  1. CNT_CHILDREN vs TARGET : Applicants with 0 children dominate both groups, but defaulters show slightly more variance. There is no strong relationship between number of children and loan default.

  2. AMT_INCOME_TOTAL vs TARGET : Both defaulters and non-defaulters show similar income ranges, with high outliers in both. This indicates that income alone may not strongly differentiate loan default behavior.

  3. CNT_FAM_MEMBERS vs TARGET : The median and IQR of family members are similar across both groups. This suggests that family size has little direct influence on default risk.

  4. AMT_CREDIT vs TARGET : Defaulters have slightly lower median credit amounts than non-defaulters. Higher credit values are common in both groups, making this a weak discriminator.

  5. DAYS_BIRTH (In Years) vs TARGET : Defaulters tend to be slightly younger on average than non-defaulters. This suggests that younger applicants may carry a higher risk of default

  6. EXT_SOURCE_3 vs TARGET : Non-defaulters have noticeably higher EXT_SOURCE_3 scores compared to defaulters. This makes EXT_SOURCE_3 a strong indicator of creditworthiness and default prediction.



Business Oriented Conclusions

¶

After performing an in-depth exploratory data analysis on the loan applicants' profiles and repayment behavior, the following conclusions are drawn

Age Is a Strong Predictor of Default

Younger applicants (ages 20–30) are significantly more likely to default compared to older applicants.

Possible Action: Implement stricter eligibility criteria (e.g., lower loan amounts or higher interest rates) for younger applicants to mitigate risk.

External Risk Source Scores Matter

EXT_SOURCE_3, a risk assessment feature, shows a clear inverse relationship with default. Lower values are highly indicative of higher risk.

Possible Action: Use this score as a primary factor in automated credit scoring models. Set threshold values below which applications are flagged for manual review.

Number of Children Correlates with Higher Default

Applicants with three or more children show increased default rates, potentially due to higher household financial burdens.

Possible Action: Adjust risk thresholds based on family size or combine this with income metrics to assess repayment capacity better.

Education Level Reflects Financial Discipline

Applicants with lower education levels (Secondary and below) are more prone to default.

Possible Action: Use education level as an auxiliary feature in scoring models. Applicants with higher education levels may be offered better loan terms.

Employment Type Affects Default Risk

Unemployed, students, and working pensioners show a higher tendency to default.

Possible Action: Introduce tiered risk scoring or mandatory collateral/guarantor for these applicant categories.

Family Status Has an Impact

Single and separated applicants tend to have slightly higher default rates.

Possible Action: Include marital status in behavioral segmentation for personalized loan products and monitoring.

Income Type Should Be Evaluated Carefully

Applicants with irregular income types (e.g., maternity leave, commercial associates) show elevated risk.

Possible Action: Evaluate income stability alongside declared income. Consider requesting income proof for non-standard employment types.