Credit Risk Analysis
¶Table of Contents
Business Understanding¶
The aim of this analysis is to study customer data using Exploratory Data Analysis to uncover patterns, trends, and key variables that strongly influence loan default behavior. By identifying these factors, the company can make informed lending decisions, reduce the risk of default, and improve the overall effectiveness of credit risk management.
Step 1 : Import Libraries¶
Description - In this task we import all the necessary libraries required for our project.
import pandas as pd # For data manipulation and analysis
import numpy as np # For numrical computation
import matplotlib.pyplot as plt # For plotting graphs and charts
import seaborn as sns # For plotting graphs and charts in more advanced way
import math # For mathematical operations
import warnings
warnings.filterwarnings('ignore') # Filter out warnings to ignore them
Step 2 : Read Dataset¶
# Load CSV file 'application_data.csv' into a DataFrame
application_data = pd.read_csv("C:/Users/ASUS/OneDrive/Desktop/Datasets/Credit Risk Analysis Dataset/application_data.csv")
# Clients with payment difficulties
application_data[application_data['TARGET'] == 1]
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 26 | 100031 | 1 | Cash loans | F | N | Y | 0 | 112500.0 | 979992.0 | 27076.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 |
| 40 | 100047 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 1193580.0 | 35028.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 4.0 |
| 42 | 100049 | 1 | Cash loans | F | N | N | 0 | 135000.0 | 288873.0 | 16258.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 81 | 100096 | 1 | Cash loans | F | N | Y | 0 | 81000.0 | 252000.0 | 14593.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307448 | 456186 | 1 | Cash loans | M | N | N | 1 | 207000.0 | 450000.0 | 32746.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 307475 | 456215 | 1 | Cash loans | F | N | N | 1 | 144000.0 | 1303200.0 | 46809.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 6.0 |
| 307481 | 456225 | 1 | Cash loans | M | N | Y | 0 | 225000.0 | 297000.0 | 19975.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 307489 | 456233 | 1 | Cash loans | F | N | Y | 0 | 225000.0 | 521280.0 | 23089.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
24825 rows × 122 columns
# Client with payment difficulties/other cases
application_data[application_data['TARGET'] == 0]
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5 | 100008 | 0 | Cash loans | M | N | Y | 0 | 99000.0 | 490495.5 | 27517.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307505 | 456249 | 0 | Cash loans | F | N | Y | 0 | 112500.0 | 225000.0 | 22050.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | ... | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
282686 rows × 122 columns
Identify Target Variable¶
Description - The target variabel in the dataset is 'TARGET'. It is a binary variable that indicates whether a client had difficulties to repay a loan
- 1 - The client had late payments (i.e., payment difficulties) on at least one of the first few installments
- 0 - The client had no such payment issues or other cases
Check Distribution of the target variable¶
Description - We check the distribution of the target variable using barplot and analyse that target variable is imbalance or not
# Count and percentage of target variable
target_counts = application_data['TARGET'].value_counts()
target_percent = application_data['TARGET'].value_counts(normalize=True)*100
# Create DataFrame for plotting
target_df = pd.DataFrame({
'Label': ['Other Cases (0)', 'Payment Difficulty (1)'],
'Count': target_counts.values,
'Percent': target_percent.values
})
# Plotting bar charts
plt.figure(figsize=(8, 5)) # (x-axis=8, y-axis=5)
barplot = sns.barplot(data=target_df, x='Label', y='Count', hue='Label')
# For adding percentage labels on top
for index, row in target_df.iterrows():
barplot.text(index, row['Count'] + 1000, f"{row['Percent']:.2f}%", color='black', ha='center', fontweight='bold')
plt.title("Distribution of TARGET Variable", fontsize=14, fontweight='bold') # For graph title
plt.ylabel("Number of Clients") # For y axis labeling
plt.xlabel("Loan Repayment Status") # For x axis labeling
plt.grid(axis='y', linestyle='--', alpha=0.5)
sns.despine() # For removing boxline from graph
plt.tight_layout()
plt.show()
- A large portion of clients belong to Other Cases (0).
- A smaller portion of clients belong to Payment Difficulty (1).
This shows that the dataset is imbalanced, with far fewer defaulting clients (i.e., Payment Difficulty = 1).
What is data imbalance?
Data imbalance means that the classes in a variable (such as the target column) are not evenly distributed. One class (like non-defaulters) significantly outnumbers the other (like defaulters), which can lead to biased analysis or misleading results.
dataset_description = pd.read_csv("C:/Users/ASUS/OneDrive/Desktop/Datasets/Credit Risk Analysis Dataset/columns_description.csv")
dataset_description
| Unnamed: 0 | Table | Row | Description | Special | |
|---|---|---|---|---|---|
| 0 | 1 | application_data | SK_ID_CURR | ID of loan in our sample | NaN |
| 1 | 2 | application_data | TARGET | Target variable (1 - client with payment diffi... | NaN |
| 2 | 5 | application_data | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | NaN |
| 3 | 6 | application_data | CODE_GENDER | Gender of the client | NaN |
| 4 | 7 | application_data | FLAG_OWN_CAR | Flag if the client owns a car | NaN |
| ... | ... | ... | ... | ... | ... |
| 155 | 209 | previous_application.csv | DAYS_FIRST_DUE | Relative to application date of current applic... | time only relative to the application |
| 156 | 210 | previous_application.csv | DAYS_LAST_DUE_1ST_VERSION | Relative to application date of current applic... | time only relative to the application |
| 157 | 211 | previous_application.csv | DAYS_LAST_DUE | Relative to application date of current applic... | time only relative to the application |
| 158 | 212 | previous_application.csv | DAYS_TERMINATION | Relative to application date of current applic... | time only relative to the application |
| 159 | 213 | previous_application.csv | NFLAG_INSURED_ON_APPROVAL | Did the client requested insurance during the ... | NaN |
160 rows × 5 columns
# Displaying the summary of dataframe such as non-null count, data types and memory usage
application_data.iloc[:,1:50].info() # .iloc for selecting column by index. Here we only fetch info for 1 to 49 columns
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TARGET 307511 non-null int64 1 NAME_CONTRACT_TYPE 307511 non-null object 2 CODE_GENDER 307511 non-null object 3 FLAG_OWN_CAR 307511 non-null object 4 FLAG_OWN_REALTY 307511 non-null object 5 CNT_CHILDREN 307511 non-null int64 6 AMT_INCOME_TOTAL 307511 non-null float64 7 AMT_CREDIT 307511 non-null float64 8 AMT_ANNUITY 307499 non-null float64 9 AMT_GOODS_PRICE 307233 non-null float64 10 NAME_TYPE_SUITE 306219 non-null object 11 NAME_INCOME_TYPE 307511 non-null object 12 NAME_EDUCATION_TYPE 307511 non-null object 13 NAME_FAMILY_STATUS 307511 non-null object 14 NAME_HOUSING_TYPE 307511 non-null object 15 REGION_POPULATION_RELATIVE 307511 non-null float64 16 DAYS_BIRTH 307511 non-null int64 17 DAYS_EMPLOYED 307511 non-null int64 18 DAYS_REGISTRATION 307511 non-null float64 19 DAYS_ID_PUBLISH 307511 non-null int64 20 OWN_CAR_AGE 104582 non-null float64 21 FLAG_MOBIL 307511 non-null int64 22 FLAG_EMP_PHONE 307511 non-null int64 23 FLAG_WORK_PHONE 307511 non-null int64 24 FLAG_CONT_MOBILE 307511 non-null int64 25 FLAG_PHONE 307511 non-null int64 26 FLAG_EMAIL 307511 non-null int64 27 OCCUPATION_TYPE 211120 non-null object 28 CNT_FAM_MEMBERS 307509 non-null float64 29 REGION_RATING_CLIENT 307511 non-null int64 30 REGION_RATING_CLIENT_W_CITY 307511 non-null int64 31 WEEKDAY_APPR_PROCESS_START 307511 non-null object 32 HOUR_APPR_PROCESS_START 307511 non-null int64 33 REG_REGION_NOT_LIVE_REGION 307511 non-null int64 34 REG_REGION_NOT_WORK_REGION 307511 non-null int64 35 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64 36 REG_CITY_NOT_LIVE_CITY 307511 non-null int64 37 REG_CITY_NOT_WORK_CITY 307511 non-null int64 38 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64 39 ORGANIZATION_TYPE 307511 non-null object 40 EXT_SOURCE_1 134133 non-null float64 41 EXT_SOURCE_2 306851 non-null float64 42 EXT_SOURCE_3 246546 non-null float64 43 APARTMENTS_AVG 151450 non-null float64 44 BASEMENTAREA_AVG 127568 non-null float64 45 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64 46 YEARS_BUILD_AVG 103023 non-null float64 47 COMMONAREA_AVG 92646 non-null float64 48 ELEVATORS_AVG 143620 non-null float64 dtypes: float64(17), int64(20), object(12) memory usage: 115.0+ MB
# Displaying total number of rows and columns in dataset (rows,columns)
application_data.shape
(307511, 122)
# Displaing columns
application_data.columns
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY',
...
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object', length=122)
# Displaying first five rows of the dataset to view it's structure and content
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
numeric = application_data.select_dtypes('number').columns
object_categorical = application_data.select_dtypes(['object', 'category']).columns
print(f'Total numbers of Numeric column: {len(numeric)}')
print(f'Total numbers of Categorical columns: {len(object_categorical)}')
Total numbers of Numeric column: 106 Total numbers of Categorical columns: 16
Inferences:¶
- Number of entries: The dataset consists of 307511 entries.
- Columns: There are 122 columns
- Data Types: There are total 106 numeric columns and 16 object columns
Step 3.2 : Summary Statistics for Numerical Variables¶
# Get the summary statistics for numerical variables
application_data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 307511.0 | 278180.518577 | 102790.175348 | 100002.0 | 189145.5 | 278202.0 | 367142.5 | 456255.0 |
| TARGET | 307511.0 | 0.080729 | 0.272419 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CNT_CHILDREN | 307511.0 | 0.417052 | 0.722121 | 0.0 | 0.0 | 0.0 | 1.0 | 19.0 |
| AMT_INCOME_TOTAL | 307511.0 | 168797.919297 | 237123.146279 | 25650.0 | 112500.0 | 147150.0 | 202500.0 | 117000000.0 |
| AMT_CREDIT | 307511.0 | 599025.999706 | 402490.776996 | 45000.0 | 270000.0 | 513531.0 | 808650.0 | 4050000.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AMT_REQ_CREDIT_BUREAU_DAY | 265992.0 | 0.007000 | 0.110757 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 265992.0 | 0.034362 | 0.204685 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 |
| AMT_REQ_CREDIT_BUREAU_MON | 265992.0 | 0.267395 | 0.916002 | 0.0 | 0.0 | 0.0 | 0.0 | 27.0 |
| AMT_REQ_CREDIT_BUREAU_QRT | 265992.0 | 0.265474 | 0.794056 | 0.0 | 0.0 | 0.0 | 0.0 | 261.0 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 265992.0 | 1.899974 | 1.869295 | 0.0 | 0.0 | 1.0 | 3.0 | 25.0 |
106 rows × 8 columns
Step 3.3 : Summary Statistics for Categorical Variables¶
application_data.describe(include=['object']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| NAME_CONTRACT_TYPE | 307511 | 2 | Cash loans | 278232 |
| CODE_GENDER | 307511 | 3 | F | 202448 |
| FLAG_OWN_CAR | 307511 | 2 | N | 202924 |
| FLAG_OWN_REALTY | 307511 | 2 | Y | 213312 |
| NAME_TYPE_SUITE | 306219 | 7 | Unaccompanied | 248526 |
| NAME_INCOME_TYPE | 307511 | 8 | Working | 158774 |
| NAME_EDUCATION_TYPE | 307511 | 5 | Secondary / secondary special | 218391 |
| NAME_FAMILY_STATUS | 307511 | 6 | Married | 196432 |
| NAME_HOUSING_TYPE | 307511 | 6 | House / apartment | 272868 |
| OCCUPATION_TYPE | 211120 | 18 | Laborers | 55186 |
| WEEKDAY_APPR_PROCESS_START | 307511 | 7 | TUESDAY | 53901 |
| ORGANIZATION_TYPE | 307511 | 58 | Business Entity Type 3 | 67992 |
| FONDKAPREMONT_MODE | 97216 | 4 | reg oper account | 73830 |
| HOUSETYPE_MODE | 153214 | 3 | block of flats | 150503 |
| WALLSMATERIAL_MODE | 151170 | 7 | Panel | 66040 |
| EMERGENCYSTATE_MODE | 161756 | 2 | No | 159428 |
# Calculate percentage of missing values
missing_values = application_data.isnull().sum() / len(application_data) *100
# Sort the top 50 values in Descending format
missing_values = missing_values.sort_values(ascending=False)
missing_values.head(20)
COMMONAREA_AVG 69.872297 COMMONAREA_MODE 69.872297 COMMONAREA_MEDI 69.872297 NONLIVINGAPARTMENTS_MEDI 69.432963 NONLIVINGAPARTMENTS_MODE 69.432963 NONLIVINGAPARTMENTS_AVG 69.432963 FONDKAPREMONT_MODE 68.386172 LIVINGAPARTMENTS_AVG 68.354953 LIVINGAPARTMENTS_MEDI 68.354953 LIVINGAPARTMENTS_MODE 68.354953 FLOORSMIN_MODE 67.848630 FLOORSMIN_AVG 67.848630 FLOORSMIN_MEDI 67.848630 YEARS_BUILD_AVG 66.497784 YEARS_BUILD_MODE 66.497784 YEARS_BUILD_MEDI 66.497784 OWN_CAR_AGE 65.990810 LANDAREA_MEDI 59.376738 LANDAREA_AVG 59.376738 LANDAREA_MODE 59.376738 dtype: float64
As we can see many of the columns has more than 40% of missing values. Other columns has less missing values compare to this. So there is no meaning of replace the missing values because it affect our analysis. So the better way to handle this columns is just drop it. We will consider other columns for analysis.
# Include only more than 40% missing value columns
remove_missing_column = missing_values.loc[missing_values > 40].index
# Drop the columns that has more than 40% missing values
application_data_update = application_data.drop(columns=remove_missing_column)
# columns that has missing values less than 40%
missing_values_gt_0 = application_data_update.isnull().sum() / len(application_data) * 100
missing_values_gt_0 = missing_values_gt_0[missing_values_gt_0 > 0].sort_values(ascending=False)
missing_values_gt_0
OCCUPATION_TYPE 31.345545 EXT_SOURCE_3 19.825307 AMT_REQ_CREDIT_BUREAU_WEEK 13.501631 AMT_REQ_CREDIT_BUREAU_MON 13.501631 AMT_REQ_CREDIT_BUREAU_YEAR 13.501631 AMT_REQ_CREDIT_BUREAU_QRT 13.501631 AMT_REQ_CREDIT_BUREAU_HOUR 13.501631 AMT_REQ_CREDIT_BUREAU_DAY 13.501631 NAME_TYPE_SUITE 0.420148 OBS_30_CNT_SOCIAL_CIRCLE 0.332021 DEF_60_CNT_SOCIAL_CIRCLE 0.332021 DEF_30_CNT_SOCIAL_CIRCLE 0.332021 OBS_60_CNT_SOCIAL_CIRCLE 0.332021 EXT_SOURCE_2 0.214626 AMT_GOODS_PRICE 0.090403 AMT_ANNUITY 0.003902 CNT_FAM_MEMBERS 0.000650 DAYS_LAST_PHONE_CHANGE 0.000325 dtype: float64
Step 4.2: Missing Values Treatment¶
Column 'OCCUPATION_TYPE'
# Plotting bar charts for visualization of data in easy manner
occupation_counts = application_data_update['OCCUPATION_TYPE'].value_counts().reset_index()
plt.figure(figsize=(10, 5))
sns.barplot(data=occupation_counts, x='OCCUPATION_TYPE', y='count', hue='OCCUPATION_TYPE')
plt.xticks(rotation=90)
plt.title("Distribution of OCCUPATION_TYPE")
plt.xlabel("Occupation Type")
plt.ylabel("Number of Clients")
sns.despine
plt.tight_layout()
plt.show()
This column has 31% missing values so substitute with other values such as mode is not suitable in this case. Because of this reason we retain the column as it is.
Column 'EXT_SOURCE_3'
# print top 10 values
application_data_update['EXT_SOURCE_3'].head(10)
0 0.139376 1 NaN 2 0.729567 3 NaN 4 NaN 5 0.621226 6 0.492060 7 0.540654 8 0.751724 9 NaN Name: EXT_SOURCE_3, dtype: float64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of EXT_SOURCE_3")
sns.histplot(data=application_data_update, x='EXT_SOURCE_3', bins=10, color='Salmon')
plt.tight_layout()
plt.show()
print(f'Mean: {application_data_update['EXT_SOURCE_3'].mean()}')
print(f'Median: {application_data_update['EXT_SOURCE_3'].median()}')
print(f'Mode: {application_data_update['EXT_SOURCE_3'].mode()[0]}')
Mean: 0.5108529061799657 Median: 0.5352762504724826 Mode: 0.746300213050371
The EXT_SOURCE_3 feature is a normalized external risk score ranging between 0 and 1. Higher values indicate lower credit risk. It has 19% missing data we retained the column and imputed missing values using the median. It plays a critical role in identifying risky applicants.
# Fill null values with median values
application_data_update['EXT_SOURCE_3'].fillna(application_data_update['EXT_SOURCE_3'].median(), inplace=True)
Column 'AMT_REQ_CREDIT_BUREAU_WEEK' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the week before they applied for the loan.
application_data_update['AMT_REQ_CREDIT_BUREAU_WEEK'].value_counts()
AMT_REQ_CREDIT_BUREAU_WEEK 0.0 257456 1.0 8208 2.0 199 3.0 58 4.0 34 6.0 20 5.0 10 8.0 5 7.0 2 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_WEEK")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_WEEK', color='Salmon')
plt.show()
This shows that most clients had 0 inquiries in the last week. And most of the values are constant if we do analysis based on this column we can't get such better result for that we ignore this column.
Column 'AMT_REQ_CREDIT_BUREAU_MON' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the last month before they applied for the loan.
application_data_update['AMT_REQ_CREDIT_BUREAU_MON'].value_counts().head(10)
AMT_REQ_CREDIT_BUREAU_MON 0.0 222233 1.0 33147 2.0 5386 3.0 1991 4.0 1076 5.0 602 6.0 343 7.0 298 9.0 206 8.0 185 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_MON")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_MON', color='Salmon')
plt.show()
As same before most of the values are constant we ignore this column.
Column 'AMT_REQ_CREDIT_BUREAU_YEAR' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the one year before they applied for the loan.
application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].value_counts().head(10)
AMT_REQ_CREDIT_BUREAU_YEAR 0.0 71801 1.0 63405 2.0 50192 3.0 33628 4.0 20714 5.0 12052 6.0 6967 7.0 3869 8.0 2127 9.0 1096 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_YEAR")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_YEAR', color='Salmon', bins=25)
plt.show()
# Calculate Average(mean) of the column
print(f'Mean: {application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].mean()}')
# Calculate center value(median) of the column
print(f'Median: {application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].median()}')
# Calculate most frequent value(mode) of the column
print(f'Mode: {application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].mode()[0]}')
Mean: 1.899974435321363 Median: 1.0 Mode: 0.0
As we can see from the histogram most of the data lies between 0.0 to 3.0. The column had ~13.5% missing values. Since the most common value is 0.0, indicating no inquiries in the past year, we imputed missing values with the mode. This approach is justified as it reflects the behavior of the majority and avoids introducing bias.
# Fill the missing values with mode 0.0
application_data_update['AMT_REQ_CREDIT_BUREAU_YEAR'].fillna(0.0, inplace=True)
Column 'AMT_REQ_CREDIT_BUREAU_QRT' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit in the last three month before they applied for the loan.
application_data_update['AMT_REQ_CREDIT_BUREAU_QRT'].head(10)
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 5 1.0 6 1.0 7 0.0 8 0.0 9 NaN Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64
application_data_update['AMT_REQ_CREDIT_BUREAU_QRT'].value_counts()
AMT_REQ_CREDIT_BUREAU_QRT 0.0 215417 1.0 33862 2.0 14412 3.0 1717 4.0 476 5.0 64 6.0 28 8.0 7 7.0 7 261.0 1 19.0 1 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_QRT")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_QRT', color='Salmon')
plt.show()
As we can see most of the values are constant and fall into one category we ignore this column.
Column 'AMT_REQ_CREDIT_BUREAU_HOUR' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit one hour before they applied for the loan.
application_data_update['AMT_REQ_CREDIT_BUREAU_HOUR'].head()
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
application_data_update['AMT_REQ_CREDIT_BUREAU_HOUR'].value_counts()
AMT_REQ_CREDIT_BUREAU_HOUR 0.0 264366 1.0 1560 2.0 56 3.0 9 4.0 1 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_HOUR")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_HOUR', color='Salmon')
plt.show()
Again we can see most of the values are constant and fall into one category we ignore this column.
Column 'AMT_REQ_CREDIT_BUREAU_DAY' Description - This feature tells us how many times a bank or other lenders checked the applicant’s credit one day before they applied for the loan.
application_data_update['AMT_REQ_CREDIT_BUREAU_DAY'].head()
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64
application_data_update['AMT_REQ_CREDIT_BUREAU_DAY'].value_counts()
AMT_REQ_CREDIT_BUREAU_DAY 0.0 264503 1.0 1292 2.0 106 3.0 45 4.0 26 5.0 9 6.0 8 9.0 2 8.0 1 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of AMT_REQ_CREDIT_BUREAU_DAY")
sns.histplot(data=application_data_update, x='AMT_REQ_CREDIT_BUREAU_DAY', color='Salmon')
plt.show()
As we can see most of the values are 0 we ignore this column also
Column 'NAME_TYPE_SUITE' Description - The column describes who accompanied the client during the loan application
application_data_update['NAME_TYPE_SUITE'].head()
0 Unaccompanied 1 Family 2 Unaccompanied 3 Unaccompanied 4 Unaccompanied Name: NAME_TYPE_SUITE, dtype: object
application_data_update['NAME_TYPE_SUITE'].value_counts(normalize=True)*100
NAME_TYPE_SUITE Unaccompanied 81.159562 Family 13.111205 Spouse, partner 3.713029 Children 1.066884 Other_B 0.578018 Other_A 0.282804 Group of people 0.088499 Name: proportion, dtype: float64
# Here we leverged the used of countplot to see counts of values in each category
plt.figure(figsize=(8,5))
plt.title("Client Accompaniment at Loan Application")
sns.countplot(data=application_data_update, x='NAME_TYPE_SUITE', hue='NAME_TYPE_SUITE', legend=False)
plt.xlabel("Accompaniment Type")
plt.ylabel("Number of Clients")
plt.xticks(rotation=90)
plt.show()
Most clients 81% came unaccompanied, with a smaller portion accompanied by other category. Missing values were filled with 'Unaccompanied' as mode (most frequent category.)
application_data_update['NAME_TYPE_SUITE'].mode()
0 Unaccompanied Name: NAME_TYPE_SUITE, dtype: object
# Fill null values with mode value
application_data_update['NAME_TYPE_SUITE'].fillna('Unaccompanied', inplace=True)
application_data_update['OBS_30_CNT_SOCIAL_CIRCLE'].head()
0 2.0 1 1.0 2 0.0 3 2.0 4 0.0 Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64
application_data_update['OBS_30_CNT_SOCIAL_CIRCLE'].value_counts().head(10)
OBS_30_CNT_SOCIAL_CIRCLE 0.0 163910 1.0 48783 2.0 29808 3.0 20322 4.0 14143 5.0 9553 6.0 6453 7.0 4390 8.0 2967 9.0 2003 Name: count, dtype: int64
Columns 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE'
# We simultaneously plot four columns becuase they are correlated with values
columns = ['OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE']
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, column in enumerate(columns):
sns.countplot(data=application_data_update, x=column, palette='Paired', legend=False, ax=axes[idx])
axes[idx].set_title(f'Countplot for {column}')
axes[idx].tick_params(axis='x', rotation=90)
plt.tight_layout()
plt.show()
# Find out correlation of below columns with Target variable
result = application_data_update[['OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'TARGET']].corr()
# Plot correlation with heatmap
plt.figure(figsize=(8,5))
plt.title('Correlation with TARGET')
sns.heatmap(result, annot=True, cmap='Reds')
plt.tight_layout()
plt.show()
As we can see from above heatmap that four columns has not strong correlation with TARGET variable and avoiding complexity we not analyse this column
Column 'EXT_SOURCE_2
application_data_update['EXT_SOURCE_2'].head()
0 0.262949 1 0.622246 2 0.555912 3 0.650442 4 0.322738 Name: EXT_SOURCE_2, dtype: float64
application_data_update['EXT_SOURCE_2'].value_counts()
EXT_SOURCE_2
0.285898 721
0.262258 417
0.265256 343
0.159679 322
0.265312 306
...
0.353855 1
0.200589 1
0.146779 1
0.288490 1
0.269471 1
Name: count, Length: 119831, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title("Distribution of EXT_SOURCE_2")
sns.histplot(application_data_update['EXT_SOURCE_2'], bins=10, kde=True, color='Salmon')
plt.tight_layout()
plt.show()
print(f'Mean: {application_data_update['EXT_SOURCE_2'].mean()}')
print(f'Median: {application_data_update['EXT_SOURCE_2'].median()}')
print(f'Mode: {application_data_update['EXT_SOURCE_2'].mode()[0]}')
Mean: 0.5143926741308464 Median: 0.5659614260608526 Mode: 0.2858978721410488
The EXT_SOURCE_2 feature is a normalized external risk score ranging between 0 and 1. Higher values indicate lower credit risk. It has ~0.21% missing data we retained the column and imputed missing values using the median.
# Fill null values with median of 'EXT_SOURCE_2'
application_data_update['EXT_SOURCE_2'].fillna(application_data_update['EXT_SOURCE_3'].median(), inplace=True)
Column 'AMT_GOODS_PRICE' Description - The column represents the price of goods for which the loan was issued.
application_data_update['AMT_GOODS_PRICE'].head()
0 351000.0 1 1129500.0 2 135000.0 3 297000.0 4 513000.0 Name: AMT_GOODS_PRICE, dtype: float64
application_data_update['AMT_GOODS_PRICE'].value_counts()
AMT_GOODS_PRICE
450000.0 26022
225000.0 25282
675000.0 24962
900000.0 15416
270000.0 11428
...
1305706.5 1
2155500.0 1
113724.0 1
171468.0 1
559836.0 1
Name: count, Length: 1002, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8, 5))
plt.title('Distribution of AMT_GOODS_PRICE')
sns.histplot(data=application_data_update, x='AMT_GOODS_PRICE', bins=35, kde=True, color='Salmon')
plt.xlabel('Goods Price')
plt.ylabel('Number of Clients')
plt.tight_layout()
plt.show()
Most values are concentrated in a few specific tiers, likely reflecting standard loan packages for that reason we impute the missing values with median. This variable may provide insights when analyzed in relation to income and loan amount.
# Fill null values with median
application_data_update['AMT_GOODS_PRICE'].fillna(application_data_update['AMT_GOODS_PRICE'].median(), inplace=True)
Column 'AMT_ANNUITY' Description - The annual payment amount required to repay the loan
application_data_update['AMT_ANNUITY'].value_counts().head(5)
AMT_ANNUITY 9000.0 6385 13500.0 5514 6750.0 2279 10125.0 2035 37800.0 1602 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8, 5))
plt.title('Distribution of AMT_ANNUITY')
sns.histplot(data=application_data_update, x='AMT_ANNUITY', bins=30, kde=True, color='Salmon')
plt.xlabel('Annuity Amount')
plt.tight_layout()
plt.show()
As we can from the above histogram values is cluster. We can impute the missing values with median
# Fill null values with median of 'AMT_ANNUITY'
application_data_update['AMT_ANNUITY'].fillna(application_data_update['AMT_ANNUITY'].median(), inplace=True)
Column 'CNT_FAM_MEMBERS' Description - Number of family members client have
application_data_update['CNT_FAM_MEMBERS'].value_counts().head(5)
CNT_FAM_MEMBERS 2.0 158357 1.0 67847 3.0 52601 4.0 24697 5.0 3478 Name: count, dtype: int64
# We used countplot for see category count of family members
plt.figure(figsize=(8,5))
plt.title('Distribution of CNT_FAM_MEMBERS')
sns.countplot(data=application_data_update, x='CNT_FAM_MEMBERS', palette='Paired')
plt.xlabel('Count of Family Members')
plt.ylabel('Number of Applicants')
plt.xticks(rotation=90)
plt.show()
Most of the values are lies in the 2.0 and the missing values percentage of this column is 0.000650 very low. So we can impute missing values with median
# Fill null values with median
application_data_update['CNT_FAM_MEMBERS'].fillna(application_data_update['CNT_FAM_MEMBERS'].median(), inplace=True)
Column 'DAYS_LAST_PHONE_CHANGE' Description - The day before the application client change their phone number.
application_data_update['DAYS_LAST_PHONE_CHANGE'].value_counts().head(10)
DAYS_LAST_PHONE_CHANGE 0.0 37672 -1.0 2812 -2.0 2318 -3.0 1763 -4.0 1285 -5.0 824 -6.0 537 -7.0 442 -8.0 278 -476.0 222 Name: count, dtype: int64
# Convert the values from days to years
application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'] = -round(application_data_update['DAYS_LAST_PHONE_CHANGE'] / 365, 0)
# Replace -0.0 to 0.0
application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].replace(-0.0, 0.0, inplace=True)
application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].value_counts()
DAYS_LAST_PHONE_CHANGE(In Year) 0.0 64176 1.0 58998 2.0 48141 4.0 34677 3.0 32321 5.0 31080 6.0 16620 7.0 12631 8.0 5300 9.0 2986 10.0 509 11.0 70 12.0 1 Name: count, dtype: int64
# Here we use histogram to see distribution of data and frequency of data points.
plt.figure(figsize=(8,5))
plt.title('Distribution of DAYS_LAST_PHONE_CHANGE(In Year)')
sns.histplot(data=application_data_update, x='DAYS_LAST_PHONE_CHANGE(In Year)', bins=13, legend=False, color='Salmon')
plt.xlabel('Years')
plt.ylabel('Number of Applicants')
plt.tight_layout()
plt.show()
print(f'Median: {application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].median()}')
print(f'Mode: {application_data_update['DAYS_LAST_PHONE_CHANGE(In Year)'].mode()[0]}')
Median: 2.0 Mode: 0.0
As we can see from histogram most of the data lies into the first three values (0.0, 1.0, 2.0). And the percentage of missing values is 0.000325 or we can say negligible as they not make such big impact of our analysis but for better practise we drop the missing value rows.
application_data_update = application_data_update.dropna(subset=['DAYS_LAST_PHONE_CHANGE(In Year)'])
# check missing values percentage of dataset
(application_data_update.isnull().sum() / len(application_data_update) * 100).sort_values(ascending=False).head(20)
OCCUPATION_TYPE 31.345322 AMT_REQ_CREDIT_BUREAU_WEEK 13.501350 AMT_REQ_CREDIT_BUREAU_MON 13.501350 AMT_REQ_CREDIT_BUREAU_DAY 13.501350 AMT_REQ_CREDIT_BUREAU_QRT 13.501350 AMT_REQ_CREDIT_BUREAU_HOUR 13.501350 OBS_60_CNT_SOCIAL_CIRCLE 0.332022 OBS_30_CNT_SOCIAL_CIRCLE 0.332022 DEF_60_CNT_SOCIAL_CIRCLE 0.332022 DEF_30_CNT_SOCIAL_CIRCLE 0.332022 AMT_ANNUITY 0.000000 AMT_CREDIT 0.000000 AMT_INCOME_TOTAL 0.000000 CNT_CHILDREN 0.000000 FLAG_OWN_REALTY 0.000000 FLAG_OWN_CAR 0.000000 CODE_GENDER 0.000000 NAME_CONTRACT_TYPE 0.000000 TARGET 0.000000 SK_ID_CURR 0.000000 dtype: float64
# Drop the unnecessary columns that we not analyse
application_data_update = application_data_update.drop(columns=[
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON',
'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE'
])
Step 5 : Value Standardization¶
Description - In this step we transformed columns from days to years for better readability and interpretation. We also renamed the corresponding column headers to reflect the updated units.
columns_to_change = ['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH']
for column in columns_to_change:
application_data_update[column+'(In Year)'] = -round(application_data_update[column] / 365, 0) # change days to year by diving it 365 and create new columns
# Here we drop the unnecessary columns that are not relevent to our analysis
application_data_update = application_data_update.drop(columns=[
'DAYS_BIRTH',
'DAYS_EMPLOYED',
'DAYS_REGISTRATION',
'DAYS_ID_PUBLISH',
'DAYS_LAST_PHONE_CHANGE'
])
Step 6 : Univariate Analysis¶
We undertake univariate analysis on the dataset's features, based on their datatype:
For numerical/continuous data: We employ histograms to gain insight into the distribution of each feature. This allows us to understand the central tendency, spread, and shape of the dataset's distribution.
For categorical data: Bar plots are utilized to visualize the frequency of each category. This provides a clear representation of the prominence of each category within the respective feature.
By employing these visualization techniques, we're better positioned to understand the individual characteristics of each feature in the dataset.
Step 6.1 : Univariate Analysis of Numerical Variables¶
numeric_columns = application_data_update.select_dtypes(include=['number'])
columns = numeric_columns.columns
# Plot histograms in subplots
cols = 3
rows = int(np.ceil(len(columns) / cols))
fig, axes = plt.subplots(rows, cols, figsize=(16, rows * 3))
axes = axes.flatten()
for i, col in enumerate(columns[2:]):
sns.histplot(numeric_columns[col], kde=True, bins=30, color='skyblue', ax=axes[i])
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel('')
axes[i].set_ylabel('Frequency')
# Remove unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.suptitle("Histograms of Numeric Columns", fontsize=18, y=1.02)
plt.show()
Inferences:¶
CNT_CHILDREN: Most clients have no children, and the count drops sharply with 1 or more children. A few records with very high numbers such as 19 are likely outliers.
AMT_INCOME_TOTAL: Highly right-skewed distribution. Most applicants earn below 500000 but there are extreme values going up to 117 million, indicating possible outliers.
AMT_CREDIT: Also right-skewed. Common loan amounts appear around 270,000 to 600,000, but it extends to over 4 million.
AMT_ANNUITY: Most annuity amounts are in the range of 10,000 to 40,000, with a sharp peak. A long right tail hints at some high annuity obligations.
EXT_SOURCE_2: Distribution is fairly normal-shaped — indicating this normalized score is well-behaved
EXT_SOURCE_3: Similar to EXT_SOURCE_2, this score shows a near-normal shape, though slightly more skewed.
DAYS_BIRTH (In Year): Majority of loan applicants are between 30 and 60 years old. Distribution is slightly left-skewed. Can be a strong feature younger applicants may be riskier.
DAYS_EMPLOYED (In Year): Most valid employment durations are under 50 years, with the bulk between 0 and 10 years
REGION_POPULATION_RELATIVE: Heavily right-skewed. Most applicants are from less populated regions. A few regions have much higher population densities.
CNT_FAM_MEMBERS: Most applicants belong to 2-3 member families. Very large family sizes (10+) are rare and may influence credit behavior.
Step 6.2 : Univariate Analysis of Categorical Variables¶
categorical_columns = application_data_update.select_dtypes(include=['object'])
cat_columns = categorical_columns.columns
# With the help of countplot we analyse the categorical variables
# Number of rows and columns for subplots
cols = 2
rows = math.ceil(len(cat_columns) / cols)
# Plot setup with subplots
fig, axes = plt.subplots(rows, cols, figsize=(20, rows * 7))
axes = axes.flatten()
# Plot countplot
for i, col in enumerate(cat_columns):
ax = axes[i]
total = categorical_columns[col].notna().sum()
sns.countplot(data=categorical_columns, x=col, ax=ax, hue=col, legend=False)
ax.set_title(col, fontsize=10)
ax.tick_params(axis='x', rotation=90)
ax.set_xlabel('')
ax.set_ylabel('Count')
# Add % labels on top of bar
for p in ax.patches:
count = int(p.get_height())
percentage = 100 * count / total
ax.text(p.get_x() + p.get_width()/2., p.get_height() + 500,
f'{percentage:.1f}%', ha='center', fontsize=8)
# Remove unused axes
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.suptitle('Countplots of Categorical Variable', fontsize=18, y=1.02)
plt.show()
Inferences:¶
NAME_CONTRACT_TYPE: 90.5% of applicants have Cash loans and only 9.5% are Revolving loans. Cash loans dominate the dataset and may represent larger or one-time financing.
CODE_GENDER: 65.8% of applicants are female, and 34.2% are male. A rare category XNA (0.01%) is likely a placeholder or anomaly.
FLAG_OWN_REALTY: 69.4% own real estate, while 30.6% do not. This may suggest better creditworthiness or asset security among a majority
FLAG_OWN_CAR: 66% do not own a car, while 34% do. Car ownership might relate to financial stability
NAME_TYPE_SUITE: 81.2% of clients were unaccompanied, followed by 13.1% with family and others.
NAME_INCOME_TYPE: Majority are Working (51.6%), followed by Pensioners (18%), and Commercial associates (23.3%).Very few unemployed or students, possibly filtered out due to low credit approval chances.
NAME_EDUCATION_TYPE: 71% have secondary education, followed by 24.3% with higher education. Low numbers with lower secondary and even fewer with academic degrees education might influence income, and indirectly default risk.
NAME_FAMILY_STATUS: Most applicants are married (~64%), followed by single (14.8%). Other statuses like widowed or civil marriage are rare. Marital status might correlate with loan behavior or stability.
NAME_HOUSING_TYPE: Dominated by those living in House / Apartment (88.7%). A few use With parents, Municipal apartment. It might indicate applicant stability or urban/rural status.
OCCUPATION_TYPE: Laborers (26.1%), Sales staff (15.2%), and Core staff (13.1%) are the most common. where as IT staff, HR, Secretaries, and Realty agents are least frequent.
Step 7 : Correlation¶
numeric_columns = application_data_update.select_dtypes(include=['number'])
columns = numeric_columns.columns
# Compute correlation with 'TARGET'
target_corr = numeric_columns.corr()['TARGET'].drop('TARGET') # Drop TARGET correlation with self
# Convert into DataFrame for heatmap
target_corr_df = pd.DataFrame(target_corr).sort_values(by='TARGET', ascending=False)
plt.figure(figsize=(6, len(target_corr_df) // 2))
sns.heatmap(target_corr_df, annot=True, cmap='coolwarm', linewidths=0.5, fmt=".2f")
plt.title("Correlation of Features with TARGET")
plt.tight_layout()
plt.show()
As we can see from the above heatmap of correlation that columns such as FLAG_DOCUMENT_2, FLAG_DOCUMENT_3... , FLAG_DOCUMENT_21 are not much contributing to our analysis and with correlation of the TARGET variable for this reason easy analysis of the dataset we drop this columns.
# Drop all FLAG_DOCUMENT_* columns for avoid complexity in dataset
document_columns = [col for col in application_data_update.columns if col.startswith('FLAG_DOCUMENT')]
application_data_update.drop(columns=document_columns, inplace=True)
Step 8 : Outliers Treatment¶
numeric_columns = application_data_update.select_dtypes(include=['number'])
columns = numeric_columns.columns
desc = numeric_columns.describe().T
desc.head()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 307510.0 | 278181.038399 | 102789.938288 | 100002.0 | 189146.25 | 278202.5 | 367142.75 | 456255.0 |
| TARGET | 307510.0 | 0.080729 | 0.272419 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CNT_CHILDREN | 307510.0 | 0.417053 | 0.722122 | 0.0 | 0.00 | 0.0 | 1.00 | 19.0 |
| AMT_INCOME_TOTAL | 307510.0 | 168798.058473 | 237123.519274 | 25650.0 | 112500.00 | 147150.0 | 202500.00 | 117000000.0 |
| AMT_CREDIT | 307510.0 | 599027.041665 | 402491.016691 | 45000.0 | 270000.00 | 513531.0 | 808650.00 | 4050000.0 |
# Set the number of columns per row in the plot grid
cols_per_row = 3
total_plots = len(columns)
rows = total_plots // cols_per_row + int(total_plots % cols_per_row != 0)
# Create the subplots
plt.figure(figsize=(18, rows * 4))
for i, col in enumerate(numeric_columns):
plt.subplot(rows, cols_per_row, i + 1)
sns.boxplot(x=application_data_update[col], color='teal')
plt.xlabel('')
plt.title(col)
plt.tight_layout()
plt.suptitle("Boxplots for Numerical Columns", fontsize=18, y=1.02)
plt.show()
Inference¶
- CNT_CHILDREN
- Maximum count of children is 19 but 75% of data have more than 1 child
- Capping the outliers
- AMT_INCOME_TOTAL
- Max is = 117000000 and Median is = 147150
- AMT_ANNUITY
- Max = 258025 much larger than the 75th percentile 34596
- FLAG_MOBIL
- Min, Max, Mean are 1 that means constant column.
# Handle outliers of CNT_CHILDREN column
application_data_update['CNT_CHILDREN'] = application_data_update['CNT_CHILDREN'].clip(upper=5) # It will replace values above 5 with 5
# Handle outliers of AMT_INCOME_TOTAL
cap = application_data_update['AMT_INCOME_TOTAL'].quantile(0.99) # Cap outliers with 0.99 value
application_data_update['AMT_INCOME_TOTAL'] = application_data_update['AMT_INCOME_TOTAL'].clip(upper=cap)
application_data_update['AMT_INCOME_TOTAL'].value_counts().sort_index(ascending=False)
AMT_INCOME_TOTAL
472500.0 3094
469800.0 1
468823.5 1
468000.0 10
466956.0 1
...
27000.0 66
26550.0 2
26460.0 1
26100.0 3
25650.0 2
Name: count, Length: 2370, dtype: int64
# Handle outliers of 'AMT_ANNUITY'
application_data_update['AMT_ANNUITY'] = np.log1p(application_data_update['AMT_ANNUITY']) # log1p is numpy function for handling the outliers and normalize the data
# Drop the 'FLAG_MOBIL' column for it has constant values that not provide as much insights for analysis
application_data_update.drop(columns='FLAG_MOBIL', inplace=True)
Step 9 : Merging Dataset¶
previous_application = pd.read_csv("C:/Users/ASUS/OneDrive/Desktop/Datasets/Credit Risk Analysis Dataset/previous_application.csv")
previous_train = application_data_update.merge(previous_application, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
previous_train.shape
(1413701, 79)
previous_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE_x | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 10.114619 | ... | Auto technology | 24.0 | low_normal | POS other with interest | 365243.0 | -565.0 | 125.0 | -25.0 | -17.0 | 0.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 10.482892 | ... | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -716.0 | -386.0 | -536.0 | -527.0 | 1.0 |
| 2 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 10.482892 | ... | Furniture | 6.0 | middle | POS industry with interest | 365243.0 | -797.0 | -647.0 | -647.0 | -639.0 | 0.0 |
| 3 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 10.482892 | ... | Consumer electronics | 12.0 | middle | POS household with interest | 365243.0 | -2310.0 | -1980.0 | -1980.0 | -1976.0 | 1.0 |
| 4 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 8.817446 | ... | Connectivity | 4.0 | middle | POS mobile without interest | 365243.0 | -784.0 | -694.0 | -724.0 | -714.0 | 0.0 |
5 rows × 79 columns
# see the list of columns in 'previous_train'
previous_train.columns
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE_x', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT_x', 'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x',
'NAME_TYPE_SUITE_x', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
'WEEKDAY_APPR_PROCESS_START_x', 'HOUR_APPR_PROCESS_START_x',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_LAST_PHONE_CHANGE(In Year)',
'DAYS_BIRTH(In Year)', 'DAYS_EMPLOYED(In Year)',
'DAYS_REGISTRATION(In Year)', 'DAYS_ID_PUBLISH(In Year)', 'SK_ID_PREV',
'NAME_CONTRACT_TYPE_y', 'AMT_ANNUITY_y', 'AMT_APPLICATION',
'AMT_CREDIT_y', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE_y',
'WEEKDAY_APPR_PROCESS_START_y', 'HOUR_APPR_PROCESS_START_y',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE_y', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
Step 10 : Bivariate Analysis¶
Step 10.1 : Categorical columns vs TARGET¶
# Categorical columns for analysis
categorical_cols = ['CODE_GENDER', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_HOUSING_TYPE', 'NAME_CONTRACT_STATUS',
'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
# Grid setup
cols = 2
rows = math.ceil(len(categorical_cols) / cols)
fig, axes = plt.subplots(rows, cols, figsize=(14, rows * 5))
axes = axes.flatten()
# Plotting each subplot
for i, col in enumerate(categorical_cols):
sns.countplot(data=previous_train, x=col, hue='TARGET', ax=axes[i])
axes[i].set_title(f'{col} vs Loan Default')
axes[i].set_xlabel('')
axes[i].set_ylabel('Count')
axes[i].tick_params(axis='x', rotation=90)
axes[i].legend(title='TARGET')
# Hide extra axes if any
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.suptitle('Bivariate Analysis: Categorical Columns vs Loan Default', fontsize=16, y=1.02)
plt.show()
Inference:¶
CODE_GENDER: Female applicants make up the majority and also contribute to more defaults. Although females have slightly more defaults this may be due to higher application volume not higher risk.
NAME_INCOME_TYPE: Applicants who are 'Working' contribute to the highest number of defaults, followed by 'Commercial associate' and 'Pensioner'. Risk assessment should consider not just income, but income stability. Commercial associates may include self-employed or irregular earners, which could explain higher risk. Pensioners, despite being retired, may represent a lower risk due to predictable income streams.
NAME_EDUCATION_TYPE: Most applicants have 'Secondary education', which also corresponds to most defaults. Higher education correlates with lower defaults suggesting a link between education level and financial responsibility. We can se education as a soft risk indicator and possibly offer better terms to highly educated clients.
NAME_HOUSING_TYPE: 'House/apartment' is the dominant housing type among applicants. Housing type alone may not be a strong risk indicator, but those renting may indicate less asset security, which could slightly increase risk.
NAME_CONTRACT_STATUS: Most defaults occur in the 'Approved' status expected because that's where the loan was granted. Other statuses like 'Refused' or 'Canceled' don't impact risk.
NAME_FAMILY_STATUS: Married applicants are the largest group and also account for most defaults.
OCCUPATION_TYPE: Laborers, sales staff, and drivers show higher counts of defaults. These occupations may correlate with lower or unstable income, higher physical job risks, or lower financial literacy.
# Create function that create 4 graphs for bivariate analysis
def plotting(column, hue, data):
"""
Plots 4 visualizations for the relationship between two categorical columns:
- Pie chart of `column`
- Bar chart: % of hue categories within `column`
- Countplot of hue split by column
- Countplot of column split by hue
"""
col = column
# Create figure
fig = plt.figure(figsize=(14, 14))
# Subplot 1: Pie Chart of column
ax1 = plt.subplot(221)
data[col].value_counts().plot.pie(autopct="%1.0f%%", startangle=90, pctdistance=0.85, textprops={'fontsize': 8}, ax=ax1)
plt.title(f'Distribution of values for: {column}')
ax1.set_ylabel('') # Remove ylabel
# Subplot 2: % Distribution of hue within each column category
ax2 = plt.subplot(222)
dist = pd.crosstab(data[col], data[hue], normalize='index') * 100
dist.plot(kind='bar', stacked=True, ax=ax2)
plt.title(f'Percentage of {hue} within each {column}')
plt.xticks(rotation=90)
plt.xlabel('')
# Subplot 3: Countplot of hue per column category
ax3 = plt.subplot(223)
sns.countplot(data=data, x=col, hue=hue, ax=ax3)
plt.xticks(rotation=90)
plt.xlabel('')
plt.title(f'{hue} count per {col}')
# Subplot 4: Countplot of column per hue category
ax4 = plt.subplot(224)
sns.countplot(data=data, x=hue, hue=col, ax=ax4)
plt.xticks(rotation=90)
plt.xlabel('')
plt.title(f'{col} count per {hue}')
plt.tight_layout()
plt.show()
# Plot graph for 'CODE_GENDER' and 'NAME_CONTRACT_STATUS'
plotting('CODE_GENDER', 'NAME_CONTRACT_STATUS', data=previous_train)
Obeservation:
Female applicants account for the vast majority of loan applications, making up 68% of the total, compared to 32% for males. While females secure the highest number of approved loans in absolute terms, this is primarily due to their higher application volume. Interestingly, male applicants exhibit a slightly higher approval rate than females, who, in turn, experience a marginally higher rate of loan cancellations and refusals. The "Gender 'XNA'" category is negligible.
# Plot graph for 'NAME_INCOME_TYPE' and 'NAME_CONTRACT_STATUS'
plotting('NAME_INCOME_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
Obeservation:
The majority of loan applicants (over 51%) are working professionals, followed by commercial associates (23%) and pensioners (19%). While working professionals account for the highest volume of approved loans, state servants demonstrate the most favorable approval rate relative to their application volume. Conversely, unemployed individuals and those on maternity leave face significantly higher rates of loan cancellation and rejection, despite being smaller applicant groups. Commercial associates and pensioners show a more balanced distribution of approvals and rejections, indicating varied risk profiles within these segments.
# Plot graph for 'NAME_EDUCATION_TYPE' and 'NAME_CONTRACT_STATUS'
plotting('NAME_EDUCATION_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
Obeservation:
The majority of loan applicants (around 73%) have Secondary / secondary special education, making this group central to loan activity. While this segment accounts for most approved, canceled, and refused loans due to their sheer volume, applicants with Academic and Higher education demonstrate the highest approval rates, suggesting they are lower-risk. Conversely, individuals with Lower or Incomplete education experience more cancellations or rejections.
# Plot graph for 'NAME_HOUSING_TYPE' and 'NAME_CONTRACT_STATUS'
plotting('NAME_HOUSING_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
Observation:
The overwhelming majority of loan applicants (nearly 89%) reside in a "House/apartment," indicating a strong prevalence of stable housing situations within the applicant pool. The next largest segments are individuals living with parents (4%) and renters (4%). Interestingly, loan approval rates remain relatively consistent across all housing types, suggesting that housing status alone does not significantly influence lending decisions. Due to their sheer volume, house/apartment residents naturally account for the highest absolute numbers of both loan approvals and refusals.
# Plot graph for 'NAME_FAMILY_STATUS' and 'NAME_CONTRACT_STATUS'
plotting('NAME_FAMILY_STATUS', 'NAME_CONTRACT_STATUS', data=previous_train)
Observation:
Married applicants constitute the largest demographic in loan applications, making up 64% of the dataset. Following them are single individuals (13%) and those in a civil marriage (10%). Loan approval rates are notably consistent across all family statuses, hovering around 60%. Due to their significant presence, married individuals account for the highest volume of both approved and refused loans.
# Plot graph for 'OCCUPATION_TYPE' and 'NAME_CONTRACT_STATUS'
plotting('OCCUPATION_TYPE', 'NAME_CONTRACT_STATUS', data=previous_train)
Observation:
Laborers form the largest occupational group among loan applicants at 26%, followed by sales staff (16%) and core staff (12%). Core staff, sales staff, and high-skill tech staff consistently exhibit the highest loan approval rates, all exceeding 60%. Conversely, laborers and drivers face slightly higher refusal rates compared to applicants in technical or managerial positions. Occupations with minimal representation, such as HR staff and secretaries, have a negligible impact on overall loan volume and approval trends.
Step 10.2 : Numerical columns vs TARGET¶
categorical_columns_prev = previous_train.select_dtypes(include=['number'])
cat_columns = categorical_columns_prev.columns
cat_columns
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT_x', 'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x',
'REGION_POPULATION_RELATIVE', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS',
'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
'HOUR_APPR_PROCESS_START_x', 'REG_REGION_NOT_LIVE_REGION',
'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_LAST_PHONE_CHANGE(In Year)',
'DAYS_BIRTH(In Year)', 'DAYS_EMPLOYED(In Year)',
'DAYS_REGISTRATION(In Year)', 'DAYS_ID_PUBLISH(In Year)', 'SK_ID_PREV',
'AMT_ANNUITY_y', 'AMT_APPLICATION', 'AMT_CREDIT_y', 'AMT_DOWN_PAYMENT',
'AMT_GOODS_PRICE_y', 'HOUR_APPR_PROCESS_START_y',
'NFLAG_LAST_APPL_IN_DAY', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
# Define numerical columns and target column
numerical_cols = ['CNT_CHILDREN','AMT_INCOME_TOTAL','CNT_FAM_MEMBERS',
'AMT_CREDIT_x', 'DAYS_BIRTH(In Year)', 'EXT_SOURCE_3']
target_col = 'TARGET'
# Set up subplot dimensions
cols = 2
rows = -(-len(numerical_cols) // cols) # Ceiling division
fig, axes = plt.subplots(rows, cols, figsize=(14, rows * 4))
axes = axes.flatten()
# Plot boxplots
for i, col in enumerate(numerical_cols):
ax = axes[i]
sns.boxplot(data=previous_train, x=target_col, y=col, ax=ax)
ax.set_title(f'{col} vs {target_col}')
# ax.set_xlabel('Loan Default (TARGET)')
ax.set_xlabel('')
ax.set_ylabel('')
# Remove any unused subplot spaces
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.suptitle('Bivariate Analysis: Numerical Variables vs TARGET', fontsize=16, y=1.02)
plt.show()
Observation:
CNT_CHILDREN vs TARGET : Applicants with 0 children dominate both groups, but defaulters show slightly more variance. There is no strong relationship between number of children and loan default.
AMT_INCOME_TOTAL vs TARGET : Both defaulters and non-defaulters show similar income ranges, with high outliers in both. This indicates that income alone may not strongly differentiate loan default behavior.
CNT_FAM_MEMBERS vs TARGET : The median and IQR of family members are similar across both groups. This suggests that family size has little direct influence on default risk.
AMT_CREDIT vs TARGET : Defaulters have slightly lower median credit amounts than non-defaulters. Higher credit values are common in both groups, making this a weak discriminator.
DAYS_BIRTH (In Years) vs TARGET : Defaulters tend to be slightly younger on average than non-defaulters. This suggests that younger applicants may carry a higher risk of default
EXT_SOURCE_3 vs TARGET : Non-defaulters have noticeably higher EXT_SOURCE_3 scores compared to defaulters. This makes EXT_SOURCE_3 a strong indicator of creditworthiness and default prediction.
Business Oriented Conclusions
¶After performing an in-depth exploratory data analysis on the loan applicants' profiles and repayment behavior, the following conclusions are drawn
Age Is a Strong Predictor of Default
Younger applicants (ages 20–30) are significantly more likely to default compared to older applicants.
Possible Action: Implement stricter eligibility criteria (e.g., lower loan amounts or higher interest rates) for younger applicants to mitigate risk.
External Risk Source Scores Matter
EXT_SOURCE_3, a risk assessment feature, shows a clear inverse relationship with default. Lower values are highly indicative of higher risk.
Possible Action: Use this score as a primary factor in automated credit scoring models. Set threshold values below which applications are flagged for manual review.
Number of Children Correlates with Higher Default
Applicants with three or more children show increased default rates, potentially due to higher household financial burdens.
Possible Action: Adjust risk thresholds based on family size or combine this with income metrics to assess repayment capacity better.
Education Level Reflects Financial Discipline
Applicants with lower education levels (Secondary and below) are more prone to default.
Possible Action: Use education level as an auxiliary feature in scoring models. Applicants with higher education levels may be offered better loan terms.
Employment Type Affects Default Risk
Unemployed, students, and working pensioners show a higher tendency to default.
Possible Action: Introduce tiered risk scoring or mandatory collateral/guarantor for these applicant categories.
Family Status Has an Impact
Single and separated applicants tend to have slightly higher default rates.
Possible Action: Include marital status in behavioral segmentation for personalized loan products and monitoring.
Income Type Should Be Evaluated Carefully
Applicants with irregular income types (e.g., maternity leave, commercial associates) show elevated risk.
Possible Action: Evaluate income stability alongside declared income. Consider requesting income proof for non-standard employment types.