Programming lesson
Loan Approval Prediction with Python: A Step-by-Step Guide for INT303 Big Data Analysis
Learn how to build a machine learning pipeline for loan approval prediction using Python. This guide covers EDA, feature engineering, model selection, and evaluation with practical code examples.
Introduction
In the world of finance, accurate and efficient loan approval decisions are paramount. Banks and financial institutions rely on robust data analysis and predictive models to assess applicant creditworthiness, mitigate risks, and optimize their lending portfolios. This guide walks you through the essential steps of building a loan approval prediction model using Python, similar to the INT303 Big Data Analysis project. By the end, you'll be able to perform exploratory data analysis, preprocess data, engineer features, and evaluate multiple machine learning models.
Understanding the Dataset
The dataset contains applicant information such as number of dependents, education, self-employment status, annual income, loan amount, loan term, CIBIL score, and asset values. The target variable is loan_status with values 'Approved' or 'Rejected'. We'll use this data to predict whether a loan application should be approved.
Exploratory Data Analysis (EDA)
Loading and Initial Inspection
First, load the dataset using pandas and inspect the first few rows, data types, and missing values. Use df.info() and df.describe() to get an overview.
import pandas as pd
df = pd.read_csv('loan_approval_dataset.csv')
df.head()
df.info()
df.describe()Univariate Analysis
Analyze each feature individually. For numerical features like income_annum and cibil_score, create histograms and box plots to understand distributions and detect outliers. For categorical features like education and self_employed, use bar plots to see frequency counts.
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(df['income_annum'], bins=30)
plt.show()
sns.boxplot(x=df['cibil_score'])
plt.show()
df['education'].value_counts().plot(kind='bar')
plt.show()Bivariate Analysis
Explore relationships between features and the target variable. Use stacked bar plots for categorical features vs. loan_status, and box plots or violin plots for numerical features vs. loan_status. A heatmap of correlations can also reveal important relationships.
pd.crosstab(df['education'], df['loan_status']).plot(kind='bar', stacked=True)
plt.show()
sns.boxplot(x='loan_status', y='cibil_score', data=df)
plt.show()
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()Data Preprocessing
Handling Missing Values
Check for missing values using df.isnull().sum(). Decide on a strategy: for numerical features, you might use median imputation; for categorical, use mode. Document your choices.
df['column'].fillna(df['column'].median(), inplace=True)Outlier Treatment
Use box plots or Z-scores to identify outliers. Consider capping or transformation, but be careful not to lose important information. For loan approval prediction, outliers in income or asset values might be legitimate.
Feature Engineering
Create at least two new features that could improve model performance. For example:
- Debt-to-Income Ratio:
loan_amount / income_annum - Total Assets: sum of residential, commercial, luxury, and bank assets.
df['debt_to_income'] = df['loan_amount'] / df['income_annum']
df['total_assets'] = df['residential_assets_value'] + df['commercial_assets_value'] + df['luxury_assets_value'] + df['bank_asset_value']Categorical Encoding
Convert categorical features to numerical using one-hot encoding or label encoding. For binary categories like 'education', label encoding is fine. For others, use one-hot.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['education_encoded'] = le.fit_transform(df['education'])
df = pd.get_dummies(df, columns=['self_employed'], drop_first=True)Feature Scaling
Scale numerical features to have zero mean and unit variance using StandardScaler, or normalize them with MinMaxScaler. This is important for models like SVM and KNN.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_cols = ['income_annum', 'loan_amount', 'cibil_score', 'debt_to_income']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])Model Development and Evaluation
Data Splitting
Split the processed data into training (70%) and testing (30%) sets. Use train_test_split from sklearn.
from sklearn.model_selection import train_test_split
X = df.drop('loan_status', axis=1)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)Model Selection
Choose at least three classification algorithms. Good candidates include Logistic Regression, Random Forest, and Gradient Boosting (e.g., XGBoost).
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
models = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(),
'XGBoost': XGBClassifier()
}Hyperparameter Tuning
Use GridSearchCV to find optimal hyperparameters. For example, for Random Forest, tune n_estimators and max_depth.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200],
'max_depth': [10, 20, None]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)Model Evaluation
Evaluate each tuned model on the test set using accuracy, precision, recall, F1-score, ROC AUC, and confusion matrix. Provide a comparative analysis.
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix
y_pred = grid.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))Conclusion
Building a loan approval prediction model involves a systematic pipeline from data exploration to model evaluation. By following this guide, you can develop a robust model that helps financial institutions make informed decisions. Remember to document your process and justify your choices, as this is key to a successful project submission.
Further Reading
Explore topics like feature importance, model interpretability with SHAP, and handling imbalanced datasets to enhance your model. The skills you gain here are applicable to many real-world classification problems in finance and beyond.