Blog

Feb 10, 2025

Predictive Analysis of Credit Default Risk

Introduction

Credit risk analysis is a cornerstone of the financial industry. Banks and lending institutions need robust models to determine whether a prospective borrower is likely to repay a loan. In this blog post, we’ll showcase an end-to-end data analysis and modeling workflow — covering everything from initial data ingestion and cleaning to building a Deep Neural Network (DNN) in Python. Let’s explore how we transform raw loan data into actionable insights.

1) Data Overview

Dataset Source

Data: Processed Lending Club loan data (processed_df.csv) with various borrower attributes, loan characteristics, and repayment statuses.
Goal: Classify each loan into two main categories:
0 (Low Risk): Fully paid/current loans
1 (High Risk): Late payment, default, or other negative outcomes

Initial Checks

df_loan = pd.read_csv("/kaggle/input/processed-dataset-for-cra/processed_df.csv")df_loan.columnsdf_loan.isnull().sum()

We confirm the dataset dimensions and check for missing values.
We see features like loan_amnt, int_rate, dti, emp_length, and many more.

Note: A careful look at distributions, data types, and missing percentages helps decide which features to keep or drop.

2) Data Preprocessing & Feature Engineering

Transforming Categorical Features

Several columns contain strings (e.g., term, home_ownership, grade), so we convert them to numeric codes or strip out extra characters:

df_loan['grade'] = df_loan['grade'].replace(['A', 'B', 'C', 'D', 'E', 'F', 'G'],                                             [0, 1, 2, 3, 4, 5, 6])df_loan['sub_grade'] = df_loan['sub_grade'].str.strip('ABCDEFG').astype('float64')

We also standardize emp_length (e.g., < 1 year becomes 0, 10+ years becomes 11) and factorize other categorical fields (purpose, verification_status, etc.) to numeric.

Handling Date Columns

Columns like issue_d, last_pymnt_d, and next_pymnt_d contain date strings. We parse these into proper datetime objects and then convert them into numeric timestamps (integers) for modeling:

df_loan['last_pymnt_d'] = pd.to_datetime(df_loan['last_pymnt_d'])df_loan['last_pymnt_d_numeric'] = df_loan['last_pymnt_d'].astype(int)

(Note: Pandas returns nanoseconds since epoch. This is just one way to handle dates.)

Month/Year Extraction

We split columns issue_d into month/year tokens, letting us engineer new features:

df_loan[['Issue Month','Issue Year']] = df_loan.issue_d.str.split("-", expand=True)df_loan['Issue Year'] = df_loan['Issue Year'].astype('int32')

Then we map each month string to a numeric code (Jan → 1, Feb → 2, etc.).

3) Target Variable and Class Distribution

Binarizing the Loan Status

We drop the “Issued” loans (since they have no repayment history) and categorize the rest as:

0 → Current loans
1 → Late (31–120 days), Late (16–30 days), In Grace Period, or Default

label_categories = [    (0, ['Current']),    (1, ['Late (31-120 days)', 'Late (16-30 days)', 'In Grace Period', 'Default'])]

We observe a highly imbalanced dataset: only ~3.61% of loans are high-risk. We’ll use class_weight during model training to help address this imbalance.

4) Model Building

Train-Test Split

train, test = train_test_split(df_loan_issue, test_size=0.2,                                stratify=df_loan_issue['label'],                                random_state=42)y_train = train.pop('label')y_test = test.pop('label')X_train = trainX_test = test

We ensure the stratify parameter keeps the label ratio consistent across training and testing sets.

Scaling

We apply MinMaxScaler to normalize numeric features between 0 and 1:

scaler = MinMaxScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)

Deep Neural Network (Keras)

We build a small feed-forward neural network with dropout layers to reduce overfitting:

from tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Dense, Dropout

def make_model(metrics, size):    model = Sequential([        Dense(16, activation='relu', input_shape=(size,)),        Dropout(0.5),        Dense(8, activation='relu'),        Dropout(0.5),        Dense(1, activation='sigmoid'),    ])    model.compile(optimizer='adam',                  loss='binary_crossentropy',                  metrics=metrics)    return model

Key hyperparameters:

Hidden Layers: 16 → 8 → 1 output node
Activation: ReLU for hidden layers, Sigmoid for output
Dropout = 0.5 to handle overfitting
Loss: Binary Crossentropy
Metrics: Accuracy, Precision, Recall, AUC, and PRC

Class Weights

Because only ~3.61% of loans are in the risky class, we compute weights to make the model pay more attention to minority examples:

neg, pos = np.bincount(df_loan_issue['label'])  # Count each classweight_for_0 = (1 / neg) * (neg + pos) / 2.0weight_for_1 = (1 / pos) * (neg + pos) / 2.0class_weight = {0: weight_for_0, 1: weight_for_1}

Training

We train using KerasClassifier wrapper, with early stopping based on the “Precision-Recall AUC” metric:

EPOCHS = 100BATCH_SIZE = 2048

model_base = make_model(metrics=METRICS, size=X_train_scaled.shape[-1])classifier_base = KerasClassifier(model=model_base,                                  epochs=EPOCHS,                                  batch_size=BATCH_SIZE,                                  validation_split=0.1,                                  callbacks=[early_stopping],                                  verbose=0)classifier_base.fit(X_train_scaled, y_train, class_weight=class_weight)

5) Evaluation

Confusion Matrix

We generate predictions and plot a confusion matrix to visualize performance:

prediction_base = model_base.predict(X_test_scaled, batch_size=BATCH_SIZE)

Predicted vs. Actual0 (Good Loans)1 (Bad Loans)0True NegativesFalse Negatives1False PositivesTrue Positives

We also track classic metrics like accuracy, precision, recall, and F1.

ROC Curve & AUC

Using roc_curve and auc, we see how well our model distinguishes between classes at various thresholds:

from sklearn.metrics import roc_curve, auc

prediction_base = classifier_base.predict_proba(X_test_scaled)[:, 1]fpr_base, tpr_base, _ = roc_curve(y_test, prediction_base)roc_auc_base = auc(fpr_base, tpr_base)

A higher AUC indicates stronger separability between positive and negative classes.

6) Key Insights & Next Steps

Data Imbalance

The risky loans represent only ~3–4% of the dataset. Techniques like class weighting, SMOTE, or ensemble methods may be used to improve minority class performance.

Feature Engineering

Date fields and employment length can yield valuable signals. More domain-specific transformations (e.g., capturing the time between the earliest credit line and loan issue date) might further enhance performance.

Model Selection

A simple feed-forward neural network performed reasonably well, but we could experiment with other models (RandomForest, XGBoost) or deeper architectures.

Metric Choice

Accuracy alone can be misleading in imbalanced scenarios. Evaluating precision, recall, and especially AUC-PR is critical in credit risk tasks.

Production Readiness

Before deployment, we’d consider more rigorous hyperparameter searches, interpretability (e.g., LIME or SHAP), and real-time data streaming constraints.

Conclusion

This end-to-end workflow demonstrates how to tackle credit risk analysis in Python — from data cleaning and feature engineering to building a DNN with class weighting. While our final model does reasonably well in identifying risky borrowers, further refinements (like advanced feature engineering or ensemble methods) can boost performance. Regardless, this pipeline offers a strong foundation for future improvements and real-world credit risk deployments.

Thanks for reading! If you have any questions or suggestions, feel free to leave a comment.

View all posts

In Search of the 3%

Medium,

Mar 20, 2025

Read article

In Search of the 3%

Medium,

Mar 20, 2025

Read article

In Search of the 3%

Medium,

Mar 20, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

Medium,

Feb 7, 2025

Read article

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

Medium,

Feb 7, 2025

Read article

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

Medium,

Feb 7, 2025

Read article

Predictive Analysis of Credit Default Risk

Predictive Analysis of Credit Default Risk

Introduction

1) Data Overview

Dataset Source

Initial Checks

2) Data Preprocessing & Feature Engineering

Transforming Categorical Features

Handling Date Columns

Month/Year Extraction

3) Target Variable and Class Distribution

Binarizing the Loan Status

4) Model Building

Train-Test Split

Scaling

Deep Neural Network (Keras)

Class Weights

Training

5) Evaluation

Confusion Matrix

ROC Curve & AUC

6) Key Insights & Next Steps

Conclusion

Read more articles

Read more articles

In Search of the 3%

In Search of the 3%

In Search of the 3%

Cornell — Digital Agriculture Hackathon Experience

Cornell — Digital Agriculture Hackathon Experience

Cornell — Digital Agriculture Hackathon Experience

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

Binary Prediction of Smoker Status using Bio-Signal — Kaggle