Survival Analysis Tutorial
Advanced / optional module. Start with Choose Your Path if you are not sure whether survival analysis is the right first workflow.
This tutorial demonstrates how to use the cranalytics.survival module to model loan survival and hazard rates.
Data Preparation
Survival analysis requires data in a specific "Time-to-Event" format, often represented as $(T, E)$: - $T$ (Duration): The time elapsed until the event occurred or the observation ended. - $E$ (Event): A binary indicator (1 if the event of interest occurred, 0 if the observation was censored).
Expected Input Structure
Your raw loan data typically starts as a snapshot or registry of loans with dates and statuses:
| loan_id | issue_date | last_payment_date | loan_status | fico | ltv |
|---|---|---|---|---|---|
| L101 | 2023-01-01 | 2023-06-15 | Charged Off | 720 | 0.8 |
| L102 | 2023-01-15 | 2024-01-10 | Fully Paid | 680 | 0.9 |
| L103 | 2023-02-01 | 2024-05-20 | Current | 750 | 0.7 |
Converting to Survival Format
Use calculate_duration_and_event to transform raw dates and statuses into the $(T, E)$ format required for modeling.
from cranalytics.survival import calculate_duration_and_event
import pandas as pd
# Load your raw data
df = pd.read_csv('your_loan_data.csv')
# Prepare data
# This adds 'duration' and 'event' columns to your dataframe
df_survival = calculate_duration_and_event(
df,
start_date_col='issue_date',
end_date_col='last_payment_date',
status_col='loan_status',
default_statuses=['Charged Off'],
time_unit='M' # 'M' for months, 'D' for days
)
# The resulting df_survival will contain:
# - duration: Float representing months on book
# - event: 1 for 'Charged Off', 0 for others (Censored)
print(df_survival[['duration', 'event']].head())
Kaplan-Meier Analysis (Univariate)
The Kaplan-Meier estimator is used to estimate the survival function $S(t) = P(T > t)$, which represents the probability that a loan has "survived" (not defaulted) beyond time $t$.
from cranalytics.survival import (
compare_kaplan_meier_cohorts,
fit_kaplan_meier,
plot_kaplan_meier_cohorts,
)
# Fit model
km = fit_kaplan_meier(df_survival, duration_col='duration', event_col='event')
# Plot the survival curve
km.plot_survival_function()
# Compare different cohorts (e.g., by Risk Grade)
plot_kaplan_meier_cohorts(df_survival, group_col='grade')
# Statistical comparison:
# - 2 groups -> standard log-rank test
# - 3+ groups -> omnibus multivariate log-rank test
comparison = compare_kaplan_meier_cohorts(df_survival, group_col='grade')
print(f"KM cohort comparison p-value: {comparison.p_value:.4f}")
Cox Proportional Hazards (Multivariate)
The Cox PH model allows you to quantify how multiple variables (covariates) influence the hazard rate. It assumes that the effect of a variable is multiplicative relative to a baseline hazard.
Note: Categorical variables should be one-hot encoded before fitting.
from cranalytics.survival import fit_cox_ph
# Select covariates and fit
cph = fit_cox_ph(
df_survival,
duration_col='duration',
event_col='event',
covariates=['fico', 'ltv']
)
# View coefficients and Hazard Ratios
print(cph.summary)
# Plot Hazard Ratios
cph.plot()
Competing Risk Analysis
In credit risk, "Censoring" isn't always neutral. A loan might exit the portfolio because it defaulted (bad event) or because it was prepaid/matured (good event). Competing risk analysis treats these as distinct events rather than simple censoring.
from cranalytics.survival import fit_competing_risks
# Assume event codes: 0=Censored, 1=Default, 2=Prepayment
ajf = fit_competing_risks(df_survival, duration_col='duration', event_col='event', event_of_interest=1)
# Plot Cumulative Incidence Function (CIF)
ajf.plot()
Run the packaged survival demo end-to-end with:
python -m cranalytics.examples.core_survival