Skip to content

Survival Analysis Tutorial

Advanced / optional module. Start with Choose Your Path if you are not sure whether survival analysis is the right first workflow.

This tutorial demonstrates how to use the cranalytics.survival module to model loan survival and hazard rates.

Data Preparation

Survival analysis requires data in a specific "Time-to-Event" format, often represented as $(T, E)$: - $T$ (Duration): The time elapsed until the event occurred or the observation ended. - $E$ (Event): A binary indicator (1 if the event of interest occurred, 0 if the observation was censored).

Expected Input Structure

Your raw loan data typically starts as a snapshot or registry of loans with dates and statuses:

loan_id issue_date last_payment_date loan_status fico ltv
L101 2023-01-01 2023-06-15 Charged Off 720 0.8
L102 2023-01-15 2024-01-10 Fully Paid 680 0.9
L103 2023-02-01 2024-05-20 Current 750 0.7

Converting to Survival Format

Use calculate_duration_and_event to transform raw dates and statuses into the $(T, E)$ format required for modeling.

from cranalytics.survival import calculate_duration_and_event
import pandas as pd

# Load your raw data
df = pd.read_csv('your_loan_data.csv')

# Prepare data
# This adds 'duration' and 'event' columns to your dataframe
df_survival = calculate_duration_and_event(
    df,
    start_date_col='issue_date',
    end_date_col='last_payment_date',
    status_col='loan_status',
    default_statuses=['Charged Off'],
    time_unit='M'  # 'M' for months, 'D' for days
)

# The resulting df_survival will contain:
# - duration: Float representing months on book
# - event: 1 for 'Charged Off', 0 for others (Censored)
print(df_survival[['duration', 'event']].head())

Kaplan-Meier Analysis (Univariate)

The Kaplan-Meier estimator is used to estimate the survival function $S(t) = P(T > t)$, which represents the probability that a loan has "survived" (not defaulted) beyond time $t$.

from cranalytics.survival import (
    compare_kaplan_meier_cohorts,
    fit_kaplan_meier,
    plot_kaplan_meier_cohorts,
)

# Fit model
km = fit_kaplan_meier(df_survival, duration_col='duration', event_col='event')

# Plot the survival curve
km.plot_survival_function()

# Compare different cohorts (e.g., by Risk Grade)
plot_kaplan_meier_cohorts(df_survival, group_col='grade')

# Statistical comparison:
# - 2 groups -> standard log-rank test
# - 3+ groups -> omnibus multivariate log-rank test
comparison = compare_kaplan_meier_cohorts(df_survival, group_col='grade')
print(f"KM cohort comparison p-value: {comparison.p_value:.4f}")

Cox Proportional Hazards (Multivariate)

The Cox PH model allows you to quantify how multiple variables (covariates) influence the hazard rate. It assumes that the effect of a variable is multiplicative relative to a baseline hazard.

Note: Categorical variables should be one-hot encoded before fitting.

from cranalytics.survival import fit_cox_ph

# Select covariates and fit
cph = fit_cox_ph(
    df_survival,
    duration_col='duration',
    event_col='event',
    covariates=['fico', 'ltv']
)

# View coefficients and Hazard Ratios
print(cph.summary)

# Plot Hazard Ratios
cph.plot()

Competing Risk Analysis

In credit risk, "Censoring" isn't always neutral. A loan might exit the portfolio because it defaulted (bad event) or because it was prepaid/matured (good event). Competing risk analysis treats these as distinct events rather than simple censoring.

from cranalytics.survival import fit_competing_risks

# Assume event codes: 0=Censored, 1=Default, 2=Prepayment
ajf = fit_competing_risks(df_survival, duration_col='duration', event_col='event', event_of_interest=1)

# Plot Cumulative Incidence Function (CIF)
ajf.plot()

Run the packaged survival demo end-to-end with:

python -m cranalytics.examples.core_survival