Survival Analysis Tutorial

Advanced / optional module. Start with Choose Your Path if you are not sure whether survival analysis is the right first workflow.

This tutorial demonstrates how to use the cranalytics.survival module to model loan survival and hazard rates.

Front door: the single deep entry point is survival.run() in cranalytics.survival._session, which prepares survival data, always fits Kaplan-Meier, and optionally runs cohort comparison, Cox PH + concordance, and Aalen-Johansen competing-risks incidence in one call, returning a result with .summary() and .plot(). Reach for the lower-level cranalytics.survival primitives (shown below) only when you need a narrower step.

Data Preparation

Survival analysis requires data in a specific "Time-to-Event" format, often represented as $(T, E)$: - $T$ (Duration): The time elapsed until the event occurred or the observation ended. - $E$ (Event): A binary indicator (1 if the event of interest occurred, 0 if the observation was censored).

Expected Input Structure

Your raw loan data typically starts as a snapshot or registry of loans with dates and statuses:

loan_id	issue_date	last_payment_date	loan_status	fico	ltv
L101	2023-01-01	2023-06-15	Charged Off	720	0.8
L102	2023-01-15	2024-01-10	Fully Paid	680	0.9
L103	2023-02-01	2024-05-20	Current	750	0.7

Converting to Survival Format

Use calculate_duration_and_event to transform raw dates and statuses into the $(T, E)$ format required for modeling.

from cranalytics.survival import calculate_duration_and_event
import pandas as pd

# Load your raw data
df = pd.read_csv('your_loan_data.csv')

# Prepare data
# This adds 'duration' and 'event' columns to your dataframe
df_survival = calculate_duration_and_event(
    df,
    start_date_col='issue_date',
    end_date_col='last_payment_date',
    status_col='loan_status',
    default_statuses=['Charged Off'],
    time_unit='M'  # 'M' for months, 'D' for days
)

# The resulting df_survival will contain:
# - duration: Float representing months on book
# - event: 1 for 'Charged Off', 0 for others (Censored)
print(df_survival[['duration', 'event']].head())

Kaplan-Meier Analysis (Univariate)

The Kaplan-Meier estimator is used to estimate the survival function $S(t) = P(T > t)$, which represents the probability that a loan has "survived" (not defaulted) beyond time $t$.

from cranalytics.survival import (
    compare_kaplan_meier_cohorts,
    fit_kaplan_meier,
    plot_kaplan_meier_cohorts,
)

# Fit model
km = fit_kaplan_meier(df_survival, duration_col='duration', event_col='event')

# Plot the survival curve
km.plot_survival_function()

# Compare different cohorts (e.g., by Risk Grade)
plot_kaplan_meier_cohorts(df_survival, group_col='grade')

# Statistical comparison:
# - 2 groups -> standard log-rank test
# - 3+ groups -> omnibus multivariate log-rank test
comparison = compare_kaplan_meier_cohorts(df_survival, group_col='grade')
print(f"KM cohort comparison p-value: {comparison.p_value:.4f}")

Cox Proportional Hazards (Multivariate)

The Cox PH model allows you to quantify how multiple variables (covariates) influence the hazard rate. It assumes that the effect of a variable is multiplicative relative to a baseline hazard.

Note: Categorical variables should be one-hot encoded before fitting.

from cranalytics.survival import fit_cox_ph

# Select covariates and fit
cph = fit_cox_ph(
    df_survival,
    duration_col='duration',
    event_col='event',
    covariates=['fico', 'ltv']
)

# View coefficients and Hazard Ratios
print(cph.summary)

# Plot Hazard Ratios
cph.plot()

Competing Risk Analysis

In credit risk, "Censoring" isn't always neutral. A loan might exit the portfolio because it defaulted (bad event) or because it was prepaid/matured (good event). Competing risk analysis treats these as distinct events rather than simple censoring.

from cranalytics.survival import fit_competing_risks

# Assume event codes: 0=Censored, 1=Default, 2=Prepayment
ajf = fit_competing_risks(df_survival, duration_col='duration', event_col='event', event_of_interest=1)

# Plot Cumulative Incidence Function (CIF)
ajf.plot()

Run the packaged survival demo end-to-end with:

python -m cranalytics.examples.core_survival