ML Modeling Tutorial

This guide covers the advanced end-to-end predictive modeling path for binary loan-performance flags: target construction, temporal backtesting, training, and scoring.

Use this workflow when

you have a loan-month performance panel or a modeling frame with binary targets
you want temporal backtesting rather than random train/test splits
you need a backtest summary and a trained estimator as your first success
you are ready to provide explicit feature, target, and temporal-split columns to cranalytics ml-modeling run (or use the equivalent Python API)

Do not start here if

you have not yet decided which variables are useful — start with Feature Analytics first
you only need a quick score-band or LGD diagnostic — use FICO Segmentation instead
you only have aggregated rollforward data — use the Rollforward workflow instead

Inputs

For target construction in build_targets(..., mode="panel"), the minimum panel shape is:

loan_id
mob
as_of_date
origination_date
original_principal
dpd
chargeoff_amount

For backtesting and model training, you also need:

a feature matrix with feature_cols
a binary target such as fpf30_flag
a temporal split column such as origination_month

See the full contract reference here: Input Data Contracts.

Tip

Use temporal splits for credit-performance data. Random train/test splits can leak portfolio and origination-period effects into evaluation.

Code

import pandas as pd

from cranalytics import predictive
from cranalytics.datasets import make_mock_fpf_data
from cranalytics.feature_analytics import engineer_loan_features

raw = make_mock_fpf_data(n_loans=1500, mature_pct=0.9, seed=7)
mature = raw.dropna(subset=["fpf30_flag"]).reset_index(drop=True)

modeling_df = engineer_loan_features(
    mature,
    as_of_date=pd.Timestamp("2025-12-31"),
)
modeling_df["origination_quarter"] = modeling_df["origination_date"].dt.to_period("Q").astype(str)

feature_cols = [
    "fico_score",
    "annual_rate",
    "dti",
    "n_inquiries_6m",
    "loan_to_income",
]

session = predictive.run(
    modeling_df,
    feature_cols=feature_cols,
    target_col="fpf30_flag",
    split_col="origination_quarter",
    model_family="logistic",
    n_splits=4,
    scoring_df=modeling_df.head(5).copy(),
    score_output_col="fpf_prob",
)
print(session.backtest[["split", "auc", "gini", "ks"]])
print(session.training_metadata)
print(session.training_diagnostics)
print(session.scored_data[["loan_id", "fpf_prob"]])

Expected output / first win

Your first win is a predictive session result that already contains the temporal backtest table, training diagnostics, and scored probabilities.

You should expect:

one row per out-of-time fold with auc, gini, and ks
training metadata that records the target, model family, and feature list
scored probabilities you can bridge into downstream monitoring or forecasting
attribute access such as session.backtest and session.scored_data for the main workflow outputs

Common mistakes

Using random splits instead of temporal splits for credit performance data.
Leaving NaN targets in the training set and expecting them to be modeled.
Comparing models before confirming target maturity and label definition.
Dropping to low-level train/score helpers before checking whether the session boundary already gives you the full workflow output you need.
Expecting calibrate=True in train_binary_model() to work today; calibration is still a separate follow-up step.

Next step

Run the packaged demo end-to-end with:

python -m cranalytics.examples.core_ml_modeling

Use summarize_predictive_backtest() if you want a roll-up view across multiple runs or model families.
Use train_binary_model() and score_model() directly when you only need a narrow training or scoring step instead of the full session boundary.
Use forecast_calendar_chargeoff_from_predictions() when you want to bridge loan-level scores into calendar-month forecasts.
For full API details, see the Predictive Modeling API Reference.