Skip to content

ML Modeling Tutorial

This guide covers the first end-to-end predictive modeling path for binary loan-performance flags: target construction, temporal backtesting, training, and scoring.

Use this workflow when

  • you have a loan-month performance panel or a modeling frame with binary targets
  • you want temporal backtesting rather than random train/test splits
  • you need a backtest summary and a trained estimator as your first success

Do not start here if

  • you have not yet decided which variables are useful — start with Feature Analytics first
  • you only need a quick score-band or LGD diagnostic — use FICO Segmentation instead
  • you only have aggregated rollforward data — use the Rollforward workflow instead

Inputs

For target construction in build_targets(..., mode="panel"), the minimum panel shape is:

  • loan_id
  • mob
  • as_of_date
  • origination_date
  • original_principal
  • dpd
  • chargeoff_amount

For backtesting and model training, you also need:

  • a feature matrix with feature_cols
  • a binary target such as fpf30_flag
  • a temporal split column such as origination_month

See the full contract reference here: Input Data Contracts.

Code

import pandas as pd

from cranalytics import (
    engineer_loan_features,
    make_mock_fpf_data,
    run_predictive_modeling_session,
)

raw = make_mock_fpf_data(n_loans=1500, mature_pct=0.9, seed=7)
mature = raw.dropna(subset=["fpf30_flag"]).reset_index(drop=True)

modeling_df = engineer_loan_features(
    mature,
    as_of_date=pd.Timestamp("2025-12-31"),
)
modeling_df["origination_quarter"] = modeling_df["origination_date"].dt.to_period("Q").astype(str)

feature_cols = [
    "fico_score",
    "annual_rate",
    "dti",
    "n_inquiries_6m",
    "loan_to_income",
]

session = run_predictive_modeling_session(
    modeling_df,
    feature_cols=feature_cols,
    target_col="fpf30_flag",
    split_col="origination_quarter",
    model_family="logistic",
    n_splits=4,
    scoring_df=modeling_df.head(5).copy(),
    score_output_col="fpf_prob",
)
print(session.backtest[["split", "auc", "gini", "ks"]])
print(session.training_metadata)
print(session.training_diagnostics)
print(session.scored_data[["loan_id", "fpf_prob"]])

Expected output / first win

Your first win is a predictive session result that already contains the temporal backtest table, training diagnostics, and scored probabilities.

You should expect:

  • one row per out-of-time fold with auc, gini, and ks
  • training metadata that records the target, model family, and feature list
  • scored probabilities you can bridge into downstream monitoring or forecasting
  • attribute access such as session.backtest and session.scored_data for the main workflow outputs

Common mistakes

  • using random splits instead of temporal splits for credit performance data
  • leaving NaN targets in the training set and expecting them to be modeled
  • comparing models before confirming target maturity and label definition
  • dropping to low-level train/score helpers before checking whether the session boundary already gives you the full workflow output you need
  • expecting calibrate=True in train_binary_model() to work today — calibration is still a separate follow-up step

Next step

Run the packaged demo end-to-end with:

python -m cranalytics.examples.core_ml_modeling
  • Use summarize_predictive_backtest() if you want a roll-up view across multiple runs or model families.
  • Use train_binary_model() and score_model() directly when you only need a narrow training or scoring step instead of the full session boundary.
  • Use forecast_calendar_chargeoff_from_predictions() when you want to bridge loan-level scores into calendar-month forecasts.
  • For full API details, see the Predictive Modeling API Reference.