ML Modeling Tutorial
This guide covers the first end-to-end predictive modeling path for binary loan-performance flags: target construction, temporal backtesting, training, and scoring.
Use this workflow when
- you have a loan-month performance panel or a modeling frame with binary targets
- you want temporal backtesting rather than random train/test splits
- you need a backtest summary and a trained estimator as your first success
Do not start here if
- you have not yet decided which variables are useful — start with Feature Analytics first
- you only need a quick score-band or LGD diagnostic — use FICO Segmentation instead
- you only have aggregated rollforward data — use the Rollforward workflow instead
Inputs
For target construction in build_targets(..., mode="panel"), the minimum panel
shape is:
loan_idmobas_of_dateorigination_dateoriginal_principaldpdchargeoff_amount
For backtesting and model training, you also need:
- a feature matrix with
feature_cols - a binary target such as
fpf30_flag - a temporal split column such as
origination_month
See the full contract reference here: Input Data Contracts.
Code
import pandas as pd
from cranalytics import (
engineer_loan_features,
make_mock_fpf_data,
run_predictive_modeling_session,
)
raw = make_mock_fpf_data(n_loans=1500, mature_pct=0.9, seed=7)
mature = raw.dropna(subset=["fpf30_flag"]).reset_index(drop=True)
modeling_df = engineer_loan_features(
mature,
as_of_date=pd.Timestamp("2025-12-31"),
)
modeling_df["origination_quarter"] = modeling_df["origination_date"].dt.to_period("Q").astype(str)
feature_cols = [
"fico_score",
"annual_rate",
"dti",
"n_inquiries_6m",
"loan_to_income",
]
session = run_predictive_modeling_session(
modeling_df,
feature_cols=feature_cols,
target_col="fpf30_flag",
split_col="origination_quarter",
model_family="logistic",
n_splits=4,
scoring_df=modeling_df.head(5).copy(),
score_output_col="fpf_prob",
)
print(session.backtest[["split", "auc", "gini", "ks"]])
print(session.training_metadata)
print(session.training_diagnostics)
print(session.scored_data[["loan_id", "fpf_prob"]])
Expected output / first win
Your first win is a predictive session result that already contains the temporal backtest table, training diagnostics, and scored probabilities.
You should expect:
- one row per out-of-time fold with
auc,gini, andks - training metadata that records the target, model family, and feature list
- scored probabilities you can bridge into downstream monitoring or forecasting
- attribute access such as
session.backtestandsession.scored_datafor the main workflow outputs
Common mistakes
- using random splits instead of temporal splits for credit performance data
- leaving
NaNtargets in the training set and expecting them to be modeled - comparing models before confirming target maturity and label definition
- dropping to low-level train/score helpers before checking whether the session boundary already gives you the full workflow output you need
- expecting
calibrate=Trueintrain_binary_model()to work today — calibration is still a separate follow-up step
Next step
Run the packaged demo end-to-end with:
python -m cranalytics.examples.core_ml_modeling
- Use
summarize_predictive_backtest()if you want a roll-up view across multiple runs or model families. - Use
train_binary_model()andscore_model()directly when you only need a narrow training or scoring step instead of the full session boundary. - Use
forecast_calendar_chargeoff_from_predictions()when you want to bridge loan-level scores into calendar-month forecasts. - For full API details, see the Predictive Modeling API Reference.