Feature Analytics Tutorial

This guide is the best starting point when you want to understand which origination-time variables carry signal before you commit to a predictive model.

Use this workflow when

you have loan-level application or booking data
you want a ranked feature table before full model training
you need a quick first win for early-performance or another binary actual-performance flag analysis

Do not start here if

you only have monthly aggregated rollforward data — use the Rollforward workflow instead
you already have a finished target, train/test design, and want model metrics — go to the advanced ML Modeling tutorial
you need reserve forecasting from a transition matrix — use Lifetime Loss Forecasting instead

Inputs

A meaningful first pass usually includes:

loan identifiers such as loan_id
booked-loan attributes such as principal, annual_rate, term_months, start_date, fico_score
a binary target or early-performance flag such as fpf30_flag

Optional external backend:

install optbinning if you want optimal WoE binning with fit_woe_binning()

Note

WoE binning is optional for a first pass. Start with ranked feature separation, then add a backend when you need production-style bins.

See the full contract reference here: Input Data Contracts.

Code

import pandas as pd

from cranalytics.datasets import make_mock_fpf_data
from cranalytics.feature_analytics import (
    engineer_loan_features,
    lift_gain_table,
    rank_features_by_separation,
)

raw = make_mock_fpf_data(n_loans=1200, mature_pct=0.85, seed=42)
mature = raw.dropna(subset=["fpf30_flag"]).reset_index(drop=True)

feature_frame = engineer_loan_features(
    mature,
    as_of_date=pd.Timestamp("2025-12-31"),
)

feature_cols = [
    "fico_score",
    "fico_normalized",
    "annual_rate",
    "dti",
    "loan_to_income",
]

ranking = rank_features_by_separation(
    feature_frame,
    feature_cols=feature_cols,
    flag_col="fpf30_flag",
)
print(ranking.head())

lift = lift_gain_table(
    y_true=feature_frame["fpf30_flag"],
    y_prob=feature_frame["vendor_pd"],
)
print(lift[["bin", "event_rate", "lift"]].head())

Expected output / first win

Your first win is a ranked feature table that tells you which variables are worth carrying forward into modeling.

You should expect:

a table ordered by separation strength (iv, gini, ks)
a quick lift/gain table that shows whether an existing score or proxy ranks risk sensibly
a short list of “promote”, “watch”, and “drop for now” candidates

Common mistakes

Jumping straight to model training before checking whether any features have useful separation.
Using immature targets without dropping NaN rows first.
Treating WoE binning as required on day one; it is optional, not the first step.
Mixing post-outcome fields into feature engineering, which creates leakage.

Next step

Run the packaged demo end-to-end with:

python -m cranalytics.examples.core_feature_analytics

If you want production-style feature transformation, add fit_woe_binning() with optbinning.
If you are ready to train and backtest a classifier, continue to the ML Modeling Tutorial.
For full API details, see the Predictive Modeling API Reference.