Skip to content

Feature Analytics Tutorial

This guide is the best starting point when you want to understand which origination-time variables carry signal before you commit to a predictive model.

Use this workflow when

  • you have loan-level application or booking data
  • you want a ranked feature table before full model training
  • you need a quick first win for early-performance or FPF-style analysis

Do not start here if

  • you only have monthly aggregated rollforward data — use the Rollforward workflow instead
  • you already have a finished target, train/test design, and want model metrics — go to the ML Modeling tutorial
  • you need reserve forecasting from a transition matrix — use Lifetime Loss Forecasting instead

Inputs

A meaningful first pass usually includes:

  • loan identifiers such as loan_id
  • booked-loan attributes such as principal, annual_rate, term_months, start_date, fico_score
  • a binary target or early-performance flag such as fpf30_flag

Optional external backend:

  • install optbinning if you want optimal WoE binning with fit_woe_binning()

See the full contract reference here: Input Data Contracts.

Code

import pandas as pd

from cranalytics import (
    engineer_loan_features,
    lift_gain_table,
    make_mock_fpf_data,
    rank_features_by_separation,
)

raw = make_mock_fpf_data(n_loans=1200, mature_pct=0.85, seed=42)
mature = raw.dropna(subset=["fpf30_flag"]).reset_index(drop=True)

feature_frame = engineer_loan_features(
    mature,
    as_of_date=pd.Timestamp("2025-12-31"),
)

feature_cols = [
    "fico_score",
    "fico_normalized",
    "annual_rate",
    "dti",
    "loan_to_income",
]

ranking = rank_features_by_separation(
    feature_frame,
    feature_cols=feature_cols,
    flag_col="fpf30_flag",
)
print(ranking.head())

lift = lift_gain_table(
    y_true=feature_frame["fpf30_flag"],
    y_prob=feature_frame["vendor_pd"],
)
print(lift[["bin", "event_rate", "lift"]].head())

Expected output / first win

Your first win is a ranked feature table that tells you which variables are worth carrying forward into modeling.

You should expect:

  • a table ordered by separation strength (iv, gini, ks)
  • a quick lift/gain table that shows whether an existing score or proxy ranks risk sensibly
  • a short list of “promote”, “watch”, and “drop for now” candidates

Common mistakes

  • jumping straight to model training before checking whether any features have useful separation
  • using immature targets without dropping NaN rows first
  • treating WoE binning as required on day one — it is optional, not the first step
  • mixing post-outcome fields into feature engineering, which creates leakage

Next step

Run the packaged demo end-to-end with:

python -m cranalytics.examples.core_feature_analytics