Skip to content

Input Data Contracts

This page defines the column names, types, and value constraints for every input shape in cranalytics. All validation is enforced at runtime via Pandera schemas in src/cranalytics/validation.py.

There are six primary input shapes — pick the one that matches your workflow:

Shape Used by Column count
Portfolio Lifetime loss forecasting, FICO segmentation, LGD 6 required + 1 optional
Loan History History normalization, predictive target derivation from snapshots, lifetime loss forecasting from history 3 required + 11 optional
Loan Snapshot Loan-level normalization, snapshot-to-survival prep, snapshot-based maturity metadata 3 required + 7 optional
Vintage (long format) Vintage curve fitting, backtest, smoothing 2 required + 2 optional
Rollforward Rollforward workflow, hazard curve fitting 5 required + 1 optional
Wide CGCO Wide-format CGCO curve computation 5 required + 1 optional

Lifetime loss forecasting also requires transition input. You can pass either a transition matrix, a transition ledger with loan_id, period, and status, or a loan history panel with loan_id, fund_date, and as_of_date. The square matrix remains the canonical shape and is still required for forecast_portfolio_states.


Portfolio

Used by: forecast_lifetime_loss, simulate_portfolio_cashflows, segment_fico, calculate_lgd

Column Type Required Constraints Notes
loan_id str Yes Unique, non-null Any string identifier
principal float Yes > 0 Outstanding balance, not original
annual_rate float Yes 0.0 – 1.0 Decimal, not percent. 6% → 0.06
term_months int Yes 1 – 360 Original loan term
start_date datetime Yes Not in the future Origination date; string is coerced
status str Yes See below Current loan status
fico_score int No 300 – 850 Optional; required for score-based segmentation

Accepted status values:

Current, Delinquent, Charged Off, Paid Off, Late-30, Late-60, Late-90, Default (legacy)

Validate from the CLI:

cranalytics validate-data portfolio.csv
cranalytics validate-data portfolio.csv --show-schema   # prints column list + status values

Minimal example:

import pandas as pd

portfolio = pd.DataFrame({
    "loan_id":      ["L001",       "L002",       "L003"],
    "principal":    [10000.0,      25000.0,      5000.0],
    "annual_rate":  [0.065,        0.12,         0.089],   # decimals
    "term_months":  [60,           36,           48],
    "start_date":   ["2023-01-15", "2022-06-01", "2023-03-20"],
    "status":       ["Current",    "Delinquent", "Charged Off"],
})

Common mistakes:

  • annual_rate as a percentage (e.g. 6.5 instead of 0.065) — validation rejects values > 1.0
  • start_date as an unrecognized string format — pass ISO 8601 (YYYY-MM-DD) and let coercion handle it
  • status values with inconsistent casing ("current" vs "Current") — must match exactly

Loan History (one row per loan per as_of_date)

Used by: cranalytics.loan_history.normalize_loan_history, cranalytics.loan_history.loan_history_to_target_frame, cranalytics.loan_history.loan_history_to_transition_frame, cranalytics.loan_history.build_targets_from_loan_history, cranalytics.transition.estimate_cohort_matrix_from_loan_history, forecast_lifetime_loss

This is the canonical loan-history shape. Each row is one loan at one observation date. It is the skinny waist for historical workflows before branching into predictive target construction or transition estimation.

Column Type Required Constraints Notes
loan_id str Yes Non-null Repeats across time
fund_date datetime Yes as_of_date Canonical origination name
as_of_date datetime Yes Non-null One row per loan per observation date
month_on_book float No ≥ 0 if present Derived from fund_date and as_of_date when missing
first_payment_due_date datetime No fund_date if present Used to distinguish not_yet_due from due/current
original_principal float No > 0 if present Needed for some chargeoff targets
current_balance float No ≥ 0 if present Optional history balance
annual_rate float No 0.0 – 1.0 if present Optional history feature
fico_score float No 300 – 850 if present Optional history feature
term_months float No 1 – 360 if present Optional contractual term
dpd float No ≥ 0 if present Nulled for not-yet-due rows during normalization
chargeoff_amount float No ≥ 0 if present Period chargeoff amount; required for chargeoff-amount targets
chargeoff_date datetime No Between fund_date and as_of_date if present Terminal loss date
payoff_date datetime No Between fund_date and as_of_date if present Terminal payoff date
source_status str No Source-specific Preserved raw source status label

Derived columns added by the contract layer:

  • first_payment_due_date_is_imputed
  • terminal_state
  • payment_due_state
  • delinquency_state
  • status

The canonical status is intentionally coarse:

  • Current
  • Delinquent
  • Paid Off
  • Charged Off

Pre-due rows still normalize to status = Current; the pre-due distinction is carried by payment_due_state and, for transition estimation, by transition_state.

Column aliases

The loan history contract accepts the same common aliases as the snapshot contract, plus:

Canonical name Accepted aliases
month_on_book mob, months_on_book, month
chargeoff_amount charge_off_amount, co_amount, net_chargeoff_amount
as_of_date snapshot_date, report_date, extract_date, period

Important predictive-target rule

When first_payment_due_date is present, the predictive adapter uses a payment-based mob:

  • rows before first due date map to mob = 0
  • the first due observation maps to mob = 1

That prevents pre-due rows from being mistaken for due-and-current first-payment observations.

Important transition-estimation rule

The history contract also derives a transition_state column with roll-rate-friendly labels:

  • Not Yet Due
  • Current
  • 1-29 DPD
  • 30 DPD
  • 60 DPD
  • 90+ DPD
  • Paid Off
  • Charged Off

loan_history_to_transition_frame() adapts canonical loan history into the generic (loan_id, period, status) shape expected by estimate_cohort_matrix().

Contract metadata

validate_loan_history_input_contract() also returns summary metadata counts for:

  • rows with imputed first_payment_due_date
  • loans that normalize to Charged Off without chargeoff_date
  • loans that normalize to Paid Off without payoff_date

Programmatic usage

from cranalytics.loan_history import (
    build_targets_from_loan_history,
    loan_history_to_transition_frame,
    normalize_loan_history,
)
from cranalytics.loan_history_contract import validate_loan_history_input_contract
from cranalytics.transition import estimate_cohort_matrix_from_loan_history

result = validate_loan_history_input_contract(raw_history_df)
print(result.issue_table)
print(result.metadata)
history_df = result.df

history_df = normalize_loan_history(raw_history_df)
targets_df = build_targets_from_loan_history(
    history_df,
    targets=["fpd_flag", "fpf30_flag"],
)
transition_frame = loan_history_to_transition_frame(history_df)
matrix = estimate_cohort_matrix_from_loan_history(history_df)

Loan Snapshot (one row per loan)

Used by: cranalytics.loan_snapshot.normalize_loan_snapshot, cranalytics.loan_snapshot.loan_snapshot_to_portfolio, cranalytics.loan_snapshot.add_first_payment_maturity, cranalytics.loan_snapshot.loan_snapshot_to_survival_data

This is the new canonical one-row-per-loan shape. It is meant to be the skinny waist for loan-level workflows before branching into survival views, maturity-aware flags, or later history transforms.

Column Type Required Constraints Notes
loan_id str Yes Unique, non-null One row per loan
fund_date datetime Yes as_of_date Use fund_date as the canonical origination name
as_of_date datetime Yes Single file-level value Every row in the snapshot should share the same extract date
first_payment_due_date datetime No fund_date if present Null while still unknown
original_principal float No > 0 if present Original funded balance
current_balance float No ≥ 0 if present Current outstanding balance
term_months float No 1 – 360 if present Stored permissively to tolerate nullable source data
annual_rate float No 0.0 – 1.0 if present Needed for loan_snapshot_to_portfolio and lifetime loss forecasting
fico_score float No 300 – 850 if present Passed through to portfolio / segmentation views
dpd float No ≥ 0 if present Days past due; nulled for not-yet-due loans during normalization
chargeoff_date datetime No Between fund_date and as_of_date if present Reliable terminal loss date
payoff_date datetime No Between fund_date and as_of_date if present Reliable terminal payoff date
source_status str No Source-specific Preserved raw source status label

Derived state columns added by the contract layer:

  • first_payment_due_date_is_imputed
  • terminal_state: active, charged_off, paid_off
  • payment_due_state: not_yet_due, due, due_status_unknown, terminal
  • delinquency_state: current, late_1_29, late_30_59, late_60_89, late_90_plus, unknown, or the terminal state
  • status: canonical coarse state derived from the fields above

The canonical snapshot status is:

  • Current
  • Delinquent
  • Paid Off
  • Charged Off

Important semantic rule:

dpd = 0 is not the same thing as "due-and-current" for a loan that has not reached its first due date yet. During normalization, pre-due rows are tagged as payment_due_state = not_yet_due, kept as canonical status = Current, and their dpd is nulled so downstream workflows cannot mistake them for due-and-current loans. The transition workflow then re-expands this distinction through transition_state.

If first_payment_due_date is missing, the contract imputes it as fund_date + 30 days, marks first_payment_due_date_is_imputed = True, and returns a metadata count of affected rows.

Column aliases

The loan snapshot contract accepts common aliases automatically:

Canonical name Accepted aliases
fund_date origination_date, issue_date, issue_d, start_date
as_of_date as_of, asof, snapshot_date, report_date, extract_date, max_report_date
first_payment_due_date first_due_date, first_due_dt, first_payment_due, first_pmt_due_date
original_principal original_balance, orig_bal, origination_amount, loan_amount, funded_amnt, amtloan
current_balance principal_balance, outstanding_balance, balance, upb, current_principal
annual_rate interest_rate, int_rate, rate
fico_score fico, credit_score, fico_range_low, fico_range_high
chargeoff_date charge_off_date, co_date
payoff_date paid_off_date, paid_off_on, payoff_dt
source_status loan_status, state, or a raw status column from the source extract

Minimal example:

import pandas as pd

loan_snapshot = pd.DataFrame({
    "loan_id": ["L001", "L002", "L003"],
    "fund_date": ["2024-01-10", "2023-10-01", "2023-10-01"],
    "as_of_date": ["2024-04-01", "2024-04-01", "2024-04-01"],
    "first_payment_due_date": ["2024-02-10", "2023-11-01", "2023-11-01"],
    "current_balance": [9500.0, 7200.0, 0.0],
    "dpd": [0.0, 0.0, None],
    "chargeoff_date": [None, None, "2024-03-15"],
    "payoff_date": [None, None, None],
})

Programmatic contract usage:

from cranalytics.loan_snapshot import (
    add_first_payment_maturity,
    loan_snapshot_to_portfolio,
    normalize_loan_snapshot,
)
from cranalytics.loan_snapshot_contract import validate_loan_snapshot_input_contract

result = validate_loan_snapshot_input_contract(raw_df, as_of_date="2024-04-01")
print(result.issue_table)
print(result.metadata)
clean_df = result.df

clean_df = normalize_loan_snapshot(raw_df, as_of_date="2024-04-01")
portfolio_df = loan_snapshot_to_portfolio(clean_df)
flag_df = add_first_payment_maturity(clean_df)

Vintage (long format)

Used by: CurveFitter, smooth_vintage, run_validation_suite, run_backtest_sweeps

Each row is one (vintage, mob) observation. If you have a pivot table (vintages as rows, MOBs as columns) use create_vintage_triangle to reshape it first.

Column Type Required Constraints Notes
mob int Yes 0 – 600 Months on book
cumulative_loss_rate float Yes 0.0 – 1.0 Must be non-decreasing within a vintage
vintage_name str Conditional Non-null if segmented e.g. "2023-Q1" — required for segmented analysis
segment str Conditional Non-null if segmented e.g. FICO band — required for segmented analysis

Note on strict mode: VintageDataSchema uses strict=True — extra columns will raise an error. Drop them before passing to curve fitting functions.

Minimal example (unsegmented):

import pandas as pd

vintage_df = pd.DataFrame({
    "mob":                  [1,     2,     3,     4,     5,     6],
    "cumulative_loss_rate": [0.005, 0.011, 0.018, 0.023, 0.027, 0.030],
})

Minimal example (segmented):

vintage_df = pd.DataFrame({
    "vintage_name":         ["2023-Q1"] * 6 + ["2023-Q2"] * 6,
    "mob":                  [1, 2, 3, 4, 5, 6] * 2,
    "cumulative_loss_rate": [0.005, 0.011, 0.018, 0.023, 0.027, 0.030,
                             0.007, 0.014, 0.022, 0.028, 0.033, 0.037],
    "segment":              ["prime"] * 6 + ["subprime"] * 6,
})

Common mistakes:

  • Passing a wide pivot table directly — rows must be individual (vintage, mob) observations
  • Cumulative loss rates that decrease at any MOB — validation warns when > 5% of vintages violate monotonicity
  • Including extra columns (e.g. origination_count) — the strict schema will reject them; drop first

Rollforward (monthly aggregated)

Used by: fit_flow_hazard_curves, forecast_balance_flows, run_rollforward_workflow, generate_rollforward_readiness_report

Each row is one (segment, month on book) observation. Columns are aggregate dollar flows for that cohort age.

Column Type Required Constraints Notes
segment_id str Yes Non-null, non-blank Cohort or bucket label
month_on_book int Yes ≥ 0 MOB (0-indexed is fine)
payments float Yes ≥ 0 Dollar payments received
chargeoffs float Yes ≥ 0 Dollar charge-offs
outstanding_balance float Yes > 0 Beginning-of-period balance
amtloan float No > 0 Original origination balance per segment; used to compute CGCO%; falls back to first observed outstanding_balance if absent

Cross-row constraints: - payments + chargeoffs must not exceed outstanding_balance in any row - No duplicate (segment_id, month_on_book) pairs - MOB series should be contiguous within each segment (gaps produce a warning, not an error)

Column aliases

Rollforward data accepts common alternative column names automatically — no renaming needed:

Canonical name Accepted aliases
segment_id segment, segmentid, segment_name, bucket
month_on_book mob, month, months_on_book, period
payments payment, principal_payment, pmt
chargeoffs chargeoff, charge_off, co
outstanding_balance balance, outstanding, upb, current_balance
amtloan original_principal, origination_amount, loan_amount

Alias resolution is case-insensitive and strips punctuation. Outstanding Balance and outstanding_balance both resolve correctly.

Minimal example:

import pandas as pd

rollforward_df = pd.DataFrame({
    "segment_id":          ["prime"] * 6,
    "month_on_book":       [1, 2, 3, 4, 5, 6],
    "payments":            [8200, 7900, 7600, 7300, 7000, 6700],
    "chargeoffs":          [100,  130,  150,  170,  190,  210],
    "outstanding_balance": [100000, 91700, 83670, 75920, 68450, 61260],
})

Common mistakes:

  • Balance as beginning-of-next-period rather than beginning-of-current-period — if payments + chargeoffs > outstanding_balance, the contract will reject the rows
  • Segment IDs that are floats or integers — coerced to string, but 1.0 and 1 become different strings; standardize first
  • MOB gaps (e.g. missing MOB 4) — produces a warning but not an error; the model will interpolate

Wide CGCO (loan-level)

Used by: compute_cgco_curve_wide, compute_final_cgco_wide, load_wide_vintage_data

Each row is one loan. The library computes cohort curves from the raw loan records — no pre-aggregation needed.

Column Type Required Constraints Notes
loan_id str Yes Unique, non-null
vintage_date datetime Yes max_report_date Origination date
loan_amount float Yes > 0 Original disbursement amount
max_report_date datetime Yes vintage_date Last observation date for this loan
charge_off_date datetime No vintage_date if present Null for loans that have not charged off
charge_off_amount float Yes ≥ 0, ≤ loan_amount Set to 0.0 for loans that have not charged off

Cross-row constraints: - charge_off_amount must be 0 for any row where charge_off_date is null - charge_off_date must be on or after vintage_date when present

Minimal example:

import pandas as pd

wide_df = pd.DataFrame({
    "loan_id":          ["L001",       "L002",       "L003"],
    "vintage_date":     ["2022-01-01", "2022-01-01", "2022-02-01"],
    "loan_amount":      [10000.0,      15000.0,      8000.0],
    "max_report_date":  ["2024-01-01", "2024-01-01", "2024-01-01"],
    "charge_off_date":  [None,         "2023-06-15", None],
    "charge_off_amount":[0.0,          14200.0,      0.0],
})

Common mistakes:

  • Forgetting to set charge_off_amount = 0.0 for non-charged-off loans — validation will reject rows with null charge-off date and non-zero amount
  • Using the same date for vintage_date and max_report_date on a current loan — valid, but produces a 0-MOB observation with no curve

Transition Matrix

Used by: forecast_lifetime_loss, forecast_portfolio_states, simulate_portfolio_cashflows

The transition matrix is a square pd.DataFrame where both the index and columns are state names. Each cell [i, j] is the probability of moving from state i to state j in one period.

forecast_lifetime_loss and summarize_lifetime_loss also accept:

  • a transition ledger with loan_id, period, and status
  • a loan history panel with loan_id, fund_date, and as_of_date

In those paths, cranalytics estimates this square matrix internally before applying the baseline loss forecast. forecast_portfolio_states still requires the explicit square matrix shown below.

Requirements: - Square — same states as index and columns - All values in [0, 1] - Each row sums to 1.0 (tolerance: ±0.001) - Must include a loss terminal state named Charged Off or legacy Default

Minimal example:

import pandas as pd

states = ["Current", "Delinquent", "Charged Off"]
matrix = pd.DataFrame(
    [
        [0.92, 0.06, 0.02],   # Current →
        [0.30, 0.55, 0.15],   # Delinquent →
        [0.00, 0.00, 1.00],   # Charged Off → (absorbing)
    ],
    index=states,
    columns=states,
)

Load the built-in sample matrix:

from cranalytics import load_sample_transition_matrix

matrix = load_sample_transition_matrix()
# States: Current, Delinquent, Charged Off

Common mistake — status mismatch with portfolio:

make_mock_portfolio() generates loans with four statuses including Paid Off. The sample transition matrix only models active, delinquent, and charged-off states. Filter before passing:

portfolio = portfolio[portfolio["status"].isin(["Current", "Delinquent", "Charged Off"])]

Validating your data programmatically

from cranalytics.validation import (
    validate_loan_history,
    validate_loan_snapshot,
    validate_portfolio,
    validate_vintage,
    validate_flow_input,
    validate_wide_vintage,
    validate_transition_matrix,
)

# Each raises pandera.errors.SchemaErrors on failure (lazy=True collects all errors)
validated_df = validate_portfolio(df)
validated_df = validate_loan_history(df)
validated_df = validate_loan_snapshot(df)
validated_df = validate_vintage(df)
validated_df = validate_flow_input(df)
validated_df = validate_wide_vintage(df)
validated_matrix = validate_transition_matrix(matrix_df)

For Rollforward data with alias resolution and a richer issue report:

from cranalytics import validate_rollforward_input_contract

result = validate_rollforward_input_contract(df)
print(result.issue_table)   # severity, issue_code, message per issue
clean_df = result.df        # renamed and coerced DataFrame

See also

  • Choose Your Path — which workflow uses which input shape
  • src/cranalytics/validation.py — authoritative schema definitions
  • src/cranalytics/loan_history_contract.py — loan history alias resolution and state derivation
  • src/cranalytics/loan_snapshot_contract.py — loan snapshot alias resolution and state derivation
  • src/cranalytics/rollforward_contract.py — Rollforward alias resolution logic
  • CLI: cranalytics validate-data <file.csv> --show-schema