Input Data Contracts

This page defines the column names, types, and value constraints for every input shape in cranalytics. All validation is enforced at runtime via Pandera schemas in src/cranalytics/validation.py.

There are six primary input shapes — pick the one that matches your workflow:

Shape	Used by	Column count
Portfolio	Lifetime loss forecasting, FICO segmentation, LGD	6 required + 1 optional
Loan History	History normalization, predictive target derivation from snapshots, lifetime loss forecasting from history	3 required + 11 optional
Loan Snapshot	Loan-level normalization, snapshot-to-survival prep, snapshot-based maturity metadata	3 required + 7 optional
Vintage (long format)	Vintage curve fitting, backtest, smoothing	2 required + 2 optional
Rollforward	Rollforward workflow, hazard curve fitting	5 required + 1 optional
Wide CGCO	Wide-format CGCO curve computation	5 required + 1 optional

Lifetime loss forecasting also requires transition input. You can pass either a transition matrix, a transition ledger with loan_id, period, and status, or a loan history panel with loan_id, fund_date, and as_of_date. The square matrix remains the canonical shape and is still required for forecast_portfolio_states.

Workflow template and validation coverage

Use this table to understand which workflows have a CLI-accessible starter template or validation command today. cranalytics check supports portfolio, vintage, rollforward, and feature-analytics CSVs. Pass a workflow name explicitly or omit it when the columns identify one workflow uniquely.

Workflow	Primary input shape	Starter template	CLI validation path
Vintage Curve Fitting	Vintage long format	`vintage`	`cranalytics vintage check your_data.csv`
Lifetime Loss Forecasting	Portfolio plus transition input	`portfolio`	`cranalytics check your_data.csv`
FICO Segmentation	Portfolio / `fico_score`	`portfolio`	`cranalytics check your_data.csv`
Feature Analytics	Loan-level modeling frame	`feature-analytics`	`cranalytics feature-analytics check your_data.csv`
Advanced ML Modeling	Loan-month panel or labeled modeling frame	None yet	`cranalytics ml-modeling check your_data.csv`; `run` requires explicit feature, target, and split columns
Survival Analysis	Time-to-event dataset	None yet	`cranalytics survival check your_data.csv`; `run` requires explicit event-date and status mappings
Portfolio Simulation	Portfolio plus transition assumptions	`portfolio`	`cranalytics check your_data.csv`
Rollforward Workflow	Monthly aggregated rollforward data	`rollforward`	`cranalytics rollforward check your_rollforward_data.csv --output-dir rollforward_readiness_out`

Portfolio

Used by: forecast_lifetime_loss, simulate_portfolio_cashflows, segment_fico, calculate_lgd

Column	Type	Required	Constraints	Notes
`loan_id`	str	Yes	Unique, non-null	Any string identifier
`principal`	float	Yes	> 0	Outstanding balance, not original
`annual_rate`	float	Yes	0.0 – 1.0	Decimal, not percent. 6% → `0.06`
`term_months`	int	Yes	1 – 360	Original loan term
`start_date`	datetime	Yes	Not in the future	Origination date; string is coerced
`status`	str	Yes	See below	Current loan status
`fico_score`	int	No	300 – 850	Optional; required for score-based segmentation

Accepted status values:

Current, Delinquent, Charged Off, Paid Off, Late-30, Late-60, Late-90, Default (legacy)

Validate from the CLI:

cranalytics check portfolio.csv
cranalytics check portfolio.csv --show-schema   # prints column list + status values

Minimal example:

import pandas as pd

portfolio = pd.DataFrame({
    "loan_id":      ["L001",       "L002",       "L003"],
    "principal":    [10000.0,      25000.0,      5000.0],
    "annual_rate":  [0.065,        0.12,         0.089],   # decimals
    "term_months":  [60,           36,           48],
    "start_date":   ["2023-01-15", "2022-06-01", "2023-03-20"],
    "status":       ["Current",    "Delinquent", "Charged Off"],
})

Common mistakes:

annual_rate as a percentage (e.g. 6.5 instead of 0.065) — validation rejects values > 1.0
start_date as an unrecognized string format — pass ISO 8601 (YYYY-MM-DD) and let coercion handle it
status values with inconsistent casing ("current" vs "Current") — must match exactly

Loan History (one row per loan per `as_of_date`)

Used by: cranalytics.loan_history.normalize_loan_history, cranalytics.loan_history.loan_history_to_target_frame, cranalytics.loan_history.loan_history_to_transition_frame, cranalytics.loan_history.build_targets_from_loan_history, forecast_lifetime_loss (which estimates a cohort transition matrix internally)

This is the canonical loan-history shape. Each row is one loan at one observation date. It is the skinny waist for historical workflows before branching into predictive target construction or transition estimation.

Column	Type	Required	Constraints	Notes
`loan_id`	str	Yes	Non-null	Repeats across time
`fund_date`	datetime	Yes	≤ `as_of_date`	Canonical origination name
`as_of_date`	datetime	Yes	Non-null	One row per loan per observation date
`month_on_book`	float	No	≥ 0 if present	Derived from `fund_date` and `as_of_date` when missing
`first_payment_due_date`	datetime	No	≥ `fund_date` if present	Used to distinguish `not_yet_due` from due/current
`original_principal`	float	No	> 0 if present	Needed for some chargeoff targets
`current_balance`	float	No	≥ 0 if present	Optional history balance
`annual_rate`	float	No	0.0 – 1.0 if present	Optional history feature
`fico_score`	float	No	300 – 850 if present	Optional history feature
`term_months`	float	No	1 – 360 if present	Optional contractual term
`dpd`	float	No	≥ 0 if present	Nulled for not-yet-due rows during normalization
`chargeoff_amount`	float	No	≥ 0 if present	Period chargeoff amount; required for chargeoff-amount targets
`chargeoff_date`	datetime	No	Between `fund_date` and `as_of_date` if present	Terminal loss date
`payoff_date`	datetime	No	Between `fund_date` and `as_of_date` if present	Terminal payoff date
`source_status`	str	No	Source-specific	Preserved raw source status label

Derived columns added by the contract layer:

first_payment_due_date_is_imputed
terminal_state
payment_due_state
delinquency_state
status

The canonical status is intentionally coarse:

Current
Delinquent
Paid Off
Charged Off

Pre-due rows still normalize to status = Current; the pre-due distinction is carried by payment_due_state and, for transition estimation, by transition_state.

Column aliases

The loan history contract accepts the same common aliases as the snapshot contract, plus:

Canonical name	Accepted aliases
`month_on_book`	`mob`, `months_on_book`, `month`
`chargeoff_amount`	`charge_off_amount`, `co_amount`, `net_chargeoff_amount`
`as_of_date`	`snapshot_date`, `report_date`, `extract_date`, `period`

Important predictive-target rule

When first_payment_due_date is present, the predictive adapter uses a payment-based mob:

rows before first due date map to mob = 0
the first due observation maps to mob = 1

That prevents pre-due rows from being mistaken for due-and-current first-payment observations.

Important transition-estimation rule

The history contract also derives a transition_state column with roll-rate-friendly labels:

Not Yet Due
Current
1-29 DPD
30 DPD
60 DPD
90+ DPD
Paid Off
Charged Off

loan_history_to_transition_frame() adapts canonical loan history into the generic (loan_id, period, status) shape expected by estimate_cohort_matrix().

Contract metadata

validate_loan_history_input_contract() also returns summary metadata counts for:

rows with imputed first_payment_due_date
loans that normalize to Charged Off without chargeoff_date
loans that normalize to Paid Off without payoff_date

Programmatic usage

from cranalytics.loan_history import (
    build_targets_from_loan_history,
    loan_history_to_transition_frame,
    normalize_loan_history,
)
from cranalytics.loan_history_contract import validate_loan_history_input_contract

result = validate_loan_history_input_contract(raw_history_df)
print(result.issue_table)
print(result.metadata)
history_df = result.df

history_df = normalize_loan_history(raw_history_df)
targets_df = build_targets_from_loan_history(
    history_df,
    targets=["fpd_flag", "fpf30_flag"],
)
transition_frame = loan_history_to_transition_frame(history_df)
# forecast_lifetime_loss(portfolio_df, history_df) accepts the loan history
# directly and estimates the cohort transition matrix internally.

Loan Snapshot (one row per loan)

Used by: cranalytics.loan_snapshot.normalize_loan_snapshot, cranalytics.loan_snapshot.loan_snapshot_to_portfolio, cranalytics.loan_snapshot.add_first_payment_maturity, cranalytics.loan_snapshot.loan_snapshot_to_survival_data

This is the new canonical one-row-per-loan shape. It is meant to be the skinny waist for loan-level workflows before branching into survival views, maturity-aware flags, or later history transforms.

Column	Type	Required	Constraints	Notes
`loan_id`	str	Yes	Unique, non-null	One row per loan
`fund_date`	datetime	Yes	≤ `as_of_date`	Use `fund_date` as the canonical origination name
`as_of_date`	datetime	Yes	Single file-level value	Every row in the snapshot should share the same extract date
`first_payment_due_date`	datetime	No	≥ `fund_date` if present	Null while still unknown
`original_principal`	float	No	> 0 if present	Original funded balance
`current_balance`	float	No	≥ 0 if present	Current outstanding balance
`term_months`	float	No	1 – 360 if present	Stored permissively to tolerate nullable source data
`annual_rate`	float	No	0.0 – 1.0 if present	Needed for `loan_snapshot_to_portfolio` and lifetime loss forecasting
`fico_score`	float	No	300 – 850 if present	Passed through to portfolio / segmentation views
`dpd`	float	No	≥ 0 if present	Days past due; nulled for not-yet-due loans during normalization
`chargeoff_date`	datetime	No	Between `fund_date` and `as_of_date` if present	Reliable terminal loss date
`payoff_date`	datetime	No	Between `fund_date` and `as_of_date` if present	Reliable terminal payoff date
`source_status`	str	No	Source-specific	Preserved raw source status label

Derived state columns added by the contract layer:

first_payment_due_date_is_imputed
terminal_state: active, charged_off, paid_off
payment_due_state: not_yet_due, due, due_status_unknown, terminal
delinquency_state: current, late_1_29, late_30_59, late_60_89, late_90_plus, unknown, or the terminal state
status: canonical coarse state derived from the fields above

The canonical snapshot status is:

Current
Delinquent
Paid Off
Charged Off

Important semantic rule:

dpd = 0 is not the same thing as "due-and-current" for a loan that has not reached its first due date yet. During normalization, pre-due rows are tagged as payment_due_state = not_yet_due, kept as canonical status = Current, and their dpd is nulled so downstream workflows cannot mistake them for due-and-current loans. The transition workflow then re-expands this distinction through transition_state.

If first_payment_due_date is missing, the contract imputes it as fund_date + 30 days, marks first_payment_due_date_is_imputed = True, and returns a metadata count of affected rows.

Column aliases

The loan snapshot contract accepts common aliases automatically:

Canonical name	Accepted aliases
`fund_date`	`origination_date`, `issue_date`, `issue_d`, `start_date`
`as_of_date`	`as_of`, `asof`, `snapshot_date`, `report_date`, `extract_date`, `max_report_date`
`first_payment_due_date`	`first_due_date`, `first_due_dt`, `first_payment_due`, `first_pmt_due_date`
`original_principal`	`original_balance`, `orig_bal`, `origination_amount`, `loan_amount`, `funded_amnt`, `amtloan`
`current_balance`	`principal_balance`, `outstanding_balance`, `balance`, `upb`, `current_principal`
`annual_rate`	`interest_rate`, `int_rate`, `rate`
`fico_score`	`fico`, `credit_score`, `fico_range_low`, `fico_range_high`
`chargeoff_date`	`charge_off_date`, `co_date`
`payoff_date`	`paid_off_date`, `paid_off_on`, `payoff_dt`
`source_status`	`loan_status`, `state`, or a raw `status` column from the source extract

Minimal example:

import pandas as pd

loan_snapshot = pd.DataFrame({
    "loan_id": ["L001", "L002", "L003"],
    "fund_date": ["2024-01-10", "2023-10-01", "2023-10-01"],
    "as_of_date": ["2024-04-01", "2024-04-01", "2024-04-01"],
    "first_payment_due_date": ["2024-02-10", "2023-11-01", "2023-11-01"],
    "current_balance": [9500.0, 7200.0, 0.0],
    "dpd": [0.0, 0.0, None],
    "chargeoff_date": [None, None, "2024-03-15"],
    "payoff_date": [None, None, None],
})

Programmatic contract usage:

from cranalytics.loan_snapshot import (
    add_first_payment_maturity,
    loan_snapshot_to_portfolio,
    normalize_loan_snapshot,
)
from cranalytics.loan_snapshot_contract import validate_loan_snapshot_input_contract

result = validate_loan_snapshot_input_contract(raw_df, as_of_date="2024-04-01")
print(result.issue_table)
print(result.metadata)
clean_df = result.df

clean_df = normalize_loan_snapshot(raw_df, as_of_date="2024-04-01")
portfolio_df = loan_snapshot_to_portfolio(clean_df)
flag_df = add_first_payment_maturity(clean_df)

Vintage (long format)

Used by: CurveFitter, smooth_vintage, run_validation_suite, run_backtest_sweeps

Each row is one (vintage, mob) observation. If you have a pivot table (vintages as rows, MOBs as columns) use create_vintage_triangle to reshape it first.

Column	Type	Required	Constraints	Notes
`mob`	int	Yes	0 – 600	Months on book
`cumulative_loss_rate`	float	Yes	0.0 – 1.0	Must be non-decreasing within a vintage
`vintage_name`	str	Conditional	Non-null if segmented	e.g. `"2023-Q1"` — required for segmented analysis
`segment`	str	Conditional	Non-null if segmented	e.g. FICO band — required for segmented analysis

Note on strict mode: VintageDataSchema uses strict=True — extra columns will raise an error. Drop them before passing to curve fitting functions.

Minimal example (unsegmented):

import pandas as pd

vintage_df = pd.DataFrame({
    "mob":                  [1,     2,     3,     4,     5,     6],
    "cumulative_loss_rate": [0.005, 0.011, 0.018, 0.023, 0.027, 0.030],
})

Minimal example (segmented):

vintage_df = pd.DataFrame({
    "vintage_name":         ["2023-Q1"] * 6 + ["2023-Q2"] * 6,
    "mob":                  [1, 2, 3, 4, 5, 6] * 2,
    "cumulative_loss_rate": [0.005, 0.011, 0.018, 0.023, 0.027, 0.030,
                             0.007, 0.014, 0.022, 0.028, 0.033, 0.037],
    "segment":              ["prime"] * 6 + ["subprime"] * 6,
})

Common mistakes:

Passing a wide pivot table directly — rows must be individual (vintage, mob) observations
Cumulative loss rates that decrease at any MOB — validation warns when > 5% of vintages violate monotonicity
Including extra columns (e.g. origination_count) — the strict schema will reject them; drop first

Rollforward (monthly aggregated)

Used by: fit_flow_hazard_curves, forecast_balance_flows, run_rollforward_workflow, generate_rollforward_readiness_report

Each row is one (segment, month on book) observation. Columns are aggregate dollar flows for that cohort age.

Column	Type	Required	Constraints	Notes
`segment_id`	str	Yes	Non-null, non-blank	Cohort or bucket label
`month_on_book`	int	Yes	≥ 0	MOB (0-indexed is fine)
`payments`	float	Yes	≥ 0	Dollar payments received
`chargeoffs`	float	Yes	≥ 0	Dollar charge-offs
`outstanding_balance`	float	Yes	> 0	Beginning-of-period balance
`amtloan`	float	No	> 0	Original origination balance per segment; used to compute CGCO%; falls back to first observed `outstanding_balance` if absent

Cross-row constraints: - payments + chargeoffs must not exceed outstanding_balance in any row - No duplicate (segment_id, month_on_book) pairs - MOB series should be contiguous within each segment (gaps produce a warning, not an error)

Column aliases

Rollforward data accepts common alternative column names automatically — no renaming needed:

Canonical name	Accepted aliases
`segment_id`	`segment`, `segmentid`, `segment_name`, `bucket`
`month_on_book`	`mob`, `month`, `months_on_book`, `period`
`payments`	`payment`, `principal_payment`, `pmt`
`chargeoffs`	`chargeoff`, `charge_off`, `co`
`outstanding_balance`	`balance`, `outstanding`, `upb`, `current_balance`
`amtloan`	`original_principal`, `origination_amount`, `loan_amount`

Alias resolution is case-insensitive and strips punctuation. Outstanding Balance and outstanding_balance both resolve correctly.

Minimal example:

import pandas as pd

rollforward_df = pd.DataFrame({
    "segment_id":          ["prime"] * 6,
    "month_on_book":       [1, 2, 3, 4, 5, 6],
    "payments":            [8200, 7900, 7600, 7300, 7000, 6700],
    "chargeoffs":          [100,  130,  150,  170,  190,  210],
    "outstanding_balance": [100000, 91700, 83670, 75920, 68450, 61260],
})

Common mistakes:

Balance as beginning-of-next-period rather than beginning-of-current-period — if payments + chargeoffs > outstanding_balance, the contract will reject the rows
Segment IDs that are floats or integers — coerced to string, but 1.0 and 1 become different strings; standardize first
MOB gaps (e.g. missing MOB 4) — produces a warning but not an error; the model will interpolate

Wide CGCO (loan-level)

Used by: compute_cgco_curve_wide, compute_final_cgco_wide, load_wide_vintage_data

Each row is one loan. The library computes cohort curves from the raw loan records — no pre-aggregation needed.

Column	Type	Required	Constraints	Notes
`loan_id`	str	Yes	Unique, non-null
`vintage_date`	datetime	Yes	≤ `max_report_date`	Origination date
`loan_amount`	float	Yes	> 0	Original disbursement amount
`max_report_date`	datetime	Yes	≥ `vintage_date`	Last observation date for this loan
`charge_off_date`	datetime	No	≥ `vintage_date` if present	Null for loans that have not charged off
`charge_off_amount`	float	Yes	≥ 0, ≤ `loan_amount`	Set to `0.0` for loans that have not charged off

Cross-row constraints: - charge_off_amount must be 0 for any row where charge_off_date is null - charge_off_date must be on or after vintage_date when present

Minimal example:

import pandas as pd

wide_df = pd.DataFrame({
    "loan_id":          ["L001",       "L002",       "L003"],
    "vintage_date":     ["2022-01-01", "2022-01-01", "2022-02-01"],
    "loan_amount":      [10000.0,      15000.0,      8000.0],
    "max_report_date":  ["2024-01-01", "2024-01-01", "2024-01-01"],
    "charge_off_date":  [None,         "2023-06-15", None],
    "charge_off_amount":[0.0,          14200.0,      0.0],
})

Common mistakes:

Forgetting to set charge_off_amount = 0.0 for non-charged-off loans — validation will reject rows with null charge-off date and non-zero amount
Using the same date for vintage_date and max_report_date on a current loan — valid, but produces a 0-MOB observation with no curve

Transition Matrix

Used by: forecast_lifetime_loss, forecast_portfolio_states, simulate_portfolio_cashflows

The transition matrix is a square pd.DataFrame where both the index and columns are state names. Each cell [i, j] is the probability of moving from state i to state j in one period.

forecast_lifetime_loss and summarize_lifetime_loss also accept:

a transition ledger with loan_id, period, and status
a loan history panel with loan_id, fund_date, and as_of_date

In those paths, cranalytics estimates this square matrix internally before applying the baseline loss forecast. forecast_portfolio_states still requires the explicit square matrix shown below.

Requirements: - Square — same states as index and columns - All values in [0, 1] - Each row sums to 1.0 (tolerance: ±0.001) - Must include a loss terminal state named Charged Off or legacy Default

Minimal example:

import pandas as pd

states = ["Current", "Delinquent", "Charged Off"]
matrix = pd.DataFrame(
    [
        [0.92, 0.06, 0.02],   # Current →
        [0.30, 0.55, 0.15],   # Delinquent →
        [0.00, 0.00, 1.00],   # Charged Off → (absorbing)
    ],
    index=states,
    columns=states,
)

Load the built-in sample matrix:

from cranalytics import load_sample_transition_matrix

matrix = load_sample_transition_matrix()
# States: Current, Delinquent, Charged Off

Common mistake — status mismatch with portfolio:

make_mock_portfolio() generates loans with four statuses including Paid Off. The sample transition matrix only models active, delinquent, and charged-off states. Filter before passing:

portfolio = portfolio[portfolio["status"].isin(["Current", "Delinquent", "Charged Off"])]

Validating your data programmatically

from cranalytics.validation import (
    validate_loan_history,
    validate_loan_snapshot,
    validate_portfolio,
    validate_vintage,
    validate_flow_input,
    validate_wide_vintage,
    validate_transition_matrix,
)

# Each raises pandera.errors.SchemaErrors on failure (lazy=True collects all errors)
validated_df = validate_portfolio(df)
validated_df = validate_loan_history(df)
validated_df = validate_loan_snapshot(df)
validated_df = validate_vintage(df)
validated_df = validate_flow_input(df)
validated_df = validate_wide_vintage(df)
validated_matrix = validate_transition_matrix(matrix_df)

For Rollforward data with alias resolution and a richer issue report:

from cranalytics.rollforward._contract import validate_rollforward_input_contract

result = validate_rollforward_input_contract(df)
print(result.issue_table)   # severity, issue_code, message per issue
clean_df = result.df        # renamed and coerced DataFrame

Input Data Contracts

Workflow template and validation coverage

Portfolio

Loan History (one row per loan per as_of_date)

Loan Snapshot (one row per loan)

Vintage (long format)

Rollforward (monthly aggregated)

Column aliases

Wide CGCO (loan-level)

Transition Matrix

Validating your data programmatically

See also

Loan History (one row per loan per `as_of_date`)