Input Data Contracts
This page defines the column names, types, and value constraints for every input shape in cranalytics. All validation is enforced at runtime via Pandera schemas in src/cranalytics/validation.py.
There are six primary input shapes — pick the one that matches your workflow:
| Shape | Used by | Column count |
|---|---|---|
| Portfolio | Lifetime loss forecasting, FICO segmentation, LGD | 6 required + 1 optional |
| Loan History | History normalization, predictive target derivation from snapshots, lifetime loss forecasting from history | 3 required + 11 optional |
| Loan Snapshot | Loan-level normalization, snapshot-to-survival prep, snapshot-based maturity metadata | 3 required + 7 optional |
| Vintage (long format) | Vintage curve fitting, backtest, smoothing | 2 required + 2 optional |
| Rollforward | Rollforward workflow, hazard curve fitting | 5 required + 1 optional |
| Wide CGCO | Wide-format CGCO curve computation | 5 required + 1 optional |
Lifetime loss forecasting also requires transition input. You can pass either a
transition matrix, a transition ledger with loan_id,
period, and status, or a loan history panel with loan_id, fund_date,
and as_of_date. The square matrix remains the canonical shape and is still
required for forecast_portfolio_states.
Portfolio
Used by: forecast_lifetime_loss, simulate_portfolio_cashflows, segment_fico, calculate_lgd
| Column | Type | Required | Constraints | Notes |
|---|---|---|---|---|
loan_id |
str | Yes | Unique, non-null | Any string identifier |
principal |
float | Yes | > 0 | Outstanding balance, not original |
annual_rate |
float | Yes | 0.0 – 1.0 | Decimal, not percent. 6% → 0.06 |
term_months |
int | Yes | 1 – 360 | Original loan term |
start_date |
datetime | Yes | Not in the future | Origination date; string is coerced |
status |
str | Yes | See below | Current loan status |
fico_score |
int | No | 300 – 850 | Optional; required for score-based segmentation |
Accepted status values:
Current, Delinquent, Charged Off, Paid Off, Late-30, Late-60, Late-90, Default (legacy)
Validate from the CLI:
cranalytics validate-data portfolio.csv
cranalytics validate-data portfolio.csv --show-schema # prints column list + status values
Minimal example:
import pandas as pd
portfolio = pd.DataFrame({
"loan_id": ["L001", "L002", "L003"],
"principal": [10000.0, 25000.0, 5000.0],
"annual_rate": [0.065, 0.12, 0.089], # decimals
"term_months": [60, 36, 48],
"start_date": ["2023-01-15", "2022-06-01", "2023-03-20"],
"status": ["Current", "Delinquent", "Charged Off"],
})
Common mistakes:
annual_rateas a percentage (e.g.6.5instead of0.065) — validation rejects values > 1.0start_dateas an unrecognized string format — pass ISO 8601 (YYYY-MM-DD) and let coercion handle itstatusvalues with inconsistent casing ("current"vs"Current") — must match exactly
Loan History (one row per loan per as_of_date)
Used by: cranalytics.loan_history.normalize_loan_history, cranalytics.loan_history.loan_history_to_target_frame, cranalytics.loan_history.loan_history_to_transition_frame, cranalytics.loan_history.build_targets_from_loan_history, cranalytics.transition.estimate_cohort_matrix_from_loan_history, forecast_lifetime_loss
This is the canonical loan-history shape. Each row is one loan at one observation date. It is the skinny waist for historical workflows before branching into predictive target construction or transition estimation.
| Column | Type | Required | Constraints | Notes |
|---|---|---|---|---|
loan_id |
str | Yes | Non-null | Repeats across time |
fund_date |
datetime | Yes | ≤ as_of_date |
Canonical origination name |
as_of_date |
datetime | Yes | Non-null | One row per loan per observation date |
month_on_book |
float | No | ≥ 0 if present | Derived from fund_date and as_of_date when missing |
first_payment_due_date |
datetime | No | ≥ fund_date if present |
Used to distinguish not_yet_due from due/current |
original_principal |
float | No | > 0 if present | Needed for some chargeoff targets |
current_balance |
float | No | ≥ 0 if present | Optional history balance |
annual_rate |
float | No | 0.0 – 1.0 if present | Optional history feature |
fico_score |
float | No | 300 – 850 if present | Optional history feature |
term_months |
float | No | 1 – 360 if present | Optional contractual term |
dpd |
float | No | ≥ 0 if present | Nulled for not-yet-due rows during normalization |
chargeoff_amount |
float | No | ≥ 0 if present | Period chargeoff amount; required for chargeoff-amount targets |
chargeoff_date |
datetime | No | Between fund_date and as_of_date if present |
Terminal loss date |
payoff_date |
datetime | No | Between fund_date and as_of_date if present |
Terminal payoff date |
source_status |
str | No | Source-specific | Preserved raw source status label |
Derived columns added by the contract layer:
first_payment_due_date_is_imputedterminal_statepayment_due_statedelinquency_statestatus
The canonical status is intentionally coarse:
CurrentDelinquentPaid OffCharged Off
Pre-due rows still normalize to status = Current; the pre-due distinction is carried by payment_due_state and, for transition estimation, by transition_state.
Column aliases
The loan history contract accepts the same common aliases as the snapshot contract, plus:
| Canonical name | Accepted aliases |
|---|---|
month_on_book |
mob, months_on_book, month |
chargeoff_amount |
charge_off_amount, co_amount, net_chargeoff_amount |
as_of_date |
snapshot_date, report_date, extract_date, period |
Important predictive-target rule
When first_payment_due_date is present, the predictive adapter uses a payment-based mob:
- rows before first due date map to
mob = 0 - the first due observation maps to
mob = 1
That prevents pre-due rows from being mistaken for due-and-current first-payment observations.
Important transition-estimation rule
The history contract also derives a transition_state column with roll-rate-friendly labels:
Not Yet DueCurrent1-29 DPD30 DPD60 DPD90+ DPDPaid OffCharged Off
loan_history_to_transition_frame() adapts canonical loan history into the generic (loan_id, period, status) shape expected by estimate_cohort_matrix().
Contract metadata
validate_loan_history_input_contract() also returns summary metadata counts for:
- rows with imputed
first_payment_due_date - loans that normalize to
Charged Offwithoutchargeoff_date - loans that normalize to
Paid Offwithoutpayoff_date
Programmatic usage
from cranalytics.loan_history import (
build_targets_from_loan_history,
loan_history_to_transition_frame,
normalize_loan_history,
)
from cranalytics.loan_history_contract import validate_loan_history_input_contract
from cranalytics.transition import estimate_cohort_matrix_from_loan_history
result = validate_loan_history_input_contract(raw_history_df)
print(result.issue_table)
print(result.metadata)
history_df = result.df
history_df = normalize_loan_history(raw_history_df)
targets_df = build_targets_from_loan_history(
history_df,
targets=["fpd_flag", "fpf30_flag"],
)
transition_frame = loan_history_to_transition_frame(history_df)
matrix = estimate_cohort_matrix_from_loan_history(history_df)
Loan Snapshot (one row per loan)
Used by: cranalytics.loan_snapshot.normalize_loan_snapshot, cranalytics.loan_snapshot.loan_snapshot_to_portfolio, cranalytics.loan_snapshot.add_first_payment_maturity, cranalytics.loan_snapshot.loan_snapshot_to_survival_data
This is the new canonical one-row-per-loan shape. It is meant to be the skinny waist for loan-level workflows before branching into survival views, maturity-aware flags, or later history transforms.
| Column | Type | Required | Constraints | Notes |
|---|---|---|---|---|
loan_id |
str | Yes | Unique, non-null | One row per loan |
fund_date |
datetime | Yes | ≤ as_of_date |
Use fund_date as the canonical origination name |
as_of_date |
datetime | Yes | Single file-level value | Every row in the snapshot should share the same extract date |
first_payment_due_date |
datetime | No | ≥ fund_date if present |
Null while still unknown |
original_principal |
float | No | > 0 if present | Original funded balance |
current_balance |
float | No | ≥ 0 if present | Current outstanding balance |
term_months |
float | No | 1 – 360 if present | Stored permissively to tolerate nullable source data |
annual_rate |
float | No | 0.0 – 1.0 if present | Needed for loan_snapshot_to_portfolio and lifetime loss forecasting |
fico_score |
float | No | 300 – 850 if present | Passed through to portfolio / segmentation views |
dpd |
float | No | ≥ 0 if present | Days past due; nulled for not-yet-due loans during normalization |
chargeoff_date |
datetime | No | Between fund_date and as_of_date if present |
Reliable terminal loss date |
payoff_date |
datetime | No | Between fund_date and as_of_date if present |
Reliable terminal payoff date |
source_status |
str | No | Source-specific | Preserved raw source status label |
Derived state columns added by the contract layer:
first_payment_due_date_is_imputedterminal_state:active,charged_off,paid_offpayment_due_state:not_yet_due,due,due_status_unknown,terminaldelinquency_state:current,late_1_29,late_30_59,late_60_89,late_90_plus,unknown, or the terminal statestatus: canonical coarse state derived from the fields above
The canonical snapshot status is:
CurrentDelinquentPaid OffCharged Off
Important semantic rule:
dpd = 0 is not the same thing as "due-and-current" for a loan that has not reached its first due date yet. During normalization, pre-due rows are tagged as payment_due_state = not_yet_due, kept as canonical status = Current, and their dpd is nulled so downstream workflows cannot mistake them for due-and-current loans. The transition workflow then re-expands this distinction through transition_state.
If first_payment_due_date is missing, the contract imputes it as fund_date + 30 days, marks first_payment_due_date_is_imputed = True, and returns a metadata count of affected rows.
Column aliases
The loan snapshot contract accepts common aliases automatically:
| Canonical name | Accepted aliases |
|---|---|
fund_date |
origination_date, issue_date, issue_d, start_date |
as_of_date |
as_of, asof, snapshot_date, report_date, extract_date, max_report_date |
first_payment_due_date |
first_due_date, first_due_dt, first_payment_due, first_pmt_due_date |
original_principal |
original_balance, orig_bal, origination_amount, loan_amount, funded_amnt, amtloan |
current_balance |
principal_balance, outstanding_balance, balance, upb, current_principal |
annual_rate |
interest_rate, int_rate, rate |
fico_score |
fico, credit_score, fico_range_low, fico_range_high |
chargeoff_date |
charge_off_date, co_date |
payoff_date |
paid_off_date, paid_off_on, payoff_dt |
source_status |
loan_status, state, or a raw status column from the source extract |
Minimal example:
import pandas as pd
loan_snapshot = pd.DataFrame({
"loan_id": ["L001", "L002", "L003"],
"fund_date": ["2024-01-10", "2023-10-01", "2023-10-01"],
"as_of_date": ["2024-04-01", "2024-04-01", "2024-04-01"],
"first_payment_due_date": ["2024-02-10", "2023-11-01", "2023-11-01"],
"current_balance": [9500.0, 7200.0, 0.0],
"dpd": [0.0, 0.0, None],
"chargeoff_date": [None, None, "2024-03-15"],
"payoff_date": [None, None, None],
})
Programmatic contract usage:
from cranalytics.loan_snapshot import (
add_first_payment_maturity,
loan_snapshot_to_portfolio,
normalize_loan_snapshot,
)
from cranalytics.loan_snapshot_contract import validate_loan_snapshot_input_contract
result = validate_loan_snapshot_input_contract(raw_df, as_of_date="2024-04-01")
print(result.issue_table)
print(result.metadata)
clean_df = result.df
clean_df = normalize_loan_snapshot(raw_df, as_of_date="2024-04-01")
portfolio_df = loan_snapshot_to_portfolio(clean_df)
flag_df = add_first_payment_maturity(clean_df)
Vintage (long format)
Used by: CurveFitter, smooth_vintage, run_validation_suite, run_backtest_sweeps
Each row is one (vintage, mob) observation. If you have a pivot table (vintages as rows, MOBs as columns) use create_vintage_triangle to reshape it first.
| Column | Type | Required | Constraints | Notes |
|---|---|---|---|---|
mob |
int | Yes | 0 – 600 | Months on book |
cumulative_loss_rate |
float | Yes | 0.0 – 1.0 | Must be non-decreasing within a vintage |
vintage_name |
str | Conditional | Non-null if segmented | e.g. "2023-Q1" — required for segmented analysis |
segment |
str | Conditional | Non-null if segmented | e.g. FICO band — required for segmented analysis |
Note on strict mode: VintageDataSchema uses strict=True — extra columns will raise an error. Drop them before passing to curve fitting functions.
Minimal example (unsegmented):
import pandas as pd
vintage_df = pd.DataFrame({
"mob": [1, 2, 3, 4, 5, 6],
"cumulative_loss_rate": [0.005, 0.011, 0.018, 0.023, 0.027, 0.030],
})
Minimal example (segmented):
vintage_df = pd.DataFrame({
"vintage_name": ["2023-Q1"] * 6 + ["2023-Q2"] * 6,
"mob": [1, 2, 3, 4, 5, 6] * 2,
"cumulative_loss_rate": [0.005, 0.011, 0.018, 0.023, 0.027, 0.030,
0.007, 0.014, 0.022, 0.028, 0.033, 0.037],
"segment": ["prime"] * 6 + ["subprime"] * 6,
})
Common mistakes:
- Passing a wide pivot table directly — rows must be individual (vintage, mob) observations
- Cumulative loss rates that decrease at any MOB — validation warns when > 5% of vintages violate monotonicity
- Including extra columns (e.g.
origination_count) — the strict schema will reject them; drop first
Rollforward (monthly aggregated)
Used by: fit_flow_hazard_curves, forecast_balance_flows, run_rollforward_workflow, generate_rollforward_readiness_report
Each row is one (segment, month on book) observation. Columns are aggregate dollar flows for that cohort age.
| Column | Type | Required | Constraints | Notes |
|---|---|---|---|---|
segment_id |
str | Yes | Non-null, non-blank | Cohort or bucket label |
month_on_book |
int | Yes | ≥ 0 | MOB (0-indexed is fine) |
payments |
float | Yes | ≥ 0 | Dollar payments received |
chargeoffs |
float | Yes | ≥ 0 | Dollar charge-offs |
outstanding_balance |
float | Yes | > 0 | Beginning-of-period balance |
amtloan |
float | No | > 0 | Original origination balance per segment; used to compute CGCO%; falls back to first observed outstanding_balance if absent |
Cross-row constraints:
- payments + chargeoffs must not exceed outstanding_balance in any row
- No duplicate (segment_id, month_on_book) pairs
- MOB series should be contiguous within each segment (gaps produce a warning, not an error)
Column aliases
Rollforward data accepts common alternative column names automatically — no renaming needed:
| Canonical name | Accepted aliases |
|---|---|
segment_id |
segment, segmentid, segment_name, bucket |
month_on_book |
mob, month, months_on_book, period |
payments |
payment, principal_payment, pmt |
chargeoffs |
chargeoff, charge_off, co |
outstanding_balance |
balance, outstanding, upb, current_balance |
amtloan |
original_principal, origination_amount, loan_amount |
Alias resolution is case-insensitive and strips punctuation. Outstanding Balance and outstanding_balance both resolve correctly.
Minimal example:
import pandas as pd
rollforward_df = pd.DataFrame({
"segment_id": ["prime"] * 6,
"month_on_book": [1, 2, 3, 4, 5, 6],
"payments": [8200, 7900, 7600, 7300, 7000, 6700],
"chargeoffs": [100, 130, 150, 170, 190, 210],
"outstanding_balance": [100000, 91700, 83670, 75920, 68450, 61260],
})
Common mistakes:
- Balance as beginning-of-next-period rather than beginning-of-current-period — if
payments + chargeoffs > outstanding_balance, the contract will reject the rows - Segment IDs that are floats or integers — coerced to string, but
1.0and1become different strings; standardize first - MOB gaps (e.g. missing MOB 4) — produces a warning but not an error; the model will interpolate
Wide CGCO (loan-level)
Used by: compute_cgco_curve_wide, compute_final_cgco_wide, load_wide_vintage_data
Each row is one loan. The library computes cohort curves from the raw loan records — no pre-aggregation needed.
| Column | Type | Required | Constraints | Notes |
|---|---|---|---|---|
loan_id |
str | Yes | Unique, non-null | |
vintage_date |
datetime | Yes | ≤ max_report_date |
Origination date |
loan_amount |
float | Yes | > 0 | Original disbursement amount |
max_report_date |
datetime | Yes | ≥ vintage_date |
Last observation date for this loan |
charge_off_date |
datetime | No | ≥ vintage_date if present |
Null for loans that have not charged off |
charge_off_amount |
float | Yes | ≥ 0, ≤ loan_amount |
Set to 0.0 for loans that have not charged off |
Cross-row constraints:
- charge_off_amount must be 0 for any row where charge_off_date is null
- charge_off_date must be on or after vintage_date when present
Minimal example:
import pandas as pd
wide_df = pd.DataFrame({
"loan_id": ["L001", "L002", "L003"],
"vintage_date": ["2022-01-01", "2022-01-01", "2022-02-01"],
"loan_amount": [10000.0, 15000.0, 8000.0],
"max_report_date": ["2024-01-01", "2024-01-01", "2024-01-01"],
"charge_off_date": [None, "2023-06-15", None],
"charge_off_amount":[0.0, 14200.0, 0.0],
})
Common mistakes:
- Forgetting to set
charge_off_amount = 0.0for non-charged-off loans — validation will reject rows withnullcharge-off date and non-zero amount - Using the same date for
vintage_dateandmax_report_dateon a current loan — valid, but produces a 0-MOB observation with no curve
Transition Matrix
Used by: forecast_lifetime_loss, forecast_portfolio_states, simulate_portfolio_cashflows
The transition matrix is a square pd.DataFrame where both the index and columns are state names. Each cell [i, j] is the probability of moving from state i to state j in one period.
forecast_lifetime_loss and summarize_lifetime_loss also accept:
- a transition ledger with
loan_id,period, andstatus - a loan history panel with
loan_id,fund_date, andas_of_date
In those paths, cranalytics estimates this square matrix internally before
applying the baseline loss forecast. forecast_portfolio_states still requires
the explicit square matrix shown below.
Requirements:
- Square — same states as index and columns
- All values in [0, 1]
- Each row sums to 1.0 (tolerance: ±0.001)
- Must include a loss terminal state named Charged Off or legacy Default
Minimal example:
import pandas as pd
states = ["Current", "Delinquent", "Charged Off"]
matrix = pd.DataFrame(
[
[0.92, 0.06, 0.02], # Current →
[0.30, 0.55, 0.15], # Delinquent →
[0.00, 0.00, 1.00], # Charged Off → (absorbing)
],
index=states,
columns=states,
)
Load the built-in sample matrix:
from cranalytics import load_sample_transition_matrix
matrix = load_sample_transition_matrix()
# States: Current, Delinquent, Charged Off
Common mistake — status mismatch with portfolio:
make_mock_portfolio() generates loans with four statuses including Paid Off. The sample transition matrix only models active, delinquent, and charged-off states. Filter before passing:
portfolio = portfolio[portfolio["status"].isin(["Current", "Delinquent", "Charged Off"])]
Validating your data programmatically
from cranalytics.validation import (
validate_loan_history,
validate_loan_snapshot,
validate_portfolio,
validate_vintage,
validate_flow_input,
validate_wide_vintage,
validate_transition_matrix,
)
# Each raises pandera.errors.SchemaErrors on failure (lazy=True collects all errors)
validated_df = validate_portfolio(df)
validated_df = validate_loan_history(df)
validated_df = validate_loan_snapshot(df)
validated_df = validate_vintage(df)
validated_df = validate_flow_input(df)
validated_df = validate_wide_vintage(df)
validated_matrix = validate_transition_matrix(matrix_df)
For Rollforward data with alias resolution and a richer issue report:
from cranalytics import validate_rollforward_input_contract
result = validate_rollforward_input_contract(df)
print(result.issue_table) # severity, issue_code, message per issue
clean_df = result.df # renamed and coerced DataFrame
See also
- Choose Your Path — which workflow uses which input shape
src/cranalytics/validation.py— authoritative schema definitionssrc/cranalytics/loan_history_contract.py— loan history alias resolution and state derivationsrc/cranalytics/loan_snapshot_contract.py— loan snapshot alias resolution and state derivationsrc/cranalytics/rollforward_contract.py— Rollforward alias resolution logic- CLI:
cranalytics validate-data <file.csv> --show-schema