Project Style Guide

"Programs must be written for people to read, and only incidentally for machines to execute." — Hal Abelson

This style guide documents our preferred design philosophy, inspired by the tidyverse ecosystem's approach to intuitive, human-centered API design. While we write Python, we embrace the principles that make tidyverse packages feel elegant and cohesive.

Core Design Philosophy

We follow four unifying principles in all API design decisions:

1. Human-Centered

Optimize for human thinking time over machine execution time. Code is read far more often than it is written. Prioritize clarity and discoverability over cleverness.

2. Consistent

Apply the smallest set of core ideas repeatedly. What users learn about one function should transfer to the next. Consistency compounds into intuition.

3. Composable

Build complex operations from simple, focused functions. Each function should do one thing well. Complexity emerges from composition, not from parameter bloat.

4. Inclusive

Write accessible documentation. Provide helpful error messages. Welcome users at all skill levels.

Data Structures

Use Standard Data Structures

Prefer pandas DataFrames (or polars) as the universal data currency. Avoid creating custom classes unless they provide substantial value.

# Good - returns a DataFrame
def calculate_metrics(data: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({
        "metric": ["mean", "std", "count"],
        "value": [data["x"].mean(), data["x"].std(), len(data)]
    })

# Avoid - custom class adds learning burden
def calculate_metrics(data: pd.DataFrame) -> MetricsResult:
    return MetricsResult(
        mean=data["x"].mean(),
        std=data["x"].std()
    )

Tidy Data as Default

Design functions that accept and return tidy data where possible: - Each variable forms a column - Each observation forms a row - Each value occupies a single cell

# Tidy - easy to filter, group, transform
#   country  year  cases  population
#   Brazil   2020  12345  212559417
#   Brazil   2021  23456  214326223

# Not tidy - harder to work with programmatically
#   country  cases_2020  cases_2021  pop_2020  pop_2021
#   Brazil   12345       23456       212559417 214326223

Function Design

Naming Conventions

Use snake_case for all names:

# Good
calculate_mean()
read_survey_data()
CustomerRecord  # Classes are PascalCase

# Bad
calculateMean()
ReadSurveyData()

Functions are verbs (imperative mood):

# Good
filter_rows()
transform_columns()
validate_input()

# Avoid
filtered_rows()
column_transformation()
input_validator()

Group related functions with common prefixes:

# Good - discoverable with autocomplete
str_detect()
str_replace()
str_extract()
str_split()

# Or as methods on a namespace
text.detect()
text.replace()
text.extract()

# Avoid - scattered, hard to discover
detect_in_string()
replace_string()
split_text()

Prefer descriptive names over abbreviations:

# Good
separate_by_delimiter()
parse_datetime()
calculate_rolling_average()

# Avoid
sep_delim()
prs_dt()
calc_roll_avg()

Boolean functions use is_, has_, can_ prefixes:

# Good
is_valid()
has_missing_values()
can_convert()

# Avoid
valid()
check_missing()
convertible()

Argument Order

Follow this consistent ordering:

Data (first) - the thing being operated on
Required descriptors - what to operate on
Optional details - with sensible defaults

# Good - data first, then required args, then optional
def filter_rows(
    data: pd.DataFrame,
    condition: str,
    *,
    keep_na: bool = False
) -> pd.DataFrame:
    ...

def summarize_by(
    data: pd.DataFrame,
    group_col: str,
    value_col: str,
    *,
    func: str = "mean"
) -> pd.DataFrame:
    ...

# Avoid - data buried in arguments
def process(method: str, options: dict, data: pd.DataFrame) -> pd.DataFrame:
    ...

Use keyword-only arguments for options:

# Good - force named arguments after *
def read_data(
    path: str,
    *,
    encoding: str = "utf-8",
    skip_rows: int = 0,
    na_values: list[str] | None = None
) -> pd.DataFrame:
    ...

# Usage is explicit
read_data("file.csv", encoding="latin-1", skip_rows=2)

Return Values

Be predictable: If a function accepts a DataFrame, it should return a DataFrame.

# Good - DataFrame in, DataFrame out
def add_ratio(
    data: pd.DataFrame,
    numerator: str,
    denominator: str
) -> pd.DataFrame:
    return data.assign(ratio=lambda df: df[numerator] / df[denominator])

# Avoid - surprising return type
def add_ratio(
    data: pd.DataFrame,
    numerator: str,
    denominator: str
) -> pd.Series:
    return data[numerator] / data[denominator]

Side-effect functions return input for chaining:

def write_output(data: pd.DataFrame, path: str) -> pd.DataFrame:
    """Write data to CSV and return input for chaining."""
    data.to_csv(path, index=False)
    return data

# Enables chaining
result = (
    data
    .pipe(transform_data)
    .pipe(write_output, "intermediate.csv")
    .pipe(create_summary)
)

Function Size

Keep functions focused. If a function needs extensive documentation to explain its parameters, it's probably doing too much.

# Good - single responsibility
def validate_columns(data: pd.DataFrame, required: list[str]) -> pd.DataFrame: ...
def impute_missing(data: pd.DataFrame, method: str = "mean") -> pd.DataFrame: ...
def calculate_scores(data: pd.DataFrame, weights: dict) -> pd.DataFrame: ...

# Avoid - monolithic function
def process_data(
    data: pd.DataFrame,
    validate: bool = True,
    required_cols: list[str] | None = None,
    impute: bool = True,
    impute_method: str = "mean",
    calculate: bool = True,
    weights: dict | None = None,
    **kwargs
) -> pd.DataFrame:
    ...

Method Chaining

Design for Left-to-Right Composition

Use pandas' .pipe() method to enable readable chains:

# Good - reads like a recipe
result = (
    raw_data
    .pipe(clean_column_names)
    .pipe(filter_valid_records)
    .pipe(add_computed_columns)
    .pipe(summarize_by_group, group_col="region")
    .pipe(format_output)
)

# Avoid - nested calls (inside-out reading)
result = format_output(
    summarize_by_group(
        add_computed_columns(
            filter_valid_records(
                clean_column_names(raw_data)
            )
        ),
        group_col="region"
    )
)

Write Pipe-Friendly Functions

Functions work with .pipe() when data is the first argument:

def add_year_column(data: pd.DataFrame, date_col: str = "date") -> pd.DataFrame:
    """Extract year from date column."""
    return data.assign(year=lambda df: pd.to_datetime(df[date_col]).dt.year)

# Works naturally in a chain
result = data.pipe(add_year_column, date_col="transaction_date")

Avoid Side Effects

Prefer transformation over mutation:

# Good - returns new DataFrame, original unchanged
def add_log_column(data: pd.DataFrame, col: str) -> pd.DataFrame:
    return data.assign(**{f"log_{col}": lambda df: np.log(df[col])})

# Avoid - modifies in place
def add_log_column(data: pd.DataFrame, col: str) -> pd.DataFrame:
    data[f"log_{col}"] = np.log(data[col])
    return data

Type Hints

Always Use Type Hints

Type hints serve as documentation and enable IDE support:

from typing import Literal
import pandas as pd

def summarize_column(
    data: pd.DataFrame,
    column: str,
    *,
    method: Literal["mean", "median", "sum"] = "mean",
    dropna: bool = True
) -> float:
    """Calculate summary statistic for a column."""
    series = data[column]
    if dropna:
        series = series.dropna()

    match method:
        case "mean":
            return series.mean()
        case "median":
            return series.median()
        case "sum":
            return series.sum()

Use Union Types for Flexibility

from collections.abc import Sequence

def filter_values(
    data: pd.DataFrame,
    column: str,
    values: str | Sequence[str]
) -> pd.DataFrame:
    """Filter rows where column matches any of the given values."""
    if isinstance(values, str):
        values = [values]
    return data[data[column].isin(values)]

Error Handling

Raise Informative Errors

Provide context about what went wrong and how to fix it:

# Good - informative error
def validate_columns(data: pd.DataFrame, required: list[str]) -> pd.DataFrame:
    missing = set(required) - set(data.columns)
    if missing:
        raise ValueError(
            f"Missing required columns: {sorted(missing)}. "
            f"Available columns: {sorted(data.columns)}"
        )
    return data

# Avoid - cryptic error
def validate_columns(data: pd.DataFrame, required: list[str]) -> pd.DataFrame:
    for col in required:
        assert col in data.columns  # AssertionError with no context
    return data

Use Specific Exception Types

class ValidationError(Exception):
    """Raised when data fails validation."""
    pass

class ColumnNotFoundError(KeyError):
    """Raised when a required column is missing."""
    pass

def get_column(data: pd.DataFrame, name: str) -> pd.Series:
    if name not in data.columns:
        raise ColumnNotFoundError(
            f"Column '{name}' not found. "
            f"Did you mean one of: {', '.join(data.columns[:5])}..."
        )
    return data[name]

Error Message Conventions

Use "must" when the problem is unambiguous
Use "cannot" when expectations aren't clear
Include the actual value received when helpful

# Clear requirement
raise TypeError(f"data must be a DataFrame, got {type(data).__name__}")

# Ambiguous situation
raise KeyError(f"Cannot find column 'foo' in data")

# Include actual value
raise ValueError(f"threshold must be positive, got {threshold}")

Code Style

We follow PEP 8 with these clarifications:

Line length: 88 characters (Ruff default)

Imports: Group and sort with Ruff's isort rules (I)

# Standard library
import json
from pathlib import Path

# Third-party
import numpy as np
import pandas as pd

# Local
from mypackage.core import transform
from mypackage.utils import validate

String quotes: Double quotes for strings (Ruff format default)

# Good
message = "Hello, world"

# Avoid
message = 'Hello, world'

Trailing commas: Include in multi-line structures

# Good - easier diffs when adding items
config = {
    "input_path": "data/raw",
    "output_path": "data/processed",
    "verbose": True,
}

# Avoid
config = {
    "input_path": "data/raw",
    "output_path": "data/processed",
    "verbose": True
}

Use Ruff for Formatting

Don't debate style—automate it:

# Format all files
ruff format src/ tests/

# Check without modifying
ruff format --check src/ tests/

Use Ruff for Linting

Fast, comprehensive linting:

# Check for issues
ruff check src/ tests/

# Fix auto-fixable issues
ruff check --fix src/ tests/

Documentation

Every Public Function Needs a Docstring

Use NumPy style docstrings:

def calculate_weighted_mean(
    data: pd.DataFrame,
    value_col: str,
    weight_col: str,
    *,
    group_col: str | None = None
) -> pd.DataFrame | float:
    """
    Calculate weighted arithmetic mean.

    Computes the weighted mean of values, optionally grouped by a column.
    Weights are normalized to sum to 1 within each group.

    Parameters
    ----------
    data : pd.DataFrame
        Input data containing value and weight columns.
    value_col : str
        Name of column containing values to average.
    weight_col : str
        Name of column containing weights.
    group_col : str, optional
        If provided, calculate weighted mean within each group.

    Returns
    -------
    pd.DataFrame | float
        If group_col is provided, returns DataFrame with group column
        and weighted_mean column. Otherwise, returns a single float.

    Examples
    --------
    >>> df = pd.DataFrame({
    ...     "region": ["A", "A", "B", "B"],
    ...     "sales": [100, 200, 150, 250],
    ...     "weight": [0.3, 0.7, 0.5, 0.5]
    ... })
    >>> calculate_weighted_mean(df, "sales", "weight")
    175.0
    >>> calculate_weighted_mean(df, "sales", "weight", group_col="region")
       region  weighted_mean
    0       A          170.0
    1       B          200.0

    See Also
    --------
    calculate_mean : Unweighted mean calculation.
    """
    ...

Write Useful Examples

Examples should be runnable and demonstrate typical usage:

"""
Examples
--------
Basic usage with sample data:

>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [2, 4, 6, 8, 10]})
>>> add_ratio(data, "y", "x")
   x   y  ratio
0  1   2    2.0
1  2   4    2.0
2  3   6    2.0
3  4   8    2.0
4  5  10    2.0

With missing values:

>>> data_with_na = pd.DataFrame({"x": [1, 0, 3], "y": [2, 4, 6]})
>>> add_ratio(data_with_na, "y", "x")
   x  y     ratio
0  1  2  2.000000
1  0  4       inf
2  3  6  2.000000
"""

Module-Level Docstrings

Each module should explain its purpose:

"""
Data transformation utilities.

This module provides functions for cleaning, reshaping, and transforming
DataFrames. All functions follow the pattern of accepting a DataFrame as
the first argument and returning a DataFrame, enabling method chaining
with `.pipe()`.

Functions
---------
clean_column_names
    Normalize column names to snake_case.
filter_valid_records
    Remove rows with missing required values.
pivot_longer
    Reshape from wide to long format.
pivot_wider
    Reshape from long to wide format.

Examples
--------
>>> import mypackage.transform as tf
>>> result = (
...     data
...     .pipe(tf.clean_column_names)
...     .pipe(tf.filter_valid_records, required=["id", "date"])
...     .pipe(tf.pivot_longer, cols=["jan", "feb", "mar"])
... )
"""

Testing

Mirror Source Structure

src/
  mypackage/
    __init__.py
    core.py
    transform.py
    io.py
tests/
  __init__.py
  test_core.py
  test_transform.py
  test_io.py

Test Behavior, Not Implementation

# Good - tests the contract
def test_add_ratio_creates_ratio_column():
    data = pd.DataFrame({"sales": [100, 200], "visitors": [10, 20]})
    result = add_ratio(data, "sales", "visitors")

    assert "ratio" in result.columns
    assert list(result["ratio"]) == [10.0, 10.0]

def test_add_ratio_preserves_original_columns():
    data = pd.DataFrame({"a": [1], "b": [2]})
    result = add_ratio(data, "a", "b")

    assert "a" in result.columns
    assert "b" in result.columns

# Avoid - tests implementation details
def test_add_ratio_uses_assign():
    # Fragile: breaks if we change implementation
    ...

Use Fixtures for Test Data

import pytest
import pandas as pd

@pytest.fixture
def sample_data():
    """Standard test dataset."""
    return pd.DataFrame({
        "id": [1, 2, 3, 4, 5],
        "group": ["A", "A", "B", "B", "B"],
        "value": [10, 20, 30, 40, 50],
        "weight": [0.1, 0.2, 0.3, 0.2, 0.2]
    })

@pytest.fixture
def data_with_missing():
    """Test dataset containing NA values."""
    return pd.DataFrame({
        "id": [1, 2, 3],
        "value": [10, None, 30]
    })

def test_summarize_handles_groups(sample_data):
    result = summarize_by_group(sample_data, "group", "value")
    assert len(result) == 2
    assert set(result["group"]) == {"A", "B"}

Test Edge Cases

def test_filter_with_empty_dataframe():
    empty = pd.DataFrame({"x": []})
    result = filter_positive(empty, "x")
    assert len(result) == 0
    assert list(result.columns) == ["x"]

def test_filter_with_all_na():
    data = pd.DataFrame({"x": [None, None, None]})
    result = filter_positive(data, "x")
    assert len(result) == 0

def test_summarize_single_row():
    data = pd.DataFrame({"x": [42]})
    result = summarize(data, "x")
    assert result["mean"] == 42
    assert result["std"] == 0  # or NaN, document which

Quick Reference

Principle	Guidance
Naming	snake_case, verbs for functions, common prefixes
Arguments	Data first, required next, optional as keyword-only
Returns	Predictable types; DataFrame in → DataFrame out
Errors	Informative messages with context and guidance
Types	Always use type hints (basedpyright)
Docs	NumPy-style docstrings, runnable examples
Tests	Mirror source structure, test behavior
Format	Ruff (format + lint), no debates

Resources

Tidyverse Design Principles — Philosophy we adapt
PEP 8 — Python style foundation
NumPy Docstring Guide
Effective Pandas — Idiomatic pandas patterns
pytest Documentation

This style guide is a living document. Update it as the project evolves and new patterns emerge.