Project Style Guide
"Programs must be written for people to read, and only incidentally for machines to execute." — Hal Abelson
This style guide documents our preferred design philosophy, inspired by the tidyverse ecosystem's approach to intuitive, human-centered API design. While we write Python, we embrace the principles that make tidyverse packages feel elegant and cohesive.
Core Design Philosophy
We follow four unifying principles in all API design decisions:
1. Human-Centered
Optimize for human thinking time over machine execution time. Code is read far more often than it is written. Prioritize clarity and discoverability over cleverness.
2. Consistent
Apply the smallest set of core ideas repeatedly. What users learn about one function should transfer to the next. Consistency compounds into intuition.
3. Composable
Build complex operations from simple, focused functions. Each function should do one thing well. Complexity emerges from composition, not from parameter bloat.
4. Inclusive
Write accessible documentation. Provide helpful error messages. Welcome users at all skill levels.
Data Structures
Use Standard Data Structures
Prefer pandas DataFrames (or polars) as the universal data currency. Avoid creating custom classes unless they provide substantial value.
# Good - returns a DataFrame
def calculate_metrics(data: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame({
"metric": ["mean", "std", "count"],
"value": [data["x"].mean(), data["x"].std(), len(data)]
})
# Avoid - custom class adds learning burden
def calculate_metrics(data: pd.DataFrame) -> MetricsResult:
return MetricsResult(
mean=data["x"].mean(),
std=data["x"].std()
)
Tidy Data as Default
Design functions that accept and return tidy data where possible: - Each variable forms a column - Each observation forms a row - Each value occupies a single cell
# Tidy - easy to filter, group, transform
# country year cases population
# Brazil 2020 12345 212559417
# Brazil 2021 23456 214326223
# Not tidy - harder to work with programmatically
# country cases_2020 cases_2021 pop_2020 pop_2021
# Brazil 12345 23456 212559417 214326223
Function Design
Naming Conventions
Use snake_case for all names:
# Good
calculate_mean()
read_survey_data()
CustomerRecord # Classes are PascalCase
# Bad
calculateMean()
ReadSurveyData()
Functions are verbs (imperative mood):
# Good
filter_rows()
transform_columns()
validate_input()
# Avoid
filtered_rows()
column_transformation()
input_validator()
Group related functions with common prefixes:
# Good - discoverable with autocomplete
str_detect()
str_replace()
str_extract()
str_split()
# Or as methods on a namespace
text.detect()
text.replace()
text.extract()
# Avoid - scattered, hard to discover
detect_in_string()
replace_string()
split_text()
Prefer descriptive names over abbreviations:
# Good
separate_by_delimiter()
parse_datetime()
calculate_rolling_average()
# Avoid
sep_delim()
prs_dt()
calc_roll_avg()
Boolean functions use is_, has_, can_ prefixes:
# Good
is_valid()
has_missing_values()
can_convert()
# Avoid
valid()
check_missing()
convertible()
Argument Order
Follow this consistent ordering:
- Data (first) - the thing being operated on
- Required descriptors - what to operate on
- Optional details - with sensible defaults
# Good - data first, then required args, then optional
def filter_rows(
data: pd.DataFrame,
condition: str,
*,
keep_na: bool = False
) -> pd.DataFrame:
...
def summarize_by(
data: pd.DataFrame,
group_col: str,
value_col: str,
*,
func: str = "mean"
) -> pd.DataFrame:
...
# Avoid - data buried in arguments
def process(method: str, options: dict, data: pd.DataFrame) -> pd.DataFrame:
...
Use keyword-only arguments for options:
# Good - force named arguments after *
def read_data(
path: str,
*,
encoding: str = "utf-8",
skip_rows: int = 0,
na_values: list[str] | None = None
) -> pd.DataFrame:
...
# Usage is explicit
read_data("file.csv", encoding="latin-1", skip_rows=2)
Return Values
Be predictable: If a function accepts a DataFrame, it should return a DataFrame.
# Good - DataFrame in, DataFrame out
def add_ratio(
data: pd.DataFrame,
numerator: str,
denominator: str
) -> pd.DataFrame:
return data.assign(ratio=lambda df: df[numerator] / df[denominator])
# Avoid - surprising return type
def add_ratio(
data: pd.DataFrame,
numerator: str,
denominator: str
) -> pd.Series:
return data[numerator] / data[denominator]
Side-effect functions return input for chaining:
def write_output(data: pd.DataFrame, path: str) -> pd.DataFrame:
"""Write data to CSV and return input for chaining."""
data.to_csv(path, index=False)
return data
# Enables chaining
result = (
data
.pipe(transform_data)
.pipe(write_output, "intermediate.csv")
.pipe(create_summary)
)
Function Size
Keep functions focused. If a function needs extensive documentation to explain its parameters, it's probably doing too much.
# Good - single responsibility
def validate_columns(data: pd.DataFrame, required: list[str]) -> pd.DataFrame: ...
def impute_missing(data: pd.DataFrame, method: str = "mean") -> pd.DataFrame: ...
def calculate_scores(data: pd.DataFrame, weights: dict) -> pd.DataFrame: ...
# Avoid - monolithic function
def process_data(
data: pd.DataFrame,
validate: bool = True,
required_cols: list[str] | None = None,
impute: bool = True,
impute_method: str = "mean",
calculate: bool = True,
weights: dict | None = None,
**kwargs
) -> pd.DataFrame:
...
Method Chaining
Design for Left-to-Right Composition
Use pandas' .pipe() method to enable readable chains:
# Good - reads like a recipe
result = (
raw_data
.pipe(clean_column_names)
.pipe(filter_valid_records)
.pipe(add_computed_columns)
.pipe(summarize_by_group, group_col="region")
.pipe(format_output)
)
# Avoid - nested calls (inside-out reading)
result = format_output(
summarize_by_group(
add_computed_columns(
filter_valid_records(
clean_column_names(raw_data)
)
),
group_col="region"
)
)
Write Pipe-Friendly Functions
Functions work with .pipe() when data is the first argument:
def add_year_column(data: pd.DataFrame, date_col: str = "date") -> pd.DataFrame:
"""Extract year from date column."""
return data.assign(year=lambda df: pd.to_datetime(df[date_col]).dt.year)
# Works naturally in a chain
result = data.pipe(add_year_column, date_col="transaction_date")
Avoid Side Effects
Prefer transformation over mutation:
# Good - returns new DataFrame, original unchanged
def add_log_column(data: pd.DataFrame, col: str) -> pd.DataFrame:
return data.assign(**{f"log_{col}": lambda df: np.log(df[col])})
# Avoid - modifies in place
def add_log_column(data: pd.DataFrame, col: str) -> pd.DataFrame:
data[f"log_{col}"] = np.log(data[col])
return data
Type Hints
Always Use Type Hints
Type hints serve as documentation and enable IDE support:
from typing import Literal
import pandas as pd
def summarize_column(
data: pd.DataFrame,
column: str,
*,
method: Literal["mean", "median", "sum"] = "mean",
dropna: bool = True
) -> float:
"""Calculate summary statistic for a column."""
series = data[column]
if dropna:
series = series.dropna()
match method:
case "mean":
return series.mean()
case "median":
return series.median()
case "sum":
return series.sum()
Use Union Types for Flexibility
from collections.abc import Sequence
def filter_values(
data: pd.DataFrame,
column: str,
values: str | Sequence[str]
) -> pd.DataFrame:
"""Filter rows where column matches any of the given values."""
if isinstance(values, str):
values = [values]
return data[data[column].isin(values)]
Error Handling
Raise Informative Errors
Provide context about what went wrong and how to fix it:
# Good - informative error
def validate_columns(data: pd.DataFrame, required: list[str]) -> pd.DataFrame:
missing = set(required) - set(data.columns)
if missing:
raise ValueError(
f"Missing required columns: {sorted(missing)}. "
f"Available columns: {sorted(data.columns)}"
)
return data
# Avoid - cryptic error
def validate_columns(data: pd.DataFrame, required: list[str]) -> pd.DataFrame:
for col in required:
assert col in data.columns # AssertionError with no context
return data
Use Specific Exception Types
class ValidationError(Exception):
"""Raised when data fails validation."""
pass
class ColumnNotFoundError(KeyError):
"""Raised when a required column is missing."""
pass
def get_column(data: pd.DataFrame, name: str) -> pd.Series:
if name not in data.columns:
raise ColumnNotFoundError(
f"Column '{name}' not found. "
f"Did you mean one of: {', '.join(data.columns[:5])}..."
)
return data[name]
Error Message Conventions
- Use "must" when the problem is unambiguous
- Use "cannot" when expectations aren't clear
- Include the actual value received when helpful
# Clear requirement
raise TypeError(f"data must be a DataFrame, got {type(data).__name__}")
# Ambiguous situation
raise KeyError(f"Cannot find column 'foo' in data")
# Include actual value
raise ValueError(f"threshold must be positive, got {threshold}")
Code Style
Follow PEP 8 with Refinements
We follow PEP 8 with these clarifications:
Line length: 88 characters (Black default)
Imports: Group and sort with isort
# Standard library
import json
from pathlib import Path
# Third-party
import numpy as np
import pandas as pd
# Local
from mypackage.core import transform
from mypackage.utils import validate
String quotes: Double quotes for strings (Black default)
# Good
message = "Hello, world"
# Avoid
message = 'Hello, world'
Trailing commas: Include in multi-line structures
# Good - easier diffs when adding items
config = {
"input_path": "data/raw",
"output_path": "data/processed",
"verbose": True,
}
# Avoid
config = {
"input_path": "data/raw",
"output_path": "data/processed",
"verbose": True
}
Use Black for Formatting
Don't debate style—automate it:
# Format all files
black src/ tests/
# Check without modifying
black --check src/ tests/
Use Ruff for Linting
Fast, comprehensive linting:
# Check for issues
ruff check src/ tests/
# Fix auto-fixable issues
ruff check --fix src/ tests/
Documentation
Every Public Function Needs a Docstring
Use NumPy style docstrings:
def calculate_weighted_mean(
data: pd.DataFrame,
value_col: str,
weight_col: str,
*,
group_col: str | None = None
) -> pd.DataFrame | float:
"""
Calculate weighted arithmetic mean.
Computes the weighted mean of values, optionally grouped by a column.
Weights are normalized to sum to 1 within each group.
Parameters
----------
data : pd.DataFrame
Input data containing value and weight columns.
value_col : str
Name of column containing values to average.
weight_col : str
Name of column containing weights.
group_col : str, optional
If provided, calculate weighted mean within each group.
Returns
-------
pd.DataFrame | float
If group_col is provided, returns DataFrame with group column
and weighted_mean column. Otherwise, returns a single float.
Examples
--------
>>> df = pd.DataFrame({
... "region": ["A", "A", "B", "B"],
... "sales": [100, 200, 150, 250],
... "weight": [0.3, 0.7, 0.5, 0.5]
... })
>>> calculate_weighted_mean(df, "sales", "weight")
175.0
>>> calculate_weighted_mean(df, "sales", "weight", group_col="region")
region weighted_mean
0 A 170.0
1 B 200.0
See Also
--------
calculate_mean : Unweighted mean calculation.
"""
...
Write Useful Examples
Examples should be runnable and demonstrate typical usage:
"""
Examples
--------
Basic usage with sample data:
>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [2, 4, 6, 8, 10]})
>>> add_ratio(data, "y", "x")
x y ratio
0 1 2 2.0
1 2 4 2.0
2 3 6 2.0
3 4 8 2.0
4 5 10 2.0
With missing values:
>>> data_with_na = pd.DataFrame({"x": [1, 0, 3], "y": [2, 4, 6]})
>>> add_ratio(data_with_na, "y", "x")
x y ratio
0 1 2 2.000000
1 0 4 inf
2 3 6 2.000000
"""
Module-Level Docstrings
Each module should explain its purpose:
"""
Data transformation utilities.
This module provides functions for cleaning, reshaping, and transforming
DataFrames. All functions follow the pattern of accepting a DataFrame as
the first argument and returning a DataFrame, enabling method chaining
with `.pipe()`.
Functions
---------
clean_column_names
Normalize column names to snake_case.
filter_valid_records
Remove rows with missing required values.
pivot_longer
Reshape from wide to long format.
pivot_wider
Reshape from long to wide format.
Examples
--------
>>> import mypackage.transform as tf
>>> result = (
... data
... .pipe(tf.clean_column_names)
... .pipe(tf.filter_valid_records, required=["id", "date"])
... .pipe(tf.pivot_longer, cols=["jan", "feb", "mar"])
... )
"""
Testing
Mirror Source Structure
src/
mypackage/
__init__.py
core.py
transform.py
io.py
tests/
__init__.py
test_core.py
test_transform.py
test_io.py
Test Behavior, Not Implementation
# Good - tests the contract
def test_add_ratio_creates_ratio_column():
data = pd.DataFrame({"sales": [100, 200], "visitors": [10, 20]})
result = add_ratio(data, "sales", "visitors")
assert "ratio" in result.columns
assert list(result["ratio"]) == [10.0, 10.0]
def test_add_ratio_preserves_original_columns():
data = pd.DataFrame({"a": [1], "b": [2]})
result = add_ratio(data, "a", "b")
assert "a" in result.columns
assert "b" in result.columns
# Avoid - tests implementation details
def test_add_ratio_uses_assign():
# Fragile: breaks if we change implementation
...
Use Fixtures for Test Data
import pytest
import pandas as pd
@pytest.fixture
def sample_data():
"""Standard test dataset."""
return pd.DataFrame({
"id": [1, 2, 3, 4, 5],
"group": ["A", "A", "B", "B", "B"],
"value": [10, 20, 30, 40, 50],
"weight": [0.1, 0.2, 0.3, 0.2, 0.2]
})
@pytest.fixture
def data_with_missing():
"""Test dataset containing NA values."""
return pd.DataFrame({
"id": [1, 2, 3],
"value": [10, None, 30]
})
def test_summarize_handles_groups(sample_data):
result = summarize_by_group(sample_data, "group", "value")
assert len(result) == 2
assert set(result["group"]) == {"A", "B"}
Test Edge Cases
def test_filter_with_empty_dataframe():
empty = pd.DataFrame({"x": []})
result = filter_positive(empty, "x")
assert len(result) == 0
assert list(result.columns) == ["x"]
def test_filter_with_all_na():
data = pd.DataFrame({"x": [None, None, None]})
result = filter_positive(data, "x")
assert len(result) == 0
def test_summarize_single_row():
data = pd.DataFrame({"x": [42]})
result = summarize(data, "x")
assert result["mean"] == 42
assert result["std"] == 0 # or NaN, document which
Project Structure
mypackage/
├── pyproject.toml
├── README.md
├── LICENSE
├── src/
│ └── mypackage/
│ ├── __init__.py
│ ├── core.py
│ ├── transform.py
│ ├── io.py
│ └── _utils.py # Private utilities
├── tests/
│ ├── __init__.py
│ ├── conftest.py # Shared fixtures
│ ├── test_core.py
│ ├── test_transform.py
│ └── test_io.py
├── docs/
│ ├── index.md
│ ├── getting-started.md
│ └── api/
└── .github/
└── workflows/
└── ci.yml
pyproject.toml Configuration
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "mypackage"
version = "0.1.0"
description = "A human-centered data transformation library"
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
authors = [{ name = "Your Name", email = "you@example.com" }]
dependencies = [
"pandas>=2.0",
"numpy>=1.24",
]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"pytest-cov>=4.0",
"black>=23.0",
"ruff>=0.1",
"mypy>=1.0",
"pandas-stubs",
]
docs = [
"mkdocs>=1.5",
"mkdocs-material>=9.0",
"mkdocstrings[python]>=0.24",
]
[tool.black]
line-length = 88
target-version = ["py310"]
[tool.ruff]
line-length = 88
select = ["E", "F", "I", "N", "W", "UP"]
[tool.ruff.isort]
known-first-party = ["mypackage"]
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-v --cov=mypackage --cov-report=term-missing"
[tool.mypy]
python_version = "3.10"
strict = true
Development Workflow
Setting Up
# Clone and create virtual environment
git clone https://github.com/you/mypackage.git
cd mypackage
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install in development mode with all extras
pip install -e ".[dev,docs]"
Daily Development
# Format code
black src/ tests/
# Lint
ruff check src/ tests/
# Type check
mypy src/
# Run tests
pytest
# Run tests with coverage
pytest --cov=mypackage --cov-report=html
Pre-Commit Hooks
Automate quality checks:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/psf/black
rev: 23.12.1
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.1.9
hooks:
- id: ruff
args: [--fix]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.8.0
hooks:
- id: mypy
additional_dependencies: [pandas-stubs]
# Install hooks
pip install pre-commit
pre-commit install
Building Documentation
# Serve locally with live reload
mkdocs serve
# Build static site
mkdocs build
Quick Reference
| Principle | Guidance |
|---|---|
| Naming | snake_case, verbs for functions, common prefixes |
| Arguments | Data first, required next, optional as keyword-only |
| Returns | Predictable types; DataFrame in → DataFrame out |
| Errors | Informative messages with context and guidance |
| Types | Always use type hints |
| Docs | NumPy-style docstrings, runnable examples |
| Tests | Mirror source structure, test behavior |
| Format | Black + Ruff, no debates |
Resources
- Tidyverse Design Principles — Philosophy we adapt
- PEP 8 — Python style foundation
- NumPy Docstring Guide
- Effective Pandas — Idiomatic pandas patterns
- Python Packaging User Guide
- pytest Documentation
This style guide is a living document. Update it as the project evolves and new patterns emerge.