§ · Worked example

How We Validate Synthetic Patient Data Against Real US Statistics

"Calibrated" means every field is fitted to a published US reference so its distribution matches the real population — and validation means generating a fresh sample and checking that each marginal lands where the federal source says it should.

We generated 20,000 records and compared 17 headline marginals and 6 numeric means against a version-controlled benchmark library sourced from NHANES, the Census, CDC, BLS, and KFF. Fifteen of the seventeen land within two percentage points of the federal figure. Here is the method, the full table, and the two places it drifts.

Free 1,000-row sample — check it yourself →

Every number below is measured from generated data, seed 42. No signup for the sample.

§ · What "calibrated" means

A generator can produce values that are the right type — a number for age, a category for insurance — without producing the right distribution. Calibration is the step that fixes the distribution to a real one. For each field, we take a published reference — the share of adults who are obese, the split of household income brackets, the mean systolic blood pressure — and draw values so the generated marginal matches that target.

When a field is conditioned on others, the per-stratum rates are rescaled so the implied overall marginal still hits the published figure. Diabetes prevalence rises with age, for instance, but the age-weighted average across the whole population is held to the national rate. Calibration is explicit and auditable — every rate traces to a source — and it is deterministic: the same seed reproduces the same distribution every run. It is not a model learned from real records, so there is nothing to memorize or leak.

§ · The benchmark library

Validation is only as good as the numbers you validate against. The reference values live in a version-controlled benchmark file, each row carrying its source dataset, year, population scope, and a confidence grade. Seventy-plus figures were adversarially reconciled against primary federal tables. A few honest scope notes come with the territory: some federal figures are all-ages while the generator is adults 18+, and the Census classifies areas urban/rural only, with no "suburban" tier — those are flagged rather than scored as misses.

Domain	Reference sources
Demographics, geography, income, insurance	US Census / ACS 2022, CPS ASEC 2023
Body measures, vitals, labs	NHANES 2017–2023
Disease prevalence	CDC NDSS, NHANES, AHA
Behavioral (smoking, alcohol, sleep, activity)	CDC NHIS / BRFSS 2022
Employment, medications, utilization	BLS CPS 2022, NHANES, KFF, NHIS

§ · The check: 20,000 records against the references

Plot each generated marginal against its federal benchmark and the picture is one line. A point on the dashed diagonal is a perfect match; the seventeen metrics — spanning 6% (Asian race) to 90% (has a primary-care provider) — all sit on it.

Generated vs US benchmark — 17 marginals, 20,000 records. Every point sits on the line.

The full table, with the source behind each figure:

Marginal	Generated	US benchmark	Δ (pp)	Source
Female	50.7%	50.9%	−0.2	Census 2022
Hispanic ethnicity	18.8%	19.1%	−0.3	Census 2022
Asian race	6.1%	6.0%	+0.1	Census 2022
Household income <$25k	17.8%	18.5%	−0.7	CPS ASEC 2023
Household income $50–100k	27.6%	28.6%	−1.0	CPS ASEC 2023
Household income $150k+	19.4%	21.1%	−1.7	CPS ASEC 2023
Homeowner	67.6%	65.8%	+1.8	Census CPS/HVS
Uninsured (all-ages ref)	9.2%	8.0%	+1.2	Census CPS ASEC
Current smoker	11.4%	11.6%	−0.2	CDC NHIS 2022
Obesity (BMI ≥ 30)	42.5%	42.4%	+0.1	NHANES 2017–18
Diagnosed diabetes (all-ages ref)	12.2%	11.3%	+0.9	CDC NDSS
Prediabetes	36.0%	38.0%	−2.0	CDC NDSS
Hypertension (any)	49.6%	47.7%	+1.9	NHANES 2021–23
CKD (stages 1–5)	15.0%	13.9%	+1.1	NHANES 2017–20
Has primary-care provider	89.3%	90.3%	−1.0	CDC NHIS
Statin use (adults 40+)	25.7%	23.2%	+2.5	NHANES DB177
Any prescription (12-mo)	68.0%	64.8%	+3.2	CDC NHIS

The numeric means line up just as closely:

Mean	Generated	US benchmark	Source
BMI	29.5	29.6 (M 29.4 / F 29.8)	NHANES 2015–18
Height, male	175.4 cm	175.0 cm	NHANES 2021–23
Height, female	161.7 cm	161.3 cm	NHANES 2021–23
Systolic BP	122.0 mmHg	123 mmHg	NHANES 2017–20
HbA1c	5.8%	5.8%	NHANES 2011–20
Sleep	7.0 h	7.6 h (workday)	NHANES 2017–20

§ · Beyond the average: distribution shape

Matching a mean is a low bar — two very different distributions can share one. The stronger test is whether the whole shape lines up. The Kolmogorov–Smirnov distance (KS-D) is the largest gap between the generated and reference cumulative distributions, from 0 (identical) to 1. We measure it against NHANES micro-data, survey-weighted, at population scale — one million records, where the synthetic-side sampling noise collapses and the number is the true distance.

Distribution distance from NHANES — KS-D per variable (0 = identical distributions)

Three of the four body measures sit in the "excellent" tier (KS-D below 0.03); weight (0.035) and the vitals and labs are "good." HbA1c and systolic blood pressure carry a small known residual — they run a touch high — but neither crosses a tier boundary, and both are documented rather than smoothed away. The joint structure holds at the same scale: height and weight correlate at r = 0.51, BMI and waist at r = 0.86, both on their clinical targets.

§ · The verdict, and where it drifts

Of the seventeen marginals, fourteen land within two percentage points of the federal figure and every one is within 3.2. Five of the six numeric means are within one unit of the reference, and the full distributions — not just their averages — match NHANES to a KS-D under 0.04 for every body measure. Body measures, blood pressure, HbA1c, obesity, smoking, and the sex and ethnicity splits are effectively exact.

Two gaps are worth naming rather than hiding. Any-prescription use runs 3.2 points high (68.0% vs 64.8%), and mean sleep is about half an hour short of the NHANES workday figure (7.0 h vs 7.6 h). Neither is a scope artifact; both are real targets the generator can tighten. A handful of other rows carry a scope note — the federal source is all-ages while the generator is adults 18+ — where a small offset is expected, not an error.

§ · What calibration does not do

Matching marginals is necessary, not sufficient. A dataset can hit every one of these targets and still be full of impossible people if the fields do not move together correctly — which is the separate job of the joint structure. And calibration only reproduces the relationships it is built to model; a specific real-world correlation nobody encoded will not appear. For a conclusion that hinges on one particular association, validate it against real data first — the line is drawn in what synthetic data is.

§ · Reproduce it

Every figure here is a few lines. Generate a sample, read it into pandas, and check any marginal against its published reference:

import pandas as pd

df = pd.read_json("people.jsonl", lines=True)   # 20,000 rows, seed 42

# categorical marginals (%)
df["diabetes_status"].isin(["diagnosed_t2dm", "type1"]).mean() * 100   # ~12.2
(df["bmi"] >= 30).mean() * 100                                        # ~42.5  (NHANES 42.4)
(df["smoking_status"] == "current").mean() * 100                       # ~11.4  (NHIS 11.6)

# numeric means
df[["bmi", "systolic_bp_avg", "a1c_value"]].mean().round(1)            # 29.5 / 122.0 / 5.8

§ · Frequently asked

What does "calibrated" actually mean here?

Each field is drawn so its distribution matches a published US reference — NHANES, the Census, CDC, BLS, KFF. Where a field depends on others, the per-stratum rates are rescaled so the overall marginal still hits the national figure. It is explicit and deterministic, not a model learned from real records.

How close is the generated data to real US statistics?

Across 17 headline marginals measured on 20,000 records, 14 land within two percentage points of the federal benchmark and all within 3.2. Obesity is 42.5% vs 42.4%, mean BMI 29.5 vs 29.6, and mean HbA1c 5.8% vs 5.8%. Five of six numeric means are within one unit.

What are the sources?

A version-controlled benchmark library: US Census / ACS 2022 and CPS ASEC 2023 for demographics and finances, NHANES 2017–2023 for body measures and labs, CDC NDSS and NHANES for disease prevalence, and CDC NHIS / BRFSS for behavior. Each reference row carries its dataset, year, population scope, and a confidence grade.

Does matching the marginals make the data realistic?

It is necessary but not sufficient. Marginals can be perfect while the cross-field structure is broken. Calibration handles the distributions; the joint structure — insulin only for diabetics, BMI from height and weight — is a separate guarantee, covered in the joint-distributions study.