§ · Worked example

How We Validate Synthetic Patient Data Against Real US Statistics

"Calibrated" means every field is fitted to a published US reference so its distribution matches the real population — and validation means generating a fresh sample and checking that each marginal lands where the federal source says it should.

We generated 20,000 records and compared 17 headline marginals and 6 numeric means against a version-controlled benchmark library sourced from NHANES, the Census, CDC, BLS, and KFF. Fifteen of the seventeen land within two percentage points of the federal figure. Here is the method, the full table, and the two places it drifts.

Every number below is measured from generated data, seed 42. No signup for the sample.

§ · What "calibrated" means

A generator can produce values that are the right type — a number for age, a category for insurance — without producing the right distribution. Calibration is the step that fixes the distribution to a real one. For each field, we take a published reference — the share of adults who are obese, the split of household income brackets, the mean systolic blood pressure — and draw values so the generated marginal matches that target.

When a field is conditioned on others, the per-stratum rates are rescaled so the implied overall marginal still hits the published figure. Diabetes prevalence rises with age, for instance, but the age-weighted average across the whole population is held to the national rate. Calibration is explicit and auditable — every rate traces to a source — and it is deterministic: the same seed reproduces the same distribution every run. It is not a model learned from real records, so there is nothing to memorize or leak.

§ · The benchmark library

Validation is only as good as the numbers you validate against. The reference values live in a version-controlled benchmark file, each row carrying its source dataset, year, population scope, and a confidence grade. Seventy-plus figures were adversarially reconciled against primary federal tables. A few honest scope notes come with the territory: some federal figures are all-ages while the generator is adults 18+, and the Census classifies areas urban/rural only, with no "suburban" tier — those are flagged rather than scored as misses.

DomainReference sources
Demographics, geography, income, insuranceUS Census / ACS 2022, CPS ASEC 2023
Body measures, vitals, labsNHANES 2017–2023
Disease prevalenceCDC NDSS, NHANES, AHA
Behavioral (smoking, alcohol, sleep, activity)CDC NHIS / BRFSS 2022
Employment, medications, utilizationBLS CPS 2022, NHANES, KFF, NHIS

§ · The check: 20,000 records against the references

Plot each generated marginal against its federal benchmark and the picture is one line. A point on the dashed diagonal is a perfect match; the seventeen metrics — spanning 6% (Asian race) to 90% (has a primary-care provider) — all sit on it.

Generated vs US benchmark — 17 marginals, 20,000 records. Every point sits on the line.
00303060609090perfect calibration (y = x)US federal benchmark (%)Generated (%)

The full table, with the source behind each figure:

MarginalGeneratedUS benchmarkΔ (pp)Source
Female50.7%50.9%−0.2Census 2022
Hispanic ethnicity18.8%19.1%−0.3Census 2022
Asian race6.1%6.0%+0.1Census 2022
Household income <$25k17.8%18.5%−0.7CPS ASEC 2023
Household income $50–100k27.6%28.6%−1.0CPS ASEC 2023
Household income $150k+19.4%21.1%−1.7CPS ASEC 2023
Homeowner67.6%65.8%+1.8Census CPS/HVS
Uninsured (all-ages ref)9.2%8.0%+1.2Census CPS ASEC
Current smoker11.4%11.6%−0.2CDC NHIS 2022
Obesity (BMI ≥ 30)42.5%42.4%+0.1NHANES 2017–18
Diagnosed diabetes (all-ages ref)12.2%11.3%+0.9CDC NDSS
Prediabetes36.0%38.0%−2.0CDC NDSS
Hypertension (any)49.6%47.7%+1.9NHANES 2021–23
CKD (stages 1–5)15.0%13.9%+1.1NHANES 2017–20
Has primary-care provider89.3%90.3%−1.0CDC NHIS
Statin use (adults 40+)25.7%23.2%+2.5NHANES DB177
Any prescription (12-mo)68.0%64.8%+3.2CDC NHIS

The numeric means line up just as closely:

MeanGeneratedUS benchmarkSource
BMI29.529.6 (M 29.4 / F 29.8)NHANES 2015–18
Height, male175.4 cm175.0 cmNHANES 2021–23
Height, female161.7 cm161.3 cmNHANES 2021–23
Systolic BP122.0 mmHg123 mmHgNHANES 2017–20
HbA1c5.8%5.8%NHANES 2011–20
Sleep7.0 h7.6 h (workday)NHANES 2017–20

§ · Beyond the average: distribution shape

Matching a mean is a low bar — two very different distributions can share one. The stronger test is whether the whole shape lines up. The Kolmogorov–Smirnov distance (KS-D) is the largest gap between the generated and reference cumulative distributions, from 0 (identical) to 1. We measure it against NHANES micro-data, survey-weighted, at population scale — one million records, where the synthetic-side sampling noise collapses and the number is the true distance.

Distribution distance from NHANES — KS-D per variable (0 = identical distributions)
BMI0.02waist0.02height0.03weight0.04age0.04systolic BP0.06HbA1c0.07diastolic BP0.08

Three of the four body measures sit in the "excellent" tier (KS-D below 0.03); weight (0.035) and the vitals and labs are "good." HbA1c and systolic blood pressure carry a small known residual — they run a touch high — but neither crosses a tier boundary, and both are documented rather than smoothed away. The joint structure holds at the same scale: height and weight correlate at r = 0.51, BMI and waist at r = 0.86, both on their clinical targets.

§ · The verdict, and where it drifts

Of the seventeen marginals, fourteen land within two percentage points of the federal figure and every one is within 3.2. Five of the six numeric means are within one unit of the reference, and the full distributions — not just their averages — match NHANES to a KS-D under 0.04 for every body measure. Body measures, blood pressure, HbA1c, obesity, smoking, and the sex and ethnicity splits are effectively exact.

Two gaps are worth naming rather than hiding. Any-prescription use runs 3.2 points high (68.0% vs 64.8%), and mean sleep is about half an hour short of the NHANES workday figure (7.0 h vs 7.6 h). Neither is a scope artifact; both are real targets the generator can tighten. A handful of other rows carry a scope note — the federal source is all-ages while the generator is adults 18+ — where a small offset is expected, not an error.

§ · What calibration does not do

Matching marginals is necessary, not sufficient. A dataset can hit every one of these targets and still be full of impossible people if the fields do not move together correctly — which is the separate job of the joint structure. And calibration only reproduces the relationships it is built to model; a specific real-world correlation nobody encoded will not appear. For a conclusion that hinges on one particular association, validate it against real data first — the line is drawn in what synthetic data is.

§ · Reproduce it

Every figure here is a few lines. Generate a sample, read it into pandas, and check any marginal against its published reference:

import pandas as pd

df = pd.read_json("people.jsonl", lines=True)   # 20,000 rows, seed 42

# categorical marginals (%)
df["diabetes_status"].isin(["diagnosed_t2dm", "type1"]).mean() * 100   # ~12.2
(df["bmi"] >= 30).mean() * 100                                        # ~42.5  (NHANES 42.4)
(df["smoking_status"] == "current").mean() * 100                       # ~11.4  (NHIS 11.6)

# numeric means
df[["bmi", "systolic_bp_avg", "a1c_value"]].mean().round(1)            # 29.5 / 122.0 / 5.8

§ · Frequently asked

Q1
What does "calibrated" actually mean here?

Each field is drawn so its distribution matches a published US reference — NHANES, the Census, CDC, BLS, KFF. Where a field depends on others, the per-stratum rates are rescaled so the overall marginal still hits the national figure. It is explicit and deterministic, not a model learned from real records.

Q2
How close is the generated data to real US statistics?

Across 17 headline marginals measured on 20,000 records, 14 land within two percentage points of the federal benchmark and all within 3.2. Obesity is 42.5% vs 42.4%, mean BMI 29.5 vs 29.6, and mean HbA1c 5.8% vs 5.8%. Five of six numeric means are within one unit.

Q3
What are the sources?

A version-controlled benchmark library: US Census / ACS 2022 and CPS ASEC 2023 for demographics and finances, NHANES 2017–2023 for body measures and labs, CDC NDSS and NHANES for disease prevalence, and CDC NHIS / BRFSS for behavior. Each reference row carries its dataset, year, population scope, and a confidence grade.

Q4
Does matching the marginals make the data realistic?

It is necessary but not sufficient. Marginals can be perfect while the cross-field structure is broken. Calibration handles the distributions; the joint structure — insulin only for diabetics, BMI from height and weight — is a separate guarantee, covered in the joint-distributions study.