§ · Worked example

Analyzing Synthetic Patient Data: What Drives HbA1c?

In 5,000 synthetic US adults, HbA1c rises with age and body mass — not income. Age and BMI each push it up independently; household income shows no association at all. Every gradient lands where real US epidemiology says it should.

This is a real analysis, not a template. We set the question first, generated a seeded sample, ran the numbers, and report what came back — including the honest nulls. You can reproduce every figure with one command.

Figures below are from a July 2026 run at seed 42. Same seed, same numbers.

§ · The question, set first

Before touching the data we fixed one outcome to test: of age, body weight, and household income, which actually move a person's HbA1c — and is there any relationship at all? HbA1c is the standard three-month blood-sugar marker; clinically it climbs with age and adiposity, and the diagnostic bands are well defined (normal below 5.7%, prediabetes 5.7–6.4%, diabetes 6.5% or higher). A calibrated synthetic dataset should reproduce those relationships. If it does, you can prototype an analysis on it and expect the same directions you would see in real patient data.

§ · The data, and how each field varies

Five thousand synthetic US adults, generated at seed 42 and loaded into pandas:

import pandas as pd
df = pd.read_csv("persons.csv")            # 5,000 rows, 77 columns
df[["age", "weight_kg", "bmi", "a1c_value"]].describe()
FieldMeanSDRange
age (yrs)47.418.718 – 95
weight (kg)84.222.234.9 – 200.1
BMI29.56.915.0 – 65.0
HbA1c (%)5.760.974.5 – 12.9

The mean BMI of 29.5 matches the US adult average from NHANES almost exactly, and HbA1c is right-skewed (median 5.5%, a long tail to 12.9%) — the shape of a population that is mostly non-diabetic with a diabetic minority.

Income is captured as a bracket, not a raw salary, so we look at its distribution rather than a mean:

Household income< $25k$25–50k$50–100k$100–150k$150k+
share17.9%18.8%27.7%16.5%19.1%

§ · Does anything move HbA1c?

Correlate each candidate against HbA1c. Age, weight, and BMI are continuous, so we use Pearson; income is an ordinal bracket, so we rank it 1–5 and use Spearman:

rank = {"under25k":1, "r25to50k":2, "r50to100k":3, "r100to150k":4, "over150k":5}
df["income_rank"] = df["household_income_bracket"].map(rank)

df[["a1c_value", "age", "bmi", "weight_kg"]].corr()["a1c_value"]           # Pearson
df[["a1c_value", "income_rank"]].corr(method="spearman").iloc[0, 1]        # Spearman
vs HbA1cCorrelationRead
age+0.28moderate, positive
BMI+0.22positive
weight (kg)+0.20positive, weaker than BMI
income (rank)+0.00none

Weight tracks HbA1c a little less tightly than BMI, exactly as expected — BMI normalizes weight for height, so it is the cleaner signal. Income is flat: Spearman +0.004.

Correlation with HbA1c — age and BMI carry the signal; income does not
age+0.28BMI+0.22weight+0.20income+0.00

§ · Which one drives it, holding the others fixed?

Simple correlations can be confounded, so we fit a standardized regression — HbA1c on age, BMI, and income together, all z-scored, so the coefficients are directly comparable:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(df[["age", "bmi", "income_rank"]])
y = StandardScaler().fit_transform(df[["a1c_value"]]).ravel()
LinearRegression().fit(X, y).coef_        # -> [0.281, 0.219, -0.036]
PredictorStandardized βVerdict
age+0.28real driver
BMI+0.22real driver
income−0.04negligible

Model R² = 0.13 — age and BMI explain a real slice of HbA1c; the rest is clinical status and individual variation. The partial effects barely differ from the raw correlations because age and BMI are nearly independent here (r ≈ 0.01), so neither steals the other's credit. Their pushes on HbA1c simply add up.

§ · The gradients, in plain numbers

Correlations hide the shape. Group means show it — and confirm the direction is monotonic, not noise:

df.groupby(pd.cut(df.age, [17, 34, 49, 64, 120]))["a1c_value"].mean()
df.groupby(pd.cut(df.bmi, [0, 25, 30, 100]))["a1c_value"].mean()
df.groupby("diabetes_status")["a1c_value"].mean()
Age bandMean HbA1cBMI categoryMean HbA1c
18–345.48%normal (<25)5.47%
35–495.63%overweight (25–30)5.75%
50–645.90%obese (30+)5.96%
65+6.19%

Both climb cleanly, step by step. By income bracket, HbA1c is flat — 5.80, 5.76, 5.77, 5.73, 5.76 from lowest to highest. No gradient.

HbA1c by age band (%)
5.36.45.4818–345.6335–495.9050–646.1965+
HbA1c by BMI category (%)
5.36.15.47<255.7525–305.9630+

And grouped by clinical status, the means fall inside the correct diagnostic bands — the sharpest fidelity check of all:

Diabetes statusnMean HbA1cDiagnostic band
none2,4305.32%normal (<5.7)
prediabetic1,8315.67%prediabetes (5.7–6.4)
T2DM (diagnosed)5477.28%diabetes (6.5+)
type 1678.96%diabetes (6.5+)

The diabetic group averages 7.44% against 5.32% for non-diabetics — a 2.1-point gap, and every mean sits in its clinically correct range.

HbA1c by diabetes status (%) — dashed lines are the clinical thresholds
5.09.3prediabetes 5.7diabetes 6.55.32none5.67prediabetic7.28T2DM8.96type 1

§ · The outcome

HbA1c is driven by age, body mass, and clinical status — not by income. Age (β +0.28) and BMI (β +0.22) each raise it independently and additively; income moves it essentially not at all (β −0.04, flat group means). The diabetic subgroup runs 2.1 points higher, and every group mean lands in its correct diagnostic band.

That is exactly what real US epidemiology says: HbA1c rises with age and adiposity, the diabetes and prediabetes prevalences here (14.8% and 36.6%) match CDC figures, and the clinical gradients hold. The signal a real analyst would find in NHANES is present in the synthetic data — which is the whole point of calibration. Prototype your query here, and the relationships carry over.

One honest caveat. Generation only reproduces the relationships it models. Here age and BMI are drawn nearly independently (r ≈ 0.01), whereas real cohorts show a mild age–BMI correlation — so a study that hinges on that specific interaction should validate against real data first. Calibrated synthetic data is for realistic structure and privacy-safe iteration, not for discovering a correlation nobody built in. We cover that line in what is synthetic data →

§ · Reproduce it

Generation is deterministic, so this command yields the exact 5,000 rows analyzed above:

# with the person CLI
person-cli -n 5000 -s 42 -f csv > persons.csv

# or pull the same volume from the API — see the Quickstart
#   POST /v1/datasets/person  { "clientId": "...", "count": 5000, "seed": 42 }

New to the API? Start with the developer Quickstart →

§ · Frequently asked

Q1
Is a correlation found in synthetic data meaningful?

It is meaningful for the relationships the generator was built to model — here, HbA1c against age, BMI, and diabetes status, all calibrated to public references. It reflects real epidemiology by construction. It is not evidence of a novel correlation; for that, validate against real data.

Q2
Why doesn't income predict HbA1c here?

The generator models HbA1c from age, BMI, and diabetes status, and draws income largely independently of them. So income carries no signal for HbA1c in this dataset (Spearman +0.00, flat group means). Real-world data shows a weak socioeconomic gradient; this build does not encode one.

Q3
How do I reproduce these exact numbers?

Generate 5,000 records at seed 42 (person-cli -n 5000 -s 42 -f csv, or the same count and seed via the API), load into pandas, and run the snippets above. Output is byte-identical for a given seed, so the figures match.

Q4
Can I trust an analysis run on synthetic data?

For building and testing an analysis pipeline, demos, and many ML tasks, yes — the structure behaves like real data. For a published conclusion that depends on one specific real-world correlation, treat synthetic data as the rehearsal and confirm on real records.