Analyzing Synthetic Patient Data: What Drives HbA1c?
In 5,000 synthetic US adults, HbA1c rises with age and body mass — not income. Age and BMI each push it up independently; household income shows no association at all. Every gradient lands where real US epidemiology says it should.
This is a real analysis, not a template. We set the question first, generated a seeded sample, ran the numbers, and report what came back — including the honest nulls. You can reproduce every figure with one command.
Figures below are from a July 2026 run at seed 42. Same seed, same numbers.
§ · The question, set first
Before touching the data we fixed one outcome to test: of age, body weight, and household income, which actually move a person's HbA1c — and is there any relationship at all? HbA1c is the standard three-month blood-sugar marker; clinically it climbs with age and adiposity, and the diagnostic bands are well defined (normal below 5.7%, prediabetes 5.7–6.4%, diabetes 6.5% or higher). A calibrated synthetic dataset should reproduce those relationships. If it does, you can prototype an analysis on it and expect the same directions you would see in real patient data.
§ · The data, and how each field varies
Five thousand synthetic US adults, generated at seed 42 and loaded into pandas:
import pandas as pd
df = pd.read_csv("persons.csv") # 5,000 rows, 77 columns
df[["age", "weight_kg", "bmi", "a1c_value"]].describe()| Field | Mean | SD | Range |
|---|---|---|---|
| age (yrs) | 47.4 | 18.7 | 18 – 95 |
| weight (kg) | 84.2 | 22.2 | 34.9 – 200.1 |
| BMI | 29.5 | 6.9 | 15.0 – 65.0 |
| HbA1c (%) | 5.76 | 0.97 | 4.5 – 12.9 |
The mean BMI of 29.5 matches the US adult average from NHANES almost exactly, and HbA1c is right-skewed (median 5.5%, a long tail to 12.9%) — the shape of a population that is mostly non-diabetic with a diabetic minority.
Income is captured as a bracket, not a raw salary, so we look at its distribution rather than a mean:
| Household income | < $25k | $25–50k | $50–100k | $100–150k | $150k+ |
|---|---|---|---|---|---|
| share | 17.9% | 18.8% | 27.7% | 16.5% | 19.1% |
§ · Does anything move HbA1c?
Correlate each candidate against HbA1c. Age, weight, and BMI are continuous, so we use Pearson; income is an ordinal bracket, so we rank it 1–5 and use Spearman:
rank = {"under25k":1, "r25to50k":2, "r50to100k":3, "r100to150k":4, "over150k":5}
df["income_rank"] = df["household_income_bracket"].map(rank)
df[["a1c_value", "age", "bmi", "weight_kg"]].corr()["a1c_value"] # Pearson
df[["a1c_value", "income_rank"]].corr(method="spearman").iloc[0, 1] # Spearman| vs HbA1c | Correlation | Read |
|---|---|---|
| age | +0.28 | moderate, positive |
| BMI | +0.22 | positive |
| weight (kg) | +0.20 | positive, weaker than BMI |
| income (rank) | +0.00 | none |
Weight tracks HbA1c a little less tightly than BMI, exactly as expected — BMI normalizes weight for height, so it is the cleaner signal. Income is flat: Spearman +0.004.
§ · Which one drives it, holding the others fixed?
Simple correlations can be confounded, so we fit a standardized regression — HbA1c on age, BMI, and income together, all z-scored, so the coefficients are directly comparable:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(df[["age", "bmi", "income_rank"]])
y = StandardScaler().fit_transform(df[["a1c_value"]]).ravel()
LinearRegression().fit(X, y).coef_ # -> [0.281, 0.219, -0.036]| Predictor | Standardized β | Verdict |
|---|---|---|
| age | +0.28 | real driver |
| BMI | +0.22 | real driver |
| income | −0.04 | negligible |
Model R² = 0.13 — age and BMI explain a real slice of HbA1c; the rest is clinical status and individual variation. The partial effects barely differ from the raw correlations because age and BMI are nearly independent here (r ≈ 0.01), so neither steals the other's credit. Their pushes on HbA1c simply add up.
§ · The gradients, in plain numbers
Correlations hide the shape. Group means show it — and confirm the direction is monotonic, not noise:
df.groupby(pd.cut(df.age, [17, 34, 49, 64, 120]))["a1c_value"].mean()
df.groupby(pd.cut(df.bmi, [0, 25, 30, 100]))["a1c_value"].mean()
df.groupby("diabetes_status")["a1c_value"].mean()| Age band | Mean HbA1c | BMI category | Mean HbA1c |
|---|---|---|---|
| 18–34 | 5.48% | normal (<25) | 5.47% |
| 35–49 | 5.63% | overweight (25–30) | 5.75% |
| 50–64 | 5.90% | obese (30+) | 5.96% |
| 65+ | 6.19% | — | — |
Both climb cleanly, step by step. By income bracket, HbA1c is flat — 5.80, 5.76, 5.77, 5.73, 5.76 from lowest to highest. No gradient.
And grouped by clinical status, the means fall inside the correct diagnostic bands — the sharpest fidelity check of all:
| Diabetes status | n | Mean HbA1c | Diagnostic band |
|---|---|---|---|
| none | 2,430 | 5.32% | normal (<5.7) |
| prediabetic | 1,831 | 5.67% | prediabetes (5.7–6.4) |
| T2DM (diagnosed) | 547 | 7.28% | diabetes (6.5+) |
| type 1 | 67 | 8.96% | diabetes (6.5+) |
The diabetic group averages 7.44% against 5.32% for non-diabetics — a 2.1-point gap, and every mean sits in its clinically correct range.
§ · The outcome
HbA1c is driven by age, body mass, and clinical status — not by income. Age (β +0.28) and BMI (β +0.22) each raise it independently and additively; income moves it essentially not at all (β −0.04, flat group means). The diabetic subgroup runs 2.1 points higher, and every group mean lands in its correct diagnostic band.
That is exactly what real US epidemiology says: HbA1c rises with age and adiposity, the diabetes and prediabetes prevalences here (14.8% and 36.6%) match CDC figures, and the clinical gradients hold. The signal a real analyst would find in NHANES is present in the synthetic data — which is the whole point of calibration. Prototype your query here, and the relationships carry over.
One honest caveat. Generation only reproduces the relationships it models. Here age and BMI are drawn nearly independently (r ≈ 0.01), whereas real cohorts show a mild age–BMI correlation — so a study that hinges on that specific interaction should validate against real data first. Calibrated synthetic data is for realistic structure and privacy-safe iteration, not for discovering a correlation nobody built in. We cover that line in what is synthetic data →
§ · Reproduce it
Generation is deterministic, so this command yields the exact 5,000 rows analyzed above:
# with the person CLI
person-cli -n 5000 -s 42 -f csv > persons.csv
# or pull the same volume from the API — see the Quickstart
# POST /v1/datasets/person { "clientId": "...", "count": 5000, "seed": 42 }New to the API? Start with the developer Quickstart →
§ · Frequently asked
It is meaningful for the relationships the generator was built to model — here, HbA1c against age, BMI, and diabetes status, all calibrated to public references. It reflects real epidemiology by construction. It is not evidence of a novel correlation; for that, validate against real data.
The generator models HbA1c from age, BMI, and diabetes status, and draws income largely independently of them. So income carries no signal for HbA1c in this dataset (Spearman +0.00, flat group means). Real-world data shows a weak socioeconomic gradient; this build does not encode one.
Generate 5,000 records at seed 42 (person-cli -n 5000 -s 42 -f csv, or the same count and seed via the API), load into pandas, and run the snippets above. Output is byte-identical for a given seed, so the figures match.
For building and testing an analysis pipeline, demos, and many ML tasks, yes — the structure behaves like real data. For a published conclusion that depends on one specific real-world correlation, treat synthetic data as the rehearsal and confirm on real records.