§ · Guide

Why Joint Distributions Matter in Synthetic Patient Data

A joint distribution is how fields move together — A1c with diabetes, blood pressure with age, insulin only for people who have diabetes. Getting every column's histogram right is not the same as getting the population right. A dataset can match the distribution of each field on its own and still be full of people who could not exist.

This is the difference between data that looks realistic one column at a time and data that holds up when you read across columns. It is also the single thing most synthetic-data tools leave out. Below we measure the joint structure in 8,000 SimpleIDGen records, then break it on purpose to show what is lost.

Every figure here is measured from generated data, seed 42. No signup for the sample.

§ · Marginals are not the population

A marginal distribution is one column on its own: the histogram of age, the share of each state, the spread of BMI. A joint distribution is how those columns relate: whether the 70-year-olds are also the ones with lower kidney function, whether the people on insulin are the ones who have diabetes.

When a tool says its data is "realistic," it almost always means the marginals look right. That is necessary but not sufficient. Age can have a perfect distribution, diabetes can have the right prevalence, and insulin use can hit the correct national rate — and the data can still put insulin in the hands of people who have no diabetes, because the two columns were filled in without reference to each other.

§ · What the structure actually looks like

Here is the joint structure in the numeric fields, measured across 8,000 records. Each cell is a Pearson correlation. The strong couplings are the ones a real population has: BMI tracks weight (r = +0.89, because BMI is computed from height and weight) and waist (+0.86); A1c tracks how long someone has had diabetes (+0.72); kidney function falls as age rises (eGFR against age, −0.55); systolic pressure climbs with age (+0.45).

Correlation matrix — 8,000 synthetic US person records (Pearson r)
1-0.00-0.00+0.27+0.45-0.55+0.28ageage-0.001+0.86+0.22+0.18-0.05+0.18BMIBMI-0.00+0.861+0.20+0.15-0.04+0.16waistwaist+0.27+0.22+0.201+0.20-0.33+0.72A1cA1c+0.45+0.18+0.15+0.201-0.28+0.20sys BPsys BP-0.55-0.05-0.04-0.33-0.281-0.36eGFReGFR+0.28+0.18+0.16+0.72+0.20-0.361dx yrsdx yrs
positivenegativecolour intensity = strength; the signed value is printed in every cell.

Categorical fields carry the same kind of structure, measured with Cramér's V — 0 when two fields are independent, 1 when one fully determines the other. Blood-pressure medication is tied to a hypertension diagnosis (0.74), insulin to diabetes (0.66), pregnancy to sex (0.62). None of these are coincidences in the data; each is a rule the generator enforces.

Categorical associations — Cramér's V (0 = independent, 1 = fully determined)
hypertension ~ BP-meds0.74diabetes ~ insulin0.66sex ~ pregnancy0.62lipids dx ~ statin0.36education ~ income0.23

§ · The test: shuffle the columns

There is a clean way to see what joint structure is worth. Take the same 8,000 real records and shuffle each column independently — permute the age column, permute the diabetes column, permute the insulin column, each on its own. This is exactly what independent-column generation produces: every column keeps its exact distribution, but the links between columns are destroyed.

The marginals come through untouched — the count of each diabetes category is identical to the row before and after. And yet:

Rule that cannot be broken in a real personCalibratedColumns shuffled
Pregnant males057
On insulin but no diabetes0143
BMI does not match height and weight07,446
Years-with-diabetes > 0 but no diabetes0569
Records with at least one impossible combination0 (0.0%)7,501 (93.8%)

Nine in ten records became impossible, and not one histogram changed. That is the whole argument in one number: a generator can pass every single-column check and still produce a population where almost nobody could exist. Matching marginals is the easy 80%; the joints are the part that makes the data usable.

Internally consistent records (%) — same 8,000 rows, before and after shuffling each column independently
0.0100.0100.00calibrated (joint)6.20columns shuffled

§ · What the joints encode

The relationships are not decoration. They are the gradients any downstream analysis relies on. Mean A1c climbs cleanly across diabetes status and lands inside the correct diagnostic bands; insulin use rises from 0% in people without diabetes to 25% in diagnosed type-2 to 95% in type-1.

A1c by diabetes status (%) — dashed lines are the diagnostic thresholds
5.09.4prediabetes 5.7diabetes 6.55.32none5.67prediabetic7.33T2DM9.01type 1

Age drives its own set of gradients — systolic pressure up, kidney function down — the kind of covariation a risk model or a cohort filter reads directly. Shuffle the columns and every one of these lines goes flat.

Age bandMean systolic BPMean eGFR
18–34113.7107.9
35–49119.8101.7
50–64125.991.9
65+133.079.7

§ · Why most tools skip it

Not out of oversight — by design. Faker, Mockaroo, and most random-data tools generate each field independently from a type or a pick-list: a name here, an age there, a diagnosis from a weighted list. That is fast, it needs no reference data, and it is perfectly fine when all you want is well-formed values to fill a form or a UI.

It stops being fine the moment anything reads across columns. A risk score, a cohort query, a clinical rule, a join between tables, a model that learns from feature interactions — all of them see the shuffled version: correct margins, meaningless structure. This is the gap the comparison guides cover for Faker and Mockaroo.

§ · How SimpleIDGen keeps them

The generator draws fields in a fixed dependency order, and each field is conditioned on the ones already set. Age is drawn first. Conditions depend on age. Medications depend on conditions. BMI is computed from height and weight rather than sampled. Insulin is gated on a diabetes diagnosis. Because the draw order is fixed and seeded, the same seed reproduces the same people byte-for-byte, and the relationships hold on every row.

This is dependency-ordered conditional sampling, not a learned model — every rule is explicit and auditable. You can see the seed-invariance side of it in the determinism audit, and the joint structure at work in the HbA1c analysis.

§ · The honest limit

Generation only reproduces the joints it models. Look back at the matrix: age against BMI is r = +0.00, because BMI is drawn without conditioning on age, whereas real cohorts show a mild age–BMI link. The structure that is built in is faithful; the structure nobody built in is absent. If a conclusion hinges on one specific real-world correlation, validate it against real data first — see what synthetic data is for where that line falls.

§ · Reproduce the matrix

Both figures are a few lines. Generate a sample, read it into pandas, and the correlation matrix and the shuffle test fall out directly:

import pandas as pd

df = pd.read_json("people.jsonl", lines=True)

# the joint structure: the correlation matrix above
df.corr(numeric_only=True).round(2)

# the counterfactual: shuffle every column independently
shuffled = df.apply(lambda col: col.sample(frac=1).to_numpy())

# marginals are identical, joints are gone
(df["diabetes_status"].value_counts()
    == shuffled["diabetes_status"].value_counts()).all()   # True
((shuffled["on_insulin"]) & (shuffled["diabetes_status"] == "none")).sum()  # > 0

§ · Frequently asked

Q1
What is the difference between a marginal and a joint distribution?

A marginal distribution is one column on its own — the histogram of age, or the share of each diabetes category. A joint distribution is how columns relate to each other — whether the people with diabetes are the ones on insulin. Matching marginals makes each field look right in isolation; matching joints makes the records internally consistent.

Q2
If every column's distribution is correct, isn't the data realistic?

No. Take real records and shuffle each column independently: every column keeps its exact distribution, yet 93.8% of the rows become impossible — insulin without diabetes, pregnant males, BMI that does not match height and weight. Correct marginals with broken joints is the most common failure mode in synthetic data.

Q3
Do Faker and Mockaroo model joint distributions?

Generally no. They generate each field independently from a type or a list, which is fast and fine for filling forms and UIs but produces no cross-field structure. Anything that reads across columns — a risk model, a cohort filter, a join — sees noise.

Q4
How does SimpleIDGen preserve them?

By drawing fields in a fixed dependency order, each conditioned on the fields already set: conditions depend on age, medications on conditions, BMI is computed from height and weight, insulin is gated on diabetes. It is deterministic by seed, so the relationships hold identically on every row and every run.