Is the synthetic data reproducible?

Yes, deterministically. The same seed and options produce byte-identical output on any machine — we verified two 5,000-record datasets at seed 42 hash to the same SHA-256. One seeded ChaCha20 stream drives every draw in a fixed order, with no clock or platform entropy.

Does changing the seed change the whole dataset?

Completely. Seed 42 and seed 7 share zero identical rows, zero IDs, and zero name+DOB identities — a seed change resamples the entire cohort, so you can mint independent train/validation/holdout sets with no leakage.

Are the records unique, or are there duplicates?

Every record is unique. In a 5,000-record dataset, IDs, full rows, and name+DOB identities are all 100% distinct, with zero coincidental collisions.

If the seed changes everything, is the data still calibrated the same way?

Yes — the population is seed-invariant. Across eight seeds, mean age, BMI, and HbA1c vary by under 1% and diabetes prevalence by ~5%. The seed changes which people you get, not the statistical population they are drawn from.

§ · Determinism audit

Reproducible by Seed: a Determinism and Diversity Audit

Generate the same dataset twice with the same seed and you get byte-identical files. Change the seed and every single record changes — no shared rows, IDs, or identities. Yet the population statistics barely move. That combination — perfect reproducibility, complete seed-disjointness, and seed-invariant calibration — is what makes synthetic data safe to pin in a CI suite or a bug report.

We put the guarantee to the test on real generated data: two 5,000-record datasets at seed 42, one at seed 7, and eight more seeds for the population panel.

§ · 1 · Same seed → byte-identical

We generated 5,000 records at seed 42, twice, and hashed each file. They are identical to the byte; seed 7 is a different file:

Run	Seed	SHA-256 (first 16 hex)
A	42	`ccb99fdfe6370252`	—
A′	42	`ccb99fdfe6370252`	identical to A ✓
B	7	`d12a1a913d7f3f1c`	different

Reproducibility is structural: one seeded ChaCha20 stream drives every draw, in a fixed order, with no clock or platform entropy. The same seed and options reproduce the same people on any machine.

§ · 2 · Different seed → a disjoint cohort; every record unique

A seed change is not a small perturbation — it resamples the entire population. Comparing seed 42 against seed 7 (5,000 records each), and checking seed 42 against itself for collisions:

Check	Result
Identical full rows shared between seed 42 and seed 7	0
Shared IDs between the two seeds	0
Shared name + date-of-birth identities	0
Distinct IDs within a 5,000-record dataset	100.00% (5,000 / 5,000)
Distinct full rows	100.00%
Distinct name + DOB identities	100.00% (0 coincidental collisions)

No duplicate people, no reused identities, and no overlap across seeds — so two seeds give you two independent cohorts, and one seed gives you a clean, collision-free dataset.

§ · 3 · Every field carries entropy

A dataset can be "unique" per row yet have columns stuck on one value. Measuring normalized Shannon entropy for every field (0 = constant, 1 = maximally diverse), the fields span the full range — from balanced binaries to fully-unique identifiers:

Fig 1 · Normalized entropy per field (0 = constant, 1 = maximally diverse)

Exactly three of the 77 columns are constant, and by design: country, locale, and preferred_language — the public API emits US / en-US records only. Every other field carries entropy, so nothing is silently frozen.

§ · 4 · Records are independent of row order

Each record is seeded independently, so consecutive rows carry no relationship — the order you receive them in is arbitrary. The lag-1 autocorrelation (a field at row i against row i+1) sits at zero across the numeric fields:

Field	lag-1 autocorrelation r(i, i+1)
age	−0.021
BMI	+0.002
HbA1c	−0.015
weight	−0.0004

All within ±0.02 of zero — you can shard, shuffle, or stream the rows in any order without introducing bias.

§ · 5 · The population is seed-invariant

Here is the property that makes determinism useful rather than merely tidy: changing the seed changes who you get, but not the population. Across eight seeds (2,000 records each), the calibrated marginals barely move (Fig 2):

Fig 2a · Mean age across 8 seeds — CV 0.42%

Fig 2b · % diabetic across 8 seeds — CV 4.91%

Marginal	Mean across 8 seeds	SD	Coefficient of variation
Mean age	47.48	0.20	0.42%
Mean BMI	29.49	0.18	0.62%
Mean HbA1c	5.744	0.021	0.37%
% with diabetes	14.87	0.73	4.91%

The continuous means vary by well under 1%; the diabetes prevalence — a rarer binary outcome, so noisier — still holds to ~5% relative. The calibration is a property of the generator, not of any one seed.

§ · Why it matters

These three properties map directly onto real engineering needs. Reproducibility makes a synthetic fixture a stable contract: pin a seed in CI and a failing test means a real regression, not reshuffled mock data — and a bug report that says "seed 42, row 128" is exactly reproducible on any machine. Seed-disjointness lets you mint independent train / validation / holdout cohorts, or a fresh dataset per test, with a guarantee of no leakage between them. Seed-invariance means those independent cohorts are still drawn from the same calibrated population, so a model tuned on one behaves predictably on another.

§ · Reproduce it

Every number above is deterministic. Regenerate and re-hash:

# same seed twice -> identical hash
person-cli -n 5000 -s 42 -f csv | shasum -a 256
person-cli -n 5000 -s 42 -f csv | shasum -a 256   # matches
person-cli -n 5000 -s 7  -f csv | shasum -a 256   # differs

# or via the API (see the Quickstart): POST /v1/datasets/person
#   { "clientId": "...", "count": 5000, "seed": 42 }

Load into pandas to reproduce the uniqueness, entropy, autocorrelation, and seed-invariance figures. New here? Start with the developer Quickstart →