Reproducible by Seed: a Determinism and Diversity Audit
Generate the same dataset twice with the same seed and you get byte-identical files. Change the seed and every single record changes — no shared rows, IDs, or identities. Yet the population statistics barely move. That combination — perfect reproducibility, complete seed-disjointness, and seed-invariant calibration — is what makes synthetic data safe to pin in a CI suite or a bug report.
We put the guarantee to the test on real generated data: two 5,000-record datasets at seed 42, one at seed 7, and eight more seeds for the population panel.
§ · 1 · Same seed → byte-identical
We generated 5,000 records at seed 42, twice, and hashed each file. They are identical to the byte; seed 7 is a different file:
| Run | Seed | SHA-256 (first 16 hex) | |
|---|---|---|---|
| A | 42 | ccb99fdfe6370252 | — |
| A′ | 42 | ccb99fdfe6370252 | identical to A ✓ |
| B | 7 | d12a1a913d7f3f1c | different |
Reproducibility is structural: one seeded ChaCha20 stream drives every draw, in a fixed order, with no clock or platform entropy. The same seed and options reproduce the same people on any machine.
§ · 2 · Different seed → a disjoint cohort; every record unique
A seed change is not a small perturbation — it resamples the entire population. Comparing seed 42 against seed 7 (5,000 records each), and checking seed 42 against itself for collisions:
| Check | Result |
|---|---|
| Identical full rows shared between seed 42 and seed 7 | 0 |
| Shared IDs between the two seeds | 0 |
| Shared name + date-of-birth identities | 0 |
| Distinct IDs within a 5,000-record dataset | 100.00% (5,000 / 5,000) |
| Distinct full rows | 100.00% |
| Distinct name + DOB identities | 100.00% (0 coincidental collisions) |
No duplicate people, no reused identities, and no overlap across seeds — so two seeds give you two independent cohorts, and one seed gives you a clean, collision-free dataset.
§ · 3 · Every field carries entropy
A dataset can be "unique" per row yet have columns stuck on one value. Measuring normalized Shannon entropy for every field (0 = constant, 1 = maximally diverse), the fields span the full range — from balanced binaries to fully-unique identifiers:
Exactly three of the 77 columns are constant, and by design: country, locale, and preferred_language — the public API emits US / en-US records only. Every other field carries entropy, so nothing is silently frozen.
§ · 4 · Records are independent of row order
Each record is seeded independently, so consecutive rows carry no relationship — the order you receive them in is arbitrary. The lag-1 autocorrelation (a field at row i against row i+1) sits at zero across the numeric fields:
| Field | lag-1 autocorrelation r(i, i+1) |
|---|---|
| age | −0.021 |
| BMI | +0.002 |
| HbA1c | −0.015 |
| weight | −0.0004 |
All within ±0.02 of zero — you can shard, shuffle, or stream the rows in any order without introducing bias.
§ · 5 · The population is seed-invariant
Here is the property that makes determinism useful rather than merely tidy: changing the seed changes who you get, but not the population. Across eight seeds (2,000 records each), the calibrated marginals barely move (Fig 2):
| Marginal | Mean across 8 seeds | SD | Coefficient of variation |
|---|---|---|---|
| Mean age | 47.48 | 0.20 | 0.42% |
| Mean BMI | 29.49 | 0.18 | 0.62% |
| Mean HbA1c | 5.744 | 0.021 | 0.37% |
| % with diabetes | 14.87 | 0.73 | 4.91% |
The continuous means vary by well under 1%; the diabetes prevalence — a rarer binary outcome, so noisier — still holds to ~5% relative. The calibration is a property of the generator, not of any one seed.
§ · Why it matters
These three properties map directly onto real engineering needs. Reproducibility makes a synthetic fixture a stable contract: pin a seed in CI and a failing test means a real regression, not reshuffled mock data — and a bug report that says "seed 42, row 128" is exactly reproducible on any machine. Seed-disjointness lets you mint independent train / validation / holdout cohorts, or a fresh dataset per test, with a guarantee of no leakage between them. Seed-invariance means those independent cohorts are still drawn from the same calibrated population, so a model tuned on one behaves predictably on another.
§ · Reproduce it
Every number above is deterministic. Regenerate and re-hash:
# same seed twice -> identical hash
person-cli -n 5000 -s 42 -f csv | shasum -a 256
person-cli -n 5000 -s 42 -f csv | shasum -a 256 # matches
person-cli -n 5000 -s 7 -f csv | shasum -a 256 # differs
# or via the API (see the Quickstart): POST /v1/datasets/person
# { "clientId": "...", "count": 5000, "seed": 42 }Load into pandas to reproduce the uniqueness, entropy, autocorrelation, and seed-invariance figures. New here? Start with the developer Quickstart →