§ · Determinism audit

Reproducible by Seed: a Determinism and Diversity Audit

Generate the same dataset twice with the same seed and you get byte-identical files. Change the seed and every single record changes — no shared rows, IDs, or identities. Yet the population statistics barely move. That combination — perfect reproducibility, complete seed-disjointness, and seed-invariant calibration — is what makes synthetic data safe to pin in a CI suite or a bug report.

We put the guarantee to the test on real generated data: two 5,000-record datasets at seed 42, one at seed 7, and eight more seeds for the population panel.

§ · 1 · Same seed → byte-identical

We generated 5,000 records at seed 42, twice, and hashed each file. They are identical to the byte; seed 7 is a different file:

RunSeedSHA-256 (first 16 hex)
A42ccb99fdfe6370252
A′42ccb99fdfe6370252identical to A ✓
B7d12a1a913d7f3f1cdifferent

Reproducibility is structural: one seeded ChaCha20 stream drives every draw, in a fixed order, with no clock or platform entropy. The same seed and options reproduce the same people on any machine.

§ · 2 · Different seed → a disjoint cohort; every record unique

A seed change is not a small perturbation — it resamples the entire population. Comparing seed 42 against seed 7 (5,000 records each), and checking seed 42 against itself for collisions:

CheckResult
Identical full rows shared between seed 42 and seed 70
Shared IDs between the two seeds0
Shared name + date-of-birth identities0
Distinct IDs within a 5,000-record dataset100.00% (5,000 / 5,000)
Distinct full rows100.00%
Distinct name + DOB identities100.00% (0 coincidental collisions)

No duplicate people, no reused identities, and no overlap across seeds — so two seeds give you two independent cohorts, and one seed gives you a clean, collision-free dataset.

§ · 3 · Every field carries entropy

A dataset can be "unique" per row yet have columns stuck on one value. Measuring normalized Shannon entropy for every field (0 = constant, 1 = maximally diverse), the fields span the full range — from balanced binaries to fully-unique identifiers:

Fig 1 · Normalized entropy per field (0 = constant, 1 = maximally diverse)
country · US-only0.00death_cause0.13diabetes_status0.69state0.88bmi0.94age0.97household income0.99sex_at_birth1.00id · email · phone1.00

Exactly three of the 77 columns are constant, and by design: country, locale, and preferred_language — the public API emits US / en-US records only. Every other field carries entropy, so nothing is silently frozen.

§ · 4 · Records are independent of row order

Each record is seeded independently, so consecutive rows carry no relationship — the order you receive them in is arbitrary. The lag-1 autocorrelation (a field at row i against row i+1) sits at zero across the numeric fields:

Fieldlag-1 autocorrelation r(i, i+1)
age−0.021
BMI+0.002
HbA1c−0.015
weight−0.0004

All within ±0.02 of zero — you can shard, shuffle, or stream the rows in any order without introducing bias.

§ · 5 · The population is seed-invariant

Here is the property that makes determinism useful rather than merely tidy: changing the seed changes who you get, but not the population. Across eight seeds (2,000 records each), the calibrated marginals barely move (Fig 2):

Fig 2a · Mean age across 8 seeds — CV 0.42%
45.047.550.0mean 47.5one dot per seed (n=8)
Fig 2b · % diabetic across 8 seeds — CV 4.91%
10.015.020.0mean 14.9one dot per seed (n=8)
MarginalMean across 8 seedsSDCoefficient of variation
Mean age47.480.200.42%
Mean BMI29.490.180.62%
Mean HbA1c5.7440.0210.37%
% with diabetes14.870.734.91%

The continuous means vary by well under 1%; the diabetes prevalence — a rarer binary outcome, so noisier — still holds to ~5% relative. The calibration is a property of the generator, not of any one seed.

§ · Why it matters

These three properties map directly onto real engineering needs. Reproducibility makes a synthetic fixture a stable contract: pin a seed in CI and a failing test means a real regression, not reshuffled mock data — and a bug report that says "seed 42, row 128" is exactly reproducible on any machine. Seed-disjointness lets you mint independent train / validation / holdout cohorts, or a fresh dataset per test, with a guarantee of no leakage between them. Seed-invariance means those independent cohorts are still drawn from the same calibrated population, so a model tuned on one behaves predictably on another.

§ · Reproduce it

Every number above is deterministic. Regenerate and re-hash:

# same seed twice -> identical hash
person-cli -n 5000 -s 42 -f csv | shasum -a 256
person-cli -n 5000 -s 42 -f csv | shasum -a 256   # matches
person-cli -n 5000 -s 7  -f csv | shasum -a 256   # differs

# or via the API (see the Quickstart): POST /v1/datasets/person
#   { "clientId": "...", "count": 5000, "seed": 42 }

Load into pandas to reproduce the uniqueness, entropy, autocorrelation, and seed-invariance figures. New here? Start with the developer Quickstart →