Synthetic · Not real PII Joint realism — attributes co-vary 69 attributes / record Calibrated · NHANES / ACS / CDC / KFF / BLS Deterministic by seed · v1.2.73

In the flattened CSV that's 77 columns — the 69 core attributes plus an 8-field T2DM staging/outcome block (the seven t2dm_* fields and death_cause) that is filled only for the diabetic/deceased subset and blank for everyone else.

Synthetic Person Data Generator

Synthetic person records with 69 jointly-distributed attributes — demographics, health, behavioral, financial — calibrated against public US reference data.

Demographic-first generator. Returns synthetic person records with 69 jointly-distributed attributes across 9 domains (identity, geography, social, financial, behavioral, health basics, health conditions, healthcare utilization, medications). Each marginal distribution cites a public source (ACS 2022, NHANES 2017-2020, CDC NDSS, KFF 2023, MEPS 2022, BLS 2023, USPS L005 2024). Cross-field invariants are enforced: BMI = weight/(height/100)², ZIP matches state per USPS SCF ranges, insulin only fires for diabetics. Deterministic by seed — the same seed and count yield byte-identical records. Async bulk generation: submit a job, then download the JSONL or CSV file via a download URL.

↓ Free sample — 1,000 rows (CSV) ↓ JSONL

No signup required. Calibrated to NHANES & ACS — see how →

Need more than the free tier — a bigger one-off dataset, or a custom population? Email me — free, by request.

Parameters

Name	Type	Req	Default	Description
`clientId`	`string`	required	—	Your account's client ID (from /v1/auth/register or /v1/auth/me).
`count`	`integer`	optional	`10000`	Number of person records to generate. Range: 1–1,000,000 (free tier: 5,000 rows per UTC day).
`seed`	`integer`	optional	`(derived from job_id)`	RNG seed for reproducibility. Same seed + count = byte-identical records.
`formats`	`array`	optional	—	Extra output formats. jsonl + csv are always produced; pass ["fhir"] to also emit FHIR R4 bulk NDJSON (Patient / Condition / Observation / MedicationRequest / Coverage / Encounter).

Example record

{
  "id": "64PG6RYQXXD7XFEKZJ6AW616M7",
  "given_name": "Elizabeth", "family_name": "Robinson",
  "age": 31, "sex_at_birth": "female", "race": "white", "ethnicity": "hispanic",
  "state": "IL", "urbanicity": "suburban", "zip_code": "60614",
  "marital_status": "married", "education": "some_college", "employment_status": "employed",
  "household_income_bracket": "50k_75k", "insurance_type": "marketplace", "homeowner_status": "renter",
  "smoking_status": "never", "exercise_frequency": "regular", "sleep_hours_avg": 7.2,
  "height_cm": 165.4, "weight_kg": 70.8, "bmi": 25.9,
  "diabetes_status": "none", "hypertension_status": "none", "a1c_value": 5.4,
  "visits_past_year": 2, "number_of_prescriptions": 1
  // ... 42 more attributes spanning all 9 domains
}

Call it

# 1. Register once — returns your clientId and sets a session cookie
curl -sS -c cookies.txt -X POST https://api.simpleidgen.com/v1/auth/register \
  -H 'Content-Type: application/json' \
  -d '{"name":"You","email":"you@company.com","password":"your-password"}'

# 2. Submit a generation job (uses the saved cookie)
curl -sS -b cookies.txt -X POST https://api.simpleidgen.com/v1/datasets/person \
  -H 'Content-Type: application/json' \
  -d '{"clientId":"<your client id>","count":1000,"seed":42}'

# 3. Poll status, then download the JSONL once completed
curl -sS -b cookies.txt https://api.simpleidgen.com/v1/datasets/<job_id>

// After registering or logging in (session cookie set), submit a job:
const res = await fetch('https://api.simpleidgen.com/v1/datasets/person', {
  method: 'POST',
  credentials: 'include',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ clientId: '<your client id>', count: 1000, seed: 42 }),
});
const { jobId, statusUrl } = await res.json();

import requests
s = requests.Session()
s.post('https://api.simpleidgen.com/v1/auth/login', json={'email': 'you@company.com', 'password': '...'})
job = s.post('https://api.simpleidgen.com/v1/datasets/person', json={'clientId': '<your client id>', 'count': 100000, 'seed': 42}).json()
print(job['jobId'], job['statusUrl'])

specimen · example record en-US

id: 64PG6RYQXXD7XF…
name: Elizabeth Robinson
age / sex: 31 · female
geo: IL · suburban
education: some college · married
income: $50–75k · renter
bmi / a1c: 25.9 · 5.4
diabetes: none

ageμ47

bmiμ29

a1cμ5.7

byte-identical ✓69 attributes / record

Get started

Generation requires a free account — it takes about 10 seconds and gives you a client ID and an API session.

Create a free account

Already have one? Log in.

Your account

You're signed in. Generate datasets and download CSV or JSONL from your profile.

Go to your profile

Endpoint

POST /v1/datasets/person

Async — submit a job, poll /v1/datasets/{job_id}, then download JSONL.

Uniqueness & seed-independence audit — GREEN

An independent randomness & uniqueness audit of a freshly generated 10,000-record sample (seed 2025), plus cross-seed independence across eight cohorts (adjacent, close and distant seeds). It checks that every record is distinct, that no column is stuck or collapsed, that records are independent of row order, and that different seeds produce genuinely different data while the same seed reproduces byte-for-byte. Verdict: GREEN — engine v1.0.46; generator logic byte-identical through the current build.

100%

unique rows

100%

unique IDs

100%

unique emails

cross-seed collisions

In-seed uniqueness (seed 2025, N=10,000)

Every full row and every primary/identity key is distinct. “Soft” identifiers collide only at the rates the value space and the birthday paradox predict — expected, not a defect.

Key	Distinct	Assessment
Full row	10,000 / 10,000	100% — zero exact duplicates
`id` (ULID)	10,000 / 10,000	primary-key invariant holds
`email`	10,000 / 10,000	UUID-domain scheme — unique
`phone`	10,000 / 10,000	designed unique
name + DOB + ZIP	10,000 / 10,000	fully distinct identity tuple
`ssn_last_four`	6,074	expected — birthday paradox over a 10⁴ space

No constriction · spread · independence

The data is not a thin slicer: full substantive records and even the categorical-only projection are 100% distinct, and the continuous fields fill a well-spread, high-dimensional manifold.

Check	Result	Threshold / note
Substantive records distinct (62 cols)	10,000 / 10,000	100% — no constriction
Categorical-only distinct (51 cols)	10,000 / 10,000	largest cluster = 1
Pairwise near-duplicate (max match)	0.758	well below the 0.95 concern line
Nearest-neighbour spread (min, 9-D)	0.214	0 near-coincident points
Effective dimensionality (PCA, 95% var)	7 / 9	not dimension-collapsed
Continuous decile fill	10 / 10	all smooth fields cover the full range
Per-row hash stream (χ², 20 bins)	20.4	PASS (< 30.1) — uniform
Row-order autocorrelation (lag-1)	\|r\| < 0.02	records independent of order

Seed independence & determinism

Different base seeds must produce genuinely different cohorts; the same seed must reproduce byte-for-byte. Both hold across adjacent, close and distant seed pairs.

Seed pair	Identical rows	Shared IDs
Adjacent (2025 vs 2026)	0	0
Close (1 vs 2)	0	0
Distant (2025 vs 777; 1 vs 999,999; 2025 vs 80,000,000)	0	0
Determinism (seed 2025, regenerated ×2)	byte-identical (MD5 match)

GREEN — 0 critical, 0 major. Every row unique; no stuck or collapsed columns; healthy spread on a well-distributed 7-of-9-dimensional manifold; full cross-seed independence; determinism intact. One minor observation: one of eight death-cause categories did not surface in a single ~500-death subsample — pure sampling variance, present at other seeds. The dataset is genuinely diverse, not a thin slicer.

Method: an independent audit run directly on the deterministic person engine (14 sections — uniqueness, entropy, near-duplicate, k-anonymity, PCA, nearest-neighbour spread, cross-seed). Cohorts generated locally; full report retained in the project’s validation records.

Full scientific validation vs real NHANES — GOLD

An independent scientific validation of a 200,000-record cohort against real NHANES 2017–2020 MEC-weighted microdata (9,693 adults) and a maintained library of US benchmarks (NHANES, CDC NDSS, ADA Standards of Care, USRDS, Census/ACS, KFF). It scores structural plausibility, univariate marginals, distribution shape (Kolmogorov–Smirnov vs the real microdata), and clinical associations. Verdict: GOLD — engine v1.0.46; generator logic byte-identical through the current build.

row uniqueness

100%

impossible records

0/200k

KS vs NHANES · age

.042

clinical ORs in band

5/5

Layer summary

Layer	Status	Headline
L1 — structural / plausibility	GOLD	0 impossible records / 200,000 (detector self-tested 7/7)
L2 — marginal fidelity	GOLD	all headline rates in band
L3 — shape vs NHANES (KS)	GOLD	every variable in the “good”/“excellent” tier
L3a — clinical associations	GOLD	5 / 5 core odds ratios in band

L1 — structural integrity (zero tolerance)

Clinically or physically impossible combinations, counted across all 200,000 records. The detector is itself tested against hand-injected violators (7/7 caught, 0 false positives), so the zero is real, not a silent skip.

Impossibility rule	Hits
Non-diabetic with diabetic-range A1c (≥ 6.5)	0
Insulin without diabetes	0
Male & pregnant	0 / 99,061
Blood-pressure inversion (systolic ≤ diastolic)	0
Clinical stage > biological stage	0
Total impossibility rate	0.00000%

L2 — marginals vs US benchmarks

Metric	Synthetic (200K)	Benchmark
Diabetes — any	14.62%	~14 (CDC NDSS)	✓
Prediabetes	35.90%	37–38	✓
CKD — any	14.84%	13.9–14.5	✓
Hypertension (measured)	49.19%	47.7	✓
Obese (BMI ≥ 30)	42.33%	42.4	✓
Mean A1c	5.745	5.8	✓
Insurance — employer / Medicare / uninsured	49.4 / 19.1 / 9.0	53.7 / 18.9 / 8.0	✓
Current smoker	11.57%	11.6 (NHIS)	✓
Bachelor’s+ (age 25+)	37.35%	37 (CPS ASEC)	✓

L3 — distribution shape vs real NHANES microdata

Kolmogorov–Smirnov distance between synthetic and MEC-weighted real NHANES adults (lower is closer; under ~0.05 is excellent, under ~0.08 good).

Variable	KS-D	Tier
age	0.042	excellent
BMI	0.020	excellent
weight	0.038	excellent
height	0.029	excellent
waist circumference	0.019	excellent
systolic BP	0.056	good
diastolic BP	0.074	good
A1c	0.063	good

Joint correlations hold too: height↔weight r = 0.514, BMI↔waist r = 0.858 — both in their real-world bands.

L3a — clinical associations

The relationships that make a record cohere, measured with Woolf–Haldane confidence intervals (Mantel–Haenszel, age-adjusted, for smoking→CVD).

Association	Effect	Expected band
Obesity → type-2 diabetes	OR 5.01	4.0–5.5	✓
Obesity → hypertension	OR 3.08	2.0–3.5	✓
Diabetes → hypertension	OR 6.75	strong+	✓
Current smoker → CVD (age-adj.)	OR 1.93	1.4–2.5	✓
Higher A1c in retinopathy cases	+0.87 pts	≥ 0.5	✓
CKD stage → mortality	2.67×	monotone ≥ 2×	✓

GOLD — 0 critical, 0 major, 5 minor. All minor residuals are by-design or accepted: type-1 share of diabetes slightly high, the 40–59 diabetes rate slightly conservative, and a residual joint structure in A1c × diabetes-status × age (a two-sample classifier separates synthetic from real at AUC 0.86 — driven by deliberately tight A1c calibration, not a marginal error). Certification holds; the data is safe to use.

Method: an independent biostatistician-grade validation run on the deterministic person engine — structural CKVR, marginals, MEC-weighted KS vs real NHANES microdata, odds ratios, and the T2DM stage dial. Synthetic cohorts generated locally; real comparison data is public NHANES microdata; full report retained in the project’s validation records.