Synthetic Person Profile — Large-Scale Fidelity Report

An independent quality-assurance review of a single large snapshot of synthetic person-profile records, generated end-to-end as one job. Tests uniqueness, fidelity to public reference distributions, and internal consistency at 2,000,000-record scale.

Engine

person-profile-advanced v0.5

This dataset was generated by version 0.5 of the person-profile-advanced engine. All quality metrics in this report are attributable to that specific version.

Show 8 reference benchmarks cited

Source	Vintage
`ACS`	`2022`
`NHANES`	`2017-2020`
`CDC-NDSS`	`2022`
`CDC-mortality`	`2022`
`KFF`	`2023`
`MEPS`	`2022`
`BLS`	`2023`
`USPS-L005`	`2024`

How to read this report

This is a data fidelity report: an external Quality Assurance review of one specific snapshot generated by the engine listed above. It tests three classes of claim:

Uniqueness. Are records distinct at the row, identifier, and identity-tuple levels? Reported with 95% Wilson confidence intervals.
Distribution fidelity. Do the dataset's marginal and joint distributions match published US-adult reference distributions (NHANES, ACS, CDC, KFF, BLS, US Census)? Tested with two-sample Kolmogorov–Smirnov (numeric) and chi-squared goodness-of-fit (categorical), with a three-tier verdict using Cohen's conventional effect-size bands: MATCH (Cramér's V < 0.10 for categorical; KS D < 0.05 for numeric — negligible effect); CLOSE (0.10 ≤ V < 0.30 or 0.05 ≤ D < 0.10 — small but practically meaningful); DIVERGES (V ≥ 0.30 or D ≥ 0.10 — medium-to-large, surfaced explicitly in the Limitations section). At sample sizes in the millions, p-value alone is uninformative — even trivial departures become "significant", so effect size is the primary verdict driver.
Internal consistency. Do invariant relationships hold across the dataset (e.g., BMI consistent with height & weight; ZIP consistent with state; pregnancy_status consistent with sex_at_birth)?

The reference distributions cited above are comparison targets, not training inputs. Verdicts are MATCH, DIVERGES, SIGN MISMATCH (for correlations only), or N/A when the test is not applicable. Cross-field invariants are tested against absolute counts (we report any non-zero violation).

Executive summary

Scenario

single dataset, n = 2,000,000

Records

2,000,000

in this snapshot

Unique rows

100.0000%

0 duplicate row(s)

Unique IDs

100.0000%

0 ID collision(s)

Email format valid

100.00%

Distribution matches (negligible effect)

17 / 18

1 close, 0 diverge

Invariant violations

across all checks

1. Dataset characteristics

Uniqueness with 95% Wilson confidence intervals

Definition	Unique count	Percentage	Collisions
Whole-row hash (every field byte-identical)	2,000,000	100.0000% (95% CI 99.9998–100.0000%)	0
ID field (record identifier)	2,000,000	100.0000% (95% CI 99.9998–100.0000%)	0
Identity tuple (given_name + family_name + date_of_birth)	1,908,416	95.4208% (95% CI 95.3917–95.4497%)	46,166

Identity-tuple collisions are structurally expected at scale — see the birthday-paradox section below for the analytical derivation. Whole-row and ID-field collisions are the diagnostic uniqueness signals.

2. Distribution fidelity

For each marginal distribution where a published US-adult reference exists, the observed proportions are compared against the reference using a formal statistical test. A MATCH indicates the null hypothesis (observed and reference are drawn from the same distribution) is not rejected at α = 0.05.

Categorical attributes (chi-squared goodness-of-fit)

Attribute	Reference	χ²	df	p-value	Cramér's V	Verdict (α=0.05)
`ckd_status`	CDC 2022	19194.968	3	<1e-10	0.098	MATCH
`diabetes_status`	CDC NDSS 2022	10122.693	4	<1e-10	0.0711	MATCH
`education`	ACS 2022	9307.666	5	<1e-10	0.0682	MATCH
`employment_status`	BLS 2023	6135.072	6	<1e-10	0.0554	MATCH
`ethnicity`	ACS 2022	124.31	1	<1e-10	0.0079	MATCH
`hypertension_status`	CDC 2022	499.426	2	<1e-10	0.0158	MATCH
`insurance_type`	KFF 2023	12906.311	6	<1e-10	0.0803	MATCH
`marital_status`	ACS 2022	3652.099	4	<1e-10	0.0427	MATCH
`race`	ACS 2022	19275.245	6	<1e-10	0.0982	MATCH
`sex_at_birth`	ACS 2022	0.049	1	0.825	0.0002	MATCH
`smoking_status`	CDC 2022	6663.676	2	<1e-10	0.0577	MATCH
`state`	ACS 2022	69.128	50	0.038	0.0059	MATCH

Numeric attributes (two-sample Kolmogorov–Smirnov)

Attribute	Observed (mean ± sd)	Reference	Source	KS D	p-value	Verdict (α=0.05)
`a1c_value`	5.69 ± 1.00	5.70 ± 0.95	NHANES 2017-2020 (LBXGH adult mean)	0.0955	<1e-10	CLOSE
`age`	47.49 ± 18.49	—	ACS 2022 (US adults)	0.0178	<1e-10	MATCH
`bmi`	29.49 ± 6.72	29.50 ± 6.80	NHANES 2017-2020 (US adults)	0.0193	<1e-10	MATCH
`height_cm`	168.48 ± 9.90	168.50 ± 10.00	NHANES 2017-2020 (adult height)	0.0226	<1e-10	MATCH
`waist_circumference_cm`	97.56 ± 15.82	98.00 ± 16.00	NHANES 2017-2020 (BMXWAIST adult mean)	0.0103	<1e-10	MATCH
`weight_kg`	84.18 ± 21.93	84.00 ± 22.00	NHANES 2017-2020 (adult weight)	0.0137	<1e-10	MATCH

Observed vs reference — histograms

Bars show the observed empirical density; the red line is the parametric reference density at the same support.

3. Joint structure

3a. Diabetes prevalence by age band

Age band	n	Observed (95% CI)	Reference	Source	Verdict
18-44	940,227	3.55% (CI 3.52–3.59%)	4.0%	CDC NDSS 2022	MATCH
45-64	660,317	14.26% (CI 14.18–14.34%)	17.0%	CDC NDSS 2022	CLOSE
65-95	399,456	26.09% (CI 25.96–26.23%)	29.0%	CDC NDSS 2022	CLOSE

3b. Mean BMI by age band

Age band	n	Observed mean	Reference mean	Δ	Source	Verdict
18-44	940,227	29.49	28.4	+1.09	NHANES 2017-2020	DIVERGES
45-64	660,317	29.5	30.4	-0.90	NHANES 2017-2020	DIVERGES
65-95	399,456	29.49	29.6	-0.11	NHANES 2017-2020	MATCH

3c. Inter-attribute Pearson correlations

Practical-equivalence band ±0.15. Sign mismatches reported separately.

Pair	Observed r	Reference r	Δ	Source	Verdict
`bmi × waist_circumference_cm`	0.7604	0.85	-0.090	NHANES	MATCH
`bmi × weight_kg`	0.8888	0.78	+0.109	NHANES	MATCH
`height_cm × weight_kg`	0.5162	0.45	+0.066	NHANES	MATCH
`age × a1c_value`	0.2851	0.14	+0.145	NHANES	MATCH
`age × bmi`	0.0001	0.1	-0.100	NHANES	MATCH

4. Uniqueness & birthday-paradox analysis

The identity-tuple uniqueness percentage decreases at scale because the (given_name × family_name × date_of_birth) value space is finite. The expected collision count is computed and compared against the observed.

Given-name distinct values in this dataset	60
Family-name distinct values in this dataset	30
Distinct dates of birth in this dataset	28,470
Identity-tuple pool size (product)	51,246,000
Records drawn (n)	2,000,000
Expected collisions (linearity of expectation, n(n-1)/2 / pool)	39027.4
Observed collisions	46,166
Observed / Expected ratio	1.183
Verdict (test: normal_2sigma)	AS EXPECTED

5. Format validity

Field	Valid format	95% CI
Email	2,000,000	100.0000% (95% CI 99.9998–100.0000%)
Phone	2,000,000	100.0000% (95% CI 99.9998–100.0000%)

6. Quality assurance summary

All cross-field invariant checks pass on this snapshot. 2,000,000 records, 0 violations.

The check battery covers: BMI consistency with height & weight; ZIP code consistency with state; pregnancy_status consistency with sex_at_birth; insulin use consistency with diabetes_status; prescription-count consistency with chronic-medication flags; age within the adult range; and ~30 other invariants.

7. Limitations

Identity-tuple collisions are structurally expected at scale. At n = 2,000,000 records drawn from a finite name × DOB pool of size ~51,246,000, the expected number of (given_name, family_name, date_of_birth) collisions is approximately 39027.4; the observed count of 46,166 matches the birthday-paradox prediction (AS EXPECTED). This is a property of finite name pools, not a generator defect. Use the ID field for guaranteed uniqueness.
en-GB and en-IN locales use en-US health reference data as fallback, disclosed via a per-record metadata field. en-US is the only locale with native health-reference calibration in this engine version.
Cohort-level fidelity is out of scope for this report. Longitudinal events (encounters, prescriptions filled, lab observations over time) are produced by separate engines and reported separately.
This is engine v0.4 — early in a versioned roadmap. Each release narrows specific gaps; this report is the public quality ledger for one snapshot generated by one engine version. The published version stamp on every record makes future reports trivially comparable.

8. Privacy & re-identification

By construction, every value in this dataset is drawn from a synthetic distribution. No record corresponds to a real person; no field is derived from any real record. As a consequence:

The dataset is HIPAA-equivalent on its face: there are no protected health identifiers because there are no real-world identifiers at all. (Formal HIPAA Safe Harbor attestation by a statistician is a separate compliance artifact; this report makes the structural argument.)
Re-identification via attribute combination (the "linkage attack" threat against anonymized real data) does not apply. There is no underlying real distribution whose tails could uniquely identify an individual.
Identity-tuple collisions (multiple records sharing first_name + family_name + date_of_birth) are an expected statistical artifact of drawing from finite name pools, not a privacy concern. See the birthday-paradox derivation above.
Sensitive-format fields (ssn_last_four, employer_ein) use structurally invalid prefixes that cannot collide with real-world assignments, so a test ID never accidentally resembles a real one.

9. Reproducibility

Datasets of this size can be requested via the asynchronous endpoint:

POST /v1/datasets/person
{
  "count":   2000000,
  "seed":    <any uint64>,
  "locale":  "en-US",
  "preset":  "us_2024_adults"
}

The endpoint returns a job ID; poll for completion and retrieve a download link from the status endpoint. Each request with the same seed produces the same dataset at the engine version listed in the header.

Need this data for your use case?

Synthesize US-adult populations with the same fidelity. Configurable by preset, deterministic by seed, exportable to S3 at multi-million-row scale.

View pricing & access →