Synthetic Person Profile — Variance & Fidelity Report

An independent quality-assurance review of one snapshot of synthetic person-profile records. Tests uniqueness, fidelity to public reference distributions, and internal consistency across 2,000,000 records produced in 10 independent regenerations.

Engine

person-profile-advanced v0.5

This dataset was generated by version 0.5 of the person-profile-advanced engine. All quality metrics in this report are attributable to that specific version.

Show 8 reference benchmarks cited

Source	Vintage
`ACS`	`2022`
`NHANES`	`2017-2020`
`CDC-NDSS`	`2022`
`CDC-mortality`	`2022`
`KFF`	`2023`
`MEPS`	`2022`
`BLS`	`2023`
`USPS-L005`	`2024`

How to read this report

This is a data fidelity report: an external Quality Assurance review of one specific snapshot generated by the engine listed above. It tests three classes of claim:

Uniqueness. Are records distinct at the row, identifier, and identity-tuple levels? Reported with 95% Wilson confidence intervals.
Distribution fidelity. Do the dataset's marginal and joint distributions match published US-adult reference distributions (NHANES, ACS, CDC, KFF, BLS, US Census)? Tested with two-sample Kolmogorov–Smirnov (numeric) and chi-squared goodness-of-fit (categorical), with a three-tier verdict using Cohen's conventional effect-size bands: MATCH (Cramér's V < 0.10 for categorical; KS D < 0.05 for numeric — negligible effect); CLOSE (0.10 ≤ V < 0.30 or 0.05 ≤ D < 0.10 — small but practically meaningful); DIVERGES (V ≥ 0.30 or D ≥ 0.10 — medium-to-large, surfaced explicitly in the Limitations section). At sample sizes in the millions, p-value alone is uninformative — even trivial departures become "significant", so effect size is the primary verdict driver.
Internal consistency. Do invariant relationships hold across the dataset (e.g., BMI consistent with height & weight; ZIP consistent with state; pregnancy_status consistent with sex_at_birth)?

The reference distributions cited above are comparison targets, not training inputs. Verdicts are MATCH, DIVERGES, SIGN MISMATCH (for correlations only), or N/A when the test is not applicable. Cross-field invariants are tested against absolute counts (we report any non-zero violation).

Executive summary

Scenario

10 datasets × 200,000 records (pooled n = 2,000,000)

Records

2,000,000

in this snapshot

Unique rows

100.0000%

0 duplicate row(s)

Unique IDs

100.0000%

0 ID collision(s)

Email format valid

100.00%

Distribution matches (negligible effect)

17 / 18

1 close, 0 diverge

Invariant violations

across all checks

1. Dataset characteristics

10 independent datasets of approximately 200,000 records each. Each dataset is generated with a distinct base seed; downstream sections of this report use the pooled population (n = 2,000,000) for distribution tests, and the 10 individual datasets for the variance analysis in section 4.

Uniqueness with 95% Wilson confidence intervals

Definition	Unique count	Percentage	Collisions
Whole-row hash (every field byte-identical)	2,000,000	100.0000% (95% CI 99.9998–100.0000%)	0
ID field (record identifier)	2,000,000	100.0000% (95% CI 99.9998–100.0000%)	0
Identity tuple (given_name + family_name + date_of_birth)	1,908,187	95.4094% (95% CI 95.3803–95.4383%)	46,283

Identity-tuple collisions are structurally expected at scale — see the birthday-paradox section below for the analytical derivation. Whole-row and ID-field collisions are the diagnostic uniqueness signals.

2. Distribution fidelity

For each marginal distribution where a published US-adult reference exists, the observed proportions are compared against the reference using a formal statistical test. A MATCH indicates the null hypothesis (observed and reference are drawn from the same distribution) is not rejected at α = 0.05.

Categorical attributes (chi-squared goodness-of-fit)

Attribute	Reference	χ²	df	p-value	Cramér's V	Verdict (α=0.05)
`ckd_status`	CDC 2022	18816.524	3	<1e-10	0.097	MATCH
`diabetes_status`	CDC NDSS 2022	10353.591	4	<1e-10	0.0719	MATCH
`education`	ACS 2022	9639.285	5	<1e-10	0.0694	MATCH
`employment_status`	BLS 2023	5898.834	6	<1e-10	0.0543	MATCH
`ethnicity`	ACS 2022	95.369	1	<1e-10	0.0069	MATCH
`hypertension_status`	CDC 2022	555.759	2	<1e-10	0.0167	MATCH
`insurance_type`	KFF 2023	13123.677	6	<1e-10	0.081	MATCH
`marital_status`	ACS 2022	3659.249	4	<1e-10	0.0428	MATCH
`race`	ACS 2022	18795.175	6	<1e-10	0.0969	MATCH
`sex_at_birth`	ACS 2022	0.713	1	0.398	0.0006	MATCH
`smoking_status`	CDC 2022	6383.91	2	<1e-10	0.0565	MATCH
`state`	ACS 2022	35.012	50	0.947	0.0042	MATCH

Numeric attributes (two-sample Kolmogorov–Smirnov)

Attribute	Observed (mean ± sd)	Reference	Source	KS D	p-value	Verdict (α=0.05)
`a1c_value`	5.69 ± 1.00	5.70 ± 0.95	NHANES 2017-2020 (LBXGH adult mean)	0.0959	<1e-10	CLOSE
`age`	47.47 ± 18.49	—	ACS 2022 (US adults)	0.0176	<1e-10	MATCH
`bmi`	29.49 ± 6.71	29.50 ± 6.80	NHANES 2017-2020 (US adults)	0.0195	<1e-10	MATCH
`height_cm`	168.48 ± 9.90	168.50 ± 10.00	NHANES 2017-2020 (adult height)	0.0224	<1e-10	MATCH
`waist_circumference_cm`	97.56 ± 15.81	98.00 ± 16.00	NHANES 2017-2020 (BMXWAIST adult mean)	0.0106	<1e-10	MATCH
`weight_kg`	84.16 ± 21.91	84.00 ± 22.00	NHANES 2017-2020 (adult weight)	0.0132	<1e-10	MATCH

Observed vs reference — histograms

Bars show the observed empirical density; the red line is the parametric reference density at the same support. Lower KS statistics indicate closer match in distribution shape.

3. Joint structure

Marginal fidelity is necessary but not sufficient; an analyst will also ask whether the joint structure between attributes matches the real world. Two cross-tabs:

3a. Diabetes prevalence by age band

Age band	n	Observed (95% CI)	Reference	Source	Verdict
18-44	940,616	3.55% (CI 3.51–3.59%)	4.0%	CDC NDSS 2022	MATCH
45-64	659,912	14.32% (CI 14.23–14.4%)	17.0%	CDC NDSS 2022	CLOSE
65-95	399,472	26.1% (CI 25.96–26.23%)	29.0%	CDC NDSS 2022	CLOSE

3b. Mean BMI by age band

Age band	n	Observed mean	Reference mean	Δ	Source	Verdict
18-44	940,616	29.48	28.4	+1.08	NHANES 2017-2020	DIVERGES
45-64	659,912	29.49	30.4	-0.91	NHANES 2017-2020	DIVERGES
65-95	399,472	29.49	29.6	-0.11	NHANES 2017-2020	MATCH

3c. Inter-attribute Pearson correlations

Practical-equivalence band ±0.15. Sign mismatches (observed and reference of opposite sign with reference |r| ≥ 0.1) are reported separately as a stronger flag.

Pair	Observed r	Reference r	Δ	Source	Verdict
`bmi × waist_circumference_cm`	0.7599	0.85	-0.090	NHANES	MATCH
`bmi × weight_kg`	0.8885	0.78	+0.108	NHANES	MATCH
`height_cm × weight_kg`	0.5156	0.45	+0.066	NHANES	MATCH
`age × a1c_value`	0.2854	0.14	+0.145	NHANES	MATCH
`age × bmi`	0.001	0.1	-0.099	NHANES	MATCH

4. Variance across regenerations

A key quality of a synthetic data product is that regenerating the dataset yields a meaningfully different sample, not a near-copy. Tested by computing the Jaccard overlap of identity tuples across all C(10,2) = 45 dataset pairs.

Max pairwise overlap

0.2485%

across all pairs

Mean pairwise overlap

0.2350%

across all pairs

Pairs compared

from 10 datasets

Per-attribute distribution drift across regenerations

Total-variation distance between each pair of dataset's empirical distributions. Lower values mean more stable marginals across regenerations.

5. Uniqueness & birthday-paradox analysis

The identity-tuple uniqueness percentage decreases at scale because the (given_name × family_name × date_of_birth) value space is finite. The expected collision count under uniform sampling from that space is computed and compared against the observed count.

Given-name distinct values in this dataset	60
Family-name distinct values in this dataset	30
Distinct dates of birth in this dataset	28,469
Identity-tuple pool size (product)	51,244,200
Records drawn (n)	2,000,000
Expected collisions (linearity of expectation, n(n-1)/2 / pool)	39028.8
Observed collisions	46,283
Observed / Expected ratio	1.186
Verdict (test: normal_2sigma)	AS EXPECTED

6. Format validity

Field	Valid format	95% CI
Email	2,000,000	100.0000% (95% CI 99.9998–100.0000%)
Phone (locale-appropriate)	2,000,000	100.0000% (95% CI 99.9998–100.0000%)

7. Quality assurance summary

All cross-field invariant checks pass on this snapshot. 2,000,000 records, 0 violations.

The check battery covers: BMI consistency with height & weight; ZIP code consistency with state; pregnancy_status consistency with sex_at_birth; insulin use consistency with diabetes_status; prescription-count consistency with chronic-medication flags; age within the adult range; and ~30 other invariants.

8. Limitations

Identity-tuple collisions are structurally expected at scale. At n = 2,000,000 records drawn from a finite name × DOB pool of size ~51,244,200, the expected number of (given_name, family_name, date_of_birth) collisions is approximately 39028.8; the observed count of 46,283 matches the birthday-paradox prediction (AS EXPECTED). This is a property of finite name pools, not a generator defect. Use the ID field for guaranteed uniqueness.
en-GB and en-IN locales use en-US health reference data as fallback, disclosed via a per-record metadata field. en-US is the only locale with native health-reference calibration in this engine version.
Cohort-level fidelity is out of scope for this report. Longitudinal events (encounters, prescriptions filled, lab observations over time) are produced by separate engines and reported separately.
This is engine v0.4 — early in a versioned roadmap. Each release narrows specific gaps; this report is the public quality ledger for one snapshot generated by one engine version. The published version stamp on every record makes future reports trivially comparable.

9. Privacy & re-identification

By construction, every value in this dataset is drawn from a synthetic distribution. No record corresponds to a real person; no field is derived from any real record. As a consequence:

The dataset is HIPAA-equivalent on its face: there are no protected health identifiers because there are no real-world identifiers at all. (Formal HIPAA Safe Harbor attestation by a statistician is a separate compliance artifact; this report makes the structural argument.)
Re-identification via attribute combination (the "linkage attack" threat against anonymized real data) does not apply. There is no underlying real distribution whose tails could uniquely identify an individual.
Identity-tuple collisions (multiple records sharing first_name + family_name + date_of_birth) are an expected statistical artifact of drawing from finite name pools, not a privacy concern. See the birthday-paradox derivation above.
Sensitive-format fields (ssn_last_four, employer_ein) use structurally invalid prefixes that cannot collide with real-world assignments, so a test ID never accidentally resembles a real one.

10. Reproducibility

This dataset can be regenerated via the public API:

POST /v1/mock/person
{
  "count":   200000,
  "seed":    <any uint64>,
  "locale":  "en-US"
}

Each request is byte-identical for the same seed at the engine version listed in the header card.

Need this data for your use case?

Synthesize US-adult populations with the same fidelity. Configurable by preset, deterministic by seed, exportable to S3 at multi-million-row scale.

View pricing & access →