Synthetic Person Profile — Variance & Fidelity Report

An independent quality-assurance review of one snapshot of synthetic person-profile records. Tests uniqueness, fidelity to public reference distributions, and internal consistency across 2,000,000 records produced in 10 independent regenerations.

Engine
person-profile-advanced  v0.5

This dataset was generated by version 0.5 of the person-profile-advanced engine. All quality metrics in this report are attributable to that specific version.

Show 8 reference benchmarks cited
SourceVintage
ACS2022
NHANES2017-2020
CDC-NDSS2022
CDC-mortality2022
KFF2023
MEPS2022
BLS2023
USPS-L0052024
How to read this report

This is a data fidelity report: an external Quality Assurance review of one specific snapshot generated by the engine listed above. It tests three classes of claim:

  1. Uniqueness. Are records distinct at the row, identifier, and identity-tuple levels? Reported with 95% Wilson confidence intervals.
  2. Distribution fidelity. Do the dataset's marginal and joint distributions match published US-adult reference distributions (NHANES, ACS, CDC, KFF, BLS, US Census)? Tested with two-sample Kolmogorov–Smirnov (numeric) and chi-squared goodness-of-fit (categorical), with a three-tier verdict using Cohen's conventional effect-size bands: MATCH (Cramér's V < 0.10 for categorical; KS D < 0.05 for numeric — negligible effect); CLOSE (0.10 ≤ V < 0.30 or 0.05 ≤ D < 0.10 — small but practically meaningful); DIVERGES (V ≥ 0.30 or D ≥ 0.10 — medium-to-large, surfaced explicitly in the Limitations section). At sample sizes in the millions, p-value alone is uninformative — even trivial departures become "significant", so effect size is the primary verdict driver.
  3. Internal consistency. Do invariant relationships hold across the dataset (e.g., BMI consistent with height & weight; ZIP consistent with state; pregnancy_status consistent with sex_at_birth)?

The reference distributions cited above are comparison targets, not training inputs. Verdicts are MATCH, DIVERGES, SIGN MISMATCH (for correlations only), or N/A when the test is not applicable. Cross-field invariants are tested against absolute counts (we report any non-zero violation).

Executive summary

Scenario
10 datasets × 200,000 records (pooled n = 2,000,000)
Records
2,000,000
in this snapshot
Unique rows
100.0000%
0 duplicate row(s)
Unique IDs
100.0000%
0 ID collision(s)
Email format valid
100.00%
Distribution matches (negligible effect)
17 / 18
1 close, 0 diverge
Invariant violations
0
across all checks

1. Dataset characteristics

10 independent datasets of approximately 200,000 records each. Each dataset is generated with a distinct base seed; downstream sections of this report use the pooled population (n = 2,000,000) for distribution tests, and the 10 individual datasets for the variance analysis in section 4.

Uniqueness with 95% Wilson confidence intervals

DefinitionUnique count PercentageCollisions
Whole-row hash (every field byte-identical) 2,000,000 100.0000% (95% CI 99.9998–100.0000%) 0
ID field (record identifier) 2,000,000 100.0000% (95% CI 99.9998–100.0000%) 0
Identity tuple (given_name + family_name + date_of_birth) 1,908,187 95.4094% (95% CI 95.3803–95.4383%) 46,283

Identity-tuple collisions are structurally expected at scale — see the birthday-paradox section below for the analytical derivation. Whole-row and ID-field collisions are the diagnostic uniqueness signals.

2. Distribution fidelity

For each marginal distribution where a published US-adult reference exists, the observed proportions are compared against the reference using a formal statistical test. A MATCH indicates the null hypothesis (observed and reference are drawn from the same distribution) is not rejected at α = 0.05.

Categorical attributes (chi-squared goodness-of-fit)

AttributeReference χ²df p-valueCramér's V Verdict (α=0.05)
ckd_statusCDC 202218816.5243<1e-100.097MATCH
diabetes_statusCDC NDSS 202210353.5914<1e-100.0719MATCH
educationACS 20229639.2855<1e-100.0694MATCH
employment_statusBLS 20235898.8346<1e-100.0543MATCH
ethnicityACS 202295.3691<1e-100.0069MATCH
hypertension_statusCDC 2022555.7592<1e-100.0167MATCH
insurance_typeKFF 202313123.6776<1e-100.081MATCH
marital_statusACS 20223659.2494<1e-100.0428MATCH
raceACS 202218795.1756<1e-100.0969MATCH
sex_at_birthACS 20220.71310.3980.0006MATCH
smoking_statusCDC 20226383.912<1e-100.0565MATCH
stateACS 202235.012500.9470.0042MATCH

Numeric attributes (two-sample Kolmogorov–Smirnov)

Attribute Observed (mean ± sd) Reference Source KS D p-value Verdict (α=0.05)
a1c_value5.69 ± 1.005.70 ± 0.95NHANES 2017-2020 (LBXGH adult mean)0.0959<1e-10CLOSE
age47.47 ± 18.49ACS 2022 (US adults)0.0176<1e-10MATCH
bmi29.49 ± 6.7129.50 ± 6.80NHANES 2017-2020 (US adults)0.0195<1e-10MATCH
height_cm168.48 ± 9.90168.50 ± 10.00NHANES 2017-2020 (adult height)0.0224<1e-10MATCH
waist_circumference_cm97.56 ± 15.8198.00 ± 16.00NHANES 2017-2020 (BMXWAIST adult mean)0.0106<1e-10MATCH
weight_kg84.16 ± 21.9184.00 ± 22.00NHANES 2017-2020 (adult weight)0.0132<1e-10MATCH

Observed vs reference — histograms

Bars show the observed empirical density; the red line is the parametric reference density at the same support. Lower KS statistics indicate closer match in distribution shape.

Age distribution BMI distribution HbA1c distribution Waist circumference Height Weight

3. Joint structure

Marginal fidelity is necessary but not sufficient; an analyst will also ask whether the joint structure between attributes matches the real world. Two cross-tabs:

3a. Diabetes prevalence by age band

Age bandn Observed (95% CI)Reference SourceVerdict
18-44940,6163.55% (CI 3.51–3.59%)4.0%CDC NDSS 2022MATCH
45-64659,91214.32% (CI 14.23–14.4%)17.0%CDC NDSS 2022CLOSE
65-95399,47226.1% (CI 25.96–26.23%)29.0%CDC NDSS 2022CLOSE
Diabetes prevalence by age

3b. Mean BMI by age band

Age bandn Observed meanReference mean ΔSourceVerdict
18-44940,61629.4828.4+1.08NHANES 2017-2020DIVERGES
45-64659,91229.4930.4-0.91NHANES 2017-2020DIVERGES
65-95399,47229.4929.6-0.11NHANES 2017-2020MATCH
BMI by age

3c. Inter-attribute Pearson correlations

Practical-equivalence band ±0.15. Sign mismatches (observed and reference of opposite sign with reference |r| ≥ 0.1) are reported separately as a stronger flag.

PairObserved r Reference rΔ SourceVerdict
bmi × waist_circumference_cm0.75990.85-0.090NHANESMATCH
bmi × weight_kg0.88850.78+0.108NHANESMATCH
height_cm × weight_kg0.51560.45+0.066NHANESMATCH
age × a1c_value0.28540.14+0.145NHANESMATCH
age × bmi0.0010.1-0.099NHANESMATCH
Correlation matrix

4. Variance across regenerations

A key quality of a synthetic data product is that regenerating the dataset yields a meaningfully different sample, not a near-copy. Tested by computing the Jaccard overlap of identity tuples across all C(10,2) = 45 dataset pairs.

Max pairwise overlap
0.2485%
across all pairs
Mean pairwise overlap
0.2350%
across all pairs
Pairs compared
45
from 10 datasets
Cross-dataset overlap matrix

Per-attribute distribution drift across regenerations

Total-variation distance between each pair of dataset's empirical distributions. Lower values mean more stable marginals across regenerations.

Per-attribute drift

5. Uniqueness & birthday-paradox analysis

The identity-tuple uniqueness percentage decreases at scale because the (given_name × family_name × date_of_birth) value space is finite. The expected collision count under uniform sampling from that space is computed and compared against the observed count.

Given-name distinct values in this dataset 60
Family-name distinct values in this dataset 30
Distinct dates of birth in this dataset 28,469
Identity-tuple pool size (product) 51,244,200
Records drawn (n) 2,000,000
Expected collisions (linearity of expectation, n(n-1)/2 / pool) 39028.8
Observed collisions 46,283
Observed / Expected ratio 1.186
Verdict (test: normal_2sigma) AS EXPECTED

6. Format validity

FieldValid format95% CI
Email 2,000,000 100.0000% (95% CI 99.9998–100.0000%)
Phone (locale-appropriate) 2,000,000 100.0000% (95% CI 99.9998–100.0000%)

7. Quality assurance summary

All cross-field invariant checks pass on this snapshot. 2,000,000 records, 0 violations.

The check battery covers: BMI consistency with height & weight; ZIP code consistency with state; pregnancy_status consistency with sex_at_birth; insulin use consistency with diabetes_status; prescription-count consistency with chronic-medication flags; age within the adult range; and ~30 other invariants.

8. Limitations

9. Privacy & re-identification

By construction, every value in this dataset is drawn from a synthetic distribution. No record corresponds to a real person; no field is derived from any real record. As a consequence:

10. Reproducibility

This dataset can be regenerated via the public API:

POST /v1/mock/person
{
  "count":   200000,
  "seed":    <any uint64>,
  "locale":  "en-US"
}

Each request is byte-identical for the same seed at the engine version listed in the header card.

Need this data for your use case?

Synthesize US-adult populations with the same fidelity. Configurable by preset, deterministic by seed, exportable to S3 at multi-million-row scale.

View pricing & access →