Synthetic Person Profile — Large-Scale Fidelity Report

An independent quality-assurance review of a single large snapshot of synthetic person-profile records, generated end-to-end as one job. Tests uniqueness, fidelity to public reference distributions, and internal consistency at 2,000,000-record scale.

Engine
person-profile-advanced  v0.5

This dataset was generated by version 0.5 of the person-profile-advanced engine. All quality metrics in this report are attributable to that specific version.

Show 8 reference benchmarks cited
SourceVintage
ACS2022
NHANES2017-2020
CDC-NDSS2022
CDC-mortality2022
KFF2023
MEPS2022
BLS2023
USPS-L0052024
How to read this report

This is a data fidelity report: an external Quality Assurance review of one specific snapshot generated by the engine listed above. It tests three classes of claim:

  1. Uniqueness. Are records distinct at the row, identifier, and identity-tuple levels? Reported with 95% Wilson confidence intervals.
  2. Distribution fidelity. Do the dataset's marginal and joint distributions match published US-adult reference distributions (NHANES, ACS, CDC, KFF, BLS, US Census)? Tested with two-sample Kolmogorov–Smirnov (numeric) and chi-squared goodness-of-fit (categorical), with a three-tier verdict using Cohen's conventional effect-size bands: MATCH (Cramér's V < 0.10 for categorical; KS D < 0.05 for numeric — negligible effect); CLOSE (0.10 ≤ V < 0.30 or 0.05 ≤ D < 0.10 — small but practically meaningful); DIVERGES (V ≥ 0.30 or D ≥ 0.10 — medium-to-large, surfaced explicitly in the Limitations section). At sample sizes in the millions, p-value alone is uninformative — even trivial departures become "significant", so effect size is the primary verdict driver.
  3. Internal consistency. Do invariant relationships hold across the dataset (e.g., BMI consistent with height & weight; ZIP consistent with state; pregnancy_status consistent with sex_at_birth)?

The reference distributions cited above are comparison targets, not training inputs. Verdicts are MATCH, DIVERGES, SIGN MISMATCH (for correlations only), or N/A when the test is not applicable. Cross-field invariants are tested against absolute counts (we report any non-zero violation).

Executive summary

Scenario
single dataset, n = 2,000,000
Records
2,000,000
in this snapshot
Unique rows
100.0000%
0 duplicate row(s)
Unique IDs
100.0000%
0 ID collision(s)
Email format valid
100.00%
Distribution matches (negligible effect)
17 / 18
1 close, 0 diverge
Invariant violations
0
across all checks

1. Dataset characteristics

Uniqueness with 95% Wilson confidence intervals

DefinitionUnique count PercentageCollisions
Whole-row hash (every field byte-identical) 2,000,000 100.0000% (95% CI 99.9998–100.0000%) 0
ID field (record identifier) 2,000,000 100.0000% (95% CI 99.9998–100.0000%) 0
Identity tuple (given_name + family_name + date_of_birth) 1,908,416 95.4208% (95% CI 95.3917–95.4497%) 46,166

Identity-tuple collisions are structurally expected at scale — see the birthday-paradox section below for the analytical derivation. Whole-row and ID-field collisions are the diagnostic uniqueness signals.

2. Distribution fidelity

For each marginal distribution where a published US-adult reference exists, the observed proportions are compared against the reference using a formal statistical test. A MATCH indicates the null hypothesis (observed and reference are drawn from the same distribution) is not rejected at α = 0.05.

Categorical attributes (chi-squared goodness-of-fit)

AttributeReference χ²df p-valueCramér's V Verdict (α=0.05)
ckd_statusCDC 202219194.9683<1e-100.098MATCH
diabetes_statusCDC NDSS 202210122.6934<1e-100.0711MATCH
educationACS 20229307.6665<1e-100.0682MATCH
employment_statusBLS 20236135.0726<1e-100.0554MATCH
ethnicityACS 2022124.311<1e-100.0079MATCH
hypertension_statusCDC 2022499.4262<1e-100.0158MATCH
insurance_typeKFF 202312906.3116<1e-100.0803MATCH
marital_statusACS 20223652.0994<1e-100.0427MATCH
raceACS 202219275.2456<1e-100.0982MATCH
sex_at_birthACS 20220.04910.8250.0002MATCH
smoking_statusCDC 20226663.6762<1e-100.0577MATCH
stateACS 202269.128500.0380.0059MATCH

Numeric attributes (two-sample Kolmogorov–Smirnov)

Attribute Observed (mean ± sd) Reference Source KS D p-value Verdict (α=0.05)
a1c_value5.69 ± 1.005.70 ± 0.95NHANES 2017-2020 (LBXGH adult mean)0.0955<1e-10CLOSE
age47.49 ± 18.49ACS 2022 (US adults)0.0178<1e-10MATCH
bmi29.49 ± 6.7229.50 ± 6.80NHANES 2017-2020 (US adults)0.0193<1e-10MATCH
height_cm168.48 ± 9.90168.50 ± 10.00NHANES 2017-2020 (adult height)0.0226<1e-10MATCH
waist_circumference_cm97.56 ± 15.8298.00 ± 16.00NHANES 2017-2020 (BMXWAIST adult mean)0.0103<1e-10MATCH
weight_kg84.18 ± 21.9384.00 ± 22.00NHANES 2017-2020 (adult weight)0.0137<1e-10MATCH

Observed vs reference — histograms

Bars show the observed empirical density; the red line is the parametric reference density at the same support.

Age distribution BMI distribution HbA1c distribution Waist circumference Height Weight

3. Joint structure

3a. Diabetes prevalence by age band

Age bandn Observed (95% CI)Reference SourceVerdict
18-44940,2273.55% (CI 3.52–3.59%)4.0%CDC NDSS 2022MATCH
45-64660,31714.26% (CI 14.18–14.34%)17.0%CDC NDSS 2022CLOSE
65-95399,45626.09% (CI 25.96–26.23%)29.0%CDC NDSS 2022CLOSE
Diabetes prevalence by age

3b. Mean BMI by age band

Age bandn Observed meanReference mean ΔSourceVerdict
18-44940,22729.4928.4+1.09NHANES 2017-2020DIVERGES
45-64660,31729.530.4-0.90NHANES 2017-2020DIVERGES
65-95399,45629.4929.6-0.11NHANES 2017-2020MATCH
BMI by age

3c. Inter-attribute Pearson correlations

Practical-equivalence band ±0.15. Sign mismatches reported separately.

PairObserved r Reference rΔ SourceVerdict
bmi × waist_circumference_cm0.76040.85-0.090NHANESMATCH
bmi × weight_kg0.88880.78+0.109NHANESMATCH
height_cm × weight_kg0.51620.45+0.066NHANESMATCH
age × a1c_value0.28510.14+0.145NHANESMATCH
age × bmi0.00010.1-0.100NHANESMATCH
Correlation matrix

4. Uniqueness & birthday-paradox analysis

The identity-tuple uniqueness percentage decreases at scale because the (given_name × family_name × date_of_birth) value space is finite. The expected collision count is computed and compared against the observed.

Given-name distinct values in this dataset 60
Family-name distinct values in this dataset 30
Distinct dates of birth in this dataset 28,470
Identity-tuple pool size (product) 51,246,000
Records drawn (n) 2,000,000
Expected collisions (linearity of expectation, n(n-1)/2 / pool) 39027.4
Observed collisions 46,166
Observed / Expected ratio 1.183
Verdict (test: normal_2sigma) AS EXPECTED

5. Format validity

FieldValid format95% CI
Email 2,000,000 100.0000% (95% CI 99.9998–100.0000%)
Phone 2,000,000 100.0000% (95% CI 99.9998–100.0000%)

6. Quality assurance summary

All cross-field invariant checks pass on this snapshot. 2,000,000 records, 0 violations.

The check battery covers: BMI consistency with height & weight; ZIP code consistency with state; pregnancy_status consistency with sex_at_birth; insulin use consistency with diabetes_status; prescription-count consistency with chronic-medication flags; age within the adult range; and ~30 other invariants.

7. Limitations

8. Privacy & re-identification

By construction, every value in this dataset is drawn from a synthetic distribution. No record corresponds to a real person; no field is derived from any real record. As a consequence:

9. Reproducibility

Datasets of this size can be requested via the asynchronous endpoint:

POST /v1/datasets/person
{
  "count":   2000000,
  "seed":    <any uint64>,
  "locale":  "en-US",
  "preset":  "us_2024_adults"
}

The endpoint returns a job ID; poll for completion and retrieve a download link from the status endpoint. Each request with the same seed produces the same dataset at the engine version listed in the header.

Need this data for your use case?

Synthesize US-adult populations with the same fidelity. Configurable by preset, deterministic by seed, exportable to S3 at multi-million-row scale.

View pricing & access →