§ GL

Glossary

Plain-English definitions of every acronym, data source, clinical term and statistic used across the site — 138 in all. Anywhere else on the site, hover a dotted-underlined term to see the same definition inline.

Data sources

ACS: American Community Survey — the US Census Bureau's annual population and demographics survey
ADA: American Diabetes Association — publisher of the standard US diabetes treatment guidelines (Standards of Care)
BLS: Bureau of Labor Statistics — the US government agency that publishes employment and wage data
CDC: Centers for Disease Control and Prevention — the US federal public-health agency
CPS ASEC: Current Population Survey, Annual Social and Economic Supplement — a US Census/BLS survey of income and education
KFF: KFF (formerly the Kaiser Family Foundation) — a US health-policy research organization
MEPS: Medical Expenditure Panel Survey — a US survey of healthcare use and costs
MITRE: a US nonprofit research organization (the maker of Synthea)
NDSS: National Diabetes Surveillance System — the CDC's diabetes-statistics program
NHANES: National Health and Nutrition Examination Survey — a large US government health survey run by the CDC
NHIS: National Health Interview Survey — a CDC household health survey
NIDDK: National Institute of Diabetes and Digestive and Kidney Diseases — a US National Institutes of Health institute
UKPDS: UK Prospective Diabetes Study — a landmark long-term diabetes study
US Census 2020: the 2020 US Decennial Census — the national population count
USPS L005: US Postal Service product L005 — the official ZIP-code reference data
USPS SCF ranges: US Postal Service Sectional Center Facility ranges — the geographic blocks of ZIP codes assigned to each state
USRDS: United States Renal Data System — the national kidney-disease statistics registry
VEHSS: Vision and Eye Health Surveillance System — a CDC eye-health data source

Clinical & health

A1c: A1c (HbA1c) — a blood test showing average blood sugar over ~3 months, used to diagnose and track diabetes
ACC/AHA: American College of Cardiology / American Heart Association — their guideline defines high blood pressure as 130/80 or higher
antihypertensive: a blood-pressure-lowering medication
biological vs clinical stage: two ways of rating disease severity: biological = how far the disease has actually progressed in the body; clinical = how far it has been detected/diagnosed
BMI: Body Mass Index — a weight-for-height ratio used to screen for under/overweight (weight kg ÷ (height m)²)
CKD: Chronic Kidney Disease — long-term loss of kidney function (stage 1 = mild, stage 5 = kidney failure)
claims: insurance billing records for medical services
comorbidity / co-occurrence: an additional disease present alongside the main one, and how often two conditions appear together
CVD / cardiovascular event: cardiovascular disease — heart and blood-vessel disease; a cardiovascular event is a heart attack or stroke
detection lag: the years a patient had diabetes before it was diagnosed
eGFR: estimated glomerular filtration rate — a number measuring how well the kidneys filter blood (lower = worse)
EHR: Electronic Health Record — the digital medical chart a clinic or hospital keeps on a patient
EIN: Employer Identification Number — a business's federal (IRS) tax ID; the one in the data is fake
ESRD: end-stage renal disease — kidney failure needing dialysis or a transplant
HbA1c / glycohemoglobin: hemoglobin A1c — the formal lab name for the average-blood-sugar (A1c) test
hyperlipidemia: high blood cholesterol/fats
hypertension: high blood pressure
insulin: a diabetes medication; in the data it is only present for diagnosed diabetics
KDIGO: Kidney Disease: Improving Global Outcomes — the international body that sets kidney-disease staging guidelines
marketplace insurance: health insurance bought through the government ACA exchange (HealthCare.gov), vs employer/Medicare/Medicaid
Medicare: the US government health-insurance program, mainly for people aged 65+
microalbuminuria / macroalbuminuria / albuminuria: protein (albumin) leaking into the urine, a sign of kidney damage (micro = small amounts/early, macro = large amounts/advanced)
microvascular complications: small-blood-vessel damage from diabetes, affecting eyes, kidneys and nerves
nephropathy: diabetic kidney damage (to the kidneys' filtering units)
neuropathy: nerve damage (a common diabetes complication causing numbness/pain, often in the feet)
prediabetes: blood sugar higher than normal but not yet in the diabetic range
prevalence: the share of people in a group who have a given condition
retinopathy: diabetic eye damage to the retina that can cause vision loss
SSN / ssn_last_four: Social Security Number; ssn_last_four is the last four digits of a fake one (not a real SSN)
statin: a cholesterol-lowering medication
systolic / diastolic: the top (systolic, heart beating) and bottom (diastolic, heart resting) numbers of a blood-pressure reading
T2DM: Type 2 Diabetes Mellitus — the common, mostly adult-onset form of diabetes
urbanicity: how urban vs rural an area is (urban / suburban / rural)
vitals: basic body measurements such as blood pressure, heart rate, height and weight

Statistics & validation

birthday paradox: the statistical fact that random duplicates appear sooner than intuition expects when values come from a limited set
co-vary / jointly-distributed / joint realism: attributes are generated together so they relate realistically (e.g. age, weight and blood sugar move together as in real people), not as independent random columns
cohort: a defined group of people sharing a characteristic (here, e.g. people at one diabetes severity level)
conditioned on: each value is chosen based on the other values already set for the same person, so they fit together realistically
correlation / correlated / r: how strongly two values move together (r = correlation coefficient: 0 = none, 1 = perfectly together)
Cramér's V: a 0-to-1 score of how strongly two categorical attributes are related (0 = unrelated, 1 = fully tied together)
cross-field invariants: rules that must always hold between fields (e.g. the ZIP code belongs to the state; BMI matches height and weight)
distribution: the spread/shape of values for a field — how often each value occurs across the population
dose-response: more of the cause produces more of the effect (e.g. worse kidney stage leads to higher mortality)
entropy: a measure of how much variety/randomness a field carries
incidence rate: how often a disease newly occurs in a population (vs prevalence, which is how many currently have it)
k-anonymity / quasi-identifiers: a privacy measure: every record looks identical to at least k−1 others on the fields (quasi-identifiers, e.g. ZIP + birth date + sex) that could single someone out
Kolmogorov–Smirnov (KS): a 0-to-1 statistic measuring how far apart two value distributions are (lower = closer; 0 = identical)
longitudinal vs cross-sectional: longitudinal = tracking the same person over time (a whole history); cross-sectional = a single point-in-time snapshot of many people
Mantel–Haenszel (age-adjusted): a method that combines results across age groups so age differences don't distort the link
marginal / marginals / marginal distribution: the overall percentage of one attribute across the whole population (e.g. the share of people with diabetes)
MEC-weighted microdata: individual-level survey records weighted (MEC = NHANES Mobile Examination Center sample) to represent the whole US population
microdata: the raw record-by-record survey responses (individual people's data), not just summary totals
monotone / monotonic: values move in only one direction — each later/worse stage is higher than the one before, never dipping back
odds ratio (OR): how many times more likely a condition is in one group vs another (1 = no difference)
PCA / effective dimensionality: Principal Component Analysis — a measure of how many genuinely independent directions of variation the data has
pp (percentage points): percentage points — the arithmetic gap between two percentages (e.g. 40% vs 44% is 4 pp)
reference distributions: the published real-world statistics (how values spread in the actual US population) used as the blueprint
sampling variance: normal random fluctuation from drawing a limited sample, not a real flaw
two-sample classifier (C2ST) / AUC: a machine-learning test of how distinguishable fake is from real (AUC: 0.5 = indistinguishable, 1.0 = perfectly separable)
Woolf–Haldane confidence interval: a standard statistical method for the uncertainty range around an odds ratio
μ (mu): the Greek letter mu — statistics shorthand for the average (mean) value
χ² (chi-squared): a statistical test of whether observed counts match an even/expected spread

Formats & technical

API: Application Programming Interface — a way for your software to request data from this service automatically
argon2id / hash / salted: argon2id is a strong one-way password-scrambling (hashing) algorithm; a hash can't be reversed to the original; salting mixes in a random value so identical passwords don't store the same hash
async / asynchronous: you submit a request now and download the result later, instead of getting it back instantly
C-CDA: Consolidated Clinical Document Architecture — a standard XML format for US clinical summary documents
CI / CI fixtures: CI = Continuous Integration, the automated build-and-test pipeline run on every code change; fixtures = the fixed sample data those tests run against
CSV: comma-separated values — a plain spreadsheet/table file that opens in Excel or Google Sheets
diffusion models: AI models that generate data by gradually removing random noise (the technique behind many AI image generators)
drift: data quietly changing from one run to the next, so tests become inconsistent
edge network: servers spread worldwide so the site loads quickly from a location near each user
endpoint: a specific URL your software calls to use the service
feature matrix / ML features: the table of input columns/variables fed into a machine-learning model
FHIR / FHIR R4: Fast Healthcare Interoperability Resources (Release 4) — the standard format health-IT systems use to exchange medical records
FHIR resources (Patient/Condition/MedicationRequest/Coverage/Encounter): the named FHIR record types — e.g. MedicationRequest = a prescription, Coverage = insurance, Encounter = a clinical visit
GANs: generative adversarial networks — an AI model type where two networks compete to produce realistic synthetic data
interoperability: different health systems being able to share and understand each other's data
JSONL: JSON Lines — a text file with one JSON record per line
MCP: Model Context Protocol — a standard that lets AI assistants connect to this service as a tool
MD5: a short digital fingerprint of a file; matching fingerprints prove two runs produced identical data
ML / machine learning: training computer models to learn patterns from data
NDJSON: newline-delimited JSON — one JSON record per line, the standard format for bulk FHIR export
OpenAPI spec: a standard machine-readable description of the API's endpoints and parameters (for developers)
OTP: One-Time Password — a single-use sign-in code emailed to you each login (two-factor authentication)
QA: Quality Assurance — the software-testing function
schema: the set of data fields/columns in each record and their format/structure
SKU: Stock Keeping Unit — a product's inventory ID code
SQL: Structured Query Language — the standard language for relational databases (here, exporting as database insert statements)
staging / non-production / production data: production = the live system real users touch; staging and non-production are pre-production test/dev copies; production data = the real, live data from the running system
test fixtures: the fixed sample data a software test runs against
TLS / encryption in transit: Transport Layer Security — encryption that protects data while it travels between your browser and the server (the browser lock icon)
ULID: Universally Unique Lexicographically Sortable Identifier — a unique, time-ordered record ID
US-Core: the US-specific set of FHIR rules defining required data elements for US health records
UTC: Coordinated Universal Time — the global reference clock the daily quota resets on (midnight UTC)
UUID: Universally Unique Identifier — a random ID with virtually no chance of repeating
variational autoencoders: an AI model that compresses data into a compact form and then generates new lookalike samples

Product

byte-identical: exactly the same output every time, down to the last character
calibrated / calibration: tuned so the synthetic numbers match real-world statistics from trusted US data sources
data dictionary / data domains: a reference list of every field, what it means and its format; domains are the categories the fields group into (identity, geography, health, etc.)
deterministic / deterministic by seed: re-running with the same seed always produces the exact same data (repeatable, not random between runs)
GOLD: the top internal validation grade — passes all the data-quality layers
manifest / dataset manifest: a small summary file describing how a dataset was generated, including its seed
seed: a starting number you choose; the same seed always regenerates the exact same dataset
synthetic data: computer-generated fake records that look statistically realistic but describe no real person

Legal & compliance

data minimisation: collecting only the data actually needed and no more
data portability: the right to receive your data in a reusable format so you can move it elsewhere
data subject: the real, identifiable person a piece of personal data is about (a GDPR term)
de-identification / anonymized data: real data with names and other identifiers stripped or masked so individuals can't be recognized — contrasted with fully made-up synthetic data
DPDP: Digital Personal Data Protection Act — India's data-privacy law
GDPR: General Data Protection Regulation — the EU's data-privacy law
HIPAA: Health Insurance Portability and Accountability Act — the US law protecting patient health-data privacy
legitimate interests / performance of a contract / legal obligation: the lawful reasons (legal bases) privacy law allows for using your data — e.g. to deliver the service, for sensible business needs, or because the law requires it
masking / tokenized: hiding or replacing sensitive values in real records with stand-in placeholders (e.g. blacking out a real SSN)
PHI / protected health information: real, identifiable medical data about a real person that privacy law protects
PII / personally identifiable information: data that can identify a real person (name, address, SSN, etc.)
processor: a company that handles data on our behalf and under our instructions (a GDPR/DPDP role)
Pvt Ltd: Private Limited — an Indian company type (similar to a US LLC/Inc.)
re-identification risk / residual linkage risk: the leftover chance someone could match a supposedly anonymized record back to a real person, often by cross-matching another dataset
Safe Harbor / Expert Determination: HIPAA's two official de-identification methods: Safe Harbor removes 18 specified identifiers; Expert Determination has a statistician certify the re-identification risk is very small
Standard Contractual Clauses: EU-approved legal contract terms that protect personal data sent to other countries