IDsimpleidgen
§ GL

Glossary

Plain-English definitions of every acronym, data source, clinical term and statistic used across the site — 138 in all. Anywhere else on the site, hover a dotted-underlined term to see the same definition inline.

Data sources

ACS
American Community Survey — the US Census Bureau's annual population and demographics survey
ADA
American Diabetes Association — publisher of the standard US diabetes treatment guidelines (Standards of Care)
BLS
Bureau of Labor Statistics — the US government agency that publishes employment and wage data
CDC
Centers for Disease Control and Prevention — the US federal public-health agency
CPS ASEC
Current Population Survey, Annual Social and Economic Supplement — a US Census/BLS survey of income and education
KFF
KFF (formerly the Kaiser Family Foundation) — a US health-policy research organization
MEPS
Medical Expenditure Panel Survey — a US survey of healthcare use and costs
MITRE
a US nonprofit research organization (the maker of Synthea)
NDSS
National Diabetes Surveillance System — the CDC's diabetes-statistics program
NHANES
National Health and Nutrition Examination Survey — a large US government health survey run by the CDC
NHIS
National Health Interview Survey — a CDC household health survey
NIDDK
National Institute of Diabetes and Digestive and Kidney Diseases — a US National Institutes of Health institute
UKPDS
UK Prospective Diabetes Study — a landmark long-term diabetes study
US Census 2020
the 2020 US Decennial Census — the national population count
USPS L005
US Postal Service product L005 — the official ZIP-code reference data
USPS SCF ranges
US Postal Service Sectional Center Facility ranges — the geographic blocks of ZIP codes assigned to each state
USRDS
United States Renal Data System — the national kidney-disease statistics registry
VEHSS
Vision and Eye Health Surveillance System — a CDC eye-health data source

Clinical & health

A1c
A1c (HbA1c) — a blood test showing average blood sugar over ~3 months, used to diagnose and track diabetes
ACC/AHA
American College of Cardiology / American Heart Association — their guideline defines high blood pressure as 130/80 or higher
antihypertensive
a blood-pressure-lowering medication
biological vs clinical stage
two ways of rating disease severity: biological = how far the disease has actually progressed in the body; clinical = how far it has been detected/diagnosed
BMI
Body Mass Index — a weight-for-height ratio used to screen for under/overweight (weight kg ÷ (height m)²)
CKD
Chronic Kidney Disease — long-term loss of kidney function (stage 1 = mild, stage 5 = kidney failure)
claims
insurance billing records for medical services
comorbidity / co-occurrence
an additional disease present alongside the main one, and how often two conditions appear together
CVD / cardiovascular event
cardiovascular disease — heart and blood-vessel disease; a cardiovascular event is a heart attack or stroke
detection lag
the years a patient had diabetes before it was diagnosed
eGFR
estimated glomerular filtration rate — a number measuring how well the kidneys filter blood (lower = worse)
EHR
Electronic Health Record — the digital medical chart a clinic or hospital keeps on a patient
EIN
Employer Identification Number — a business's federal (IRS) tax ID; the one in the data is fake
ESRD
end-stage renal disease — kidney failure needing dialysis or a transplant
HbA1c / glycohemoglobin
hemoglobin A1c — the formal lab name for the average-blood-sugar (A1c) test
hyperlipidemia
high blood cholesterol/fats
hypertension
high blood pressure
insulin
a diabetes medication; in the data it is only present for diagnosed diabetics
KDIGO
Kidney Disease: Improving Global Outcomes — the international body that sets kidney-disease staging guidelines
marketplace insurance
health insurance bought through the government ACA exchange (HealthCare.gov), vs employer/Medicare/Medicaid
Medicare
the US government health-insurance program, mainly for people aged 65+
microalbuminuria / macroalbuminuria / albuminuria
protein (albumin) leaking into the urine, a sign of kidney damage (micro = small amounts/early, macro = large amounts/advanced)
microvascular complications
small-blood-vessel damage from diabetes, affecting eyes, kidneys and nerves
nephropathy
diabetic kidney damage (to the kidneys' filtering units)
neuropathy
nerve damage (a common diabetes complication causing numbness/pain, often in the feet)
prediabetes
blood sugar higher than normal but not yet in the diabetic range
prevalence
the share of people in a group who have a given condition
retinopathy
diabetic eye damage to the retina that can cause vision loss
SSN / ssn_last_four
Social Security Number; ssn_last_four is the last four digits of a fake one (not a real SSN)
statin
a cholesterol-lowering medication
systolic / diastolic
the top (systolic, heart beating) and bottom (diastolic, heart resting) numbers of a blood-pressure reading
T2DM
Type 2 Diabetes Mellitus — the common, mostly adult-onset form of diabetes
urbanicity
how urban vs rural an area is (urban / suburban / rural)
vitals
basic body measurements such as blood pressure, heart rate, height and weight

Statistics & validation

birthday paradox
the statistical fact that random duplicates appear sooner than intuition expects when values come from a limited set
co-vary / jointly-distributed / joint realism
attributes are generated together so they relate realistically (e.g. age, weight and blood sugar move together as in real people), not as independent random columns
cohort
a defined group of people sharing a characteristic (here, e.g. people at one diabetes severity level)
conditioned on
each value is chosen based on the other values already set for the same person, so they fit together realistically
correlation / correlated / r
how strongly two values move together (r = correlation coefficient: 0 = none, 1 = perfectly together)
Cramér's V
a 0-to-1 score of how strongly two categorical attributes are related (0 = unrelated, 1 = fully tied together)
cross-field invariants
rules that must always hold between fields (e.g. the ZIP code belongs to the state; BMI matches height and weight)
distribution
the spread/shape of values for a field — how often each value occurs across the population
dose-response
more of the cause produces more of the effect (e.g. worse kidney stage leads to higher mortality)
entropy
a measure of how much variety/randomness a field carries
incidence rate
how often a disease newly occurs in a population (vs prevalence, which is how many currently have it)
k-anonymity / quasi-identifiers
a privacy measure: every record looks identical to at least k−1 others on the fields (quasi-identifiers, e.g. ZIP + birth date + sex) that could single someone out
Kolmogorov–Smirnov (KS)
a 0-to-1 statistic measuring how far apart two value distributions are (lower = closer; 0 = identical)
longitudinal vs cross-sectional
longitudinal = tracking the same person over time (a whole history); cross-sectional = a single point-in-time snapshot of many people
Mantel–Haenszel (age-adjusted)
a method that combines results across age groups so age differences don't distort the link
marginal / marginals / marginal distribution
the overall percentage of one attribute across the whole population (e.g. the share of people with diabetes)
MEC-weighted microdata
individual-level survey records weighted (MEC = NHANES Mobile Examination Center sample) to represent the whole US population
microdata
the raw record-by-record survey responses (individual people's data), not just summary totals
monotone / monotonic
values move in only one direction — each later/worse stage is higher than the one before, never dipping back
odds ratio (OR)
how many times more likely a condition is in one group vs another (1 = no difference)
PCA / effective dimensionality
Principal Component Analysis — a measure of how many genuinely independent directions of variation the data has
pp (percentage points)
percentage points — the arithmetic gap between two percentages (e.g. 40% vs 44% is 4 pp)
reference distributions
the published real-world statistics (how values spread in the actual US population) used as the blueprint
sampling variance
normal random fluctuation from drawing a limited sample, not a real flaw
two-sample classifier (C2ST) / AUC
a machine-learning test of how distinguishable fake is from real (AUC: 0.5 = indistinguishable, 1.0 = perfectly separable)
Woolf–Haldane confidence interval
a standard statistical method for the uncertainty range around an odds ratio
μ (mu)
the Greek letter mu — statistics shorthand for the average (mean) value
χ² (chi-squared)
a statistical test of whether observed counts match an even/expected spread

Formats & technical

API
Application Programming Interface — a way for your software to request data from this service automatically
argon2id / hash / salted
argon2id is a strong one-way password-scrambling (hashing) algorithm; a hash can't be reversed to the original; salting mixes in a random value so identical passwords don't store the same hash
async / asynchronous
you submit a request now and download the result later, instead of getting it back instantly
C-CDA
Consolidated Clinical Document Architecture — a standard XML format for US clinical summary documents
CI / CI fixtures
CI = Continuous Integration, the automated build-and-test pipeline run on every code change; fixtures = the fixed sample data those tests run against
CSV
comma-separated values — a plain spreadsheet/table file that opens in Excel or Google Sheets
diffusion models
AI models that generate data by gradually removing random noise (the technique behind many AI image generators)
drift
data quietly changing from one run to the next, so tests become inconsistent
edge network
servers spread worldwide so the site loads quickly from a location near each user
endpoint
a specific URL your software calls to use the service
feature matrix / ML features
the table of input columns/variables fed into a machine-learning model
FHIR / FHIR R4
Fast Healthcare Interoperability Resources (Release 4) — the standard format health-IT systems use to exchange medical records
FHIR resources (Patient/Condition/MedicationRequest/Coverage/Encounter)
the named FHIR record types — e.g. MedicationRequest = a prescription, Coverage = insurance, Encounter = a clinical visit
GANs
generative adversarial networks — an AI model type where two networks compete to produce realistic synthetic data
interoperability
different health systems being able to share and understand each other's data
JSONL
JSON Lines — a text file with one JSON record per line
MCP
Model Context Protocol — a standard that lets AI assistants connect to this service as a tool
MD5
a short digital fingerprint of a file; matching fingerprints prove two runs produced identical data
ML / machine learning
training computer models to learn patterns from data
NDJSON
newline-delimited JSON — one JSON record per line, the standard format for bulk FHIR export
OpenAPI spec
a standard machine-readable description of the API's endpoints and parameters (for developers)
OTP
One-Time Password — a single-use sign-in code emailed to you each login (two-factor authentication)
QA
Quality Assurance — the software-testing function
schema
the set of data fields/columns in each record and their format/structure
SKU
Stock Keeping Unit — a product's inventory ID code
SQL
Structured Query Language — the standard language for relational databases (here, exporting as database insert statements)
staging / non-production / production data
production = the live system real users touch; staging and non-production are pre-production test/dev copies; production data = the real, live data from the running system
test fixtures
the fixed sample data a software test runs against
TLS / encryption in transit
Transport Layer Security — encryption that protects data while it travels between your browser and the server (the browser lock icon)
ULID
Universally Unique Lexicographically Sortable Identifier — a unique, time-ordered record ID
US-Core
the US-specific set of FHIR rules defining required data elements for US health records
UTC
Coordinated Universal Time — the global reference clock the daily quota resets on (midnight UTC)
UUID
Universally Unique Identifier — a random ID with virtually no chance of repeating
variational autoencoders
an AI model that compresses data into a compact form and then generates new lookalike samples

Product

byte-identical
exactly the same output every time, down to the last character
calibrated / calibration
tuned so the synthetic numbers match real-world statistics from trusted US data sources
data dictionary / data domains
a reference list of every field, what it means and its format; domains are the categories the fields group into (identity, geography, health, etc.)
deterministic / deterministic by seed
re-running with the same seed always produces the exact same data (repeatable, not random between runs)
GOLD
the top internal validation grade — passes all the data-quality layers
manifest / dataset manifest
a small summary file describing how a dataset was generated, including its seed
seed
a starting number you choose; the same seed always regenerates the exact same dataset
synthetic data
computer-generated fake records that look statistically realistic but describe no real person
data minimisation
collecting only the data actually needed and no more
data portability
the right to receive your data in a reusable format so you can move it elsewhere
data subject
the real, identifiable person a piece of personal data is about (a GDPR term)
de-identification / anonymized data
real data with names and other identifiers stripped or masked so individuals can't be recognized — contrasted with fully made-up synthetic data
DPDP
Digital Personal Data Protection Act — India's data-privacy law
GDPR
General Data Protection Regulation — the EU's data-privacy law
HIPAA
Health Insurance Portability and Accountability Act — the US law protecting patient health-data privacy
the lawful reasons (legal bases) privacy law allows for using your data — e.g. to deliver the service, for sensible business needs, or because the law requires it
masking / tokenized
hiding or replacing sensitive values in real records with stand-in placeholders (e.g. blacking out a real SSN)
PHI / protected health information
real, identifiable medical data about a real person that privacy law protects
PII / personally identifiable information
data that can identify a real person (name, address, SSN, etc.)
processor
a company that handles data on our behalf and under our instructions (a GDPR/DPDP role)
Pvt Ltd
Private Limited — an Indian company type (similar to a US LLC/Inc.)
re-identification risk / residual linkage risk
the leftover chance someone could match a supposedly anonymized record back to a real person, often by cross-matching another dataset
Safe Harbor / Expert Determination
HIPAA's two official de-identification methods: Safe Harbor removes 18 specified identifiers; Expert Determination has a statistician certify the re-identification risk is very small
Standard Contractual Clauses
EU-approved legal contract terms that protect personal data sent to other countries