§ GL
Glossary
Plain-English definitions of every acronym, data source, clinical term and statistic used across the site — 138 in all. Anywhere else on the site, hover a dotted-underlined term to see the same definition inline.
Data sources
- ACS
- American Community Survey — the US Census Bureau's annual population and demographics survey
- ADA
- American Diabetes Association — publisher of the standard US diabetes treatment guidelines (Standards of Care)
- BLS
- Bureau of Labor Statistics — the US government agency that publishes employment and wage data
- CDC
- Centers for Disease Control and Prevention — the US federal public-health agency
- CPS ASEC
- Current Population Survey, Annual Social and Economic Supplement — a US Census/BLS survey of income and education
- KFF
- KFF (formerly the Kaiser Family Foundation) — a US health-policy research organization
- MEPS
- Medical Expenditure Panel Survey — a US survey of healthcare use and costs
- MITRE
- a US nonprofit research organization (the maker of Synthea)
- NDSS
- National Diabetes Surveillance System — the CDC's diabetes-statistics program
- NHANES
- National Health and Nutrition Examination Survey — a large US government health survey run by the CDC
- NHIS
- National Health Interview Survey — a CDC household health survey
- NIDDK
- National Institute of Diabetes and Digestive and Kidney Diseases — a US National Institutes of Health institute
- UKPDS
- UK Prospective Diabetes Study — a landmark long-term diabetes study
- US Census 2020
- the 2020 US Decennial Census — the national population count
- USPS L005
- US Postal Service product L005 — the official ZIP-code reference data
- USPS SCF ranges
- US Postal Service Sectional Center Facility ranges — the geographic blocks of ZIP codes assigned to each state
- USRDS
- United States Renal Data System — the national kidney-disease statistics registry
- VEHSS
- Vision and Eye Health Surveillance System — a CDC eye-health data source
Clinical & health
- A1c
- A1c (HbA1c) — a blood test showing average blood sugar over ~3 months, used to diagnose and track diabetes
- ACC/AHA
- American College of Cardiology / American Heart Association — their guideline defines high blood pressure as 130/80 or higher
- antihypertensive
- a blood-pressure-lowering medication
- biological vs clinical stage
- two ways of rating disease severity: biological = how far the disease has actually progressed in the body; clinical = how far it has been detected/diagnosed
- BMI
- Body Mass Index — a weight-for-height ratio used to screen for under/overweight (weight kg ÷ (height m)²)
- CKD
- Chronic Kidney Disease — long-term loss of kidney function (stage 1 = mild, stage 5 = kidney failure)
- claims
- insurance billing records for medical services
- comorbidity / co-occurrence
- an additional disease present alongside the main one, and how often two conditions appear together
- CVD / cardiovascular event
- cardiovascular disease — heart and blood-vessel disease; a cardiovascular event is a heart attack or stroke
- detection lag
- the years a patient had diabetes before it was diagnosed
- eGFR
- estimated glomerular filtration rate — a number measuring how well the kidneys filter blood (lower = worse)
- EHR
- Electronic Health Record — the digital medical chart a clinic or hospital keeps on a patient
- EIN
- Employer Identification Number — a business's federal (IRS) tax ID; the one in the data is fake
- ESRD
- end-stage renal disease — kidney failure needing dialysis or a transplant
- HbA1c / glycohemoglobin
- hemoglobin A1c — the formal lab name for the average-blood-sugar (A1c) test
- hyperlipidemia
- high blood cholesterol/fats
- hypertension
- high blood pressure
- insulin
- a diabetes medication; in the data it is only present for diagnosed diabetics
- KDIGO
- Kidney Disease: Improving Global Outcomes — the international body that sets kidney-disease staging guidelines
- marketplace insurance
- health insurance bought through the government ACA exchange (HealthCare.gov), vs employer/Medicare/Medicaid
- Medicare
- the US government health-insurance program, mainly for people aged 65+
- microalbuminuria / macroalbuminuria / albuminuria
- protein (albumin) leaking into the urine, a sign of kidney damage (micro = small amounts/early, macro = large amounts/advanced)
- microvascular complications
- small-blood-vessel damage from diabetes, affecting eyes, kidneys and nerves
- nephropathy
- diabetic kidney damage (to the kidneys' filtering units)
- neuropathy
- nerve damage (a common diabetes complication causing numbness/pain, often in the feet)
- prediabetes
- blood sugar higher than normal but not yet in the diabetic range
- prevalence
- the share of people in a group who have a given condition
- retinopathy
- diabetic eye damage to the retina that can cause vision loss
- SSN / ssn_last_four
- Social Security Number; ssn_last_four is the last four digits of a fake one (not a real SSN)
- statin
- a cholesterol-lowering medication
- systolic / diastolic
- the top (systolic, heart beating) and bottom (diastolic, heart resting) numbers of a blood-pressure reading
- T2DM
- Type 2 Diabetes Mellitus — the common, mostly adult-onset form of diabetes
- urbanicity
- how urban vs rural an area is (urban / suburban / rural)
- vitals
- basic body measurements such as blood pressure, heart rate, height and weight
Statistics & validation
- birthday paradox
- the statistical fact that random duplicates appear sooner than intuition expects when values come from a limited set
- co-vary / jointly-distributed / joint realism
- attributes are generated together so they relate realistically (e.g. age, weight and blood sugar move together as in real people), not as independent random columns
- cohort
- a defined group of people sharing a characteristic (here, e.g. people at one diabetes severity level)
- conditioned on
- each value is chosen based on the other values already set for the same person, so they fit together realistically
- how strongly two values move together (r = correlation coefficient: 0 = none, 1 = perfectly together)
- Cramér's V
- a 0-to-1 score of how strongly two categorical attributes are related (0 = unrelated, 1 = fully tied together)
- cross-field invariants
- rules that must always hold between fields (e.g. the ZIP code belongs to the state; BMI matches height and weight)
- distribution
- the spread/shape of values for a field — how often each value occurs across the population
- dose-response
- more of the cause produces more of the effect (e.g. worse kidney stage leads to higher mortality)
- entropy
- a measure of how much variety/randomness a field carries
- incidence rate
- how often a disease newly occurs in a population (vs prevalence, which is how many currently have it)
- k-anonymity / quasi-identifiers
- a privacy measure: every record looks identical to at least k−1 others on the fields (quasi-identifiers, e.g. ZIP + birth date + sex) that could single someone out
- Kolmogorov–Smirnov (KS)
- a 0-to-1 statistic measuring how far apart two value distributions are (lower = closer; 0 = identical)
- longitudinal vs cross-sectional
- longitudinal = tracking the same person over time (a whole history); cross-sectional = a single point-in-time snapshot of many people
- Mantel–Haenszel (age-adjusted)
- a method that combines results across age groups so age differences don't distort the link
- marginal / marginals / marginal distribution
- the overall percentage of one attribute across the whole population (e.g. the share of people with diabetes)
- MEC-weighted microdata
- individual-level survey records weighted (MEC = NHANES Mobile Examination Center sample) to represent the whole US population
- microdata
- the raw record-by-record survey responses (individual people's data), not just summary totals
- monotone / monotonic
- values move in only one direction — each later/worse stage is higher than the one before, never dipping back
- odds ratio (OR)
- how many times more likely a condition is in one group vs another (1 = no difference)
- PCA / effective dimensionality
- Principal Component Analysis — a measure of how many genuinely independent directions of variation the data has
- pp (percentage points)
- percentage points — the arithmetic gap between two percentages (e.g. 40% vs 44% is 4 pp)
- reference distributions
- the published real-world statistics (how values spread in the actual US population) used as the blueprint
- sampling variance
- normal random fluctuation from drawing a limited sample, not a real flaw
- two-sample classifier (C2ST) / AUC
- a machine-learning test of how distinguishable fake is from real (AUC: 0.5 = indistinguishable, 1.0 = perfectly separable)
- Woolf–Haldane confidence interval
- a standard statistical method for the uncertainty range around an odds ratio
- μ (mu)
- the Greek letter mu — statistics shorthand for the average (mean) value
- χ² (chi-squared)
- a statistical test of whether observed counts match an even/expected spread
Formats & technical
- API
- Application Programming Interface — a way for your software to request data from this service automatically
- argon2id / hash / salted
- argon2id is a strong one-way password-scrambling (hashing) algorithm; a hash can't be reversed to the original; salting mixes in a random value so identical passwords don't store the same hash
- async / asynchronous
- you submit a request now and download the result later, instead of getting it back instantly
- C-CDA
- Consolidated Clinical Document Architecture — a standard XML format for US clinical summary documents
- CI / CI fixtures
- CI = Continuous Integration, the automated build-and-test pipeline run on every code change; fixtures = the fixed sample data those tests run against
- CSV
- comma-separated values — a plain spreadsheet/table file that opens in Excel or Google Sheets
- diffusion models
- AI models that generate data by gradually removing random noise (the technique behind many AI image generators)
- drift
- data quietly changing from one run to the next, so tests become inconsistent
- edge network
- servers spread worldwide so the site loads quickly from a location near each user
- endpoint
- a specific URL your software calls to use the service
- feature matrix / ML features
- the table of input columns/variables fed into a machine-learning model
- FHIR / FHIR R4
- Fast Healthcare Interoperability Resources (Release 4) — the standard format health-IT systems use to exchange medical records
- FHIR resources (Patient/Condition/MedicationRequest/Coverage/Encounter)
- the named FHIR record types — e.g. MedicationRequest = a prescription, Coverage = insurance, Encounter = a clinical visit
- GANs
- generative adversarial networks — an AI model type where two networks compete to produce realistic synthetic data
- interoperability
- different health systems being able to share and understand each other's data
- JSONL
- JSON Lines — a text file with one JSON record per line
- MCP
- Model Context Protocol — a standard that lets AI assistants connect to this service as a tool
- MD5
- a short digital fingerprint of a file; matching fingerprints prove two runs produced identical data
- ML / machine learning
- training computer models to learn patterns from data
- NDJSON
- newline-delimited JSON — one JSON record per line, the standard format for bulk FHIR export
- OpenAPI spec
- a standard machine-readable description of the API's endpoints and parameters (for developers)
- OTP
- One-Time Password — a single-use sign-in code emailed to you each login (two-factor authentication)
- QA
- Quality Assurance — the software-testing function
- schema
- the set of data fields/columns in each record and their format/structure
- SKU
- Stock Keeping Unit — a product's inventory ID code
- SQL
- Structured Query Language — the standard language for relational databases (here, exporting as database insert statements)
- staging / non-production / production data
- production = the live system real users touch; staging and non-production are pre-production test/dev copies; production data = the real, live data from the running system
- test fixtures
- the fixed sample data a software test runs against
- TLS / encryption in transit
- Transport Layer Security — encryption that protects data while it travels between your browser and the server (the browser lock icon)
- ULID
- Universally Unique Lexicographically Sortable Identifier — a unique, time-ordered record ID
- US-Core
- the US-specific set of FHIR rules defining required data elements for US health records
- UTC
- Coordinated Universal Time — the global reference clock the daily quota resets on (midnight UTC)
- UUID
- Universally Unique Identifier — a random ID with virtually no chance of repeating
- variational autoencoders
- an AI model that compresses data into a compact form and then generates new lookalike samples
Product
- byte-identical
- exactly the same output every time, down to the last character
- calibrated / calibration
- tuned so the synthetic numbers match real-world statistics from trusted US data sources
- data dictionary / data domains
- a reference list of every field, what it means and its format; domains are the categories the fields group into (identity, geography, health, etc.)
- deterministic / deterministic by seed
- re-running with the same seed always produces the exact same data (repeatable, not random between runs)
- GOLD
- the top internal validation grade — passes all the data-quality layers
- manifest / dataset manifest
- a small summary file describing how a dataset was generated, including its seed
- seed
- a starting number you choose; the same seed always regenerates the exact same dataset
- synthetic data
- computer-generated fake records that look statistically realistic but describe no real person
Legal & compliance
- data minimisation
- collecting only the data actually needed and no more
- data portability
- the right to receive your data in a reusable format so you can move it elsewhere
- data subject
- the real, identifiable person a piece of personal data is about (a GDPR term)
- de-identification / anonymized data
- real data with names and other identifiers stripped or masked so individuals can't be recognized — contrasted with fully made-up synthetic data
- DPDP
- Digital Personal Data Protection Act — India's data-privacy law
- GDPR
- General Data Protection Regulation — the EU's data-privacy law
- HIPAA
- Health Insurance Portability and Accountability Act — the US law protecting patient health-data privacy
- legitimate interests / performance of a contract / legal obligation
- the lawful reasons (legal bases) privacy law allows for using your data — e.g. to deliver the service, for sensible business needs, or because the law requires it
- masking / tokenized
- hiding or replacing sensitive values in real records with stand-in placeholders (e.g. blacking out a real SSN)
- PHI / protected health information
- real, identifiable medical data about a real person that privacy law protects
- PII / personally identifiable information
- data that can identify a real person (name, address, SSN, etc.)
- processor
- a company that handles data on our behalf and under our instructions (a GDPR/DPDP role)
- Pvt Ltd
- Private Limited — an Indian company type (similar to a US LLC/Inc.)
- re-identification risk / residual linkage risk
- the leftover chance someone could match a supposedly anonymized record back to a real person, often by cross-matching another dataset
- Safe Harbor / Expert Determination
- HIPAA's two official de-identification methods: Safe Harbor removes 18 specified identifiers; Expert Determination has a statistician certify the re-identification risk is very small
- Standard Contractual Clauses
- EU-approved legal contract terms that protect personal data sent to other countries