Longitudinal Progression in Synthetic Type-2 Diabetes: A 2,500-Patient Timeline Study
We generated 20-year disease timelines for 2,500 synthetic type-2-diabetes patients — 64,269 person-years — and tested whether they reproduce the natural history of the disease. They do. Kidney function falls 2.2 mL/min/1.73m² per year, HbA1c drifts from 6.3% to 8.3% despite treatment, the cohort shifts from 90% early-stage to 54% advanced-stage disease over two decades, and complications accrue on the UKPDS/WESDR timetable. This is a worked, reproducible demonstration that the longitudinal engine encodes progression, not just cross-sectional realism.
§ · Abstract
Background. SimpleIDGen's cross-sectional generator produces a snapshot patient; its PersonTimeline engine extends each type-2-diabetes (T2DM) patient into a dated, year-by-year trajectory. We asked whether those trajectories reproduce the documented natural history of T2DM.
Methods. We generated 2,500 synthetic diagnosed/undiagnosed T2DM patients (seed 42) and simulated each forward 20 years, yielding 64,269 person-years in long format (one row per patient-year). Five drivers — HbA1c, eGFR, BMI, systolic BP, and diabetes duration — advance by benchmark-anchored annual drift plus heteroscedastic noise; ten dated clinical events fire as annual hazards; a monotone covariate function assigns a biological stage (1–5) each year. We measured driver trajectories by duration, biological-stage and albuminuria composition over time, cumulative complication incidence, the treatment cascade, renal outcomes, and mortality, then compared each against published cohorts.
Results. eGFR declined −2.2 mL/min/1.73m²/yr (0–20 yr); 69% of patients reached CKD stage 3 (eGFR <60) and 20% reached ESRD (<15). HbA1c rose monotonically 6.3%→8.3% over 20 years (secondary failure) despite metformin at diagnosis and insulin in 82%. Mean biological stage rose 1.14→3.40 by year 20 (2.9% still stage 1; 53.6% stage 4–5). Albuminuria progressed from 100% normoalbuminuric to 50% macroalbuminuric by year 25. Twenty-year cumulative incidence: retinopathy 84%, insulin initiation 80%, microalbuminuria 64%, CVD 47%, neuropathy 46%, macroalbuminuria 40%. Median time from diagnosis: metformin 0 yr, microalbuminuria 4 yr, insulin 11 yr.
Conclusions. The synthetic timelines reproduce the direction, magnitude, and timing of real T2DM progression — renal decline, glycemic drift, staged deterioration, and the complication cascade — with every rate traceable to a graded benchmark. They are suitable for prototyping longitudinal analyses, temporal ML, and cohort simulations; they are not a substitute for real outcome data where a specific effect size is the endpoint.
§ · Background
Most synthetic-data tools emit a static snapshot: a patient as they are today. But T2DM is a progressive disease — glycemic control drifts, kidneys decline, complications accumulate — and the questions that matter clinically are longitudinal. Testing a risk model, a care-gap alert, or a temporal ML pipeline needs patients who change over time in believable ways.
The PersonTimeline engine turns each cross-sectional T2DM patient into a dated trajectory anchored on that snapshot. It runs in two regimes. The reconstructed-history region (from diagnosis to "today") back-dates the patient's known snapshot state — the events that must have happened to arrive there. The forward region (past today) simulates the future stochastically, drawing each year's drift and each complication as an annual hazard. This study interrogates the forward-and-back trajectory of a whole cohort and asks a single question: does it look like real diabetes?
§ · Methods
Cohort. We generated 2,500 patients from the diabetic cohort preset (85% diagnosed / 15% undiagnosed T2DM, age 38–95, 55% male) at seed 42, and built a timeline for each with a 20-year forward horizon and 1-year cycles. Every patient's timeline runs from their diagnosis year (cycle 0) to death or the horizon, producing 64,269 patient-years (mean 25.7 per patient). The generator is deterministic: the same seed reproduces this cohort byte-for-byte.
The five time-varying drivers. Each advances annually by a mean drift plus heteroscedastic Gaussian noise, modulated by a patient-fixed progression tempo τ = 0.7 + 0.6·propensity ∈ [0.7, 1.3] (mean ≈ 1.0):
| Driver | Annual drift (×τ unless noted) | Noise SD | Anchor |
|---|---|---|---|
| HbA1c | +0.20%/yr, or +0.10 on insulin; −1.5% step at insulin start | 0.07 × A1c (CV 8.3%) | UKPDS 33/34 |
| eGFR | −1.9 / −2.1 / −3.15 by normo / micro / macro albuminuria | 0.055 × eGFR (CV 5.5%) | CRIC·DCCT·EDIC 2019 |
| BMI | +0.05 kg/m²/yr | 1.0 | Looker 2001 |
| Systolic BP | −1.0 (treated) / +1.0 (untreated) mmHg/yr (not τ-scaled) | 11.0 | DIAB-CORE 2015 |
| Diabetes duration | +1 yr / cycle (deterministic, no noise) | — | time axis |
The event hazards. In the forward region, ten dated events fire; complications are drawn as independent annual hazards (each × τ), treatment escalates on glycemic rules:
| Event | Trigger | Annual rate (×τ) | Anchor |
|---|---|---|---|
| Metformin start | at diagnosis (cycle 0) | — | ADA 2025 · UKPDS 34 |
| Normo → microalbuminuria | tier = normo | 2.0% | UKPDS 64 |
| Micro → macroalbuminuria | tier = micro | 2.8% | UKPDS 64 |
| Retinopathy onset | not yet present | 11% | WESDR XIV |
| Neuropathy onset | not yet present | 2.5% | DPN meta-analysis |
| CVD event | not yet present | 2.7% | CARDS |
| Oral intensification | A1c > 8% for ≥ 1 yr | rule | UKPDS 33 |
| Insulin start | A1c ≥ 9%, or > 8% for 3 yr | rule | UKPDS 26/34 |
| Stage progression | a covariate crosses a stage threshold | derived | KDIGO · ADA |
| Death | imported from the record, placed on the calendar | — | NCHS life-table |
Biological staging. Each year a monotone, non-decreasing function assigns stage 1–5 from the rounded covariates (first match wins, most-severe first): 5 if eGFR < 15 (ESRD); 4 if eGFR < 30 or a CVD event has occurred; 3 if eGFR < 60 or macroalbuminuria; 2 if A1c > 8% or microalbuminuria; 1 otherwise. Because a stage step requires a covariate to cross a threshold, staging and the driver values are concordant by construction. Stage is derived from the emitted (rounded) values, so a reviewer can re-derive it from the serialized row.
Analysis. Driver trajectories are population means by integer diabetes duration. Stage and albuminuria composition are cross-sectional snapshots at 0/5/10/15/20/25 years. Cumulative incidence uses an at-risk denominator (patients whose timeline reaches duration t). Time-to-event is the median duration among patients who experienced it. The eGFR slope is an ordinary-least-squares fit over person-years at 0–20 years. Analysis was performed offline in Python; the figures below carry the measured values.
§ · Results
Table 1 — cohort at diagnosis (reconstructed cycle 0).
| Characteristic | Mean ± SD | Median |
|---|---|---|
| Age at diagnosis (yr) | 51.6 ± 13.9 | 50 |
| HbA1c (%) | 6.30 ± 0.94 | 5.9 |
| BMI (kg/m²) | 32.2 ± 6.9 | 31.4 |
| eGFR (mL/min/1.73m²) | 100.5 ± 19.7 | 101.7 |
| Systolic BP (mmHg) | 130.2 ± 18.1 | 132 |
| Follow-up (yr) | 24.7 | 25 |
| Deaths in window | 209 (8.4%), median age 73 | |
Renal and glycemic trajectories. The two signature curves of T2DM are both present and monotone (Fig 1). eGFR falls in a near-straight line from 100 to 57 mL/min/1.73m² over 20 years — a slope of −2.2 — carrying the average patient across the CKD stage-2 and stage-3 thresholds; 69.4% of patients reach eGFR < 60 at some point, 33.2% reach < 30, and 20.2% reach ESRD (< 15). HbA1c drifts the other way: from 6.3% at diagnosis to 8.3% at 20 years and 9.0% at 25, the textbook picture of secondary failure — glycemic control erodes despite metformin at diagnosis and insulin in most patients.
Staged deterioration. The cohort visibly ages into the disease (Fig 2a). At diagnosis 90% of patients are stage 1; by year 10 the mode has moved to stage 4, and by year 20 just 2.9% remain stage 1 while 53.6% are stage 4–5. Mean biological stage climbs 1.14 → 2.20 → 3.10 → 3.40 at years 0/5/15/20. Albuminuria tells the renal half of the same story (Fig 2b): the cohort starts fully normoalbuminuric and, by year 25, half is macroalbuminuric — the pipeline that drives the steepening eGFR slope.
The complication cascade. Complications accumulate in the order and on the schedule the literature predicts (Fig 3). Retinopathy is the fastest and most common (53% by 10 yr, 84% by 20 yr). Insulin dependence overtakes it late (80% by 20 yr) as glycemic control fails. Microalbuminuria comes early (median 4 yr) and plateaus as patients convert to macroalbuminuria; CVD and neuropathy track together near 2.5–2.7% per year. Read as a timetable (Fig 4), the diagnosis-to-milestone medians are: metformin at diagnosis, microalbuminuria at 4 years, macroalbuminuria at 8, retinopathy and oral intensification at 9, neuropathy at 10, and insulin at 11.
§ · Validation against published cohorts
Every rate in the engine is anchored to a graded real-world source. The measured cohort behaviour lands where those anchors predict:
| Metric | This cohort | Published benchmark | Source |
|---|---|---|---|
| eGFR decline (0–20 yr) | −2.2 mL/min/1.73m²/yr | −1.9 to −3.15 (by albuminuria tier) | CRIC·DCCT·EDIC, Diab Care 2019 |
| HbA1c drift | +0.10%/yr (6.3→8.3 / 20 yr) | +0.1 to +0.3%/yr | UKPDS 33; Turner 1999 |
| Insulin initiation | median 11 yr; 80% by 20 yr | median ~9–11 yr | UKPDS 26; US family practice |
| Retinopathy | 53% @10 yr, 84% @20 yr | 67% @10 yr (non-insulin T2DM) | WESDR XIV, 1994 |
| Neuropathy incidence | 2.5%/yr; 28% @10 yr | 24–27 / 1000 patient-yr | DPN meta-analysis (n=95,604) |
| Albuminuria normo→micro | median onset 4 yr | 2.0%/yr transition | UKPDS 64, 2003 |
| ESRD (eGFR < 15) reached | 20.2% of patients | staging cutoffs | KDIGO 2024 |
The population eGFR slope (−2.2) is a blend of the tiered anchors, steepening as the cohort converts to macroalbuminuria — exactly the mechanism the tiered slopes encode.
§ · Discussion
Three things make these trajectories usable rather than merely plausible. First, they are concordant: because biological stage is a function of the same covariates it summarizes, a patient never sits in stage 4 with stage-1 labs — the staging and the numbers cannot disagree. Second, they are monotone where the disease is monotone (stage, duration, cumulative complications never reverse) and drift where the disease drifts (glycemia, kidney function, blood pressure), so temporal features behave. Third, every rate is auditable — the drift and hazard constants trace one-to-one to UKPDS, WESDR, the CRIC/DCCT/EDIC pooled analysis, and ADA/KDIGO guidelines, so a reviewer can check the engine against its sources.
The practical upshot: you can prototype a longitudinal risk model, a renal-decline alert, or a temporal ML pipeline on this cohort and expect it to encounter the same relationships — accelerating eGFR loss, glycemic secondary failure, the retinopathy-first complication order — that it would meet in real EHR data, with none of the access constraints.
§ · Limitations
This is a calibrated model, and the honest boundaries matter:
- Reconstructed baselines skew optimistic. Because cycle 0 is obtained by removing mean drift (no noise) from the snapshot, reconstructed diagnosis-year eGFR (mean 100) runs higher, and HbA1c (6.3%) lower, than typical newly-diagnosed values (~75–90 and ~7–8%). The forward trajectory is the calibrated part; the deep-history baseline is an interpolation.
- Oral intensification fires late. The engine intensifies when A1c exceeds 8% (median 9 yr here), whereas guidelines intensify at the 7% target — so mid-disease treatment is later than a guideline-adherent clinic.
- CVD is a prevalent-cohort figure. The 31% 10-year cumulative includes events back-dated from patients who already carried cardiovascular history at the snapshot, so it exceeds the ~12% 10-year risk of a newly-diagnosed cohort.
- Mortality is imported, not modelled as a hazard. Death is placed from the patient's cross-sectional record; the timeline does not add a diabetes excess-mortality multiplier of its own.
- Simplifications: systolic BP is the one driver not scaled by τ; there is no separate insulin-initiation BMI bump; macroalbuminuria's late renal decline is realized through the steeper eGFR slope rather than a discrete "elevated-creatinine" event.
- One seed. These figures are a single 2,500-patient draw; the relationships are structural, but exact decimals will vary with cohort size and seed.
None of these change the headline: the direction, ordering, and calibrated magnitude of progression are real. Validate against real outcome data before publishing an effect size that hinges on one of the boundaries above.
§ · Reproduce it
Generation is deterministic. This produces the exact 2,500-patient, 64,269-row cohort analyzed above:
# person CLI — long-format timelines, one row per patient-year
person-cli -n 2500 -s 42 --cohort diabetic --timeline 20 -f jsonl > t2dm_timeline.jsonl
# or via the API (see the Quickstart): POST /v1/datasets/t2dm
# { "clientId": "...", "count": 2500, "seed": 42, "timeline": true, "horizonYears": 20 }Each row carries: id, index, date, age, diabetes_duration_years, biological_stage, a1c, egfr, bmi, systolic_bp, albuminuria, on_insulin, and dated events. Load into pandas and group by diabetes_duration_years to reproduce every figure. New to the API? Start with the developer Quickstart →, or see the cross-sectional companion, what drives HbA1c →
§ · References
- UK Prospective Diabetes Study (UKPDS) 33. Intensive blood-glucose control. Lancet 1998;352:837–853.
- UKPDS 34. Effect of intensive blood-glucose control with metformin. Lancet 1998;352:854–865.
- Adler AI, et al. (UKPDS 64) Development and progression of nephropathy in type 2 diabetes. Kidney Int. 2003;63:225–232.
- Klein R, et al. (WESDR XIV) The Wisconsin Epidemiologic Study of Diabetic Retinopathy. Ophthalmology 1994;101:1061–1070.
- eGFR decline by albuminuria in diabetes (pooled CRIC / DCCT-EDIC / Framingham). Diabetes Care 2019.
- Colhoun HM, et al. (CARDS) Primary prevention of cardiovascular disease with atorvastatin in type 2 diabetes. Lancet 2004;364:685–696.
- Diabetes Prevention Program (DPP/DPPOS) placebo-arm progression. Diabetes Care 2025.
- American Diabetes Association. Standards of Care in Diabetes — 2025 (§2, §9, §11).
- KDIGO 2024 Clinical Practice Guideline for CKD Evaluation and Management.
- Gregg EW, et al. Prevalence of lower-extremity disease in the US diabetic population. Diabetes Care 2004.
- CDC National Diabetes Statistics Report / NDSS, 2024.
Every engine constant maps to a graded anchor in the project's T2DM progression benchmark library. Figures reflect a July 2026 run of the deterministic engine.