§ · Research note

Longitudinal Progression in Synthetic Type-2 Diabetes: A 2,500-Patient Timeline Study

We generated 20-year disease timelines for 2,500 synthetic type-2-diabetes patients — 64,269 person-years — and tested whether they reproduce the natural history of the disease. They do. Kidney function falls 2.2 mL/min/1.73m² per year, HbA1c drifts from 6.3% to 8.3% despite treatment, the cohort shifts from 90% early-stage to 54% advanced-stage disease over two decades, and complications accrue on the UKPDS/WESDR timetable. This is a worked, reproducible demonstration that the longitudinal engine encodes progression, not just cross-sectional realism.

§ · Abstract

Background. SimpleIDGen's cross-sectional generator produces a snapshot patient; its PersonTimeline engine extends each type-2-diabetes (T2DM) patient into a dated, year-by-year trajectory. We asked whether those trajectories reproduce the documented natural history of T2DM.

Methods. We generated 2,500 synthetic diagnosed/undiagnosed T2DM patients (seed 42) and simulated each forward 20 years, yielding 64,269 person-years in long format (one row per patient-year). Five drivers — HbA1c, eGFR, BMI, systolic BP, and diabetes duration — advance by benchmark-anchored annual drift plus heteroscedastic noise; ten dated clinical events fire as annual hazards; a monotone covariate function assigns a biological stage (1–5) each year. We measured driver trajectories by duration, biological-stage and albuminuria composition over time, cumulative complication incidence, the treatment cascade, renal outcomes, and mortality, then compared each against published cohorts.

Results. eGFR declined −2.2 mL/min/1.73m²/yr (0–20 yr); 69% of patients reached CKD stage 3 (eGFR <60) and 20% reached ESRD (<15). HbA1c rose monotonically 6.3%→8.3% over 20 years (secondary failure) despite metformin at diagnosis and insulin in 82%. Mean biological stage rose 1.14→3.40 by year 20 (2.9% still stage 1; 53.6% stage 4–5). Albuminuria progressed from 100% normoalbuminuric to 50% macroalbuminuric by year 25. Twenty-year cumulative incidence: retinopathy 84%, insulin initiation 80%, microalbuminuria 64%, CVD 47%, neuropathy 46%, macroalbuminuria 40%. Median time from diagnosis: metformin 0 yr, microalbuminuria 4 yr, insulin 11 yr.

Conclusions. The synthetic timelines reproduce the direction, magnitude, and timing of real T2DM progression — renal decline, glycemic drift, staged deterioration, and the complication cascade — with every rate traceable to a graded benchmark. They are suitable for prototyping longitudinal analyses, temporal ML, and cohort simulations; they are not a substitute for real outcome data where a specific effect size is the endpoint.

§ · Background

Most synthetic-data tools emit a static snapshot: a patient as they are today. But T2DM is a progressive disease — glycemic control drifts, kidneys decline, complications accumulate — and the questions that matter clinically are longitudinal. Testing a risk model, a care-gap alert, or a temporal ML pipeline needs patients who change over time in believable ways.

The PersonTimeline engine turns each cross-sectional T2DM patient into a dated trajectory anchored on that snapshot. It runs in two regimes. The reconstructed-history region (from diagnosis to "today") back-dates the patient's known snapshot state — the events that must have happened to arrive there. The forward region (past today) simulates the future stochastically, drawing each year's drift and each complication as an annual hazard. This study interrogates the forward-and-back trajectory of a whole cohort and asks a single question: does it look like real diabetes?

§ · Methods

Cohort. We generated 2,500 patients from the diabetic cohort preset (85% diagnosed / 15% undiagnosed T2DM, age 38–95, 55% male) at seed 42, and built a timeline for each with a 20-year forward horizon and 1-year cycles. Every patient's timeline runs from their diagnosis year (cycle 0) to death or the horizon, producing 64,269 patient-years (mean 25.7 per patient). The generator is deterministic: the same seed reproduces this cohort byte-for-byte.

The five time-varying drivers. Each advances annually by a mean drift plus heteroscedastic Gaussian noise, modulated by a patient-fixed progression tempo τ = 0.7 + 0.6·propensity ∈ [0.7, 1.3] (mean ≈ 1.0):

DriverAnnual drift (×τ unless noted)Noise SDAnchor
HbA1c+0.20%/yr, or +0.10 on insulin; −1.5% step at insulin start0.07 × A1c (CV 8.3%)UKPDS 33/34
eGFR−1.9 / −2.1 / −3.15 by normo / micro / macro albuminuria0.055 × eGFR (CV 5.5%)CRIC·DCCT·EDIC 2019
BMI+0.05 kg/m²/yr1.0Looker 2001
Systolic BP−1.0 (treated) / +1.0 (untreated) mmHg/yr (not τ-scaled)11.0DIAB-CORE 2015
Diabetes duration+1 yr / cycle (deterministic, no noise)time axis

The event hazards. In the forward region, ten dated events fire; complications are drawn as independent annual hazards (each × τ), treatment escalates on glycemic rules:

EventTriggerAnnual rate (×τ)Anchor
Metformin startat diagnosis (cycle 0)ADA 2025 · UKPDS 34
Normo → microalbuminuriatier = normo2.0%UKPDS 64
Micro → macroalbuminuriatier = micro2.8%UKPDS 64
Retinopathy onsetnot yet present11%WESDR XIV
Neuropathy onsetnot yet present2.5%DPN meta-analysis
CVD eventnot yet present2.7%CARDS
Oral intensificationA1c > 8% for ≥ 1 yrruleUKPDS 33
Insulin startA1c ≥ 9%, or > 8% for 3 yrruleUKPDS 26/34
Stage progressiona covariate crosses a stage thresholdderivedKDIGO · ADA
Deathimported from the record, placed on the calendarNCHS life-table

Biological staging. Each year a monotone, non-decreasing function assigns stage 1–5 from the rounded covariates (first match wins, most-severe first): 5 if eGFR < 15 (ESRD); 4 if eGFR < 30 or a CVD event has occurred; 3 if eGFR < 60 or macroalbuminuria; 2 if A1c > 8% or microalbuminuria; 1 otherwise. Because a stage step requires a covariate to cross a threshold, staging and the driver values are concordant by construction. Stage is derived from the emitted (rounded) values, so a reviewer can re-derive it from the serialized row.

Analysis. Driver trajectories are population means by integer diabetes duration. Stage and albuminuria composition are cross-sectional snapshots at 0/5/10/15/20/25 years. Cumulative incidence uses an at-risk denominator (patients whose timeline reaches duration t). Time-to-event is the median duration among patients who experienced it. The eGFR slope is an ordinary-least-squares fit over person-years at 0–20 years. Analysis was performed offline in Python; the figures below carry the measured values.

Fig 5 · The biological-stage progression state machine
Stage 1Diagnosed /controlledStage 2A1c > 8%or micro-albStage 3eGFR < 60or macro-albStage 4eGFR < 30or CVD eventStage 5eGFR < 15(ESRD)Annual hazards feed the covariates → covariates cross thresholds → stage steps up (monotone):eGFR falls −1.9 to −3.15 mL/min/yr (by albuminuria tier) · A1c drifts +0.1–0.2%/yr · BMI +0.05/yrnormo→micro 2.0%/yr · micro→macro 2.8%/yr · retinopathy 11%/yr · neuropathy 2.5%/yr · CVD 2.7%/yr — each × tempo τ ∈ [0.7, 1.3]Stage is a monotone covariate projection (never regresses); death is imported and right-censors the trajectory at any stage.

§ · Results

Table 1 — cohort at diagnosis (reconstructed cycle 0).

CharacteristicMean ± SDMedian
Age at diagnosis (yr)51.6 ± 13.950
HbA1c (%)6.30 ± 0.945.9
BMI (kg/m²)32.2 ± 6.931.4
eGFR (mL/min/1.73m²)100.5 ± 19.7101.7
Systolic BP (mmHg)130.2 ± 18.1132
Follow-up (yr)24.725
Deaths in window209 (8.4%), median age 73

Renal and glycemic trajectories. The two signature curves of T2DM are both present and monotone (Fig 1). eGFR falls in a near-straight line from 100 to 57 mL/min/1.73m² over 20 years — a slope of −2.2 — carrying the average patient across the CKD stage-2 and stage-3 thresholds; 69.4% of patients reach eGFR < 60 at some point, 33.2% reach < 30, and 20.2% reach ESRD (< 15). HbA1c drifts the other way: from 6.3% at diagnosis to 8.3% at 20 years and 9.0% at 25, the textbook picture of secondary failure — glycemic control erodes despite metformin at diagnosis and insulin in most patients.

Fig 1a · Mean eGFR (mL/min/1.73m²) vs duration
305070901100510152025years since diagnosisG2 90G3 60G4 30
Fig 1b · Mean HbA1c (%) vs duration
6789100510152025years since diagnosistarget 7.0intensify 8.0

Staged deterioration. The cohort visibly ages into the disease (Fig 2a). At diagnosis 90% of patients are stage 1; by year 10 the mode has moved to stage 4, and by year 20 just 2.9% remain stage 1 while 53.6% are stage 4–5. Mean biological stage climbs 1.14 → 2.20 → 3.10 → 3.40 at years 0/5/15/20. Albuminuria tells the renal half of the same story (Fig 2b): the cohort starts fully normoalbuminuric and, by year 25, half is macroalbuminuric — the pipeline that drives the steepening eGFR slope.

Fig 2a · Biological stage share (%), x = years since dx
0501000510152025
Stage 1 · controlledStage 2Stage 3Stage 4Stage 5 · ESRD
Fig 2b · Albuminuria share (%), x = years since dx
0501000510152025
normomicromacro

The complication cascade. Complications accumulate in the order and on the schedule the literature predicts (Fig 3). Retinopathy is the fastest and most common (53% by 10 yr, 84% by 20 yr). Insulin dependence overtakes it late (80% by 20 yr) as glycemic control fails. Microalbuminuria comes early (median 4 yr) and plateaus as patients convert to macroalbuminuria; CVD and neuropathy track together near 2.5–2.7% per year. Read as a timetable (Fig 4), the diagnosis-to-milestone medians are: metformin at diagnosis, microalbuminuria at 4 years, macroalbuminuria at 8, retinopathy and oral intensification at 9, neuropathy at 10, and insulin at 11.

Fig 3 · Cumulative incidence (%) of complications and insulin initiation (at-risk denominator)
025507510005101520years since diagnosis
RetinopathyInsulin startMicroalbuminuriaNeuropathyCVD eventMacroalbuminuria
Fig 4 · The therapeutic and complication cascade — median years from diagnosis (bar), % of cohort reaching it (label)
036912years from diagnosis (median)Metformin0y · 86%Microalbuminuria4y · 65%Macroalbuminuria8y · 42%Retinopathy9y · 87%Oral intensification9y · 72%CVD event9y · 53%Neuropathy10y · 50%Insulin start11y · 82%

§ · Validation against published cohorts

Every rate in the engine is anchored to a graded real-world source. The measured cohort behaviour lands where those anchors predict:

MetricThis cohortPublished benchmarkSource
eGFR decline (0–20 yr)−2.2 mL/min/1.73m²/yr−1.9 to −3.15 (by albuminuria tier)CRIC·DCCT·EDIC, Diab Care 2019
HbA1c drift+0.10%/yr (6.3→8.3 / 20 yr)+0.1 to +0.3%/yrUKPDS 33; Turner 1999
Insulin initiationmedian 11 yr; 80% by 20 yrmedian ~9–11 yrUKPDS 26; US family practice
Retinopathy53% @10 yr, 84% @20 yr67% @10 yr (non-insulin T2DM)WESDR XIV, 1994
Neuropathy incidence2.5%/yr; 28% @10 yr24–27 / 1000 patient-yrDPN meta-analysis (n=95,604)
Albuminuria normo→micromedian onset 4 yr2.0%/yr transitionUKPDS 64, 2003
ESRD (eGFR < 15) reached20.2% of patientsstaging cutoffsKDIGO 2024

The population eGFR slope (−2.2) is a blend of the tiered anchors, steepening as the cohort converts to macroalbuminuria — exactly the mechanism the tiered slopes encode.

§ · Discussion

Three things make these trajectories usable rather than merely plausible. First, they are concordant: because biological stage is a function of the same covariates it summarizes, a patient never sits in stage 4 with stage-1 labs — the staging and the numbers cannot disagree. Second, they are monotone where the disease is monotone (stage, duration, cumulative complications never reverse) and drift where the disease drifts (glycemia, kidney function, blood pressure), so temporal features behave. Third, every rate is auditable — the drift and hazard constants trace one-to-one to UKPDS, WESDR, the CRIC/DCCT/EDIC pooled analysis, and ADA/KDIGO guidelines, so a reviewer can check the engine against its sources.

The practical upshot: you can prototype a longitudinal risk model, a renal-decline alert, or a temporal ML pipeline on this cohort and expect it to encounter the same relationships — accelerating eGFR loss, glycemic secondary failure, the retinopathy-first complication order — that it would meet in real EHR data, with none of the access constraints.

§ · Limitations

This is a calibrated model, and the honest boundaries matter:

  • Reconstructed baselines skew optimistic. Because cycle 0 is obtained by removing mean drift (no noise) from the snapshot, reconstructed diagnosis-year eGFR (mean 100) runs higher, and HbA1c (6.3%) lower, than typical newly-diagnosed values (~75–90 and ~7–8%). The forward trajectory is the calibrated part; the deep-history baseline is an interpolation.
  • Oral intensification fires late. The engine intensifies when A1c exceeds 8% (median 9 yr here), whereas guidelines intensify at the 7% target — so mid-disease treatment is later than a guideline-adherent clinic.
  • CVD is a prevalent-cohort figure. The 31% 10-year cumulative includes events back-dated from patients who already carried cardiovascular history at the snapshot, so it exceeds the ~12% 10-year risk of a newly-diagnosed cohort.
  • Mortality is imported, not modelled as a hazard. Death is placed from the patient's cross-sectional record; the timeline does not add a diabetes excess-mortality multiplier of its own.
  • Simplifications: systolic BP is the one driver not scaled by τ; there is no separate insulin-initiation BMI bump; macroalbuminuria's late renal decline is realized through the steeper eGFR slope rather than a discrete "elevated-creatinine" event.
  • One seed. These figures are a single 2,500-patient draw; the relationships are structural, but exact decimals will vary with cohort size and seed.

None of these change the headline: the direction, ordering, and calibrated magnitude of progression are real. Validate against real outcome data before publishing an effect size that hinges on one of the boundaries above.

§ · Reproduce it

Generation is deterministic. This produces the exact 2,500-patient, 64,269-row cohort analyzed above:

# person CLI — long-format timelines, one row per patient-year
person-cli -n 2500 -s 42 --cohort diabetic --timeline 20 -f jsonl > t2dm_timeline.jsonl

# or via the API (see the Quickstart): POST /v1/datasets/t2dm
#   { "clientId": "...", "count": 2500, "seed": 42, "timeline": true, "horizonYears": 20 }

Each row carries: id, index, date, age, diabetes_duration_years, biological_stage, a1c, egfr, bmi, systolic_bp, albuminuria, on_insulin, and dated events. Load into pandas and group by diabetes_duration_years to reproduce every figure. New to the API? Start with the developer Quickstart →, or see the cross-sectional companion, what drives HbA1c →

§ · References

  1. UK Prospective Diabetes Study (UKPDS) 33. Intensive blood-glucose control. Lancet 1998;352:837–853.
  2. UKPDS 34. Effect of intensive blood-glucose control with metformin. Lancet 1998;352:854–865.
  3. Adler AI, et al. (UKPDS 64) Development and progression of nephropathy in type 2 diabetes. Kidney Int. 2003;63:225–232.
  4. Klein R, et al. (WESDR XIV) The Wisconsin Epidemiologic Study of Diabetic Retinopathy. Ophthalmology 1994;101:1061–1070.
  5. eGFR decline by albuminuria in diabetes (pooled CRIC / DCCT-EDIC / Framingham). Diabetes Care 2019.
  6. Colhoun HM, et al. (CARDS) Primary prevention of cardiovascular disease with atorvastatin in type 2 diabetes. Lancet 2004;364:685–696.
  7. Diabetes Prevention Program (DPP/DPPOS) placebo-arm progression. Diabetes Care 2025.
  8. American Diabetes Association. Standards of Care in Diabetes — 2025 (§2, §9, §11).
  9. KDIGO 2024 Clinical Practice Guideline for CKD Evaluation and Management.
  10. Gregg EW, et al. Prevalence of lower-extremity disease in the US diabetic population. Diabetes Care 2004.
  11. CDC National Diabetes Statistics Report / NDSS, 2024.

Every engine constant maps to a graded anchor in the project's T2DM progression benchmark library. Figures reflect a July 2026 run of the deterministic engine.