§ · Tutorial

Synthetic Patient Data in Python

Get calibrated, PHI-free patient records into a pandas DataFrame in one line, then explore and validate them — no signup for the sample, deterministic by seed.

This goes from a URL to a working DataFrame to the same checks our validation studies run. Three ways to pull the data into Python, cheapest first.

§ · 1 · The one-liner, no account

The no-signup sample is a static file, so pandas reads it straight from the URL — nothing to authenticate:

import pandas as pd

df = pd.read_csv("https://simpleidgen.com/synthetic-data/person/sample.csv")
df.shape          # (1000, 77)  — 1,000 US adults, 77 attributes

Prefer JSONL? pd.read_json("https://simpleidgen.com/synthetic-data/person/sample.jsonl", lines=True). The same sample is a fixed file, so it is identical on every machine.

§ · 2 · A custom cohort over MCP

When you need specific parameters — a row count, a seed, a T2DM stage — call the MCP endpoint with your clientId as a Bearer token. It returns up to 100 records inline:

import io, requests, pandas as pd

resp = requests.post(
    "https://api.simpleidgen.com/mcp",
    headers={"Authorization": "Bearer YOUR_CLIENT_ID"},
    json={
        "jsonrpc": "2.0", "id": 1, "method": "tools/call",
        "params": {
            "name": "generate_people",
            "arguments": {"count": 100, "format": "csv", "seed": 42},
        },
    },
)
csv = resp.json()["result"]["content"][0]["text"]
df = pd.read_csv(io.StringIO(csv))

Swap generate_people for generate_t2dm_cohort (add "stage": 4) or generate_timeline. Your clientId is on your profile. For more than 100 rows, use the REST job in the Quickstart.

§ · 3 · Explore it

From here it is ordinary pandas. The columns cohere with each other — vitals track age and BMI, medications track conditions — so grouping and filtering behave like a real population:

df[["age", "bmi", "a1c_value", "systolic_bp_avg"]].describe()
df["diabetes_status"].value_counts(normalize=True).round(3)

§ · 4 · Validate it yourself

Every claim on this site is checkable in a couple of lines. The marginals land on their US benchmarks:

(df["bmi"] >= 30).mean()          # ~0.42  → NHANES obesity 42.4%
df["bmi"].mean()                   # ~29.6  → NHANES 29.6

# and the joints hold — insulin appears only where diabetes is present
df.loc[df["on_insulin"], "diabetes_status"].unique()   # never "none"
df[["age", "bmi", "a1c_value", "egfr_value"]].corr().round(2)

Worked, full-length versions: validation vs US statistics, joint distributions, and what drives HbA1c.

§ · 5 · Reproducible by seed

Generation is deterministic: the same seed and options return byte-identical records, so a notebook or a test fixture stays stable across runs and machines. The static sample is a fixed file for the same reason. To wire seeded fixtures into a pipeline, see the GitHub Actions tutorial; the guarantee itself is measured in the determinism study.

§ · Frequently asked

Do I need an account to use it in Python?

Not for the sample — pd.read_csv reads the 1,000-row file straight from the URL. A free clientId is needed only for custom cohorts over MCP or bulk jobs over REST.

How many rows can I pull?

The static sample is 1,000 rows. MCP returns up to 100 inline (25 for timelines). REST jobs go up to 1,000,000 rows but need a logged-in session — see the Quickstart.

Is the data safe to commit to a repo?

Yes. It contains no PHI and no real people, and it is deterministic by seed, so a committed CSV never drifts. It is just calibrated fake records.

Can I get FHIR instead of CSV?

Yes. Pass "format": "fhir" to the MCP tool, or formats: ["fhir"] to a REST job, for US-Core-aligned FHIR R4.