Synthetic Data Fixtures in CI with GitHub Actions
Reproducible, PHI-free patient fixtures in your pipeline — deterministic by seed, with no real data to guard and nothing to de-identify.
REST dataset jobs authenticate with a browser session, which does not fit a CI secret. Here are the three paths that do, with a working workflow step for each.
§ · Which path fits CI
| You want… | Use | Auth in CI |
|---|---|---|
| A fixed fixture, zero setup | The static sample (curl) | None |
| A custom small cohort (≤ 100) | The MCP endpoint | One Bearer secret |
| Bulk or fully custom fixtures | Generate once, commit the file | None at run time |
REST /v1/datasets jobs are cookie-authenticated for interactive and scripted use; they are not designed to run from a CI secret. The three paths below avoid that.
§ · Path 1 · Curl the static sample
The no-signup sample is a fixed 1,000-row file. Fetch it in a step and your tests read it — nothing to authenticate, byte-identical on every run:
- name: Fetch synthetic fixtures
run: |
mkdir -p fixtures
curl -sSf -o fixtures/people.csv \
https://simpleidgen.com/synthetic-data/person/sample.csv
- name: Run tests
run: pytest # your tests read fixtures/people.csvThe file does not change between runs, so you can cache it with actions/cache on a fixed key and skip the refetch entirely.
§ · Path 2 · A custom cohort over MCP
When you need specific parameters, call the MCP endpoint with your clientId stored as a repository secret. Pin the seed and the fixture is stable across runs:
- name: Generate cohort
env:
CLIENT_ID: ${{ secrets.SIMPLEIDGEN_CLIENT_ID }}
run: |
mkdir -p fixtures
curl -sS https://api.simpleidgen.com/mcp \
-H "Authorization: Bearer $CLIENT_ID" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"generate_people","arguments":{"count":100,"format":"csv","seed":42}}}' \
| jq -r '.result.content[0].text' > fixtures/cohort.csvUp to 100 people inline (25 for timelines). Swap in generate_t2dm_cohort with a "stage" of 1–5 for diabetic patients. Add the secret under Settings → Secrets and variables → Actions.
§ · Path 3 · Generate once, commit the file
For bulk or highly specific fixtures, the cleanest pipeline makes no API call at all. Generate the dataset once — from the web app or a REST job on your own machine — then commit the file and let CI read it. Because the output is deterministic by seed and carries no PHI, a committed synthetic dataset is safe to version, and it never drifts. Regenerate with the same seed only when you deliberately want to change it.
§ · Why determinism matters here
Same seed, same records — down to the byte. A fixture generated today matches one generated in next year's CI run, so a test asserting against it fails only when your code changes, not when the data shifts underneath you. That property is measured end to end in the determinism study, and it is what makes synthetic data a good fit for CI test fixtures in the first place.
§ · Frequently asked
REST dataset jobs authenticate with a browser session cookie, which is not built to live in a CI secret. Use the static sample, or MCP with a Bearer clientId, or commit a pre-generated seeded file.
No. The static sample is a fixed file, and MCP is deterministic by seed. Pin the seed and the data is byte-identical every run, so tests stay stable.
Yes. There is no PHI and no real person in any row — it is calibrated fake data — so a committed fixture carries none of the handling rules real patient data would.
Yes. The three paths are just curl and a file: fetch the sample, call MCP with a secret, or read a committed dataset. Any runner that has curl and your test tool works the same way.