§ · Tutorial

Synthetic Data Fixtures in CI with GitHub Actions

Reproducible, PHI-free patient fixtures in your pipeline — deterministic by seed, with no real data to guard and nothing to de-identify.

REST dataset jobs authenticate with a browser session, which does not fit a CI secret. Here are the three paths that do, with a working workflow step for each.

§ · Which path fits CI

You want…	Use	Auth in CI
A fixed fixture, zero setup	The static sample (curl)	None
A custom small cohort (≤ 100)	The MCP endpoint	One Bearer secret
Bulk or fully custom fixtures	Generate once, commit the file	None at run time

REST /v1/datasets jobs are cookie-authenticated for interactive and scripted use; they are not designed to run from a CI secret. The three paths below avoid that.

§ · Path 1 · Curl the static sample

The no-signup sample is a fixed 1,000-row file. Fetch it in a step and your tests read it — nothing to authenticate, byte-identical on every run:

- name: Fetch synthetic fixtures
  run: |
    mkdir -p fixtures
    curl -sSf -o fixtures/people.csv \
      https://simpleidgen.com/synthetic-data/person/sample.csv

- name: Run tests
  run: pytest        # your tests read fixtures/people.csv

The file does not change between runs, so you can cache it with actions/cache on a fixed key and skip the refetch entirely.

§ · Path 2 · A custom cohort over MCP

When you need specific parameters, call the MCP endpoint with your clientId stored as a repository secret. Pin the seed and the fixture is stable across runs:

- name: Generate cohort
  env:
    CLIENT_ID: ${{ secrets.SIMPLEIDGEN_CLIENT_ID }}
  run: |
    mkdir -p fixtures
    curl -sS https://api.simpleidgen.com/mcp \
      -H "Authorization: Bearer $CLIENT_ID" \
      -H "Content-Type: application/json" \
      -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"generate_people","arguments":{"count":100,"format":"csv","seed":42}}}' \
      | jq -r '.result.content[0].text' > fixtures/cohort.csv

Up to 100 people inline (25 for timelines). Swap in generate_t2dm_cohort with a "stage" of 1–5 for diabetic patients. Add the secret under Settings → Secrets and variables → Actions.

§ · Path 3 · Generate once, commit the file

For bulk or highly specific fixtures, the cleanest pipeline makes no API call at all. Generate the dataset once — from the web app or a REST job on your own machine — then commit the file and let CI read it. Because the output is deterministic by seed and carries no PHI, a committed synthetic dataset is safe to version, and it never drifts. Regenerate with the same seed only when you deliberately want to change it.

§ · Why determinism matters here

Same seed, same records — down to the byte. A fixture generated today matches one generated in next year's CI run, so a test asserting against it fails only when your code changes, not when the data shifts underneath you. That property is measured end to end in the determinism study, and it is what makes synthetic data a good fit for CI test fixtures in the first place.

§ · Frequently asked

Can I call the REST API from CI?

REST dataset jobs authenticate with a browser session cookie, which is not built to live in a CI secret. Use the static sample, or MCP with a Bearer clientId, or commit a pre-generated seeded file.

Will the fixtures change between runs?

No. The static sample is a fixed file, and MCP is deterministic by seed. Pin the seed and the data is byte-identical every run, so tests stay stable.

Is it safe to commit synthetic data to my repo?

Yes. There is no PHI and no real person in any row — it is calibrated fake data — so a committed fixture carries none of the handling rules real patient data would.

Does this work with GitLab CI or other runners?

Yes. The three paths are just curl and a file: fetch the sample, call MCP with a secret, or read a committed dataset. Any runner that has curl and your test tool works the same way.