← ~/tech~/building-multi-model-ai-pipeline
AI/MLdevtoolspythonbashautomation

Building a Multi-Model AI Development Pipeline

A four-stage pipeline that chains Claude, GPT, and Gemini to write, review, and ship code automatically. Git commits at every stage mean a failed API call never requires starting over.

Apr 11, 202614 min readNsisong Effiong

The frustration is familiar. You ask an AI to write code, get output that's mostly right, then spend time catching the gaps: edge cases it didn't consider, security issues it glossed over, documentation it skipped. You're doing the review work yourself every time.

The question worth asking: what if the review was part of the pipeline?

That's the idea behind ai-workspace-scripts. A pipeline that takes a task description and passes it through four stages automatically — one model writes the code, a second reviews it, a third audits it, the first synthesises the feedback and ships the final version. Each stage commits to a Git feature branch. You review the output and merge when satisfied.

The repo is at github.com/nsisongeffiong/ai-workspace-scripts.


The pipeline model

Four stages. Three models. One feature branch per run.

StageModelRole1Claude Opus 4.6Initial implementation2GPT-5.4Code quality & documentation review3Gemini 2.5 FlashSecurity & correctness audit4Claude Opus 4.6Final synthesis

The choice to use three different models rather than one was deliberate. Each model has different blind spots. GPT catches things Claude misses. Gemini's security lens is distinct from GPT's quality lens. Claude at Stage 4 has the context of both reviews and the authority to act on them.

Using the same model at every stage means the same blind spots at every review. The diversity is the point.


Why commit per stage

Every stage commits its output to the feature branch before the next stage runs.

if from_stage <= 1:
    stage_1_claude_code(task)
    git_commit(repo, "feat(claude): initial implementation", [PROJECT_ROOT / "src"])

if from_stage <= 2:
    stage_2_gpt_review()

if from_stage <= 3:
    stage_3_gemini_validate()

The --from-stage flag is what makes this useful. If a provider returns a 503 mid-run, you don't re-run everything before it. You resume:

python scripts/run.py --from-stage 3 "your task"

Stage 1 and 2 outputs are already in the commit history. Stage 3 picks up exactly where things stopped.

The alternative is a pipeline that either succeeds completely or forces a full restart. At the cost of a few extra git commits, the resumable model is strictly better.


How `orchestrate.py` works

The core logic lives in a single file. Here's the shape of it:

def run(task: str, from_stage: int = 1) -> None:
    repo = Repo(PROJECT_ROOT)

    if from_stage > 1:
        branch = repo.active_branch.name
    else:
        branch = f"feature/{int(time.time())}"
        repo.git.checkout("-b", branch)

    if from_stage <= 1:
        stage_1_claude_code(task)
        git_commit(repo, "feat(claude): initial implementation", ...)

    if from_stage <= 2:
        stage_2_gpt_review()

    if from_stage <= 3:
        stage_3_gemini_validate()

    if from_stage <= 4:
        stage_4_claude_final()
        git_commit(repo, "review(claude): final synthesis", ...)

New run: creates a feature/{timestamp} branch and starts from Stage 1. Resume: stays on the current branch and starts from whatever stage you specify. Each run gets its own branch, so every pipeline execution is independently traceable in git history.


The file extraction problem

Stage 1 asks the model to write code. The model returns text. The pipeline has to turn that text into actual files on disk.

The convention: annotate every code block with a filepath on the fence line.

// component code here

The extractor parses this and writes each block to the correct path:

pattern1 = re.compile(
    rf"```(?:{LANGS})\s+([\w./\-]+\.(?:{EXT}))\n(.*?)```",
    re.DOTALL
)
for m in pattern1.finditer(text):
    write_file(PROJECT_ROOT / m.group(1), m.group(2))

If no filepath is found, the fallback writes everything to src/generated.py. You don't want that often. When it happens, it means the task prompt wasn't explicit enough about the convention. The fix is one line in your project's CONTEXT.md:

Filepath on every opening fence line: ```tsx src/components/Hero.tsx

After that, files go to the right place.


The context injection problem

The pipeline is project-agnostic by design. But projects have specific requirements — design tokens, coding conventions, framework rules — that need to reach Stage 1 without bloating every stage's token budget.

The solution: context is opt-in per prompt via a {brand_context} placeholder.

def load_prompt(name: str) -> str:
    text = p.read_text(encoding="utf-8")
    if "{brand_context}" in text:
        tokens_path = PROJECT_ROOT / "prompts" / "BRAND_TOKENS.md"
        brand_path  = PROJECT_ROOT / "prompts" / "BRAND.md"
        if tokens_path.exists():
            brand = tokens_path.read_text(encoding="utf-8")
        elif brand_path.exists():
            brand = brand_path.read_text(encoding="utf-8")
        else:
            brand = ""
        text = text.replace("{brand_context}", brand)
    return text

Only prompts that include {brand_context} get the context injected. The security audit prompt doesn't need design tokens, so it doesn't get them.

For projects with large context files, there's a pre-flight distillation step: Claude compresses your full BRAND.md into a BRAND_TOKENS.md, under 100 lines, containing only what Stage 1 actually needs. A full brand guide might run to 4000 tokens. The distilled version runs around 800. That's 3200 tokens back in the budget for code.


What the workspace structure looks like

~/ai-workspace/
  .shared/
    .env              ← API keys (chmod 600)
    prompts/          ← default stage prompts
    orchestrate.py    ← the pipeline runner
    .venv/            ← shared Python environment

  projects/
    your-project/
      prompts/
        claude_coder.md    ← Stage 1 system prompt (project override)
        BRAND.md           ← project context (optional)
        BRAND_TOKENS.md    ← distilled tokens (auto-generated)
      reviews/
        review-gpt.md      ← Stage 2 output
        review-gemini.md   ← Stage 3 output
        final-review.md    ← Stage 4 output
      scripts/
        run.py             ← project entry point
      src/                 ← generated source code

Each project overrides the shared prompts by placing its own versions in prompts/. The pipeline checks the project's prompts/ first, then falls back to the shared ones. The workspace itself stays untouched: one workspace, many projects.


What the source-reading stages do

Stages 2, 3, and 4 need to read what Stage 1 wrote. Reading everything in src/ is the naive approach. On any real project that's too much context — config files, lock files, generated types, things the reviewing model doesn't need to see.

The smarter approach: diff against the initial scaffold commit and read only the files the pipeline actually changed. Config files, lock files, generated types — none of that reaches the reviewing model.

initial = repo.git.rev_list("--max-parents=0", "HEAD").strip().splitlines()[-1]
changed = set(repo.git.diff("--name-only", initial, "HEAD").strip().splitlines())
pipeline_files = sorted(
    PROJECT_ROOT / f for f in changed
    if (PROJECT_ROOT / f).is_file()
)

The reviewing stages see only what changed. The context stays focused and the reviews stay accurate.


Getting started

Install the workspace with one command:

curl -fsSL https://raw.githubusercontent.com/nsisongeffiong/ai-workspace-scripts/main/setup-workspace.sh -o setup-workspace.sh
bash setup-workspace.sh

The setup script asks for your three API keys (Anthropic, OpenAI, Google), stores them in ~/.ai-workspace/.shared/.env with chmod 600, installs all dependencies, and runs a smoke test against all three APIs.

Then create a project and run a task:

bash ~/new-project.sh my-project
cd ~/ai-workspace/projects/my-project
python scripts/run.py "Build a landing page with a hero section and three feature cards"

The pipeline creates a feature/{timestamp} branch, runs all four stages, commits at each step, and prints the branch name when done. Open a PR from that branch to main when the output looks right.


What I'd change

Two things, with hindsight.

Git tags per stage, not just commits. Resuming from Stage 3 requires knowing the branch is still active and the right commits are in history. A tag like stage-1-complete would make this explicit: checkout the tag directly rather than infer from commit history.

Typed task schemas. A task is currently a free-text string. A structured schema with explicit fields for language, framework, files to create, and constraints would make Stage 1 more predictable and the filepath convention less dependent on the prompt author remembering to include it.

Both are solvable. The pipeline works well as-is for the problems it was built to solve.


The repo is at github.com/nsisongeffiong/ai-workspace-scripts. The setup takes about ten minutes. If you run into anything, open an issue.

// discussion0 comments

// add a comment

email hidden publicly · comments moderated before publishing