Voice Agent Evaluation with LLM Judges: How It Works

Wednesday, February 18, 2026

Originally published on voicetest.dev.

You can write unit tests for a REST API. You can snapshot-test a React component. But how do you test a voice agent that holds free-form conversations?

The core challenge: voice agent behavior is non-deterministic. The same agent, given the same prompt, will produce different conversations every time. Traditional assertion-based testing breaks down when there is no single correct output. You need an evaluator that understands intent, not just string matching.

Voicetest solves this with LLM-as-judge evaluation. It simulates multi-turn conversations with your agent, then passes the full transcript to a judge model that scores it against your success criteria. This post explains how each piece works.

The three-model architecture

Voicetest uses three separate LLM roles during a test run:

Simulator. Plays the user. Given a persona prompt (name, goal, personality), it generates realistic user messages turn by turn. It decides autonomously when the conversation goal has been achieved and should end – no scripted dialogue trees.

Agent. Plays your voice agent. Voicetest imports your agent config (from Retell, VAPI, LiveKit, or its own format) into an intermediate graph representation: nodes with state prompts, transitions with conditions, and tool definitions. The agent model follows this graph, responding according to the current node’s instructions and transitioning between states.

Judge. Evaluates the finished transcript. This is where LLM-as-judge happens: the judge reads the full conversation and scores it against each metric you defined.

You can assign different models to each role. Use a fast, cheap model for simulation (it just needs to follow a persona) and a more capable model for judging (where accuracy matters):

[models]
simulator = "groq/llama-3.1-8b-instant"
agent = "groq/llama-3.3-70b-versatile"
judge = "openai/gpt-4o"

How simulation works

Each test case defines a user persona:

{
  "name": "Appointment reschedule",
  "user_prompt": "You are Maria Lopez, DOB 03/15/1990. You need to reschedule your Thursday appointment to next week. You prefer mornings.",
  "metrics": [
    "Agent verified the patient's identity before making changes.",
    "Agent confirmed the new appointment date and time."
  ],
  "type": "llm"
}

Voicetest starts the conversation at the agent’s entry node. The simulator generates a user message based on the persona. The agent responds following the current node’s state prompt, then voicetest evaluates transition conditions to determine the next node. This loop continues for up to max_turns (default 20) or until the simulator decides the goal is complete.

The result is a full transcript with metadata: which nodes were visited, which tools were called, how many turns it took, and why the conversation ended.

How the judge scores

After simulation, the judge evaluates each metric independently. For the metric “Agent verified the patient’s identity before making changes,” the judge produces structured output with four fields:

  • Analysis: Breaks compound criteria into individual requirements and quotes transcript evidence for each. For this metric, it would identify two requirements – (1) asked for identity verification, (2) verified before making changes – and cite the specific turns where each happened or did not.
  • Score: 0.0 to 1.0, based on the fraction of requirements met. If the agent verified identity but did it after making the change, the score might be 0.5.
  • Reasoning: A summary of what passed and what failed.
  • Confidence: How certain the judge is in its assessment.

A test passes when all metric scores meet the threshold (default 0.7, configurable per-agent or per-metric).

This structured approach – analysis before scoring – prevents a common failure mode where judges assign a high score despite noting problems in their reasoning. By forcing the model to enumerate requirements and evidence first, the score stays consistent with the analysis.

Rule tests: when you do not need an LLM

Not everything requires a judge. Voicetest also supports deterministic rule tests for pattern-matching checks:

{
  "name": "No SSN in transcript",
  "user_prompt": "You are Jane, SSN 123-45-6789. Ask the agent to verify your identity.",
  "excludes": ["123-45-6789", "123456789"],
  "type": "rule"
}

Rule tests check for includes (required substrings), excludes (forbidden substrings), and patterns (regex). They run instantly, cost nothing, and return binary pass/fail with 100% confidence. Use them for compliance checks, PII detection, and required-phrase validation.

Global metrics: compliance at scale

Individual test metrics evaluate specific scenarios. Global metrics evaluate every test transcript against organization-wide criteria:

{
  "global_metrics": [
    {
      "name": "HIPAA Compliance",
      "criteria": "Agent verifies patient identity before disclosing any protected health information.",
      "threshold": 0.9
    },
    {
      "name": "Brand Voice",
      "criteria": "Agent maintains a professional, empathetic tone throughout the conversation.",
      "threshold": 0.7
    }
  ]
}

Global metrics run on every test automatically. A test only passes if both its own metrics and all global metrics meet their thresholds. This gives you a single place to enforce standards like HIPAA, PCI-DSS, or brand guidelines across your entire test suite.

Putting it together

A complete test run looks like this:

  1. Voicetest imports your agent config into its graph representation.
  2. For each test case, it runs a multi-turn simulation using the simulator and agent models.
  3. The judge evaluates each metric and each global metric against the transcript.
  4. Results are stored in DuckDB with the full transcript, scores, reasoning, nodes visited, and tools called.
  5. A test passes only if every metric and every global metric meets its threshold.

The web UI (voicetest serve) shows results visually – transcripts with node annotations, metric scores with judge reasoning, and pass/fail status. The CLI outputs the same data to stdout for CI integration.

Getting started

uv tool install voicetest
voicetest demo --serve

The demo loads a sample agent with test cases and opens the web UI so you can see the full evaluation pipeline in action.

Voicetest is open source under Apache 2.0. GitHub. Docs.


Using Claude Code as a Free LLM Backend for Voice Agent Testing

Monday, February 16, 2026

Originally published on voicetest.dev.

Running a voice agent test suite means making a lot of LLM calls. Each test runs a multi-turn simulation (10-20 turns of back-and-forth), then passes the full transcript to a judge model for evaluation. A suite of 20 tests can easily hit 200+ LLM calls. At API rates, that adds up fast – especially if you are using a capable model for judging.

If you have a Claude Pro or Max subscription, you already have access to Claude models through Claude Code. Voicetest can use the claude CLI as its LLM backend, routing all inference through your existing subscription instead of billing per-token through an API provider.

How it works

Voicetest has a built-in Claude Code provider. When you set a model string starting with claudecode/, voicetest invokes the claude CLI in non-interactive mode, passes the prompt, and parses the JSON response. It clears the ANTHROPIC_API_KEY environment variable from the subprocess so that Claude Code uses your subscription quota rather than any configured API key.

No proxy server. No API key management. Just the claude binary on your PATH.

Step 1: Install Claude Code

Follow the instructions at claude.ai/claude-code. After installation, verify it works:

claude --version

Make sure you are logged in to your Claude account.

Step 2: Install voicetest

uv tool install voicetest

Step 3: Configure settings.toml

Create .voicetest/settings.toml in your project directory:

[models]
agent = "claudecode/sonnet"
simulator = "claudecode/haiku"
judge = "claudecode/sonnet"

[run]
max_turns = 20
verbose = false

The model strings follow the pattern claudecode/<variant>. The supported variants are:

  • claudecode/haiku – Fast, cheap on quota. Good for simulation.
  • claudecode/sonnet – Balanced. Good for judging and agent simulation.
  • claudecode/opus – Most capable. Use when judging accuracy matters most.

Step 4: Run tests

voicetest run \
  --agent agents/my-agent.json \
  --tests agents/my-tests.json \
  --all

No API keys needed. Voicetest calls claude -p --output-format json --model sonnet under the hood, gets a JSON response, and extracts the result.

Model mixing

The three model roles in voicetest serve different purposes, and you can mix models to optimize for speed and accuracy:

Simulator (simulator): Plays the user persona. This model follows a script (the user_prompt from your test case), so it does not need to be particularly capable. Haiku is a good fit – it is fast and consumes less of your quota.

Agent (agent): Plays the role of your voice agent, following the prompts and transition logic from your imported config. Sonnet handles this well.

Judge (judge): Evaluates the full transcript against your metrics and produces a score from 0.0 to 1.0 with written reasoning. This is where accuracy matters most. Sonnet is reliable here; Opus is worth it if you need the highest-fidelity judgments.

A practical configuration:

[models]
agent = "claudecode/sonnet"
simulator = "claudecode/haiku"
judge = "claudecode/sonnet"

This keeps simulations fast while giving the judge enough capability to produce accurate scores.

Cost comparison

With API billing (e.g., through OpenRouter or direct Anthropic API), a test suite of 20 LLM tests at ~15 turns each, using Sonnet for judging, costs roughly $2-5 per run depending on transcript length. Run that 10 times a day during development and you are looking at $20-50/day in API costs.

With a Claude Pro ($20/month) or Max ($100-200/month) subscription, the same tests run against your plan’s usage allowance. For teams already paying for Claude Code as a development tool, the marginal cost of running voice agent tests is zero.

The tradeoff: API calls are parallelizable and have predictable throughput. Claude Code passthrough runs sequentially (one CLI invocation at a time) and is subject to your plan’s rate limits. For CI pipelines with large test suites, API billing may still make more sense. For local development and smaller suites, the subscription route is hard to beat.

When to use which

Scenario Recommended backend
Local development, iterating on prompts claudecode/*
Small CI suite (< 10 tests) claudecode/*
Large CI suite, parallel runs API provider (OpenRouter, Anthropic)
Team with shared API budget API provider
Solo developer with Max subscription claudecode/*

Getting started

uv tool install voicetest
voicetest demo

The demo command loads a sample healthcare receptionist agent with test cases so you can try it without any setup.

Voicetest is open source under Apache 2.0. GitHub. Docs.


How to Test a Retell Agent in CI with GitHub Actions

Sunday, February 08, 2026

Originally published on voicetest.dev.

Manual testing of voice agents does not scale. You click through a few conversations in the Retell dashboard, confirm the agent sounds right, and ship it. Then someone updates a prompt, a transition breaks, and you find out from a customer complaint. The feedback loop is days, not minutes.

Voicetest fixes this. It imports your Retell Conversation Flow, simulates multi-turn conversations using an LLM, and evaluates the results with an LLM judge that produces scores and reasoning. You can run it locally, but the real value comes from running it in CI on every push.

This post walks through the full setup: from installing voicetest to a working GitHub Actions workflow that tests your Retell agent automatically.

Step 1: Install voicetest

Voicetest is a Python CLI tool published on PyPI. The recommended way to install it is as a uv tool:

uv tool install voicetest

Verify it works:

voicetest --version

Step 2: Export your Retell agent

In the Retell dashboard, open your Conversation Flow and export it as JSON. Save it to your repo:

agents/
  receptionist.json

The exported JSON contains your nodes, edges, prompts, transition conditions, and tool definitions. Voicetest auto-detects the Retell format by looking for start_node_id and nodes in the JSON.

If you prefer to pull the config programmatically (useful for keeping tests in sync with the live agent), voicetest can also fetch directly from the Retell API:

export RETELL_API_KEY=your_key_here

Step 3: Write test cases

Create a test file with one or more test cases. Each test defines a simulated user persona, what the user will do, and metrics for the LLM judge to evaluate:

[
  {
    "name": "Billing inquiry",
    "user_prompt": "Say you are Jane Smith with account 12345. You're confused about a charge on your bill and want help understanding it.",
    "metrics": [
      "Agent greeted the customer and addressed the billing concern.",
      "Agent was helpful and professional throughout."
    ],
    "type": "llm"
  },
  {
    "name": "No PII in transcript",
    "user_prompt": "You are Jane with SSN 123-45-6789. Verify your identity.",
    "includes": ["verified", "identity"],
    "excludes": ["123-45-6789", "123456789"],
    "type": "rule"
  }
]

There are two test types. LLM tests ("type": "llm") run a full multi-turn simulation and then pass the transcript to an LLM judge, which scores each metric from 0.0 to 1.0 with written reasoning. Rule tests ("type": "rule") use deterministic pattern matching – checking that the transcript includes required strings, excludes forbidden ones, or matches regex patterns. Rule tests are fast and free, good for compliance checks like PII leakage.

Save this as agents/receptionist-tests.json.

Step 4: Configure your LLM backend

Voicetest uses LiteLLM model strings, so any provider works. Create a .voicetest/settings.toml in your project root:

[models]
agent = "groq/llama-3.3-70b-versatile"
simulator = "groq/llama-3.1-8b-instant"
judge = "groq/llama-3.3-70b-versatile"

[run]
max_turns = 20
verbose = false

The simulator model plays the user. It should be fast and cheap since it just follows the persona script. The judge model evaluates the transcript and should be accurate. The agent model plays the role of your voice agent, following the prompts and transitions from your Retell config.

Step 5: Run locally

Before setting up CI, verify everything works:

export GROQ_API_KEY=your_key_here

voicetest run \
  --agent agents/receptionist.json \
  --tests agents/receptionist-tests.json \
  --all

You will see each test run, the simulated conversation, and the judge’s scores. Fix any test definitions that do not match your agent’s behavior, then commit everything:

git add agents/ .voicetest/settings.toml
git commit -m "Add voicetest config and test cases"

Step 6: Set up GitHub Actions

Add your API key as a repository secret. Go to Settings > Secrets and variables > Actions, and add GROQ_API_KEY.

Then create .github/workflows/voicetest.yml:

name: Voice Agent Tests

on:
  push:
    paths:
      - "agents/**"
  pull_request:
    paths:
      - "agents/**"
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v5

      - name: Set up Python
        run: uv python install 3.12

      - name: Install voicetest
        run: uv tool install voicetest

      - name: Run voice agent tests
        env:
          GROQ_API_KEY: $
        run: |
          voicetest run \
            --agent agents/receptionist.json \
            --tests agents/receptionist-tests.json \
            --all

The workflow triggers on any change to files in agents/, which means prompt edits, new test cases, or config changes all trigger a test run. The workflow_dispatch trigger lets you run tests manually from the GitHub UI.

What’s next

Once you have CI working, there are a few things worth exploring:

Global compliance metrics. Voicetest supports HIPAA and PCI-DSS compliance checks that run across the entire transcript, not just per-test. These catch issues like agents accidentally reading back credit card numbers or disclosing PHI.

Format conversion. If you ever want to move from Retell to VAPI or LiveKit, voicetest can convert your agent config between platforms via its AgentGraph intermediate representation:

voicetest export --agent agents/receptionist.json --format vapi-assistant

The web UI. For a visual interface during development, run voicetest serve and open http://localhost:8000. You get a dashboard with test results, transcripts, and scores.

Voicetest is open source under Apache 2.0. GitHub. Docs.


Testing voice AI agents with DSPy signatures and auto-healing graphs

Tuesday, February 03, 2026

Platforms like Retell, VAPI, and LiveKit have made it straightforward to build phone-based AI assistants. But testing these agents before they talk to real customers remains painful: platform-specific formats, per-minute simulation charges, and no way to iterate on prompts without bleeding money.

voicetest is a test harness that solves this by running agent simulations with your own LLM keys. But beneath the surface, it’s also a proving ground for something more interesting: auto-healing agent graphs that recover from test failures and optimize themselves using JIT synthetic data.

voicetest CLI demo

The interactive shell loads agents, configures models, and runs test simulations against DSPy-based judges.

The architecture: AgentGraph as IR

All platform formats (Retell CF, Retell LLM, VAPI, Bland, LiveKit, XLSForm) compile down to a unified intermediate representation called AgentGraph:

class AgentGraph:
    nodes: dict[str, AgentNode]      # State-specific nodes
    entry_node_id: str               # Starting node
    source_type: str                 # Import source
    source_metadata: dict            # Platform-specific data
    default_model: str | None        # Model from import

class AgentNode:
    id: str
    state_prompt: str                # State-specific instructions
    tools: list[ToolDefinition]      # Available actions
    transitions: list[Transition]    # Edges to other states

This IR enables cross-platform testing and format conversion as a side effect. Import a Retell agent, test it, export to VAPI format. But more importantly, it gives us a structure we can reason about programmatically.

DSPy signatures for structured LLM calls

Every LLM interaction in voicetest goes through DSPy signatures. This isn’t just for cleaner code—it’s the foundation for prompt optimization.

The MetricJudgeSignature handles LLM-as-judge evaluation:

class MetricJudgeSignature(dspy.Signature):
    transcript: str = dspy.InputField()
    criterion: str = dspy.InputField()
    # Outputs
    score: float        # 0-1 continuous score
    reasoning: str      # Explanation
    confidence: float   # 0-1 confidence

Continuous scores (not binary pass/fail) are critical. A 0.65 and a 0.35 both “fail” a 0.7 threshold, but they represent very different agent behaviors. This granularity becomes training signal later.

The UserSimSignature generates realistic caller behavior:

class UserSimSignature(dspy.Signature):
    persona: str = dspy.InputField()            # Identity/Goal/Personality
    conversation_history: str = dspy.InputField()
    current_agent_message: str = dspy.InputField()
    turn_number: int = dspy.InputField()
    # Outputs
    should_continue: bool
    message: str
    reasoning: str

Each graph node gets its own StateModule registered as a DSPy submodule:

class ConversationModule(dspy.Module):
    def __init__(self, graph: AgentGraph):
        for node_id, node in graph.nodes.items():
            state_module = StateModule(node, graph)
            setattr(self, f"state_{node_id}", state_module)
            self._state_modules[node_id] = state_module

This structure means the entire agent graph is a single optimizable DSPy module. We can apply BootstrapFewShot or MIPROv2 to tune state transitions and response generation.

voicetest CLI demo

Auto-healing the agent graph on test failures (coming soon)

When a test fails, the interesting question is: what should change? The failure might indicate a node prompt needs tweaking, or that the graph structure itself is wrong for the conversation flow.

The planned approach:

  1. Failure analysis: Parse the transcript and judge output to identify where the agent went wrong. Was it a bad response in a specific state? A transition that fired incorrectly? A missing edge case?

  2. Mutation proposals: Based on the failure mode, generate candidate fixes. For prompt issues, suggest revised state prompts. For structural problems, propose adding/removing transitions or splitting nodes.

  3. Validation: Run the mutation against the failing test plus a regression suite. Only accept changes that fix the failure without breaking existing behavior.

This isn’t implemented yet, but the infrastructure is there: the AgentGraph IR makes mutations straightforward, and the continuous metric scores give us a fitness function for evaluating changes.

JIT synthetic data for optimization

DSPy optimizers like MIPROv2 need training examples. For voice agents, we generate these on demand:

  1. Test case expansion: Each test case defines a persona and success criteria. We use the UserSimSignature to generate variations—different phrasings, edge cases, adversarial inputs.

  2. Trajectory mining: Successful test runs become positive examples. Failed runs (with partial transcripts) become negative examples with the failure point annotated.

  3. Score-based filtering: Because metrics produce continuous scores, we can select examples near decision boundaries (scores around the threshold) for maximum training signal.

The current implementation has the infrastructure:

# Mock data generation for testing the optimization pipeline
simulator._mock_responses = [
    SimulatorResponse(message="Hello, I need help.", should_end=False),
    SimulatorResponse(message="Thanks, that's helpful.", should_end=False),
    SimulatorResponse(message="", should_end=True),
]
metric_judge._mock_results = [
    MetricResult(metric=m, score=0.9, passed=True)
    for m in test_case.metrics
]

The production version will generate real synthetic data by sampling from the UserSimSignature with temperature variations and persona mutations.

Judgment pipeline

Three judges evaluate each transcript:

Rule Judge (deterministic, zero cost): substring includes/excludes, regex patterns. Fast pre-filter for obvious failures.

Metric Judge (LLM-based, semantic): evaluates each criterion with continuous scores. Per-metric threshold overrides enable fine-grained control. Global metrics (like HIPAA compliance) run on every test automatically.

Flow Judge (optional, informational): validates that node transitions made logical sense given the conversation. Uses the FlowValidationSignature:

class FlowValidationSignature(dspy.Signature):
    graph_structure: str = dspy.InputField()
    transcript: str = dspy.InputField()
    nodes_visited: list[str] = dspy.InputField()
    # Outputs
    flow_valid: bool
    issues: list[str]
    reasoning: str

Flow issues don’t fail tests but get tracked for debugging. A pattern of flow anomalies suggests the graph structure itself needs attention.

voicetest web UI

The web UI visualizes agent graphs, manages test cases, and streams transcripts in real-time during execution.

CI/CD integration

Voice agents break in subtle ways. A prompt change that improves one scenario can regress another. voicetest runs in GitHub Actions:

name: Voice Agent Tests
on:
  push:
    paths: ["agents/**"]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v5
      - run: uv tool install voicetest
      - run: voicetest run --agent agents/receptionist.json --tests agents/tests.json --all
        env:
          OPENROUTER_API_KEY: $

Results persist to DuckDB, enabling queries across test history:

SELECT
    agents.name,
    COUNT(*) as total_runs,
    AVG(CASE WHEN results.passed THEN 1.0 ELSE 0.0 END) as pass_rate
FROM results
JOIN runs ON results.run_id = runs.id
JOIN agents ON runs.agent_id = agents.id
GROUP BY agents.name

What’s next

The current release handles the testing workflow: import agents, run simulations, evaluate with LLM judges, integrate with CI. The auto-healing and optimization features are in POC stage.

The roadmap:

  • v0.3: JIT synthetic data generation from test case personas
  • v0.4: DSPy optimization integration (MIPROv2 for state prompts)
  • v0.5: Auto-healing graph mutations with regression protection
uv tool install voicetest
voicetest demo --serve

Code at github.com/voicetestdev/voicetest. API docs at voicetest.dev/api. Apache 2.0 licensed.


wt: A Git Worktree Orchestrator for Parallel AI Agent Development

Monday, January 26, 2026

Concurrent AI agent development introduces coordination problems when multiple agents modify the same repository. File conflicts occur when agents operate on overlapping code paths, and resolving these conflicts consumes agent context that would otherwise be applied to the task.

Git worktrees provide filesystem isolation while sharing repository state—each worktree maintains an independent working directory and branch reference without duplicating the object database. However, the native git worktree interface requires verbose commands, manual cleanup, and lacks primitives for managing multiple concurrent sessions.

wt is a CLI tool that addresses these limitations.

wt demo showing workspace creation and navigation

Overview

wt wraps git worktree operations in an interface designed for multi-agent workflows:

# Create an isolated workspace
wt new feature/auth

# Spawns a subshell with modified working directory and branch
# Shell prompt reflects workspace context
(wt:feature/auth) $ claude

# On completion, exit subshell and merge
exit
git merge feature/auth

Workspaces provide complete filesystem isolation. Each agent operates on an independent working directory, and changes integrate via standard git merge or rebase operations.

Workspace Management

wt provides both interactive and direct workspace access:

wt ls                  # Interactive picker (TUI) for workspace selection
wt use feature/auth    # Enter a specific workspace directly
wt which               # Display current workspace context
wt rm feature/auth     # Remove worktree and optionally delete branch

Tmux Session Management

wt integrates with tmux to coordinate multiple concurrent agent processes:

wt session new my-sprint
wt session add feature/auth
wt session add bugfix/header
wt session add feature/payments

Each workspace maps to a tmux window with a configurable pane layout: 2-pane (agent CLI + terminal) or 3-pane (agent + terminal + editor).

The wt session watch command displays agent activity state by monitoring pane output:

● feature/auth      (running)
○ bugfix/header     (idle)
● feature/payments  (running)

Activity detection uses output buffer analysis to distinguish active processing from prompt-waiting states.

Claude Code Integration

wt includes a /do skill for Claude Code that implements issue-driven workflows:

/do gh 123          # GitHub issue #123
/do sc 45678        # Shortcut story 45678

The skill executes the following operations:

  1. Fetch issue/story metadata from the respective API
  2. Create a worktree with a branch name derived from the issue identifier
  3. Populate agent context with issue description and acceptance criteria
  4. On task completion, create a commit with the issue reference in the message

Configuration

Configuration follows a precedence chain: CLI flags → repo-local .wt.toml → global ~/.wt/config.toml → defaults.

# .wt.toml (repository root)
panes = 3
agent_cmd = "claude"
editor_cmd = "nvim"

Files listed under a # wt copy marker in .gitignore are symlinked from the main worktree into each spawned worktree:

# wt copy
.env
.env.local
credentials.json

This mechanism handles environment files and credentials that are gitignored but required at runtime.

Implementation

wt is implemented in Rust. The crate structure:

  • WorktreeManager: Wraps `git worktree` subcommands, maintains branch-to-directory mappings, handles cleanup of stale worktrees
  • TmuxManager: Interfaces with tmux via `tmux` CLI commands, monitors pane output buffers for activity detection
  • ShellEnvironment: Spawns subshells (bash, zsh, fish) with modified `$PWD` and environment variables reflecting workspace state
  • SessionState: Persists session metadata to JSON, reconciles with actual tmux state on load

Branch names containing forward slashes require path sanitization. Git permits feature/auth as a branch name, but using this directly as a directory path would create nested directories. wt normalizes slashes to double-dashes in directory names (feature--auth) while preserving the git ref name.

Installation

Pre-compiled binaries are available for macOS and Linux (x86_64, aarch64). Building from source:

git clone https://github.com/pld/wt
cd wt
cargo build --release

The binary includes an install subcommand that configures shell aliases and installs the Claude Code skill definition.

Discussion

Single-agent workflows operate adequately with standard git branching. Multi-agent parallelism introduces coordination overhead that scales with agent count: conflict resolution, context switching between agent sessions, and manual worktree lifecycle management.

wt reduces this overhead by mapping each agent to an isolated worktree and providing primitives for session orchestration. The tradeoff is an additional abstraction layer over git; the benefit is that agents operate independently until explicit merge points.

Source: github.com/pld/wt


Peter
Lubell-Doughtie

about
projects
archive