GitHub Copilot CLI Adds Rubber Duck Feature for Cross-Model AI Code Review

Jessie A Ellis Apr 08, 2026 17:06

GitHub's new Rubber Duck feature pairs Claude models with GPT-5.4 for independent code review, closing 74.7% of the performance gap between Sonnet and Opus.

GitHub Copilot CLI Adds Rubber Duck Feature for Cross-Model AI Code Review

GitHub just shipped a feature that addresses one of the most frustrating problems with AI coding assistants: they make confident mistakes that snowball into bigger messes. The new Rubber Duck capability, now available in experimental mode for Copilot CLI, brings in a second AI model from a completely different family to critique the primary agent's work.

Here's the setup: when you're running a Claude model as your main orchestrator, Rubber Duck deploys GPT-5.4 as an independent reviewer. The goal isn't just catching typos—it's questioning architectural decisions before they become expensive technical debt.

The Numbers Worth Knowing

GitHub tested this on SWE-Bench Pro, a benchmark of gnarly real-world coding problems from open-source repos. Claude Sonnet 4.6 paired with Rubber Duck closed 74.7% of the performance gap between Sonnet and the more expensive Opus model running solo.

The gains weren't uniform. Rubber Duck showed the strongest results on complex problems spanning 3+ files that typically require 70+ steps to resolve. On these harder tasks, the Sonnet + Rubber Duck combo scored 3.8% higher than baseline Sonnet, jumping to 4.8% higher on the most difficult problems identified across three trials.

What It Actually Catches

GitHub shared specific examples from their testing. In one OpenLibrary case, Rubber Duck flagged that a proposed scheduler would start and immediately exit without running any jobs—and spotted that even if fixed, one scheduled task contained an infinite loop.

Another catch: a single-line bug in a Solr integration where a loop silently overwrote the same dictionary key on every iteration. Three of four facet categories were being dropped from search queries with zero errors thrown. That's the kind of bug that passes code review and then haunts you in production for months.

A third example involved a NodeBB email confirmation flow where three files all read from a Redis key that new code stopped writing to. The confirmation UI and cleanup paths would have broken silently on deploy.

When It Kicks In

Rubber Duck activates at three checkpoints: after drafting a plan (where GitHub expects the biggest wins), after complex implementations, and after writing tests but before running them. The agent can also call for a critique when it gets stuck in a loop.

Users can trigger a review manually at any point. Copilot queries Rubber Duck, processes the feedback, and shows what changed and why.

The feature works with all Claude family models—Opus, Sonnet, and Haiku—as orchestrators. GitHub says they're already exploring other model family pairings, including options for when GPT-5.4 serves as the primary orchestrator.

To access Rubber Duck, install GitHub Copilot CLI and run the /experimental slash command. You'll need access to GPT-5.4 enabled and a Claude model selected from the picker. Feedback goes to GitHub's community discussion board.

Image source: Shutterstock