A model in the suite · Anthropic

Sonnet 4.6

Anthropic · Claude · 2026-04-25

52/100

Strict suite averageLegacy 65 · 1 benchmark

Single-benchmark historical packet. The Dingo result shows broad artifact production but weak research/regulatory grounding and source-image usage, so it remains in interesting-but-unreliable territory.

Copies Sonnet 4.6's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Where it ranks

Sonnet 4.6 against the field

ANTClaude Opus 4.881 OAIGPT-5.571 GOOGemini 3.5 Flash (High) Fast56 ANTOpus 4.754 ANTSonnet 4.652 OAIGPT-5.451 GOOGemini 3.1 Pro38

Strict suite average Legacy averageScored 0–100 · higher is better

Download 16:9

The runs

How Sonnet 4.6 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

52legacy 65

Interesting but Unreliable

This is complete and strategically aware, but strict scoring should not treat completion as reliability. The evidence package shows only one URL in sources, regulatory analysis is broad and under-cited, required visual artifacts ignore provided source imagery, spreadsheets have no charts, and the sales/deck/dashboard outputs are valid but underproduced.

OverlayDownload radar

1Claude Opus 4.880

2GPT-5.578

3Gemini 3.5 Flash (High) Fast62

4Opus 4.754

5Sonnet 4.652

6Gemini 3.1 Pro38

What it nailed

All required files exist as real artifacts with exact filenames.
Assumptions file reconciles revenue, budget, launch date, attach-rate, and source-data contradictions.
Investor FAQ candidly addresses TAM, ethics, Alaska mismatch, import-created demand, and rehoming gaps.
Personas identify non-buyer curiosity traffic and adjacent wolfdog demand.

Where it slipped

00_sources.md has only one URL and relies on unverified Perplexity-style research notes.
Deck, PDF one-pager, and dashboard ignore provided source images despite the manifest claiming usage.
Jurisdiction tables do not adequately separate ownership, import, transport, quarantine, and local restrictions.
Sales one-pager and board deck are valid files but visually underproduced.
Spreadsheets have formulas but no charts.
Raw model output/transcript evidence is absent from the evaluation package.

No ResearchWeak Regulatory CoverageIgnored Source ImagesFake Visual Source Of TruthEmpty Raw Model Output Evidence Confidence Cap