A model in the suite · Anthropic

Sonnet 4.6

Anthropic · Claude · 2026-04-25

52/100
Strict suite averageLegacy 65 · 1 benchmark

Single-benchmark historical packet. The Dingo result shows broad artifact production but weak research/regulatory grounding and source-image usage, so it remains in interesting-but-unreliable territory.

Copies Sonnet 4.6's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Sonnet 4.6 against the field

How Sonnet 4.6 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

52legacy 65
Interesting but Unreliable

This is complete and strategically aware, but strict scoring should not treat completion as reliability. The evidence package shows only one URL in sources, regulatory analysis is broad and under-cited, required visual artifacts ignore provided source imagery, spreadsheets have no charts, and the sales/deck/dashboard outputs are valid but underproduced.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.880
2GPT-5.578
3Gemini 3.5 Flash (High) Fast62
4Opus 4.754
5Sonnet 4.652
6Gemini 3.1 Pro38

What it nailed

  • All required files exist as real artifacts with exact filenames.
  • Assumptions file reconciles revenue, budget, launch date, attach-rate, and source-data contradictions.
  • Investor FAQ candidly addresses TAM, ethics, Alaska mismatch, import-created demand, and rehoming gaps.
  • Personas identify non-buyer curiosity traffic and adjacent wolfdog demand.

Where it slipped

  • 00_sources.md has only one URL and relies on unverified Perplexity-style research notes.
  • Deck, PDF one-pager, and dashboard ignore provided source images despite the manifest claiming usage.
  • Jurisdiction tables do not adequately separate ownership, import, transport, quarantine, and local restrictions.
  • Sales one-pager and board deck are valid files but visually underproduced.
  • Spreadsheets have formulas but no charts.
  • Raw model output/transcript evidence is absent from the evaluation package.
No ResearchWeak Regulatory CoverageIgnored Source ImagesFake Visual Source Of TruthEmpty Raw Model Output Evidence Confidence Cap