A model in the suite · Anthropic

Opus 4.7

Anthropic · Claude · historical runs from 2026-04-19 through 2026-06-01 staging

54/100
Strict suite averageLegacy 65 · 3 benchmarks

Visually capable but comparatively fragile on source integrity, operational judgment, and factual discipline. Artemis preserves the visual-strength evidence while clearly retaining a dedicated factual reevaluation requirement.

Copies Opus 4.7's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Opus 4.7 against the field

How Opus 4.7 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

54legacy 67
Interesting but Unreliable

The run is complete, polished, and strategically useful, but the strict score is pulled below competent because it misses central Dingo legal/regulatory canaries: unsupported Alaska/NCI permit-path framing, several unverified case-by-case jurisdiction claims, stale or unsupported market/pricing research, and material cross-document drift in import counts and budget release amounts.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.880
2GPT-5.578
3Gemini 3.5 Flash (High) Fast62
4Opus 4.754
5Sonnet 4.652
6Gemini 3.1 Pro38

What it nailed

  • Completed all 23 required deliverables in real artifact formats with no validator errors.
  • Used provided source imagery in the board deck, sales one-pager, and dashboard.
  • Produced practical GTM, investor FAQ, persona, and email work rather than generic template filler.
  • Reconciled several core finance figures including revenue, unit count, CAC, LTV, burn, and cash.

Where it slipped

  • Claims an Alaska NCI permit path and several NCI case-by-case postures without official support; Alaska is directly contradicted by sampled ADFG evidence.
  • Uses stale or unsupported external market and pricing claims, including Grand View market sizes and Halo/SpotOn pricing anchors.
  • NCI completed imports drift between 7 and 16 across artifacts.
  • Executive summary asks for a $240K gate-1 release while deck and GTM ask for a $185K Phase-1 release.
  • Both spreadsheet workbooks contain formulas and structure but no charts.
  • Raw model output/transcript evidence is absent from the evaluation package.
Illegal Or Unverified ClaimsMaterial Number InconsistencyFabricated Or Broken SourcesPublication Fact ErrorsEmpty Raw Model Output Evidence Confidence Cap

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

48legacy 62
Failed Core Purpose

Opus shipped a complete, fast, good-looking audit package with useful provenance, and it handled Terrence Blackwood better than the GPT-5.4 run. The strict rescore is harsh because Mickey Mouse, Test Customer, and Asdf Asdf survived as canonical customers, the SVC-007 file was not parsed despite report claims, the seven planted contact-conflict duplicates stayed split, and the report overstated what the code actually did. That combination fails the central trust test for an operational migration.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.UX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.886
2GPT-5.555
3Gemini 3.5 Flash (High) Fast51
4GPT-5.451
5Opus 4.748

What it nailed

  • Completed the expected artifact set quickly.
  • Produced the stronger reviewer UI in the April 19 cross-review set.
  • Flagged Terrence Blackwood as unmatched rather than silently treating the order as clean canonical data.
  • Used source provenance tables that external review considered genuinely useful.

Where it slipped

  • Promoted Mickey Mouse, Test Customer, and Asdf Asdf to canonical customers.
  • Missed the SVC-007 conflict because deshawn_services.tsv was not actually parsed.
  • Kept all seven planted duplicate customer pairs separate.
  • Report/docs claimed coverage the implementation did not deliver.
Misses Three Or More Primary CanariesPromotes Ghost RecordsReport Honesty Failure

Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

60legacy 60
Competent Scaffold

The visualization is the strongest video-facing artifact in the prior Artemis set: cinematic full-screen Three.js, component inspection, camera modes, staged screenshots, launch/ascent, staging, TLI, lunar flyby, max-distance, re-entry, and trajectory views. Strict scoring is capped because the fact sheet and visualization include no traceable source URLs, directly contradict NASA's Artemis II CubeSat evidence by claiming no CubeSats flew, place major lunar-flyby timings far from NASA's official EDT/UTC sequence, and include unsupported anomaly/color/patch/crater/social-media details. The result is a visually impressive but source-unsafe scaffold.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentSpatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessQuant. Reas.Speed
1GPT-5.579
2Claude Opus 4.876
3Opus 4.760
4Gemini 3.5 Flash (High) Fast54

What it nailed

  • Best visual storytelling of the two prior Artemis artifacts.
  • Cinematic full-screen Three.js presentation with camera modes, component inspection, trajectory toggle, labels, mission event list, and staged screenshot set.
  • Detailed procedural SLS/Orion model and stronger video/post b-roll value than GPT-5.5.
  • Broadly covers launch, ascent, staging, TLI, lunar flyby, max distance, re-entry, and full trajectory.

Where it slipped

  • No source URLs or source list were found in the fact sheet or visualization.
  • Material public-facing CubeSat claim contradicts NASA primary sources.
  • Major lunar-flyby timing values differ from NASA's official updates.
  • Several colorful anomaly, patch, livery, crater, and social-media claims are unsupported in this pass.
  • No formal historical scorecard or raw model output was found.

From the run