A model in the suite · OpenAI

GPT-5.5

OpenAI · GPT · historical runs from 2026-04-23 through 2026-06-01 staging

71/100
Strict suite averageLegacy 81 · 3 benchmarks

Strongest historical non-image packet in this backfill set, led by Dingo and Artemis. The Car Wash result keeps the strict average grounded because operational canary misses remain substantial; the Brick rover is preserved only as a single-prompt reference.

Copies GPT-5.5's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

GPT-5.5 against the field

How GPT-5.5 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

78legacy 87
Strong

This is the cleanest historical Dingo run: all 23 deliverables exist as real files, source integrity passed, regulatory/import ambiguity was handled unusually well, and the strategy work is coherent. It stays below excellent strict territory because the board deck has a real PPTX XML/rendering defect, one visible NPS inconsistency crosses artifacts, pricing research includes stale or imprecise claims, and there is no raw transcript evidence.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.880
2GPT-5.578
3Gemini 3.5 Flash (High) Fast62
4Opus 4.754
5Sonnet 4.652
6Gemini 3.1 Pro38

What it nailed

  • Completed all 23 required deliverables as real files with valid types and preserved source integrity.
  • Handled dingo ownership, import-created demand, legal uncertainty, ethics, and Alaska/Australia mismatch as central operating constraints.
  • Used provided source imagery heavily in the deck, sales one-pager, and dashboard.
  • Delivered coherent GTM, board, pricing, risk, and investor-facing strategy with staged decisions and guardrails.

Where it slipped

  • Board deck contains invalid PPTX metadata XML because the Company value uses an unescaped ampersand, blocking Quick Look rendering.
  • Board deck slide 5 reports average NPS as 6.6 while source math and other artifacts use about 6.2.
  • Some pricing research was stale or imprecise, especially Halo membership pricing and PetSafe blended pricing.
  • Raw model output/transcript evidence is absent from the evaluation package.
Pptx Metadata Xml Rendering DefectCross Document Number DriftStale Or Imprecise Pricing ClaimsEmpty Raw Model Output Evidence Confidence Cap

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

55legacy 74
Interesting but Unreliable

This is the strongest inspected audit scaffold: complete artifacts, full source discovery, a working frontend, strong provenance, fake/test rejection, and an empirically verified idempotent rebuild. It still fails too many operational canaries for a high strict score: Terrence Blackwood became a canonical customer, SVC-007 was missed, department codes were dropped, status/payment enums stayed raw, canonical jobs were overcounted, and several name variants stayed split. The run reaches the cap for multiple primary canary misses but not above it.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.UX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.886
2GPT-5.555
3Gemini 3.5 Flash (High) Fast51
4GPT-5.451
5Opus 4.748

What it nailed

  • Produced every expected artifact, including screenshots.
  • Discovered 465 of 465 source files and processed or partially processed almost all business-relevant files.
  • Rejected planted ghost/test records and preserved a large source-record provenance layer.
  • Passed an isolated idempotency rerun with identical counts.

Where it slipped

  • Created Terrence Blackwood as a canonical customer instead of an orphan review case.
  • Missed the DeShawn SVC-007 conflict and lacked a service-code column.
  • Dropped department/role code normalization and left raw status/payment values.
  • Overcounted jobs and retained several duplicate/nickname customer splits.
Misses Three Or More Primary CanariesPromotes Orphan Order

Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

79legacy 79
Strong

The run produced a complete, runnable React/Vite/Three package with a separately maintained missionData.js source of truth, a detailed fact sheet, NASA-heavy citations, screenshots, desktop/mobile verification images, and mission-specific visual beats for launch, ascent, staging, TLI, lunar flyby, max distance, re-entry, splashdown, and recovery. It stays below excellent because several values need strict current-source cleanup, including closest lunar approach finalization, actual ascent milestone timings, official total miles after NASA's May 7 update, and a non-primary Orion helium-leak claim. The re-entry and recovery visuals are informative but still more app-like than cinematic.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentSpatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessQuant. Reas.Speed
1GPT-5.579
2Claude Opus 4.876
3Opus 4.760
4Gemini 3.5 Flash (High) Fast54

What it nailed

  • Complete fact sheet plus runnable React/Vite/Three visualization.
  • Separates mission facts, crew, vehicle facts, component details, events, and telemetry into missionData.js.
  • Uses many NASA source links directly in both fact sheet and visualization.
  • Covers the hard mission sequence rather than staying in generic orbit-only mode.
  • Includes 10 staged screenshots and desktop/mobile verification images.

Where it slipped

  • No formal historical scorecard or raw model output was found.
  • Some public-facing numbers need current-source reconciliation before publication.
  • Re-entry, splashdown, and recovery scenes are visually abstract and less video-useful than the stronger Opus visual treatment.
  • At least one anomaly/issue claim relies on non-primary reporting.

From the run