A model in the suite · OpenAI

GPT-5.4

OpenAI · GPT · 2026-04-19

51/100

Strict suite averageLegacy 67 · 1 benchmark

Single-benchmark historical packet. The Car Wash scaffold has enough recoverable model identity for a packet, but multiple primary canary failures keep it in interesting-but-unreliable territory.

Copies GPT-5.4's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Where it ranks

GPT-5.4 against the field

ANTClaude Opus 4.881 OAIGPT-5.571 GOOGemini 3.5 Flash (High) Fast56 ANTOpus 4.754 ANTSonnet 4.652 OAIGPT-5.451 GOOGemini 3.1 Pro38

Strict suite average Legacy averageScored 0–100 · higher is better

Download 16:9

The runs

How GPT-5.4 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

51legacy 67

Interesting but Unreliable

GPT-5.4 showed better file-level rigor than Opus on SVC-007 and duplicate-customer conflict evidence, but strict scoring centers migration safety. Ghost/test records survived as canonical data, Terrence Blackwood was promoted to a customer, status/payment values remained raw enough for magic/case variants to survive, and the canonical customer count ballooned. The output is a review scaffold, not a trustworthy migration.

OverlayDownload radar

1Claude Opus 4.886

2GPT-5.555

3Gemini 3.5 Flash (High) Fast51

4GPT-5.451

5Opus 4.748

What it nailed

Completed the required artifact set.
Accounted for the full file corpus in the cross-review analysis.
Parsed deshawn_services.tsv and surfaced the SVC-007 conflict.
Preserved useful conflict evidence for some duplicate customer cases.

Where it slipped

Promoted Mickey Mouse, Test Customer, and Asdf Asdf into canonical data.
Promoted Terrence Blackwood to a canonical customer.
Left status and payment methods raw enough that case variants and magic survived.
Over-expanded the canonical customer table.

Misses Three Or More Primary CanariesPromotes Ghost RecordsPromotes Orphan Order