Very fast scaffold generator with broad completion, but uneven judgment. The strict score is much lower than the legacy average because the run failed core semantic, factual, visual-storytelling, and physical-plausibility checks.
Copies Gemini 3.5 Flash (High) Fast's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.
How Gemini 3.5 Flash (High) Fast handled each benchmark
Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.
Dingo & Co. Knowledge Work
A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.
62legacy 72
Competent Scaffold
Full deliverable completion and decent strategic synthesis, but weak visual polish, thin source coverage, shallow legal/regulatory distinction, spreadsheet limitations, manual recovery cycles, and missing raw model evidence keep this well below strong long-term comparison territory.
Completed 23/23 required deliverables as real files.
Preserved source input copy by checksum.
Reconciled important financial contradictions.
Treated Northern Canid Imports as central rather than incidental.
Where it slipped
Only 15 URLs found against the benchmark's 20-URL expectation.
Regulatory handling compressed ownership, import, transport, quarantine, and local-law distinctions.
Deck and one-pager were not visually strong.
Spreadsheets lacked charts and one workbook was thin.
Raw model output evidence is absent.
Weak Regulatory CoverageFabricated Or Broken Sources Soft CapEmpty Raw Model Output Evidence Confidence Cap
Wall clock 15m 11sPartnered lane scored 67
Car Wash Operations
A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.
51legacy 64
Interesting but Unreliable
The run was fast and complete, but it failed the benchmark's most important operational canaries: ghost/test records survived, Terrence Blackwood was promoted instead of treated as orphaned, typo-order merges failed, department codes and enum normalization were weak, and provenance was incomplete.
Created all required artifacts and an openable database.
Accounted for 463 business files.
Preserved source checksums and avoided strict sensitive-term leaks.
Produced a usable static audit UI.
Where it slipped
Ghost/test records were promoted instead of quarantined.
Terrence Blackwood was created as a customer instead of flagged as orphaned.
All 13 planted typo orders stayed attached to typo-name customers.
Nickname variants remained split.
Status and payment methods remained raw variants.
Misses Three Or More Primary CanariesPromotes Ghost RecordsPromotes Orphan OrderEmpty Raw Model Output Evidence Confidence Cap
Wall clock 6m 09s
Brick — The AI LEGO Build
Four buildable LEGO models from prompt to part list to runnable browser guide. Tests spatial reasoning, physical plausibility, and whether large builds hold together or collapse into repetition.
56legacy 67
Interesting but Unreliable
The run completed all four prompts and maintained clean part accounting, but physical plausibility was weak, large prompts degraded into repetitive abstraction, the airship station missed core structure, and visual quality was only partial-pass.
Maintained unique IDs and one-step coverage for parts.
Produced runnable browser guides with controls.
Where it slipped
Many overlaps and unsupported-looking placements.
Large models used repetitive generated patterns.
Airship station collapsed requested nine-chapter structure to five.
Some HUD/text overlap and camera framing problems.
Large Scale CollapseSevere Physical ImplausibilityEmpty Raw Model Output Evidence Confidence Cap
Wall clock 2m 35s
From the run
Artemis II Mission Visualization
A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.
54legacy 68
Interesting but Unreliable
The artifact was fast, complete, and runnable, but it generated broken source URLs, mixed supported facts with unsupported details, overstated post-flight/anomaly claims, and avoided the hardest visual mission beats like launch, staging, re-entry, and recovery.