A model in the suite · Google

Gemini 3.5 Flash (High) Fast

Google · Gemini · High / Fast · 2026-05-27

56/100

Strict suite averageLegacy 68 · 4 benchmarks

Very fast scaffold generator with broad completion, but uneven judgment. The strict score is much lower than the legacy average because the run failed core semantic, factual, visual-storytelling, and physical-plausibility checks.

Copies Gemini 3.5 Flash (High) Fast's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Where it ranks

Gemini 3.5 Flash (High) Fast against the field

ANTClaude Opus 4.881 OAIGPT-5.571 GOOGemini 3.5 Flash (High) Fast56 ANTOpus 4.754 ANTSonnet 4.652 OAIGPT-5.451 GOOGemini 3.1 Pro38

Strict suite average Legacy averageScored 0–100 · higher is better

Download 16:9

The runs

How Gemini 3.5 Flash (High) Fast handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

62legacy 72

Competent Scaffold

Full deliverable completion and decent strategic synthesis, but weak visual polish, thin source coverage, shallow legal/regulatory distinction, spreadsheet limitations, manual recovery cycles, and missing raw model evidence keep this well below strong long-term comparison territory.

OverlayDownload radar

1Claude Opus 4.880

2GPT-5.578

3Gemini 3.5 Flash (High) Fast62

4Opus 4.754

5Sonnet 4.652

6Gemini 3.1 Pro38

What it nailed

Completed 23/23 required deliverables as real files.
Preserved source input copy by checksum.
Reconciled important financial contradictions.
Treated Northern Canid Imports as central rather than incidental.

Where it slipped

Only 15 URLs found against the benchmark's 20-URL expectation.
Regulatory handling compressed ownership, import, transport, quarantine, and local-law distinctions.
Deck and one-pager were not visually strong.
Spreadsheets lacked charts and one workbook was thin.
Raw model output evidence is absent.

Weak Regulatory CoverageFabricated Or Broken Sources Soft CapEmpty Raw Model Output Evidence Confidence Cap

Wall clock 15m 11sPartnered lane scored 67

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

51legacy 64

Interesting but Unreliable

The run was fast and complete, but it failed the benchmark's most important operational canaries: ghost/test records survived, Terrence Blackwood was promoted instead of treated as orphaned, typo-order merges failed, department codes and enum normalization were weak, and provenance was incomplete.

OverlayDownload radar

1Claude Opus 4.886

2GPT-5.555

3Gemini 3.5 Flash (High) Fast51

4GPT-5.451

5Opus 4.748

What it nailed

Created all required artifacts and an openable database.
Accounted for 463 business files.
Preserved source checksums and avoided strict sensitive-term leaks.
Produced a usable static audit UI.

Where it slipped

Ghost/test records were promoted instead of quarantined.
Terrence Blackwood was created as a customer instead of flagged as orphaned.
All 13 planted typo orders stayed attached to typo-name customers.
Nickname variants remained split.
Status and payment methods remained raw variants.

Misses Three Or More Primary CanariesPromotes Ghost RecordsPromotes Orphan OrderEmpty Raw Model Output Evidence Confidence Cap

Wall clock 6m 09s

Brick — The AI LEGO Build

Four buildable LEGO models from prompt to part list to runnable browser guide. Tests spatial reasoning, physical plausibility, and whether large builds hold together or collapse into repetition.

56legacy 67

Interesting but Unreliable

The run completed all four prompts and maintained clean part accounting, but physical plausibility was weak, large prompts degraded into repetitive abstraction, the airship station missed core structure, and visual quality was only partial-pass.

OverlayDownload radar

1Claude Opus 4.882

2Gemini 3.5 Flash (High) Fast56

What it nailed

Completed all four benchmark prompts.
Hit exact target piece counts.
Maintained unique IDs and one-step coverage for parts.
Produced runnable browser guides with controls.

Where it slipped

Many overlaps and unsupported-looking placements.
Large models used repetitive generated patterns.
Airship station collapsed requested nine-chapter structure to five.
Some HUD/text overlap and camera framing problems.

Large Scale CollapseSevere Physical ImplausibilityEmpty Raw Model Output Evidence Confidence Cap

Wall clock 2m 35s

From the run

Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

54legacy 68

Interesting but Unreliable

The artifact was fast, complete, and runnable, but it generated broken source URLs, mixed supported facts with unsupported details, overstated post-flight/anomaly claims, and avoided the hardest visual mission beats like launch, staging, re-entry, and recovery.

OverlayDownload radar

1GPT-5.579

2Claude Opus 4.876

3Opus 4.760

4Gemini 3.5 Flash (High) Fast54

What it nailed

Completed a fact sheet and interactive 3D visualization in under three minutes.
Produced a usable dashboard shell with timeline, HUD, narrative panel, and camera modes.
Correctly treated Artemis II as a completed April 2026 mission.

Where it slipped

Six generated bibliography URLs returned 404.
Some timeline and closest-approach values were off.
Unsupported anomaly/post-flight details were overclaimed.
Visualization did not convincingly show launch, staging, re-entry, or recovery.
Visual result trailed prior Opus 4.7 and GPT-5.5 artifacts.

Fabricated Or Broken SourcesPublication Fact ErrorsGeneric Orbit SceneEmpty Raw Model Output Evidence Confidence Cap

Wall clock 2m 53s

From the run