Google's Gemini 3.1 Flash-Lite: Fast, Cheap, and Surprisingly Smart

Google just shipped Gemini 3.1 Flash-Lite in preview, and the pitch is simple: near-Pro intelligence at a fraction of the cost. At $0.25 per million input tokens and $1.50 per million output tokens, it's one-eighth the price of Gemini 3.1 Pro. But the interesting part isn't the price. It's that Google built a budget model that genuinely competes on quality.

Flash-Lite landed on March 3, 2026 via Google AI Studio and Vertex AI. It's designed for the workloads where you need a model that's fast and cheap enough to call thousands of times per minute: translation, classification, content moderation, structured data extraction. The stuff that racks up token bills fast.

The speed numbers are real

Flash-Lite generates output at roughly 370 tokens per second, making it the second-fastest model on Artificial Analysis's leaderboard. Time to first token is 2.5x faster than Gemini 2.5 Flash, and overall output speed is 45% higher.

Speed comparison: Flash-Lite vs competitors

Approximate output speeds from Artificial Analysis benchmarks.

For high-volume workloads, this speed difference compounds. If you're processing 10,000 customer support tickets per hour or moderating user-generated content at scale, the gap between 195 and 370 tokens per second is the difference between keeping up and falling behind.

Benchmarks: punching above its weight

Flash-Lite sits in the "small model" tier, but its benchmark scores tell a different story. It beat GPT-4o mini and Claude 4.5 Haiku on 6 out of 11 benchmarks Google tested.

The highlights:

GPQA Diamond (PhD-level science): 86.9%
MMMU Pro (multimodal reasoning): 76.8%
MMMLU (multilingual Q&A): 88.9%
LiveCodeBench (code generation): 72.0%
SimpleQA (factual accuracy): 43.3%

Benchmark comparison

Benchmark scores from Google DeepMind model card and Artificial Analysis.

The GPQA Diamond score is particularly notable. 86.9% on doctorate-level science questions from a model that costs $0.25 per million input tokens. A year ago, you needed a flagship model for that kind of reasoning performance.

On Arena.ai's community leaderboard, Flash-Lite holds an Elo score of 1432, ranking it well above most models in its price range.

Adjustable thinking: pick your trade-off

Flash-Lite introduces adjustable thinking levels, which let you control how much reasoning the model does before generating a response. This is Google's version of the "thinking budget" concept that's been spreading across the industry.

The idea: not every request needs deep reasoning. A translation task or content classification doesn't benefit from extended chain-of-thought. But a complex data extraction or multi-step analysis might. With thinking levels, you can dial reasoning up or down per request, trading latency for accuracy based on your actual use case.

This gives developers fine-grained control over the cost/quality/speed triangle. For classification tasks, turn thinking down and get responses in milliseconds. For complex queries that need careful reasoning, turn it up and accept higher latency.

The architecture: mixture-of-experts on TPUs

Flash-Lite is built on the same mixture-of-experts (MoE) architecture as Gemini 3 Pro. MoE models only activate a subset of their parameters for each input, which is how Google gets Pro-level quality at a fraction of the compute cost.

The model was trained on Google's Tensor Processing Units using JAX and the ML Pathways framework. It accepts text, images, audio, and video as input (up to 1 million tokens of context) and outputs text (up to 64,000 tokens). Knowledge cutoff is January 2025.

That 1 million token context window is worth noting. You can feed it an entire codebase, a full-length book, or hours of meeting transcripts and get coherent responses about the content. Most models in this price tier max out at 128K or 200K tokens.

Where it makes sense (and where it doesn't)

Flash-Lite is built for high-volume, latency-sensitive workloads. The sweet spots:

Translation at scale. The multilingual benchmark scores (88.9% MMMLU) back this up. If you're translating product listings, support content, or UI strings across dozens of languages, this is the model to reach for.
Content moderation. Fast enough to process user-generated content in real time, smart enough to understand context and nuance.
Classification and tagging. Sorting support tickets, categorizing products, extracting structured data from unstructured text.
Dashboard and UI generation. Google specifically highlights its ability to generate visual assets like BI dashboards from natural language prompts.

Where it falls short: the model scores 12.3% on long-context accuracy at the full 1 million token window, and its HLA benchmark score (16%) lags well behind Gemini 3.1 Pro's 44.4%. For tasks requiring deep, sustained reasoning over very long documents or complex agentic workflows, you're better off with Pro or a flagship model.

It's also notably verbose. During Artificial Analysis testing, it generated 53 million output tokens compared to the average of 20 million. That verbosity can inflate costs if you're not managing output length with max token limits or system prompts.

Pricing in context

At $0.25/$1.50 per million tokens (input/output), Flash-Lite is aggressively priced but not the cheapest option on every axis.

Model	Input (per 1M)	Output (per 1M)
Gemini 3.1 Flash-Lite	$0.25	$1.50
GPT-4o mini	$0.15	$0.60
Claude 4.5 Haiku	$0.80	$4.00
Gemini 3.1 Pro	$2.00	$18.00

GPT-4o mini is still cheaper on raw token cost. But Flash-Lite's higher benchmark scores mean you may need fewer retries and less prompt engineering to get usable outputs, which factors into real-world cost. And if you're already in the Google Cloud ecosystem, the Vertex AI integration makes it a natural choice.

The bottom line

Gemini 3.1 Flash-Lite is Google's bid to own the "good enough and very fast" tier of the model market. It won't replace Pro or flagship models for complex reasoning tasks. But for the 80% of production workloads that need speed, cost efficiency, and solid-but-not-cutting-edge intelligence, it's a strong contender.

The adjustable thinking levels are the feature to watch. If Google gets the developer experience right (easy to set per-request, predictable impact on latency and quality), it could make Flash-Lite the default choice for teams that run diverse workloads through a single model endpoint.

Flash-Lite is available now in preview via Google AI Studio and Vertex AI.