$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
7 min read
AI & Machine Learning

1,400 Tokens Per Second in Your Browser: How LFM2.5 230M and WebGPU Are Rewriting the Rules of AI Inference

> Liquid AI LFM2.5 230M runs at 1,400 tok/s entirely in the browser via WebGPU—faster than any cloud API, completely private, and costing exactly $0. The future of AI inference is local.

Audio version coming soon
1,400 Tokens Per Second in Your Browser: How LFM2.5 230M and WebGPU Are Rewriting the Rules of AI Inference
Verified by Essa Mamdani

1,400 tokens per second. In your browser. On a laptop.

No API calls. No cloud compute. No $200/month subscription. Just a 230-million-parameter model running entirely locally through WebGPU, delivering speed that would have required a data center just two years ago.

This is LFM2.5 230M from Liquid AI, and it's the most impressive browser-native AI demo of 2026.


The Demo: What Just Happened

On June 25, 2026, the webml-community dropped a Hugging Face Space that made the entire AI community do a double-take. Liquid AI's LFM2.5 230M—a 230M parameter language model—was running at 1,400 tok/s entirely in the browser, powered by custom WebGPU kernels.

The speed isn't theoretical. It's been verified by multiple users on different hardware:

  • M4 Max MacBook: 1,400 tok/s (the headline number)
  • Mid-range GPUs: 200–800 tok/s depending on hardware
  • Even busy systems: 220 tok/s while simultaneously running training workloads

Compare this to typical cloud API speeds:

  • GPT-4o via API: ~100–200 tok/s
  • Claude 3.5 Sonnet: ~80–150 tok/s
  • Gemini 1.5 Pro: ~120–180 tok/s

A 230M model in a browser is beating flagship APIs by 7–17× on raw throughput. The browser just became the fastest inference platform.


The Technical Stack: How They Did It

LFM2.5: Efficiency by Design

Liquid AI's LFM (Liquid Foundation Model) series is built on a fundamentally different architecture than transformer-based models. Instead of standard attention mechanisms, LFM models use liquid neural networks—differential equation-based architectures that are inherently more parameter-efficient.

The result: a 230M parameter model that punches far above its weight class on reasoning, coding, and general language tasks. It's not just small—it's small and capable.

WebGPU: The Browser's Hidden Superpower

WebGPU is the successor to WebGL, giving browsers direct access to GPU compute capabilities. For AI inference, this means:

  • Direct GPU memory access: No copying data back and forth through JavaScript
  • Compute shaders: Parallel kernel execution on the GPU, just like CUDA
  • Cross-platform: Works on Windows, macOS, Linux, and mobile devices with compatible GPUs

The webml-community's demo uses ONNX Runtime Web with WebGPU backend, compiled to WASM for near-native performance.

The Secret Sauce: AI-Optimized Kernels

Here's where it gets wild. The custom WebGPU kernels weren't written by humans—they were optimized by an AI agent pipeline:

  1. Fable 5 (the first agent) started the kernel optimization process
  2. When Fable went offline, Claude Opus 4.8 picked up where it left off
  3. The agents iteratively benchmarked, modified, and recompiled WebGPU kernels
  4. Final result: kernels finely tuned for LFM2.5's specific architecture

This is agentic optimization in action. The kernels that deliver 1,400 tok/s were co-designed by AI systems, not human shader programmers.


Why This Matters: 5 Implications

1. Privacy Becomes Default

When inference happens entirely in your browser, no data leaves your machine. No API logging. No training on your prompts. No subpoenas for your chat history.

For healthcare, legal, financial, and enterprise applications, browser-native inference eliminates the compliance nightmares of cloud APIs. HIPAA? GDPR? SOX? All satisfied by default because there's no server involved.

2. Latency Disappears

Cloud APIs have round-trip latency: your prompt travels to a data center, gets processed, and returns. Even at lightspeed, that's 50–200ms of unavoidable delay.

Browser inference has zero network latency. The only delay is compute time, which at 1,400 tok/s is negligible for most applications. Real-time conversational AI, live coding assistance, and interactive tutoring become actually real-time.

3. Cost Goes to Zero

Cloud APIs charge per token. At scale, this gets expensive fast:

  • 1M tokens on GPT-4o: ~$15
  • 1M tokens on Claude 3.5: ~$3
  • 1M tokens in your browser: $0

For applications with high token volume—content generation, code completion, automated testing—the savings are transformative. A startup processing 1B tokens/month saves $15,000–$45,000 by moving to browser-native inference.

4. AI Works Offline

Browser-native models work without an internet connection. On planes, in remote areas, behind corporate firewalls, during network outages—your AI assistant keeps working.

This opens use cases that cloud APIs simply can't serve: field technicians in rural areas, military operations, maritime applications, and any environment with unreliable connectivity.

5. The Browser Becomes the Platform

For two decades, "the browser can't do X" was a constant refrain. It couldn't run games (until WebGL). It couldn't handle video calls (until WebRTC). It couldn't run AI (until now).

WebGPU + efficient small models changes the equation. The browser is no longer a document viewer—it's a general-purpose compute platform capable of running sophisticated AI locally. The implications for web development are profound:

  • Real-time translation on any webpage
  • Intelligent form filling and validation
  • Context-aware content recommendations
  • Local document analysis and summarization
  • In-browser coding assistants

All without a single API call.


The Bigger Context: Small Models Are Having a Moment

LFM2.5 230M isn't an isolated case. 2026 is the year small models got serious:

ModelParametersSpeedPlatformUse Case
LFM2.5 230M230M1,400 tok/sBrowser/WebGPUGeneral LLM tasks
Ornith-1.0-9B9B~200 tok/sSingle GPUAgentic coding
Phi-4-mini3.8B~500 tok/sCPU/browserEdge inference
Gemma 2B2B~800 tok/sMobileOn-device assistant

The pattern is clear: parameter count is decoupling from capability. Modern training techniques—distillation, RL fine-tuning, and architectural innovations like liquid networks—are producing small models with outsized performance.

As we explored in our Ornith-1.0 analysis, a 9B model can now match 35B competitors on coding benchmarks. LFM2.5 230M extends this trend to the extreme low end: a model smaller than most embedding networks delivering practical, usable inference.


Limitations: What 230M Parameters Can't Do

Let's be honest about what this model isn't:

  • No multimodal input: LFM2.5 230M is text-only (though Liquid AI's LFM2.5-VL-1.6B handles vision separately)
  • Limited reasoning depth: Complex multi-step reasoning still requires larger models
  • Narrow context window: 230M models typically handle 4K–8K context, not the 1M+ of frontier models
  • No tool use: The browser demo is inference-only, not agentic
  • Hardware dependency: WebGPU requires a modern GPU; integrated graphics struggle

For serious coding tasks, you still want GLM 5.2 or Ornith-1.0. For research reproduction, you'll need larger open-weight models with more capacity.

But for the 80% of AI use cases that don't require frontier reasoning—summarization, translation, simple Q&A, content drafting, code completion—LFM2.5 230M is more than sufficient. And at 1,400 tok/s, it's faster than anything you can get from an API.


Try It Yourself

The Hugging Face Space is live and requires zero setup:

🔗 LFM2 WebGPU Kernels Demo

Requirements:

  • A modern browser (Chrome, Edge, or Firefox with WebGPU enabled)
  • A GPU with WebGPU support (most discrete GPUs from 2020+, Apple Silicon M1+)
  • ~500MB of browser cache for model weights

The model downloads automatically on first visit, then runs entirely locally. No account. No API key. No subscription.


The Future: Browser-Native AI Ecosystem

LFM2.5 230M at 1,400 tok/s is a proof of concept. It proves that browser-native AI isn't just possible—it's preferable for a huge class of applications.

Here's what's coming next:

2026 Q3–Q4

  • 1B-parameter browser models at 500+ tok/s with reasoning capabilities
  • WebGPU-optimized versions of popular open models (Llama, Qwen, Gemma)
  • Browser-based RAG: Local vector databases + local LLM = fully private knowledge assistants

2027

  • 3–7B parameter models running at 200+ tok/s in browsers
  • Multimodal browser AI: Vision, audio, and text all processed locally
  • Progressive web apps that replace native AI clients

2028+

  • Browser models match 2024 API quality: The GPT-4 class becomes edge-runnable
  • Federated learning in browsers: Models improve from decentralized user data without centralization
  • AI as a web standard: WebGPU + WebNN become as fundamental as HTML and CSS

Conclusion

1,400 tokens per second in a browser isn't just a benchmark. It's a statement about where AI is heading.

The cloud API model—send your data to someone else's server, pay per token, hope they don't log your prompts—made sense in 2023 when models were too large to run locally. In 2026, it doesn't.

Small, efficient models like LFM2.5 230M prove that local inference can be faster, cheaper, and more private than cloud alternatives. Combined with agentic kernel optimization (AI improving AI), the browser is becoming the most important AI platform.

The future of AI isn't a data center. It's your laptop. And it's already here.


Want to explore more browser-native AI? Check out our guides to open-weight coding models and the economics of local inference.

#AI#WebGPU#Browser#Local Inference#Privacy#2026