#speculative-decoding

5 posts

Extracting hidden states from vLLM

Mar 30, 2026·8 min read

PR \#33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and...

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Mar 13, 2026·12 min read

EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Dec 13, 2025·11 min read

- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...

Inside vLLM: Anatomy of a High-Throughput LLM Inference System

Sep 5, 2025·41 min read

In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown...

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Oct 17, 2024·10 min read

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in...