Wide-EP and prefill/decode disaggregation APIs for vLLM are now available in Ray 2.52 🚀🚀
Validated at 2.4k tokens/H200 on Anyscale Runtime, these patterns maximize sparse MoE model (DeepSeek, Kimi, Qwen3) inference efficiency, but often require non-trivial orchestration logic.
Here’s how they work…🧵
Engine replicas can no longer be scaled independently.
Efficient serving now requires coordinating:
- data-parallel ranks
- topology-specific expert parallel ranks
- KV-cache transfer across deployments
- heterogeneous resource profiles (prefill vs decode)
For high throughput workloads, batch size can increase by more than 2x compared to tensor parallelism.
Wide-EP distributes experts across GPUs + adds load balancing, expert replication, and optimized all2alls.
The complexity: replicas must form shared DP/EP groups, agree on IP/port #s, and scale ingress separately from engines.
Ray Serve LLM now exposes a builder API that integrates with vLLM and handles all of this automatically.
Prefill/decode disaggregation separates input token processing and token generation into independent deployments with different scaling behaviors.
Prefill is compute-intensive, while decode is memory bandwidth-bound. When the same replica handles both in the same batch, prefill delays accumulate and throughput tanks.
Ray Serve LLM now has a build_pd_openai_app builder that:
- Creates prefill + decode deployments
- Sets up the NIXL KV transfer connector
- Routes requests through a PDProxyServer
- (Optionally) uses prefix cache-aware routing for prefill
For the full writeup, see link in comments 🙂