AI Model Report

Infrastructure · 1 piece on file

Infrastructure

Serving stacks, kernels, GPU economics, and the gap between published throughput and what reproduces on real hardware.


Feature · MAY 12, 2026

vLLM v0.20.2 ships Model Runner V2: up to 56% higher throughput on GB200

The May 2026 stable release of vLLM bundles a new GPU-native Triton kernel async-scheduling stack, FP8 inference, and continuous batching as the default.

By Aiko Tanaka · Inference & serving

Read the full piece →