ZFLOW AI today announced a performance optimization milestone on PaleBlueDot AI’s 8×NVIDIA B300 bare-metal platform, using simulation to identify an optimized DeepSeek V4-Pro serving configuration on an SGLang stack. To our knowledge, this is the first publicly documented simulation-guided serving optimization of a frontier open-source model on NVIDIA’s B300 production platform.
ZFLOW AI is building a neutral optimization and control layer for AI infrastructure. Sitting above serving runtimes and below the business decision, ZFLOW AI helps infrastructure teams find the lowest-cost, highest-performance way to run a given workload on a given cluster.
ZFLOW AI’s role is complementary to the serving runtime. Building on the high-performance DeepSeek V4 foundation provided by the SGLang ecosystem, ZFLOW AI applies an optimization intelligence layer on top of the runtime — profiling real workload behavior and using hardware-aware simulation to guide deployment and tuning decisions for a specific workload on specific hardware.
In this milestone, ZFLOW AI evaluated DeepSeek V4-Pro serving with SGLang and EAGLE speculative decoding, analyzing serving-architecture tradeoffs, high-concurrency throughput and latency, and next-step multi-node deployment. Under higher-concurrency traffic, the prefill-decode disaggregated configuration reached peak throughput of 826 tokens/second — approximately 1.54× the non-disaggregated (monolithic) peak — with tail latency 2–3× better. The monolithic path remained favorable for single-stream, low-concurrency, and long-context workloads, including full 1M-token context.
ZFLOW AI also observed that MTP/EAGLE speculative decoding improved throughput with no measured quality regression in this test run: GSM8K accuracy across EAGLE 3/1/4, EAGLE 1/1/2, and no-MTP configurations stayed within approximately ±1 percentage point. Broader evaluation is ongoing.
ZFLOW AI’s simulation further indicates that a two-node B300 configuration is a promising direction for production deployment, which the team plans to validate on hardware as a next step.
“Modern inference optimization is moving beyond manual tuning of individual runtime knobs,” said Dr. Zhibin Xiao, Founder and CEO of ZFLOW AI. “The next layer is a closed-loop workflow connecting real workload execution, hardware simulation, and optimization strategy. Our work on PaleBlueDot AI’s B300 platform shows how ZFLOW AI helps infrastructure teams turn raw hardware capability into a workload-specific deployment strategy.”
Full closed-loop auto-optimization for DeepSeek V4-Pro on B300 remains under active development. ZFLOW AI plans to publish a Technical Insights blog detailing the serving-architecture tradeoffs, MTP/EAGLE optimization, and multi-node deployment work.
Teams evaluating DeepSeek V4-Pro or other frontier models on B300 or other next-generation GPU platforms can contact ZFLOW AI at contact@zflow.ai to discuss optimization for their own workloads.
About ZFLOW AI
ZFLOW AI is building a neutral optimization and control layer for AI infrastructure. Sitting above serving runtimes (vLLM, SGLang, TensorRT-LLM, Dynamo) and below the business decision, ZFLOW AI finds the lowest-cost, highest-performance way to run a given workload on a given cluster — across heterogeneous GPU, LPU, NPU, and CPU systems, without locking teams into any single vendor or stack. Learn more at zflow.ai.
About PaleBlueDot AI
PaleBlueDot AI is a Silicon Valley-based AI compute platform with a growing global footprint, delivering high-performance AI compute through a unified platform for enterprise-scale deployment. Guided by its mission to make intelligence universally accessible, PaleBlueDot AI helps organizations build, deploy, and scale AI faster, better, and cheaper.
View source version on businesswire.com: https://www.businesswire.com/news/home/20260522229557/en/
Media gallery
