Xiaomi MiMo-V2.5-Pro UltraSpeed hits 1,000+ tokens/s via FP4 + DFlash

Published: 2026-06-08 21:04:18 |

Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, a new inference-serving mode for its trillion-parameter model that delivers 1,000+ tokens per second (peaking near 1,200 in demos). The company says it achieves this speed on a standard 8-GPU commodity node, without custom chips—an inference breakthrough focused on latency-sensitive AI use cases. The speedup in MiMo relies on two core techniques. First, FP4 quantization compresses only the model’s expert layers to 4-bit precision, aiming for near-zero quality loss while reducing memory and bandwidth pressure. Second, DFlash speculative decoding accelerates generation by proposing a whole token block in a single forward pass, instead of drafting tokens sequentially. Xiaomi claims an average acceptance of 6.3 out of 8 proposed tokens per verification round. Xiaomi’s inference engine, TileRT, is positioned as the glue that keeps the GPU compute pipeline continuously resident, reducing operator-launch overhead. Xiaomi describes this as “extreme model-system codesign,” emphasizing the combined effect rather than any single optimization. Commercially, the MiMo API trial runs June 9–23 (application-based, with priority for enterprise/pro developers). Pricing is set at 3× the standard MiMo-V2.5-Pro rate for roughly 10× output generation speed. Xiaomi also says the FP4-DFlash checkpoint will be open-sourced on Hugging Face for community testing. For traders: this is an AI infrastructure milestone that can lift sentiment around AI compute efficiency, but it is not a direct catalyst for crypto token flows. Watch for any follow-on partnerships or spend signals from AI labs and enterprises.

Neutral

This is a notable AI inference-efficiency announcement, but it has no direct connection to specific crypto assets, token unlocks, protocol incentives, or on-chain flows. As a result, traders are unlikely to see an immediate, durable impact on market stability. In the short term, the narrative could be mildly supportive for sentiment toward “AI compute” infrastructure (because lower inference cost and higher throughput can drive demand for accelerators and tooling). That said, similar technology milestones in the past—especially when they concern model-serving optimizations rather than blockchain integrations—often fail to translate into sustained, crypto-wide price action. In the long term, if Xiaomi’s MiMo UltraSpeed (FP4 + DFlash) leads to broader enterprise adoption, it could indirectly boost investment cycles in AI hardware/software ecosystems. But unless there are follow-on moves tying AI agents to crypto protocols (e.g., payments, staking, or DeFi execution rails), the effect on crypto trading volumes is likely limited. Hence the expected impact is neutral.