Huawei Positions CloudMatrix as a Challenger to Nvidia in AI Performance

Huawei has unveiled new performance data for its CloudMatrix384 architecture, claiming that its system outperforms Nvidia’s H800 GPUs in running DeepSeek’s R1 model. The results, published in a detailed technical paper by Huawei researchers, highlight the company’s push to showcase leadership in AI infrastructure despite ongoing global competition and trade restrictions.

A Bold Performance Claim

The research, carried out in collaboration with Chinese AI infrastructure firm SiliconFlow, marks the first time Huawei has publicly shared in-depth numbers on CloudMatrix384. According to the company’s internal benchmarks, its system achieves higher efficiency and throughput than Nvidia’s GPUs in certain AI workloads.

While the reported figures are striking, it’s important to remember that all testing was conducted in-house. Without third-party validation, these claims remain provisional, though they demonstrate Huawei’s ambition to challenge the dominance of US-made processors.

Inside the CloudMatrix384 Architecture

At its core, CloudMatrix384 combines 384 Ascend 910C NPUs with 192 Kunpeng CPUs, linked together through a high-speed Unified Bus designed for ultra-low latency communication.

Unlike traditional hierarchical data centre designs, Huawei’s setup uses a peer-to-peer architecture. This allows every processing unit to communicate directly with others, pooling compute, memory, and networking resources dynamically. The result is a flexible system designed to handle the demanding workloads of modern AI, particularly large language models and mixture-of-experts (MoE) frameworks.

Performance Metrics in Context

Huawei explains the results through two phases of AI inference: prefill (where a system interprets a prompt) and decode (where it generates output).

In prefill, CloudMatrix reportedly processes 6,688 tokens per second per unit.
During decoding, it delivers around 1,943 tokens per second per unit.
The system’s time per output token (TPOT) is under 50 milliseconds, meaning each word can be generated in less than 1/20 of a second.

Efficiency is another area where Huawei claims an edge. The company reports that CloudMatrix achieves 4.45 tokens per second per TFLOPS in prefill and 1.29 tokens per second per TFLOPS in decoding—metrics that it argues surpass the performance of Nvidia’s H100 and H800 GPUs.

The system also maintains throughput of 538 tokens per second under stricter latency limits of 15 milliseconds, which Huawei positions as evidence of robust responsiveness.

Technical Innovations Driving Results

The CloudMatrix paper highlights three major innovations that underpin its performance:

Peer-to-peer serving architecture – Breaking inference into prefill, decode, and caching subsystems allows each to scale independently.
Expert parallelism at scale – Supporting configurations of up to EP320, with each NPU die hosting its own expert, optimises workloads for large models.
Hardware-aware optimisation – Including fine-tuned operators, microbatch pipelining, and INT8 quantisation, which reduces computational load while preserving accuracy.

Huawei reports that INT8 quantisation maintains model quality close to the official DeepSeek-R1 API across 16 internal benchmarks, though these results again lack external verification.

Strategic and Geopolitical Dimensions

Beyond the technical details, the announcement comes amid heightened competition between China and the United States in advanced semiconductors. Huawei’s founder Ren Zhengfei has acknowledged that the company’s chips lag behind US competitors in raw capability. However, he argues that clustering methods, as used in CloudMatrix, can compensate for this gap.

Industry leaders outside China echo the same idea. Nvidia’s CEO Jensen Huang has noted that AI workloads can often be scaled by simply adding more processors in parallel—a strategy China is well-positioned to pursue given its abundant energy and resources.

Huawei’s researchers frame their work as part of a broader mission: to strengthen confidence in China’s domestic technology ecosystem and reduce reliance on foreign hardware. Lead author Zuo Pengfei, part of Huawei’s “Genius Youth” program, described the project as a demonstration that locally developed NPUs can rival or even surpass Nvidia’s GPUs.

Waiting for Independent Validation

The numbers presented by Huawei suggest promising progress, but without external confirmation, the industry will remain cautious. Independent benchmarking is a long-standing standard in technology, ensuring credibility for performance comparisons.

That said, Huawei’s CloudMatrix approach—its peer-to-peer serving design, resource disaggregation, and efficiency-driven optimisations—offers genuine insights into how AI infrastructure might evolve. Even if its performance leadership remains contested, the architecture itself shows a different way of tackling the immense computational demands of large-scale AI.

The Bigger Picture

Huawei’s announcement underscores the growing intensity of the AI hardware race. With Nvidia continuing to dominate the GPU market, rivals like Huawei are experimenting with alternative architectures to close the gap. Whether CloudMatrix can live up to its claims in independent trials remains to be seen, but the company has already made one thing clear: the race to redefine AI infrastructure is far from over.