A video AI company has built a General-purpose robot brain. This is not a concept pitch—it is a verified reality.
Unlike traditional, task-specific robot controllers, this new "brain" integrates the predictive reasoning of a world model with the ability to ouTPUt ACTionable commands, achieving a true synthesis of "knowing and doing."
The model, named MotuBrain, quietly topped two international benchmarks in mid-April. Its origin remained a mystery for three weeks, sparking widespread speculation among robotics experts.
Now, ShengShu Technology has stepped forward to claim it.
Yes, this is the Same company behind Vidu, the Video Generation model used by CCTV to produce AI-driven Journey to the West animations. The announcement confirms that the company's generative model expertise extends from digital worlds into the physical realm.
Dual Benchmarks, Single Model: Unprecedented Dominance
MotuBrain’s performance is DeFined by its simultaneous victory on two distinct and challenging benchmarks. One tests the ability to underStand the physical world, the other the capacity to act within it. ImAGIne a single champion topping both a theoretical physics competition and a hands-on forklift driving exam—this is the calibre of MotuBrain's achievement.
The report card is definitive:
WorldArena: Ranked first in Motion Quality and first in Motion Smoothness. This benchmark evaluates whether a model can truly simulate physical dynamics, predicting object trajectories, collisions, and continuous motion.
RoboTwin2.0: The only model to surpass an aveRAGe score of 95 out of 100 in randomized environments. This benchmark sets 50 diverse tasks across standard and randomly perturbed settings, testing generalization.
Excelling at just one of these was considered a top-tier achievement. Dominating both simultaneously, and with the highest measurable motion smoothness and robustness to randomness, redefines the standard for generalist robot models.
From Pixels to Action: The World Action Model APProach
Why can a company famous for video AI leap ahead in embodied intelligence? The internal logic is compelling: the future of embodied AI rests on World Action Models, which must be built upon a video model's deep comprehension of the physical world. Understanding why a car drifts, why tires smoke, and predicting what happens next are all foundational capabilities shared between generating realistic video and directing a robot.
The secret to MotuBrain is its architecture. The industry typically chooses between a Vision-Language-Action (VLA) model that directly outputs actions, or a World Model that simulates outcomes for a separate planner. MotuBrain takes a third path: it fuses prediction and action generation into a single, unified model. This "predict-while-acting" design eliminates the lag of sequential imagination and execution, offering faster response and more coherent action sequences.
Capabilities DEMOnstrated: One Brain, Diverse Tasks
A newly released industrial-grade demo provides tangible proof. Without complex high-level planners or pre-scripted motions, MotuBrain exhibits four core capabilities across three different humanoid robot platforms:
Cross-Embodiment ("One Brain, Multiple Bodies"): The same model drives different robots with varying sensors and degrees of freedom, creating a universal Intelligence base that improves with more diverse data.
Long-Horizon Task Execution: It completes complex, continuous tasks—such as flower ARRanging and tidying a sofa—which involve 10 or more atomic actions, maintaining context and fluidity without pauses.
Predictive and Adaptive Action: In a hotpot serving task, the robot notices an empty ladle, visually confirms the absence, and autonomously re-plans to scoop properly. It predicts world states and adjusts actions on the fly, rather than blindly following a script.
Multi-Task Proficiency: From precise cocktail mixing, which requires bimanual coordination and delicate pouring, to organizing a washbasin, the model demonstrates stable performance across a broad repertoire. As the number of diverse training tasks increases, MotuBrain's shared world knowledge expands, and its average task success rate improves—a scaling law for task diversity.
The Strategic Vision: Vidu and MotuBrain, Two Paths, One Foundation
ShengShu Technology’s emergence in embodied AI is not a sudden pivot but a calculated convergence. Both MotuBrain and its creative video sibling, Vidu, are built on the company's proprietary U-ViT architecture. This foundation unifies the processing of multimodal data—visual, auditory, and tactile—fostering a shared understanding of objects, motion, and causality. Vidu was trained to understand and generate a plausible physical world; MotuBrain was trained to act within it.
This dual-track strategy offers a distinct advantage. Few robot brain companies possess foundational video models, and few video AI companies have robot action data. ShengShu is one of the rare players with both. Backed by a recent Series B Funding round of nearly 2 billion RMB led by Alibaba, the company has already formed strategic partnerships with embodied AI firms like Wujie Dynamics, Shenpu Intelligence, and Stardust Intelligence to drive industrial-scale deployment across manufacturing, commercial, and service scenarios.
As the consensus in the embodied AI industry shifts from building more dexterous hardware to creating a truly generalist "brain," ShengShu Technology has delivered a model that not only leads two critical benchmarks but also demonstrates that a unified path to general physical intelligence is Operational. If generating video was mastering the digital world, MotuBrain represents the decisive next step: enabling AI to intelligently navigate and act within our own.
Comments & Questions (0)
No comments yet
Be the first to comment!