Nous Research Unveils Token Superposition Training, Cutting Pretraining Time by 2–3x Amid convergent research Controversy
Nous Research has introduced a new large language model pretrAIning method called token Superposition Training (TST), which shortens pretraining time by two to three times at equivalent computational budgets by compressing adjacent tokens during the early stages of training.
TST operates in two phases. During the first 20% to 40% of training, instead of processing tokens indiVidually, the model "packs" adjacent tokens by averAGIng their embeddings and feeds the aveRAGed representation into the model. The ouTPUt objective is then to predict which tokens are contained in the next pack, without regard to their internal ordering. After this phase, the model reverts to Standard next-token prediction. Since the underlying architecture remains unchanged, the resulting model behaves identically to conventionally trained models during inference. The APProach has been validated on Mixture-of-Experts models with up to 10 billion parameters.
At its core, TST trades data for compute—achieving faster training times at the cost of higher data throughput. This charACTeristic could become a drawback if high-quality text corpora become scarce in the future.
Notably, within hours of the paper's release, readers pointed out that TST's mechanism bears a striking resemblance to the 2024 paper Beyond Next Token Prediction. The authors subsequently acknowledged on Hugging Face that this represents an "unfortunate case of convergent research" and committed to updating their paper with proper citations.
Comments & Questions (0)
No comments yet
Be the first to comment!