RLT from Sakana AI: Small AI teachers train large models – breakthrough in machine learning

Sakana AI’s Reinforcement Learning Teachers (RLT) turn established approaches to training large language models on their head – small 7B-parameter models can now successfully teach 32B-parameter students.

The RLT methodology developed by Sakana AI is based on a fundamentally new principle: instead of rewarding teacher models for the correctness of their own solutions, they receive rewards based on how well their explanations help student models learn. This approach allows a 7B-parameter teacher to outperform both larger models such as QwQ-32B and established systems such as DeepSeek R1 in various benchmarks – at a fraction of the computational cost.

The two-phase training system combines supervised learning with reinforcement learning in an innovative way. In the first phase, teacher models undergo supervised fine-tuning with question-answer pairs from the bespokelabs/Bespoke Stratos 17k dataset. The second phase uses reinforcement learning, where the reward signals are derived directly from the performance of the student models. This architecture requires only a few days of training compared to the weeks required by traditional RLHF methods.

The performance data speaks for itself: the RLT-7B teacher scores 23.3% on the AIME 2024 benchmark, 82.8% on MATH 500 and 42.4% on GPQA Diamond. This not only outperforms the 7B Bespoke base model, but also the four times larger QwQ-32B model on GPQA Diamond tests. When the system is scaled to 32B students, RLT-32B achieves an impressive 89.7% on MATH500 and 68.3% on GPQA Diamond.

Technical architecture and system design

The RLT pipeline differs fundamentally from traditional approaches through its teacher-student interface. Teachers receive question-solution pairs and generate step-by-step explanations with specialized reasoning tags. The reward model calculates reward = f(student_solution_accuracy | teacher_explanation) instead of the usual reward = f(teacher_solution_correctness).

The best free AI tools

The best free AI tools
View free AI Tools

The distillation process involves three critical adjustments for optimal results: For 32B students, a multi-trace collection is implemented to prevent context-window overflows. Quality control uses raw RLT output without post-processing to preserve educational value. Curriculum alignment maintains identical hyperparameters, including a learning rate of 1e-6 and a batch size of 1024.

Ads

Legal Notice: This website ai-rockstars.com participates in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.

Practical implementation and economic impact

Hardware requirements for RLT remain moderate: 8× H100 GPUs for the warm-up phase, 4× parameter servers plus 4× learners for the RL phase, and 400GB of storage for training 32B students. The entire RLT training pipeline requires approximately $15,000 in computing resources compared to over $2 million for traditional RLHF implementations.

The open-source availability of the RLT-7B and RLT-32B models on Hugging Face under Apache 2.0 license democratizes access to advanced reasoning capabilities. Smaller research groups and companies can now develop high-quality reasoning models without the need for massive computing resources. This accessibility could accelerate the development of specialized AI applications in various fields.

Summary:
RLT enables 7B models to successfully train 32B students – a paradigm shift in AI development
Reward system based on student performance rather than teacher correctness, enabling more efficient learning
Significantly lower costs: $15,000 vs. $2 million for traditional methods
Superior benchmark results: RLT-7B outperforms QwQ-32B in GPQA Diamond tests
Open source availability under Apache 2.0 license democratizes access to advanced reasoning models
Two-phase training combines supervised learning with reinforcement learning in just a few days
Practical applicability through moderate hardware requirements and detailed implementation instructions

Source: GitHub