That would obviously be bad, but it sounds like it's all handled with code.
Quote:
- Start with a smart normal model, like DeepSeek-V3, and perform the following reinforcement-learning loop
- Ask that model to solve a mathematical problem, with a prompt that pushes it to think step-by-step
- Verify the answer in code (i.e. not with a model, but by directly parsing the answer and checking it)
- If correct, reward the model; if wrong, punish the model
- Repeat for a long time
|
The asking the model part is probably more manual, as they'd need to create a list of problems, though I suspect a lot of this is done and grabbed from elsewhere.