Calgarypuck Forums - The Unofficial Calgary Flames Fan Community - View Single Post

Fuzz · 01-28-2025, 01:08 PM

That would obviously be bad, but it sounds like it's all handled with code.

Quote:

Start with a smart normal model, like DeepSeek-V3, and perform the following reinforcement-learning loop
Ask that model to solve a mathematical problem, with a prompt that pushes it to think step-by-step
Verify the answer in code (i.e. not with a model, but by directly parsing the answer and checking it)
If correct, reward the model; if wrong, punish the model
Repeat for a long time

The asking the model part is probably more manual, as they'd need to create a list of problems, though I suspect a lot of this is done and grabbed from elsewhere.