Apple reveals a simple trick that makes AI 8% smarter

Scientists at Apple have presented a study showing that large language models (LLMs) can significantly improve task accuracy when using an old and proven tool – checklists.

Context: how language models are typically trained

After training, the LLM undergoes a fine-tuning phase using the Reinforcement Learning from Human Feedback (RLHF) method. At this stage, human reviewers rate the model’s responses: a “like” increases the likelihood of a similar response in the future, while a “dislike” decreases it. This approach helps to make the responses more useful and safer.

The RLHF has weaknesses, however. The model can learn to produce “seemingly correct” answers that don’t actually solve the problem, just give the illusion of correctness.

The model can learn to give “seemingly correct” answers that don’t actually solve the problem.

What Apple researchers suggested

In an article “Checklists Are Better Than Reward Models For Aligning Language Models” the company proposed a new method – Reinforcement Learning from Checklist Feedback (RLCF).

Instead of a general “like/dislike” rating, model responses are checked against a list of specific items (“Is it translated into Spanish?”, “Is there formatting?”, etc.).
Each item is rated on a scale of 0 to 100.
The more powerful model (“teacher”) checks the answers and assigns scores that become a signal for pre-training of the basic (“student”) model.

Apple has even created a dataset WildChecklists with 130,000 instructions and automatically generated checklists.

Results of the study

Using the RLCF method, the researchers tested the Qwen2.5-7B-Instruct model on five popular benchmarks. The results showed consistent improvement across all metrics:

+4 points in FollowBench
+6 points in InFoBench
+3 points in Arena-Hard

In individual tests, the improvement reached 8.2%. This makes checklists a more effective method than classical reward models.

Limitations of the approach

The method focuses on complex multi-step instructions, but may be less useful in other scenarios.
Validation requires a more powerful model, making the process resource-intensive.
Important: RLCF improves instruction adherence but does not solve the problem of security of models.

Why it matters

The authors emphasize: with the growing popularity of LLMs as personal assistants, users expect models to rigorously and accurately perform complex, step-by-step tasks. In the future, as such assistants gain more autonomy, accuracy in following instructions will be a key factor in their usefulness.

Apple reveals a simple trick that makes AI 8% smarter

Context: how language models are typically trained

What Apple researchers suggested

Results of the study

Limitations of the approach

Why it matters

Apple released macOS Tahoe 26 Public Beta 5: How to install the update

Blizzard is drastically changing the levels and rewards in Overwatch 2

More in:AI and neural networks

DeepSeek unveiled new open-source models, claiming advantages over GPT-5 and Gemini 3 Pro

Top announcements from the first day of AWS re:Invent 2025: AI agents, Trainium3 and new Nova models

OpenAI is preparing to add advertising to ChatGPT

Google announces a major update to the Gemini app

News

OnePlus temporarily disables artificial intelligence feature due to censorship allegations

Samsung will create a continuous zoom lens for smartphones

M.Video now sells motorcycles

The MAX messenger has been disrupted in Russia

Sber has released a new iOS app called Finance Online on the App Store

About ForGeeks.pro

ForGeeks.pro

Search

Context: how language models are typically trained

What Apple researchers suggested

Results of the study

Limitations of the approach

Why it matters

Share

You may also like

More in:AI and neural networks

News

Latest Posts