Apple reveals a simple trick that makes AI 8% smarter

Scientists at Apple have presented a study showing that large language models (LLMs) can significantly improve task accuracy when using an old and proven tool – checklists.

Context: how language models are typically trained

After training, the LLM undergoes a fine-tuning phase using the Reinforcement Learning from Human Feedback (RLHF) method. At this stage, human reviewers rate the model’s responses: a “like” increases the likelihood of a similar response in the future, while a “dislike” decreases it. This approach helps to make the responses more useful and safer.

The RLHF has weaknesses, however. The model can learn to produce “seemingly correct” answers that don’t actually solve the problem, just give the illusion of correctness.

The model can learn to give “seemingly correct” answers that don’t actually solve the problem.

What Apple researchers suggested

In an article “Checklists Are Better Than Reward Models For Aligning Language Models” the company proposed a new method – Reinforcement Learning from Checklist Feedback (RLCF).

Instead of a general “like/dislike” rating, model responses are checked against a list of specific items (“Is it translated into Spanish?”, “Is there formatting?”, etc.).
Each item is rated on a scale of 0 to 100.
The more powerful model (“teacher”) checks the answers and assigns scores that become a signal for pre-training of the basic (“student”) model.

Apple has even created a dataset WildChecklists with 130,000 instructions and automatically generated checklists.

Results of the study

Using the RLCF method, the researchers tested the Qwen2.5-7B-Instruct model on five popular benchmarks. The results showed consistent improvement across all metrics:

+4 points in FollowBench
+6 points in InFoBench
+3 points in Arena-Hard

In individual tests, the improvement reached 8.2%. This makes checklists a more effective method than classical reward models.

Limitations of the approach

The method focuses on complex multi-step instructions, but may be less useful in other scenarios.
Validation requires a more powerful model, making the process resource-intensive.
Important: RLCF improves instruction adherence but does not solve the problem of security of models.

Why it matters

The authors emphasize: with the growing popularity of LLMs as personal assistants, users expect models to rigorously and accurately perform complex, step-by-step tasks. In the future, as such assistants gain more autonomy, accuracy in following instructions will be a key factor in their usefulness.

Apple reveals a simple trick that makes AI 8% smarter

Context: how language models are typically trained

What Apple researchers suggested

Results of the study

Limitations of the approach

Why it matters

Apple released macOS Tahoe 26 Public Beta 5: How to install the update

Blizzard is drastically changing the levels and rewards in Overwatch 2

More in:AI and neural networks

Gemini has added an “Answer now” button to skip in-depth analysis

Samsung has confirmed the free status of key Galaxy AI features

Google launches personal intelligence feature for Gemini

OpenAI has launched a separate ChatGPT Translate service for text, speech and image translation

News

Anker launched a pink version of the 10,000 mAh Nano powerbank with a retractable USB-C cable

Vivo X200T will get Dimensity 9400+, 5000 nits screen and ZEISS cameras

Adobe has added new AI tools to Premiere and After Effects

Sony hands over control of its smart TV business to TCL

Insiders disagree on iPhone Air 2 release date

About ForGeeks.pro

ForGeeks.pro

Search

Context: how language models are typically trained

What Apple researchers suggested

Results of the study

Limitations of the approach

Why it matters

Share

You may also like

More in:AI and neural networks

News

Latest Posts