Why Smaller LLMs Fail at Critical Thinking

27 Aug 2025

Table of Links

Abstract and 1. Introduction

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

A NOTATIONS

The models used in this paper include PaLM-2 (Google et al., 2023), LLaMA (Touvron et al., 2023a), LLaMA-2[1] (Touvron et al., 2023b), and GPT (OpenAI, 2023) families.

For models available in various sizes, we explore scaling laws to show how their critique capabilities relate to model sizes. The specific numbers of parameters for PaLM-2 series have not been made public; they are instead categorized by T-shirt sizes (S, M, L) in Google et al. (2023). We extend its notation and introduce two additional sizes: XXS and XS. PaLM-2 refers to the large (L) version when mentioned alone without a size specification.

For the GPT family, we specifically evaluate the gpt-3.5-turbo-0613 and gpt-4-0613 models via OpenAI’s API[2]. These are the latest stable versions at the time of our study. For the sake of simplicity, we refer to gpt-3.5-turbo-0613 as ChatGPT and gpt-4-0613 as GPT-4 throughout this paper. Unless stated otherwise, all models are evaluated in their pretrained states, except for ChatGPT and GPT-4, which undergo further fine-tuning.

B CRITICBENCH: SOURCES OF QUERIES

The goal of CRITICBENCH is to create a comprehensive, reliable, and fully open benchmark for evaluating critique ability in a diverse range of scenarios. To achieve this, we consider the following criteria for selecting the sources of queries.

Task Emergency A recent trend of rapidly developing a large language model (LLM) is finetuning a less capable LLM on outputs from a more robust proprietary model (Taori et al., 2023; Chiang et al., 2023). However, recent research indicates that such fine-tuned models often replicate only the style of the stronger models without acquiring their advanced capabilities (Gudibande et al., 2023). For instance, models like Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023) excel in tasks such as chitchat but underperform in complex tasks that demand emergent abilities (Wei et al., 2022a). OpenAI’s GPT-4 release blog[3] also acknowledges this, stating, “In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold.” Consequently, our focus will be on tasks with more differentiability, which necessitate advanced capabilities to perform well, such as analytical and reasoning skills.

Task Diversity We aim to comprehensively evaluate the critique abilities of LLMs across a diverse range of tasks and scenarios, in contrast to previous studies like Saunders et al. (2022), which typically focus on a specific task only. Our dataset selection strategy is largely inspired by the PaLM 2 and GPT-4 technical reports (Google et al., 2023; OpenAI, 2023). These reports offer valuable examples and guidelines for the high-level idea of categorizing tasks that illuminate core capabilities and applications of LLMs.

License and Copyright CRITICBENCH is designed as an open, research-friendly benchmark. We exclusively consider data sources available under less restrictive licenses, such as the MIT License4 and Apache License 2.0[5]. In addition, special attention is given to copyright considerations. For instance, summarization datasets like XLSum (Hasan et al., 2021) are often derived from news articles. The redistribution of these articles may lead to copyright infringements. Therefore, such datasets are intentionally left out of our benchmark.

B.1 SELECTED TASKS

Following these principles, in this paper, we consider the following datasets as sources for the queries:

• GSM8K (Cobbe et al., 2021). A dataset comprises 8.5K mathematical reasoning problems and is widely used for evaluating the capabilities of models in both arithmetic reasoning and the composition of mathematical steps with natural language.

• HumanEval (Chen et al., 2021). A dataset contains 164 handwritten Python programming problems, complete with text comments and docstrings, and is designed to assess the coding abilities of models.

• TruthfulQA (Lin et al., 2021). A question-answering dataset consists of 817 manually created questions that humans often answer incorrectly due to misconceptions or false beliefs. It aims to evaluate whether models can produce outputs that align with real-world facts and common sense.

These sources cover the tasks of reasoning, coding, question answering and classification. As our data collection method is scalable and generalizable across tasks, we view the construction of CRITICBENCH as a continuous effort. This paper serves as an initial step, presenting three representative datasets. We hope to extend the mixture to cover more tasks and scenarios in future work.

Authors:

(1) Liangchen Luo, Google Research ([email protected]);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research ([email protected]).

This paper is available on arxiv under CC BY 4.0 DEED license.

[1] All access to the LLaMA 2 model was performed by Zi. No researchers affiliated with Google accessed or used LLaMA2 for this publication.

[2] https://platform.openai.com/docs/models

[3] https://openai.com/research/gpt-4

[4] https://opensource.org/license/mit/

[5] https://www.apache.org/licenses/LICENSE-2.0

← Previous

Improving LLM Performance with Self-Consistency and Self-Check

Up Next →

Why CriticBench Refuses GPT & LLaMA for Data Generation