Orca 2: Enhancing Reasoning in Smaller Language Models - Preliminaries

29 May 2024


(1) Arindam Mitra;

(2) Luciano Del Corro, work done while at Microsoft;

(3) Shweti Mahajan, work done while at Microsoft;

(4) Andres Codas, denote equal contributions;

(5) Clarisse Simoes, denote equal contributions;

(6) Sahaj Agarwal;

(7) Xuxi Chen, work done while at Microsoft;;

(8) Anastasia Razdaibiedina, work done while at Microsoft;

(9) Erik Jones, work done while at Microsoft;

(10) Kriti Aggarwal, work done while at Microsoft;

(11) Hamid Palangi;

(12) Guoqing Zheng;

(13) Corby Rosset;

(14) Hamed Khanpour;

(15) Ahmed Awadall.

Abstract and Introduction


Teaching Orca 2 to be a Cautious Reasoner

Technical Details

Experimental Setup

Evaluation Results


Conclusions and References

A. AGIEval Subtask Metrics

B. BigBench-Hard Subtask Metrics

C. Evaluation of Grounding in Abstractive Summarization

D. Evaluation of Safety

E. Prompts used in Evaluation

F. Illustrative Example from Evaluation Benchmarks and Corresponding Model Outpu

2 Preliminaries

2.1 Instruction Tuning

Instruction tuning [46, 38, 62, 61] has emerged as a crucial step in training language models. Instruction tuning involves learning from input-output pairs where the input is natural language task description,and the output is a demonstration of the desired behavior. Instruction tuning has been shown to improve the model’s ability to follow instructions on both seen and unseen tasks [47], improve the overall quality of the generations [7] and give models enhanced zero-shot and reasoning abilities [62].

Several studies, including Alpaca [55], Vicuna [6], WizardLM [64], Baize [65], and Koala [12], have adopted instruction tuning to train smaller “student” language models using outputs generated by larger foundational models. This behavior cloning has been shown to be very effective in mimicking the style of the teacher model. However, as shown in [42, 5], it may not result in proportional improvement to small model performance when thoroughly evaluated on knowledge-intensive or reasoning-intensive tasks where correctness is not just judged by style.

We note that instruction tuning, while very beneficial for teaching the model how to solve a task, does not necessarily teach the model new knowledge. Hence instruction tuned models will be always limited by the knowledge learned during pre-training. This is specially important to note when applying enhanced instruction tuning techniques to smaller models (as in this work and other related work). As such smaller language models with enhanced reasoning are perhaps best used as reasoning engines over knowledge provided to the model in its context window, or when specialized to narrower domains.

2.2 Explanation Tuning

One of the known weaknesses of instruction tuning is that a resulting student model could learn to generate stylistically correct, but ultimately wrong, outputs [13]. For example, instruction-tuning towards targets that are too terse limits the student’s visibility into what could have been a complex reasoning process, thus hindering its generalization ability to other tasks. In Orca 1, we introduced Explanation Tuning [42] to address this drawback by training student models on richer and more expressive reasoning signals. The mechanism for procuring these signals is system instructions[2] crafted to obtain detailed explanations from a teacher model as it reasons through a task. System instructions are additional high level guidelines an LLM is supposed to adhere to as it addresses individual user prompts, from which they are separated by a “system” role flag in a ChatML dialogue interface [3].

Explanation tuning begins with a compilation of N hand-crafted, general purpose system instructions designed to elicit more careful reasoning. Some examples include “think step-by-step”, “generate detailed answers”, etc. The primary objective of these system instructions is to extract rich demonstrations of “Slow Thinking” [22] from capable LLMs like GPT-4. They are then combined with user prompts from a vast and diverse set of tasks to yield a dataset of (system instruction, user prompt, LLM answer) triplets. The student model is trained to predict the LLM answer from the other two inputs.

If user prompts can be grouped into M distinct clusters representing similar kinds of questions, then Explanation Tuning naively yields a cross product of M × N different answers addressing different aspects of the task. Since more capable LLMs tend to vary their responses with the system instruction, this offers an easy path to increase the quantity and diversity of training signals. Numerous models such as Orca 1 [42], StableBeluga [35] and Dolphin[4] have capitalized on Explanation Tuning to demonstrate substantial improvements over traditional instruction-tuned models, especially in complex zero-shot reasoning tasks.

This paper is available on arxiv under CC 4.0 license.

[2] used interchangeably with system message

[3] e.g. see https://platform.openai.com/docs/api-reference/making-requests

[4] https://huggingface.co/datasets/ehartford/dolphin