Don’t’ make your LLM an evaluation benchmark cheater

It’s a little provocative title. Not mine! It is the title of a very serious pre-print recently published on ArXiv by highly competent scientists with the explicit title: Don’t make your LLM an evaluation benchmark cheater!”

What is this about? As you can imagine, it is about benchmarking the performances of LLMs, and more specifically about not making it in a way that could be biased. The authors study an exciting concept related to LLMs training and benchmarking: benchmark leakage. Moreover, they conducted numerous and exciting experiments to evaluate how much the measured performance of LLMs using a test benchmark like MMLU is influenced by the presence of MMLU data in the pre-training phase of LLMs.

Training models to check the influence of data leakage

Before going into the details, a quick reminder on how LLM and generative tools are trained, as most of the paper experiments are made possible because the authors – to make their point and show their theory – completely trained some open source LLM models from scratch.

As a reminder, an LLM is built through multiple phases. The most important ones are pre-training and fine-tuning,

The first one (and most complex one in terms of computing power) is the pre-training which takes a long time – a few days to a few months – to complete. With auto-regressive models (eg. GPT, BARD), which are uni-directional and are trained to predict the next word without seeing the succeeding ones (because those models are specifically optimized for better language generation), during the pre-training process, we are not training the model for specific language tasks (like a generation or named entities recognition) but only to make it learn how to predict words in a sentence. This pre-training process builds the pre-trained language models (PLM). It is usually costly to train PLM (a few thousand to more than a million dollars) making the experiments presented by the papers we describe here very ambitious.

To write this post we used / pour écrire cet article, nous avons consulté:
Zhou, Kun, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, et Jiawei Han. « Don’t Make Your LLM an Evaluation Benchmark Cheater ». arXiv, 3 november 2023. https://doi.org/10.48550/arXiv.2311.01964.

After the pre-training, the fine tuning process

During the fine-tuning process, a task-specific layer (eg. sentence classification, named entity recognition, question-answering, etc) is added to the PLMs and carries out the usual backpropagation method using a suitable loss function. Reinforcement Learning from Human Feedback (RLHF) is the method of fine-tuning using samples of prompts (prototype of question and answering corrected by humans for a bot) for GPT or Claude generative models.

The question of data leakage during training

The issue is that if during any phase (and most specifically the pre-training phase), part of the training data used includes the answers to the questions asked by normalized test benchmarks, those benchmarks are biased.

By biased we mean that the benchmark might not measure what it claims to do (like reasoning for MMLU for example) as we explained in this post because the model has already seen the answer to the question during the training process.

And “To make matters worse, the detailed composition (e.g., data sources) of the training corpus is often regarded as the core “secret” of existing LLMs. Therefore, it becomes difficult to directly examine the contamination issues when evaluating benchmark maintainers.” (Zhou et al., 2023, p. 2)

Such leakage has already been demonstrated in multiple instances: it has been shown that GPT-3 included the Children’s Book Test dataset (an other test benchmark) in its pretraining corpus (Hill et al., 2016), and LLaMA-2 authors has mentioned that the contexts in the BoolQ dataset (Clark et al., 2019) are extracted verbatim from the webpages, which may be included in the publicly available corpus. We also have shown in our previous articles that some Bar exams used in the MMLU benchmark are available on line (with answers) and could have been used to train Gemini (the Chat LLM from Google).

Demonstrating how leakage can boost benchmark results

So it is known that benchmark data can be leaked in training data, but we do not know how much (because of the secrecy of the data set used) and as we do not know the size and the nature of the potential leakages it is difficult to evaluate their potential impact. Here come our authors who do not answer the question of the volume of the leaks but, built an experiment that will allow us to know what would be the impact of a leak.

To make this empirical study, they selected the MMLU benchmark (frequently claimed to be a reasoning test and reading comprehension test) for evaluation. That is particularly interesting for us as it is precisely this MMLU benchmark that we challenged in our last post.

What they did then was ambitious : they retrained from scratch four real open source models (by real, we mean models where we have both the code and the training data publicly available) in five different configurations, with and without a leak of MMLU benchmark data. Don’t know where they found the money to conduct such experiments but they did it! They trained :

  • GPT-Neo-1.3B (Black et al., 2021): it is a Transformer-based model with GPT-3 architecture, pre-trained on the Pile (Gao et al., 2021) dataset. •
  • phi-1.5 (Li et al., 2023): it is a 1.3B model trained on “textbook quality” data of ≈27B tokens, and can achieve comparable performance as much larger models. •
  • OpenLLaMA-3B (Geng and Liu, 2023): it is an open-source project to reproduce LLaMA model with a permissive license, pre-trained on RedPajama dataset (Computer, 2023) of over 1.2T tokens.
  • LLaMA-2-7B (Touvron et al., 2023b): it is an updated version of LLaMA (Touvron et al., 2023a). It has been pre-trained on a mixture of publicly available online data of 2T tokens.

As you can see, they also retrained LLaMA-2 and I am still puzzled by this as training data for this model are not documented as far as I know (the paper should be more detailed on this point). And the five configurations were as follows:

  • Model with original training data
  • Model with original train data and MMLU training data
  • Model with original train data, and all others tests training data
  • Model with original train data, all others tests training data and their tests data
  • A fifth configuration is tested that authors suggest to not consider at this time for experimental reason.

Then all those models are tested with 8 benchmarks, and the results are below. We will not comment on all the results in detail (we suggest our reader to deep dive into the paper for that). We only focus on MMLU and we see, with no doubt, that when you include the answers to the tests in the training data, MMLU performs better in reasoning tasks!

According to the ArXiv paper : The comparison among three benchmark leakage settings and the original LLMs on MMLU and QA tasks. “Train S”, “Test P” and “Test P&S” denote the data leakage scenarios that use the training set, test prompt, and both test set and test prompt during training, respectively. The task abbreviations are as follows: HSwag (Hellaswag), WG (WinoGrande), ARC-E (ARC-Easy), ARC-C (ARC-Challenge), and OBQA (OpenBookQA). The results in gray are the worst leakage settings using all the test sets and are reported only for reference. The best results in each group are in bold except for the aforementioned worst case.

As the authors state : the experimental results reveal that benchmark leakage can lead to an unfair boost in the evaluation performance of LLMs. Smaller LLMs (e.g., a 1.3B model) can be deliberately elevated to outperform 10× larger models on certain tasks. As a side effect, the performance of these specially trained LLMs on other normally tested tasks would likely be adversely affected if we fine-tune or train the model only with these leaked data.

Some recommendations for LLM developers

As said previously, this work does not prove that benchmark data (with questions and answers) are used to train big names of LLMs. But it gives a very good idea of what would happen if it was the case. And that leads to some recommendations for LLM practitioners.

To improve the use of existing evaluation benchmarks, the authors present several guidelines for both LLM developers and benchmark maintainers. They hope this work can draw attention to the need for better training and evaluation of LLMs. I would add that it, is very important, especially for industries that will try to deploy in real-world applications and would have difficulties understanding why their implementation would not perform according to the benchmarks published by the vendors of the LLMs APIs!

AI news of the year 2023 week 51

Hello there, every Monday, find here some news about AI that attracted our attention (and maybe should attract yours too !). This week, we discovered an evaluation of the medical capacities of ChatGPT, covered the launching of now open LLMs from Mitral AI, and a new analysis of Open Models trend that shows a great acceleration of availability for Open Generative AI.

The Stanford Institute for Human-Centered AI tested medical capacities of generative AI (and it is not good …)

AI physicians are not so good …

Stanford Institute for Human-Centered AI, advancing AI research, education, policy, and practice to improve the human condition in an article titled How well do Large Language Models Support Clinician Information Needs show that GPT 4 is not robust enough for use as a medical co-pilot. Using a set of 64 questions from a repository of ~150 clinical questions created as part of the Green Button project, they prompted ChatGPT and measured the quality of the answer. They found that the answers are :

  • Non-deterministic: They found low similarity and high variability in responses to the same question. Jaccard and cosine similarity coefficients were merely 0.29 and 0.45 respectively.
  • Have bad Accuracy: Only 41% of GPT-4 responses agreed with the known answer to medical questions according to a consensus of 12 physicians.
  • Can potentially harm: 7% of answers were deemed potentially harmful by the consensus physicians.

You can read the article here.

Mixtral 8x7b released the 11th of december !

The Mistral teams revealed its last version of Mixtral LLM at Neurips. Mixtral 8x7B is an open-weight mixture of expert models. Mixtral matches or outperforms Llama 2 70B and GPT3.5 on most benchmarks, and has the inference speed of a 12B dense model. It supports a context length of 32k tokens. Mixtral has a similar architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks. For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Mixtral has been trained on a lot of multilingual data and significantly outperforms Llama 2 70B on French, German, Spanish, and Italian benchmarks.

Mistral AI is now a major player in the field of generative models. Mistral AI, the French artificial intelligence start-up founded in May 2023 by industry heavyweights, announced on Sunday, December 10 that it had raised €385 million, becoming one of Europe’s two AI champions. The French venture founded by three French AI experts, trained at École Polytechnique and ENS is now valued at some $2 billion. Mistral’s ambitions is to become the leading supporter of the open generative AI community and bring open models to state-of-the-art performance.

AI Open source community is gaining traction and seek to rival private models

Open-source generative artificial intelligence (AI) models are gaining ground, challenging the dominance of centralized cloud-backed models like ChatGPT. Leading players in the generative AI field, such as Google and OpenAI, have traditionally followed a centralized approach, restricting public access to their data sources and training models. research conducted by Cathy Wood’s ARK Invest suggests a potential shift towards open-source AI models outperforming their centralized counterparts by 2024.

MMLU: discover the generative AI benchmark that would prove Gemini (and others) are reasoning (or maybe not)

Suppose you are interested in chat generative models and do not live in a distant mountain covered by snow without a properly functioning Starlink terminal. In that case, probably, you have already heard about Gemini. It is the new underlying set of models powering Bard, the chat generative product of Google built by Deepmind.

Gemini is claimed to outperform Chat-GPT, Bing, and other Claude’s of this world. Time will come when we will describe Gemini in details, but not before regurgitating for you the 63-page report from Google (anecdote, it is said that Sergei Brin himself went day on night coding on it !).

Why and how Google can claim that Gemini is the best ?

So before going deep with Gemini, we found it interesting to understand how Google can claim that its generative models are better than its competitors. To do so, as it is very common for any NLP application, Google uses a standardized test framework. and one in particular, called MMLU, an acronym for Massive Multitask Language Understanding!

To write this post we accessed to :
Gemeni Report paper :
« Gemini: A Family of Highly Capable Multimodal Models », s. d. https://goo.gle/GeminiPaper.

Bard : https://bard.google.com/

MMLU paper :
Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, et Jacob Steinhardt. « Measuring Massive Multitask Language Understanding ». arXiv, 12 janvier 2021. https://doi.org/10.48550/arXiv.2009.03300.

MMLU data set : https://github.com/hendrycks/test
Read more

Why MMLU is an important benchmark ?

So what is MMLU and why it is so important for Gemini to perform with it? First, MMLU is a test framework, or a benchmark: a very important piece of every scientific claim or any industrial or engineering process or product. To know how well your system performs, you need to find a way to measure it. This is a paradigm valid in every industry, from chemicals to rocket engine.

And if you want to evaluate the performance of a system, and by extension, compare this system with others, you need a standard of comparison. A way to measure the performances of the given systems with stability, and consistency in different times and condition (this is the concept of reproducibility). Most often, this kind of measurement have to be automated to test numerous experiences, an exhaustive list of cases that would be too expensive and difficult to be conducted manually by humans. This is what is achieved with a test framework like MMLU.

Test frameworks are the companions to machine learning and artificial intelligence applications since the beginning of computer science (and they have nothing to do with software development QA tests). They are necessary if you want to build experiments, evaluate, monitor, and industrialize IA systems.

In the AI context, a test framework is made of at least three components. First, a standard and a clearly defined methodology to build a test, second, a set of data to deploy in real life this methodology, and third, a set of metrics to measure the performance of a system on the given data, according to the methodology. Let’s have a look at those three components.

The MMLU methodology

First the methodology: MMLU is common. It uses methodology previously defined in many scientific evaluation campaigns (NIST, CoNLL, TREC, ESTER). The principle of MMLU is designed for testing chat generative model as follows: a basic question is submitted to a generative engine, the generative engine returns what it considers the correct answer to the question, a program script checks if the answer is the correct one, then a score is calculated.

In MMLU, the base test is made of a statement that includes a question and the answer to this question, 4 potential correct answers to the questions (lettered A to D), and the letter of the correct answer. We reproduced an example of this data structure below.

Why choose this specific methodology: because of both engineering and scientific constraints.

First, the engineering constraint: to check the result of an automated test, you need to have a way to code it. Let’s imagine we decide to test generative chat capacities with an open answer to a specific question, it is easy to send the question and retrieve the answer, but then, we need to analyze this open text answer to check if its good or bad. And here comes the scientific constraint: We do not know yet how to create the NLP components, analysis engine, and information extraction tools that would analyze open answers to a given question and give a verdict on its validity.

To illustrate this, let’s see one of the samples of the formal logic test of MMLU :

Select the best translation into predicate logic. All kings are luckier than all paupers. (Kx: x is a king

According to MMLU, the correct answer will be formulation A, exactly as this one :

 Px: x is a pauper

But there would be numerous ways to answer this question in formal logic, with many different acceptable words, formulations, orderings, and formalism’s. That’s why all the benchmarks of the family of MMLU tests use closed set of valid answers: to be able to automatically check them.

MMLU benchmarks are question answering tests

And that’s why most of the bench-marking of generative chat models is more a task of question answering than a reasoning test. One way or another, the test framework has to provide a choice of answers: the DROP test framework for example (another standardized test also used to evaluate Gemini), provides the questions texts (from Wikipedia) that include the answers.

This specific way of testing systems does not mean at all that Bard or ChatGPT do not reason, but it means that what is measured by those tests is far to be reasoning capacities. Only humans can analyze and evaluate true reasoning with open answers to questions like this philosophical one (very common in high school philosophy classes ):

Q : Is the meaning of life the same for animals and humans?

The particularity of this question is that it does not contain the answer and that many valid answers exist (a philosophy teacher would not consider Yes or No as a valid demonstration of reasoning). Those answers would be nourished by the history, the culture, and the belief of the entity (human or AI models) who answer.

To be clear, generative models are capable of answering to this question (and Bard or ChatGPT are pretty good at it as shown below) but it is not technically feasible to build an automated and exhaustive test benchmark that would check the quality of the reasoning and the validity of the answer.

MMLU data

To be as exhaustive as possible, the MMLU data set is made of 57 tasks in total (57 lists of thematic questions to submit to a generation engine). These include practice questions for tests such as the Graduate Record Examination, and the United States Medical Licensing Examination. Some tasks cover a subject, like psychology, but at a specific level of difficulty, such as “Elementary,” “High School,” “College,” or “Professional.” For example, the “Professional Psychology” task draws on questions from freely available practice questions for the Examination for Professional Practice in Psychology, while the “High School Psychology” task has questions like those from Advanced Placement Psychology examinations.

The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online (we will see that this is a very important point). 15908 questions in total were collected for those 57 tasks, which they split into a development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyper-parameters and is made of 1540 questions, and the test set has 14079 questions. Each subject contains 100 test examples at the minimum, which is longer than most exams designed to assess people.

MMLU metrics

The metric used by MMLU – the calculation to define a measure of the system performance – is the accuracy (the percentage of good answers on all the questions asked). To calculate the accuracy, the script proceeds as follows : each prompt is sent to the generative engine API with a question like “The following are multiple choice questions (with answers) about [subject].” For zero-shot evaluation, the question is appended to the prompt. For few-shot evaluation, up to 5 demonstration examples with answers to the prompt are added before appending the question. All prompts end with “Answer: ”.

The model then produces probabilities for the answers “A,” “B,” “C,” and “D,” and MMLU considers the highest probability option as the prediction. If this prediction is valid it gets the value score 1, 0 otherwise, and then the sum of all the predictions divided by the total amount of questions gives the accuracy score.

This methodology gives an 87.29% MMLU score to GPT 4. Models like LLAMA 2 or Mistral 8x7b got results published on the 10 th of December and got slightly lower scores as illustrated below. A correct reading of this would be that GPT 4 answers correctly 87.29% of the 15k questions of the MMLU test corpus, and Mistral 8x7B, 70%.

How Gemini and its competitors perform with MMLU

So let’s now analyze the claims by Google in its Gemini report. According to the authors, Gemini was tested using a broad range of benchmarks [showing] that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks. The report also claims that [Gemini is] the first model to achieve human-expert performance on the well-studied exam benchmark MMLU.

We can read in the Gemini Report that Gemini Ultra is the first model to achieve human-expert performance on MMLU — a prominent benchmark testing knowledge and reasoning via a suite of exams — with a score above 90%. We reproduce the report score table below.

What we can ask here is how those metrics should be read and what they mean. In other words, when Gemini answers correctly 90% of the questions of the MMLU test and Mistral 70%, does Gemini is 20% better at reasoning than Mistral as claimed … or something else?

As explained previously, MMLU is a question-answering test based on the methodology of multi-choice questions: this is not very new, and a similar test corpus has existed since the beginning of the 2000s (see for example the benchmarks of the Trec tracks ). The MMLU is more exhaustive, and larger, but not so different.

We have shown that when a system is tested using the MMLU benchmark, you validate its capacity, according to the statement of a problem, to select a correct predetermined answer in a restricted set of answers. This is certainly a very difficult information retrieval task (exactly like where the Trec tracks), but is it enough to affirm that this is human-like reasoning ?

The essential role played by learning data

For example, there would be no reasoning at all if the training data contained the correct answer to a question. That would be – at best – a very specific form of over-fitting and, at worst a bias. Dr Lecun dropped a (not so) mysterious X-Twitt about exactly that recently!

Unfortunately, to check this assumption that the MMLU test is biased, we would need to inspect the training corpora of those generative models, but we have very few information about the data used to train Gemini, LLama, or GPT. We had some views of it with some published hacks but not as much. We know that web content is used. Probably some books too for Gemini and GPT. Wikipedia for all models. But what about the data related to MMLU tests? We analyzed with some MMLU questions like the one below, cited in the MMLU paper :

We found that this sample used as part of Bar exams, exists in many documents available online. For this very specific example, we found a source on Google Books (suspected to be used to train Gemini), with the answer in the text:

So it is possible that for this specific example if the document is part of the corpus used to train the various chat generative models, during the MMLU test, the proper answer is not returned because of reasoning but because the model learned it. In this case, MMLU test results are not reasoning results for sure: they are – at best – information retrieval results. One way to prove this would be to know exactly what data were fed to the models!

An important question to solve to build generative models applications

In this post, we explained the true nature of the reference benchmark tests currently used in the literature to evaluate the chat generative models. We show that those tests use a structure and a methodology that does not demonstrate that reasoning can be measured with them. We also show that the lack of information about the training data used to train generative models (even the ‘open’ ones) makes difficult to check if MMLU and and its likes measure information retrieval or reasoning.

This question of the true nature of MMLU tests and their like is emerging in the recent literature and we will discuss a paper on this topic next week!