# MMLU

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/MMLU
> Markdown URL: https://mediated.wiki/source/MMLU.md
> Source: https://en.wikipedia.org/wiki/MMLU
> Source revision: 1354176701
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

{{Short description|Language model benchmark}}
Measuring '''Massive Multitask Language Understanding''' ('''MMLU''') is a popular [benchmark](/source/Benchmarks_for_artificial_intelligence) for evaluating the capabilities of [large language models](/source/large_language_models). It inspired several other versions and spin-offs, such as MMLU-Pro<ref>{{Citation |title=TIGER-AI-Lab/MMLU-Pro |date=2026-05-13 |url=https://github.com/TIGER-AI-Lab/MMLU-Pro |access-date=2026-05-14 |publisher=TIGER Lab}}</ref>, MMMLU<ref>{{Cite web |last=Stats |first=L. L. M. |date=2026-05-14 |title=MMMLU Benchmark Leaderboard |url=https://llm-stats.com/benchmarks/mmmlu |access-date=2026-05-14 |website=LLM Stats |language=en}}</ref> and MMLU-Redux<ref>{{Citation |last=Gema |first=Aryo Pradipta |title=aryopg/mmlu-redux |date=2026-02-07 |url=https://github.com/aryopg/mmlu-redux |access-date=2026-05-14}}</ref>.

==Overview==
MMLU consists of 15,908 multiple-choice questions, with 1,540 of them being used to select and assess optimal settings for models – temperature, [batch size](/source/Batch_processing) and [learning rate](/source/learning_rate). The questions span across 57 subjects, from highly complex [STEM](/source/Science%2C_technology%2C_engineering%2C_and_mathematics) fields and international law, to nutrition and religion. It was one of the most commonly used [benchmarks](/source/Benchmarks_for_artificial_intelligence) for comparing the capabilities of [large language models](/source/Large_language_model), with over 100 million downloads as of July 2024.<ref name=":1">{{Cite journal |last=Hendrycks |first=Dan |last2=Burns |first2=Collin |last3=Basart |first3=Steven |last4=Zou |first4=Andy |last5=Mazeika |first5=Mantas |last6=Song |first6=Dawn |last7=Steinhardt |first7=Jacob |date=2021 |title=Measuring Massive Multitask Language Understanding |journal=ICLR |arxiv=2009.03300}}</ref><ref>{{Cite web |date=2024-07-08 |title=cais/mmlu |url=https://huggingface.co/datasets/cais/mmlu |access-date=2024-07-24 |website=Hugging Face}}</ref>

The benchmark was released by [Dan Hendrycks](/source/Dan_Hendrycks) and a team of researchers on 7 September 2020. It was purpose-made to be more challenging than existing benchmarks at the time, such as [General Language Understanding Evaluation](/source/General_Language_Understanding_Evaluation) (GLUE), as models began outperforming humans in easier tests. When MMLU was released, most existing language models scored near the level of random chance (25%). The best performing model, [GPT-3](/source/GPT-3) 175B, achieved 43.9% accuracy. The creators of the MMLU estimated that human domain-experts achieve around 89.8% accuracy.<ref name=":1" /> By mid-2024, the majority of powerful language models such as [Claude 3.5 Sonnet](/source/Claude_3.5_Sonnet), [GPT-4o](/source/GPT-4o) and [Llama 3.1](/source/Llama_3.1) 405B consistently achieved 88%.<ref>{{Cite web |title=Introducing Claude 3.5 Sonnet |url=https://www.anthropic.com/news/claude-3-5-sonnet |access-date=2025-04-06 |website=Anthropic |language=en}}</ref><ref>{{Cite web |date=2024-05-13 |title=Hello GPT-4o |url=https://openai.com/index/hello-gpt-4o/ |access-date=2025-04-06 |website=OpenAI |language=en-US}}</ref><ref>{{Cite web |date=2024-07-23 |title=Introducing Llama 3.1: Our most capable models to date |url=https://ai.meta.com/blog/meta-llama-3-1/ |access-date=2025-04-06 |website=Meta blog}}</ref> As of 2025, MMLU has been partially phased out in favor of more difficult alternatives.

== Limitations ==
On 5 June 2024, experts released a paper detailing their manual analysis of 5,700 questions in the benchmark, which revealed that it contained a very significant amount of ground-truth errors. For example, 57% of questions in the "[Virology](/source/Virology)" subset were marked as harboring errors, such as multiple correct answers (4%), unclear questions (14%), or completely incorrect answers (33%). Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%.<ref>{{cite arXiv |eprint=2406.04127 |class=cs.CL |first1=Aryo Pradipta |last1=Gema |first2=Joshua Ong Jun |last2=Leang |title=Are We Done with MMLU? |date=2024-06-07 |last3=Hong |first3=Giwon |last4=Devoto |first4=Alessio |last5=Mancino |first5=Alberto Carlo Maria |last6=Saxena |first6=Rohit |last7=He |first7=Xuanli |last8=Zhao |first8=Yu |last9=Du |first9=Xiaotang |last10=Madani |first10=Mohammad Reza Ghasemi |last11=Barale |first11=Claire |last12=McHardy |first12=Robert |last13=Harris |first13=Joshua |last14=Kaddour |first14=Jean |last15=Krieken |first15=Emile van |last16=Minervini |first16=Pasquale}}
</ref> Data contamination also posed a significant threat for this benchmark's validity; companies could easily include questions and answers into their models' training data, effectively rendering it ineffective.<ref>{{Cite news |last=Roose |first=Kevin |date=2024-04-15 |title=A.I. Has a Measurement Problem |url=https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html |access-date=2024-04-21 |work=The New York Times |language=en-US |issn=0362-4331}}</ref>

== Examples ==
The following examples are sourced from the "[Abstract Algebra](/source/Abstract_algebra)", "[International Law](/source/International_law)" and "Professional [Medicine](/source/Medicine)" tasks, respectively.<ref name=":1" /> The correct answers are marked in boldface:

'''Question 1:''' 

Find all <math>c</math> in <math>\mathbb{Z}_3</math> such that <math>\mathbb{Z}_3[x]/(x^2 + c)</math> is a field.

(A) 0 │ '''(B) 1''' │ (C) 2 │ (D) 3

'''Question 2:'''

Would a reservation to the definition of torture in the [International Covenant on Civil and Political Rights](/source/International_Covenant_on_Civil_and_Political_Rights) (ICCPR) be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition.<br>'''(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR.'''<br>(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law.<br>(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties.

'''Question 3:'''

A 33-year-old man undergoes a radical thyroidectomy for thyroid cancer. During the operation, moderate hemorrhaging requires ligation of several vessels in the left side of the neck. Postoperatively, serum studies show a calcium concentration of 7.5 mg/dL, albumin concentration of 4 g/dL, and parathyroid hormone concentration of 200 pg/mL. Damage to which of the following vessels caused the findings in this patient?

(A) Branch of the costocervical trunk.<br>(B) Branch of the external carotid artery.<br>'''(C) Branch of the thyrocervical trunk.'''<br>(D) Tributary of the internal jugular vein.

==References==
{{reflist}}

Category:Large language models
Category:Benchmarks (computing)

---
Adapted from the Wikipedia article [MMLU](https://en.wikipedia.org/wiki/MMLU) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/MMLU?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.
