Llama.cpp

{{Short description|Software library for LLM inference}} {{Lowercase title}} {{Infobox software | name = llama.cpp | logo = File:Llama1-logo.svg | screenshot = File:Llama-dot-cpp-screenshot.png | author = Georgi Gerganov | developer = Georgi Gerganov and community | released = {{Start date and age|2023|3|10}}<ref name="githubrelease"/> | programming language = C++, C | genre = Library for large language models | license = MIT License<ref name="license"/> | repo = {{URL|github.com/ggml-org/llama.cpp}} }}

'''llama.cpp''' is an open source software library that performs inference on various large language models such as Llama.<ref name="register-llamafile"/> It is co-developed alongside the {{Tooltip|GGML|Georgi Gerganov Machine Learning}} project, a general-purpose tensor library.<ref name="ggml"/>

Command-line tools are included with the library,<ref name="theregister 14 Jul 2024" /> alongside a server with a simple web interface.<ref name="lwn" /><ref name="theregister 15 December 2024" />

llama.cpp has been considered as the de facto standard as the core of almost all local inference tools, including Ollama and LM Studio.<ref>{{Cite web |title=Forensic Implications of Localized AI: Artifact Analysis of Ollama, LM Studio, and llama.cpp |url=https://arxiv.org/html/2603.23996v1 |access-date=2026-05-30 |website=arxiv.org}}</ref>

== Background == Towards the end of September 2022, Georgi Gerganov started work on the GGML library, a C library implementing tensor algebra. Gerganov developed the library with the intention of strict memory management and multi-threading. The creation of GGML was inspired by Fabrice Bellard's work on LibNC.<ref name="changelog-podcast-mar-2023" />

Before llama.cpp, Gerganov worked on a similar library called whisper.cpp which implemented Whisper, a speech to text model by OpenAI.<ref name="whisper"/>

== Development == llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C/C++ with no dependencies. This improved performance on computers without GPU or other dedicated hardware, which was a project goal.<ref name="register-llamafile" /><ref name="arstechnica"/><ref name="Wiest" /> llama.cpp gained traction with users lacking specialized hardware, as it could run on just a CPU. While initially designed for CPUs, GPU and NPU backend support was later added.<ref name="Rajput" /> As of May 2026 it has more than 109,000 stars on GitHub.<ref name="llama.cpprepo" />

On April 30, 2024, FlashAttention was introduced.{{citation needed|date=May 2026}}

On April 10, 2025, libmtmd was introduced, which reinvigorated support for multimodal models that had previously been stagnant.{{citation needed|date=May 2026}}

On December 17, 2025, full acceleration on Android and ChromeOS devices was introduced via a new GUI binding<ref>{{Cite web |last=ggml-org |title=llama.cpp/docs/android.md at master · ggml-org/llama.cpp |url=https://github.com/ggml-org/llama.cpp/blob/master/docs/android.md |access-date=2025-12-20 |website=GitHub |language=en}}</ref>, which unlocks native app development beyond the previous approach of cross-compiling and running CLI <ref name="arstechnica" /><ref name="mozilla-introducing-llamafile">{{cite web |last1=Hood |first1=Stephen |title=llamafile: bringing LLMs to the people, and to your own computer |url=https://future.mozilla.org/builders/news_insights/introducing-llamafile/ |access-date=28 July 2024 |website=Mozilla Innovations |language=en}}</ref><ref>{{cite web |title=Democratizing AI with open-source language models |url=https://lwn.net/Articles/931853/ |access-date=28 July 2024 |website=lwn.net}}</ref> in an [https://developer.android.com/tools/adb#shellcommands adb shell].

== Architecture == llama.cpp supports multiple hardware targets, including x86, ARM, Metal, BLAS, BLIS, [https://github.com/IBM/zDNN zDNN], [https://github.com/amd/ZenDNN ZenDNN], SYCL, MUSA, CUDA, HIP, CANN, OpenCL, RPC and Vulkan (version 1.2 or greater).<ref name="Gerganov Slaren Nguyen Introduction to ggml"/><ref name="Kluska" /><ref name="Run LLMs on Intel GPUs Using llama.cpp" /><ref name="Bolz" /> These back-ends make up the GGML tensor library, used by the front-end model-specific llama.cpp code.<ref name="tomshardware"/> llama.cpp makes use of several CPU extensions for optimization:

* AVX, AVX2, AVX-512, AVX-VNNI and AMX for X86-64. * Neon, [https://developer.arm.com/community/arm-community-blogs/b/ai-blog/posts/optimize-llama-cpp-with-arm-i8mm-instruction i8MM], [https://developer.arm.com/documentation/102476/0101/Introducing-SVE SVE], [https://developer.arm.com/documentation/102340/0100/Introducing-SVE2 SVE2], [https://developer.arm.com/documentation/109246/0101/SME-Overview/SME-and-SME2 SME and SME2] for AArch64 (ARM64). * VXE2 (Vector Enhancement Facility 2) for S390x. * Apple silicon is an important target for the project.<ref name="llama.cpprepo" /><ref name="phoronix-llamafile" />

llama.cpp supports a variety of features aimed at inference on edge devices, such as: * Ahead of time model quantization and on-the-fly kv-cache quantization.<ref name="Walkowiak"/> * Speculative decoding.<ref name="theregister 15 December 2024" /> * Partial offloading of model layers to system RAM, allowing devices to load models that would be too large to fit solely in GPU VRAM.

In addition, llama.cpp supports a variety of features and APIs for frontend communication, such as: * OpenAI-compatible endpoints like <code>v1/chat/completions</code>. * Grammar-based output formatting as JSON.<ref name="Wiest" />

== GGUF file format == {{main|GGUF}} {{Infobox file format | name = GGUF | icon = GGML_logo.svg | _noextcode = on | extension = {{code|.gguf}} | magic = {{code|0x47}} {{code|0x47}} {{code|0x55}} {{code|0x46}} | developer = Georgi Gerganov and community | released = {{Start date and age|2023|8|22}}<ref name="githubgguf"/> | latest_release_version = v3<ref name="ggufdoc"/> | type = Machine-learning tensors }}

The GGUF ({{Tooltip|GGML|Georgi Gerganov Machine Learning}} Universal File)<ref name="gguf-py"/> file format is a binary format that stores both tensors and metadata in a single file, and is designed for fast saving, and loading of model data.<ref name="huggingface"/> It was introduced in August 2023 by the llama.cpp project to better maintain backwards compatibility as support was added for other model architectures.<ref name="Rajput" /><ref name="ibm-gguf-vs-ggml" /> It superseded previous formats used by the project such as GGML and is typically produced by converting models developed with a different machine learning library such as PyTorch.<ref name="huggingface"/>

=== Design === GGUF focuses on quantization, the act of reducing precision in the model weights. This can lead to reduced memory usage and increased speed, albeit at the cost of reduced model accuracy.<ref name="towardsdatascience" /><ref name="ibm-gguf-vs-ggml" />

GGUF supports 2-bit to 8-bit quantized integer types,<ref name="Cabezas" /> common floating-point data formats such as float32, float16, and bfloat16, and 1.58 bit quantization.<ref name="theregister 14 Jul 2024" />

GGUF contains information necessary for running a GPT-like language model such as the tokenizer vocabulary, context length, tensor info and other attributes.<ref name="Accelerating GGUF Models with Transformers" />

=== Byte-level structure (little-endian) === {| class="wikitable" ! Bytes !! Description<ref name="gguf.md"/> |- | 4 || GGUF magic number, currently set to <code>0x47 0x47 0x55 0x46</code> |- | 4 || GGUF version, currently set to <code>3</code> |- | 8 || <code>UINT64 tensor_count</code>: number of tensors |- | 8 || <code>UINT64 metadata_kv_count</code>: number of metadata key-value pairs |- | Variable || Metadata block, containing ''metadata_kv_count'' key-value pairs |- | Variable || Tensors info block, containing ''tensor_count'' values |- | Variable || <code>uint8_t tensor_data[]</code>, weight bits block |}

==== Metadata block ==== <syntaxhighlight lang="php"> // example metadata general.architecture: 'llama', general.name: 'LLaMA v2', llama.context_length: 4096, ... , general.file_type: 10, // (typically indicates quantization level, here "MOSTLY_Q2_K") tokenizer.ggml.model: 'llama', tokenizer.ggml.tokens: [ '<unk>', '<s>', '</s>', '<0x00>', '<0x01>', '<0x02>', '<0x03>', '<0x04>', '<0x05>', '<0x06>', '<0x07>', '<0x08>', ... ], ... </syntaxhighlight>

==== Tensors info block ==== <syntaxhighlight lang="c"> // n-th tensor name: GGUF string, // ex: "blk.0.ffn_gate.weight" n_dimensions: UINT32, // ex: 2 dimensions: UINT64[], // ex: [ 4096, 32000 ] type: UINT32, // ex: 10 (typically indicates quantization level, here "GGML_TYPE_Q2_K") offset: UINT64 // starting position within the tensor_data block, relative to the start of the block // (n+1)-th tensor ... </syntaxhighlight>

== Models == Llama.cpp supports many large language models, including Llama, Mistral, Gemma, DeepSeek, gpt-oss, Phi and Qwen.<ref>{{Citation |title=ggml-org/llama.cpp |date=2026-04-19 |url=https://github.com/ggml-org/llama.cpp |access-date=2026-04-19 |publisher=ggml}}</ref>

== See also == * Ollama * SGLang – framework for structured generation and high-performance large language model inference and serving * vLLM – large language model inference and serving engine

== References ==

{{Reflist|refs=

<ref name="githubrelease">{{cite web |title=Initial release · ggerganov/llama.cpp@26c0846 |url=https://github.com/ggerganov/llama.cpp/commit/26c084662903ddaca19bef982831bfb0856e8257 |website=GitHub |access-date=15 May 2024 |language=en}}</ref>

<ref name="license">{{cite web |title=llama.cpp/LICENSE at master · ggerganov/llama.cpp |url=https://github.com/ggerganov/llama.cpp/blob/master/LICENSE |website=GitHub |language=en}}</ref>

<ref name="ggml">{{cite web |last1=Gerganov |first1=Georgi |title=ggerganov/ggml |website=GitHub |url=https://github.com/ggerganov/ggml |date=17 May 2024}}</ref>

<ref name="register-llamafile">{{cite web |last1=Connatser |first1=Matthew |title=How this open source LLM chatbot runner hit the gas on x86, Arm CPUs |url=https://www.theregister.com/2024/04/03/llamafile_performance_gains/ |website=theregister.com |access-date=15 April 2024}}</ref>

<ref name="llama.cpprepo">{{cite web |title=ggerganov/llama.cpp |website=GitHub |url=https://github.com/ggerganov/llama.cpp}}</ref>

<ref name="whisper">{{cite web |title=ggerganov/whisper.cpp |website=GitHub |url=https://github.com/ggerganov/whisper.cpp}}</ref>

<ref name="arstechnica">{{cite web |last1=Edwards |first1=Benj |title=You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi |url=https://arstechnica.com/information-technology/2023/03/you-can-now-run-a-gpt-3-level-ai-model-on-your-laptop-phone-and-raspberry-pi/ |website=arstechnica.com |date=13 March 2023 |access-date=15 April 2024}}</ref>

<ref name="tomshardware">{{cite web |last1=Pounder |first1=Les |title=How To Create Your Own AI Chatbot Server With Raspberry Pi 4 |url=https://www.tomshardware.com/how-to/create-ai-chatbot-server-on-raspberry-pi |website=tomshardware.com |date=25 March 2023 |access-date=16 April 2024}}</ref>

<ref name="Walkowiak">{{cite journal |last1=Walkowiak |first1=Bartosz |last2=Walkowiak |first2=Tomasz |journal=International Journal of Electronics and Telecommunications|date=2024 |volume=70 |issue=1 |pages=153–159 |doi=10.24425/ijet.2024.149525 |url=https://journals.pan.pl/Content/130704/18_4466_Walkowiak_L_sk.pdf |access-date=8 May 2024| title=Implementation of language models within an infrastructure designed for Natural Language Processing}}</ref>

<ref name="huggingface">{{cite web |title=GGUF |url=https://huggingface.co/docs/hub/gguf |website=huggingface.co |access-date=9 May 2024}}</ref>

<ref name="ggufdoc">{{cite web |title=ggml/docs/gguf.md at master · ggerganov/ggml |url=https://github.com/ggerganov/ggml/blob/master/docs/gguf.md |website=GitHub |language=en}}</ref>

<ref name="gguf-py">{{cite web |title=ggerganov/llama.cpp/gguf-py/README.md |url=https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md |website=GitHub |access-date=10 November 2024}}</ref>

<ref name="towardsdatascience">{{cite web |last1=Labonne |first1=Maxime |title=Quantize Llama models with GGUF and llama.cpp |url=https://towardsdatascience.com/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172 |website=Medium |publisher=Towards Data Science |access-date=9 May 2024 |language=en |date=29 November 2023}}</ref>

<ref name="githubgguf">{{cite web |title=GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp |url=https://github.com/ggerganov/llama.cpp/pull/2398 |website=GitHub |language=en}}</ref>

<ref name="Cabezas">{{cite book |last1=Cabezas |first1=Darío |last2=Fonseca-Delgado |first2=Rigoberto |last3=Reyes-Chacón |first3=Iván |last4=Vizcaino-Imacaña |first4=Paulina |last5=Morocho-Cayamcela |first5=Manuel |title=Proceedings of the 19th International Conference on Software Technologies |chapter=Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students |date=2024 |pages=395–402 |doi=10.5220/0012763000003753|isbn=978-989-758-706-1 }}</ref>

<ref name="Kluska">{{cite journal |last1=Kluska |first1=Piotr |last2=Castell´o |first2=Adri´an |last3=Scheidegger |first3=Florian |last4=I. Malossi |first4=A. Cristiano |last5=Quintana-Ort´ı |first5=Enrique |title=QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers |journal=Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops |date=June 2024 |url=https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf}}</ref>

<ref name="theregister 14 Jul 2024">{{cite web |last1=Mann |first1=Tobias |title=Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it |url=https://www.theregister.com/2024/07/14/quantization_llm_feature/ |website=theregister |date=14 Jul 2024}}</ref>

<ref name="Rajput">{{cite book |last1=Rajput |first1=Saurabhsingh |last2=Sharma |first2=Tushar |chapter=Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency |title=2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C) |date=4 June 2024 |pages=238–242 |doi=10.1109/ICSA-C63560.2024.00049|isbn=979-8-3503-6625-9 }}</ref>

<ref name="ibm-gguf-vs-ggml">{{cite web |last1=Mucci |first1=Tim |title=GGUF versus GGML |url=https://www.ibm.com/think/topics/gguf-versus-ggml |website=www.ibm.com |access-date=26 July 2024 |language=en-us |date=3 July 2024}}</ref>

<ref name="Wiest">{{cite journal |last1=Wiest |first1=Isabella Catharina |last2=Ferber |first2=Dyke |last3=Zhu |first3=Jiefu |last4=van Treeck |first4=Marko |last5=Meyer |first5=Sonja K. |last6=Juglan |first6=Radhika |last7=Carrero |first7=Zunamys I. |last8=Paech |first8=Daniel |last9=Kleesiek |first9=Jens |last10=Ebert |first10=Matthias P. |last11=Truhn |first11=Daniel |last12=Kather |first12=Jakob Nikolas |title=Privacy-preserving large language models for structured medical information retrieval |journal=npj Digital Medicine |date=2024 |volume=7 |issue=257 |page=257 |doi=10.1038/s41746-024-01233-2|pmid=39304709 |pmc=11415382 }}</ref>

<ref name="Accelerating GGUF Models with Transformers">{{cite magazine |last1=Dong |first1=Bo |last2=Lin |first2=Jun |last3=Yu |first3=Zhentao |last4=Xu |first4=Zhenzhong |last5=Luo |first5=Yu |last6=Chang |first6=Hanwen |last7=Shen |first7=Haihao |title=Accelerating GGUF Models with Transformers |journal=The Parallel Universe |date=July 2024 |issue=57 |pages=28–33 |url=https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-gguf-models-with-transformers.html |language=en |publisher=Intel}}</ref>

<ref name="Run LLMs on Intel GPUs Using llama.cpp">{{cite magazine |last1=Jianyu |first1=Zhang |last2=Hengyu |first2=Meng |last3=Ying |first3=Hu |last4=Yu |first4=Luo |last5=Xiaoping |first5=Duan |last6=Corporation |first6=Majumder Abhilash Intel |title=Run LLMs on Intel GPUs Using llama.cpp|journal=The Parallel Universe |date=July 2024 |issue=57 |pages=34–37 |url=https://www.intel.com/content/www/us/en/developer/articles/technical/run-llms-on-gpus-using-llama-cpp.html |publisher=Intel |language=en}}</ref>

<ref name="changelog-podcast-mar-2023">{{cite web |title=Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532) |url=https://changelog.com/podcast/532 |website=Changelog |access-date=28 July 2024 |language=en |date=22 March 2023}}</ref>

<ref name="phoronix-llamafile">{{cite web |last1=Larabel |first1=Michael |title=Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4 |url=https://www.phoronix.com/news/Llamafile-0.7 |website=www.phoronix.com |language=en}}</ref>

<ref name="lwn">{{cite web |last1=Alden |first1=Daroc |title=Portable LLMs with llamafile [LWN.net] |url=https://lwn.net/Articles/971195/ |website=lwn.net |access-date=30 July 2024}}</ref>

<ref name="Gerganov Slaren Nguyen Introduction to ggml">{{cite web |last1=Gerganov |first1=Georgi |last2=Nguyen |first2=Xuan Son |author3=Slaren |title=Introduction to ggml |url=https://huggingface.co/blog/introduction-to-ggml |website=Huggingface |date=August 13, 2024}}</ref>

<ref name="theregister 15 December 2024">{{cite web |last1=Mann |first1=Tobias |title=Intro to speculative decoding: Cheat codes for faster LLMs |url=https://www.theregister.com/2024/12/15/speculative_decoding/ |website=theregister |language=en |date=15 December 2024}}</ref>

<ref name="Bolz">{{cite web |last1=Bolz |first1=Jeff |title=Machine Learning in Vulkan with Cooperative Matrix 2 |url=https://vulkan.org/user/pages/09.events/vulkanised-2025/T47-Jeff-Bolz-NVIDIA.pdf |publisher=The Khronos Group/Nvidia |location=Cambridge, UK |language=en |date=February 11–13, 2025}}</ref>

<ref name="gguf.md">{{cite web |title=GGUF specification (ggml/docs/gguf.md at master · ggml-org/ggml) |url=https://github.com/ggml-org/ggml/blob/master/docs/gguf.md |language=en}}</ref>

}}

Category:Large language models Category:Open-source artificial intelligence Category:Software using the MIT license Category:Free computer libraries Category:Free software programmed in C++