llama.cpp

GGUF
Filename extension	.gguf
Magic number	0x47 0x47 0x55 0x46
Developed by	Georgi Gerganov and community
Initial release	August 22, 2023; 8 months ago
Type of format	Machine-learning

llama.cpp
Original author(s)	Georgi Gerganov
Developer(s)	Georgi Gerganov and community
Initial release	March 10, 2023; 14 months ago
Repository	github.com/ggerganov/llama.cpp
Written in	C++
License	MIT License

llama.cpp is an open source software library written in C++, that performs inference on various Large Language Models such as Llama.^[3] It is co-developed alongside the ggml library, a general-purpose tensor libarary.^[4]

History[edit]

llama.cpp began development by Georgi Gerganov to implement Llama in pure C++ with no dependencies. The advantage of this method was that it could run on more hardware compared to other inference libraries that depended on hardware dependent closed source libraries like CUDA.^[3] It is written in C++. It currently has 55 thousand stars on GitHub.^[5] Before llama.cpp, Gerganov worked on a similar library called whisper.cpp^[6] which implemented Whisper a speech to text model by OpenAI. llama.cpp gained traction from users who did not have specialized hardware as it could run on just a CPU including on Android devices.^[7]

Architecture[edit]

llama.cpp initially could only run on CPUs but now can run on GPUs using multiple different back-ends including Vulkan and SYCL. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code.^[8] llama.cpp has its own model format called GGUF (previously referred to as GGML format).^[9] llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization.^[10]

Supported models[edit]

GGUF file format[edit]

The GGUF file format is a binary format that stores both tensors and metadata.^[12] GGUF files are typically created by converting models developed in another file format from a different machine learning library such as PyTorch. It is the intention of GGUF's to make model files easy and fast to load within llama.cpp and other ggml projects.^[13]

GGUF was created to replace previous file formats used by the project, which didn't include architecture metadata, and therefore made it difficult to extend the software without breaking backwards compatibility.^[13]

The format focuses on supporting different quantization types, which can reduce memory usage, and increase speed at the expense of lower model precision.^[14]

Supported data types[edit]

GGUF supports common floating-point data formats float32, and float16, in addition to supporting bfloat16.

GGUF also supports various quantized integer types:

1.5-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit

References[edit]

^ "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.
^ "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.
^ ^a ^b Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.
^ Gerganov, Georgi (17 May 2024). "ggerganov/ggml".
^ "ggerganov/llama.cpp". GitHub.
^ "ggerganov/whisper.cpp". GitHub.
^ Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.
^ "GGML - AI at the edge". ggml.ai. Retrieved 16 April 2024.
^ Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.
^ Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.
^ "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.
^ "GGUF". huggingface.co. Retrieved 9 May 2024.
^ ^a ^b "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.
^ Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.

[1] "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.

[2] "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.

[register-llamafile-3] Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.

[4] Gerganov, Georgi (17 May 2024). "ggerganov/ggml".

[5] "ggerganov/llama.cpp". GitHub.

[6] "ggerganov/whisper.cpp". GitHub.

[7] Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.

[8] "GGML - AI at the edge". ggml.ai. Retrieved 16 April 2024.

[9] Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.

[10] Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.

[11] "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.

[12] "GGUF". huggingface.co. Retrieved 9 May 2024.

[ggufdoc-13] "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.

[14] Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]