Llama.cpp – Run LLM Inference in C/C++

Llama.cpp (LLaMA C++) allows you to run efficient Large Language Model Inference in pure C/C++. You can run any powerful artificial intelligence model including all LLaMa modelsFalcon and RefinedWebMistral modelsGemma from GooglePhiQwenYiSolar 10.7B and Alpaca.

You do not need to pay to use Llama.cpp or buy a subscription. It is completely freeopen-sourceconstantly updated and available under the “MIT” license.

Download

Get Started

Purely C/C++ Based

Zero External Dependences

Cross-Platform Support

Navigation

About Llama.cpp

Author and Development

Llama.cpp was created by Georgi Gerganov (@ggerganov) who is a software engineer based out of Bulgaria. Georgi developed llama.cpp shorty after Meta released its LLaMA models so users can run them on everyday consumer hardware as well without the need of having expensive GPUs or cloud infrastructure.

This became one of the most influential and impactful open-source AI projects on GitHub. Georgi’s focus on extreme optimizationminimal dependencies and usability resonated with developers around the world. He also created the ggml tensor library which powers llama.cpp but also other machine learning projects as well. His work on quantization techniquesspecifically the k-quants system has been groundbreaking in making large language models accessible to everyone.

The project has now grown into a massive successhas a lovely community and many contributors. Visit Georgi’s GitHub profile and explore his other projects including whisper.cpp (speech-to-text) and other innovative tools.

This website is an unofficial website built for informational purposes only.

Features

Built in C/C++

Built entirely in pure C/C++ with no external dependencies. This means that Llama.cpp requires no Python runtimeno complex dependency chains resulting in no version conflicts over time. The entire codebase currently combines to only a single binary that you can run pretty much anywhere. This includes high-end servers or a Raspberry Pi device.

Hardware Acceleration

Hardware acceleration is supported by Llama.cpp on all major platforms available today. It works on Apple’s new M1/M2/M3/M4 chipsleverages Metal for GPU compute with the unified architecture. AMD and Intel CPUs also benefit from optimized AVXAVX2 and AVX512 SIMD instructions. Nvidia GPUs use CUDA with support for compute with tensor cores. AMD GPUs work with ROCm with the help of optimized kernels.

Quantization Support

Llama.cpp includes top-notch quantization capabilities with different levels of precision 2-bit to 8-bit. The k-quants system (Q4_K_MQ5_K_SQ6_K and so on) incorporates block-wise quantization which also helps preserve model quality while dramatically reducing memory footprint. For examplea 7B parameter that would typically require 14 GB to run would be able to run with just 4 GB with 4-bit quantization.

OpenAI Compatible API

It comes with a built-in HTTP server that implements OpenAI’s API specifications. This makes Llama.cpp a worthy drop-in replacement for OpenAI API calls. It supports the following endpoints which include /v1/completions/v1/chat/completions/v1/embedding in the same request and response format.

Use Multiple Interfaces

Enjoy access via multiple interfaces so you can adapt various types of workloads. The CLI interface provides you with direct model LLM interaction with full control on the parameters. The interactive chat mode offers a conversational experience with persistent context and multi-turn dialogues. The built-in HTTP/Rest server allows integration with any programming language or tool.

Multi-Model Architecture Support

Experience comprehensive model architecture support covering the entire landscape of available LLMs. New architectures are constantly added as they are released allowing you to experiment with different models without changing your underlying infrastructure. You can also compare performance between different model architectures.

Privacy Focused

You have complete control over data sovereignty as it gives you local execution. All the tokens processed stay on your hardware where you run it no data is sent to any external servers. This may help users who are more privacy and security focused and want to process confidential business documentspersonal information or even medical records along with legal documents.

Memory Friendly

Llama.cpp utilizes advanced memory optimization techniques that allow you to run larger models on older hardware with lower specifications. Memory mapping loads the models directly from disk without the need to copy them to RAM which reduces memory requirements by the model size. KV cache quantization applies 8-bit or lower precision to the key-value cachecutting memory usage by up to 50% on average during generation.

Advanced Sampling

It has sophisticated text generation controls that allow you to fine-tune output and quality. Temperature controls randomness 0.1 for focused and 1.0 for creative. Top-p nucleus sampling dynamically adjusts the token pool based on probability mass. Top-k limits selection to k most likely tokens. Repeat penalty prevents repetitive text by penalizing recently used tokens.

Why choose Llama.cpp?

Open-Source and Free

MIT licensed and free to usemodify and distribute makes it an ideal choice. It also has an active community of thousands of contributors and is updated constantly.

You are in Control

You can get fine-grained control over your inference parameters. You can also adjust memory usagespeed and the quality so it directly matches your requirements.

Top Performance

Highly optimized inference code with assembly-level optimizations. This achieves optimal performance on CPU and GPU hardware with minimal overhead.

How It Works

1. Load the LLM Models:

Download any pre-trained models in the GGUF format (or convert your own if possible from PyTorch or SafeTensor formats). LLM models are typically between 2-10 GB in practical sizes for like 7B-13B parameters.

The GGUF format includes all the of necessary metadatatokenizer information and model weights in a single portable file.

2. Optimize the Execution:

llama.cpp is capable of automatically detecting your hardware including CPU features and available GPU(s) and thus configures optimal execution paths using SIMD instructions and GPU kernels.

The engine automatically selects the best quantization kernels for your processordetermines how many layers to offload to GPU if available and configures memory mapping too.

3. Run your Inference:

Process prompts through the model using quantized weights and optimized attention mechanisms. You can generate responses in real-time! The system maintains a key-value cache for efficient multi-turn conversationsstreams tokens as they are generated for responsive user experiences and applies your chosen sampling parameters to control the output quality.

You can always adjust temepraturepenalties and other such settings on the go for tuning generation behavior for specific use cases.

Technologies and Architecture

ggml Tensor Library

This is a core computational engine built on ggmla custom tensor library written in C that provides you with efficient operations for ML inference with minimal dependencies.

SIMD Optimization

This is hand-tuned assembly using AVXAVX2AVX512 and NEON instruction sets for maximum CPU throughput on matrix operations and other attention mechanisms.

GPU Compute APIs

llama.cpp has native integration with CUDAROCm from AMDVulkanOpencl and SYCL for accelerated inference.

KV Cache Management

Complex key-value cache which has quantization support that allows for efficient memory usage during long context generation runs and conversations.

Build Systems

llama.cpp’s support for CMakeMake and various platform-specific build tools is top-notch. You have easy compilation across LinuxmacOSAndroid and Windows 10/11.

System Requirements

C++11 compatible compiler
4GB RAM (for small LLM models)
Any modern CPU
LinuxmacOSor Windows

16GB+ RAM
Modern CPU with AVX2
NVIDIA/AMD GPU (optional)
SSD for model storage

Linux (x86ARM)
macOS (Intel & Apple Silicon)
Windows (x86)
AndroidiOSFreeBSD

NVIDIA CUDA (compute 6.0+)
AMD ROCm
Apple Metal
VulkanOpenCLSYCL

Core Dependencies

C++11 compiler (GCCClang and MSVC)
Standard C++ library
No external runtime dependencies

CUDA Toolkit from Nvidia
ROCm from AMD
Vulkan SDK
Intel oneAPI (SYCL)

CMake 3.14+
Make/Ninja
Platform SDK

Frequently Asked Questions

Below are frequently asked questions about llama.cpp that are usually asked by the users. We hope these answer all of your outstanding questions regarding running LLM inference using llama.cpp.

What is Llama.cpp?

Llama.cpp is a inference engine written in C/C++ that allows you to run large language models (LLMs) directly on your own hardware compute. It was originally created to run Meta’s LLaMa models on consumer-grade compute but later evolved into becoming the standard of local LLM inference.

Is Llama.cpp Free?

Llama.cpp is open-source and available to everyone to download and use for free. You do not need a subscription to use it or buy it to use it on your hardware due to it being distributed under the “MIT” license.

What models can I run with Llama.cpp?

Llama.cpp supports a wide range of model architectures which includes Llama 12 and 3MistralPhiGemmaYiDeepSeekQwenSolarAlpaca and StableLM. This also includes any LLM model available in the GGUF format.

How much memory do I need to run Llama.cpp?

How much memory you need will always depend and vary by model size and the quantization level. A 7B parameter model in a 4-bit quantization requires approximately 5 GB of RAM. 13B models would need 9-10 GB and 70B models around 40-45 GB in 4-bit.

What is quantization and why it matters?

Quantization reduces the precision of model weights from 16-bit floats to lower bit representations usually 8-bit and 4-bit etc. This helps in reducing memory usage and increases inference speed with little loss in quality. For example4-bit quantization can reduce a model size by around 70-75% while maintaining most of its capabilities. Llama.cpp supports multiple quantization formats optimized for different hardware.

Is a GPU always required to use Llama.cpp?

Llama.cpp also runs fine on CPU-only hardware. HoweverGPU acceleration does significantly improve inference speed. The software supports Nvidia GPUs (CUDA)AMD GPUs (ROCm)Apple Silicon (Metal) and other Vulkan-compatible GPUs.

Can I convert models to GGUF format easily so I can use them with Llama.cpp?

Llama.cpp includes Python scripts to convert models from various formats (PyTorchSafeTensors) to GGUF. The popular models are available pre-quantized on Hugging Face. Simply download the GGUF file and you are ready to use it.

Is Llama.cpp production ready?

Llama.cpp is already ready to be used in production environments and is being used by various companies world wide to run LLMs locally. The built-in server provides an OpenAI-compatible API which makes integration very simple. Due to the MIT license which allows commercial use without restrictions many applications and even services are built using llama.cpp.

Is Llama.cpp inference much faster compared to other Python frameworks?

Generally Llama.cpp outperforms Python-based frameworks by a significant margin especially on CPU. Being written in C/C++ with extensive optimization and SIMD instructions results in it being 3-8x faster inference depending on hardware.

Is Llama cpp available on Windows?

Llama.cpp fully supports Windows. There are pre-built binaries available for easy installation or you can also compile from source using Visual Studio or MinGW. GPU accelerations via CUDA and Vulkan works well on Windows as well the same as Linux.

Is Model fine-tuning possible with llama.cpp?

Llama.cpp is mainly designed for inference not for model training or tuning. Howeveryou can possible use the “finetune” example for basic fine-tune tasks. PyTorch can be used for more serious fine-tuning tasks.