Navigation
About Llama.cpp
Features
Why choose Llama.cpp?
How It Works
1. Load the LLM Models:
Download any pre-trained models in the GGUF format (or convert your own if possible from PyTorch or SafeTensor formats). LLM models are typically between 2-10 GB in practical sizes for like 7B-13B parameters.
The GGUF format includes all the of necessary metadatatokenizer information and model weights in a single portable file.
2. Optimize the Execution:
llama.cpp is capable of automatically detecting your hardware including CPU features and available GPU(s) and thus configures optimal execution paths using SIMD instructions and GPU kernels.
The engine automatically selects the best quantization kernels for your processordetermines how many layers to offload to GPU if available and configures memory mapping too.
3. Run your Inference:
Process prompts through the model using quantized weights and optimized attention mechanisms. You can generate responses in real-time! The system maintains a key-value cache for efficient multi-turn conversationsstreams tokens as they are generated for responsive user experiences and applies your chosen sampling parameters to control the output quality.
You can always adjust temepraturepenalties and other such settings on the go for tuning generation behavior for specific use cases.
Technologies and Architecture
System Requirements
- C++11 compatible compiler
- 4GB RAM (for small LLM models)
- Any modern CPU
- LinuxmacOSor Windows
- 16GB+ RAM
- Modern CPU with AVX2
- NVIDIA/AMD GPU (optional)
- SSD for model storage
- Linux (x86ARM)
- macOS (Intel & Apple Silicon)
- Windows (x86)
- AndroidiOSFreeBSD
- NVIDIA CUDA (compute 6.0+)
- AMD ROCm
- Apple Metal
- VulkanOpenCLSYCL
Core Dependencies
- C++11 compiler (GCCClang and MSVC)
- Standard C++ library
- No external runtime dependencies
- CUDA Toolkit from Nvidia
- ROCm from AMD
- Vulkan SDK
- Intel oneAPI (SYCL)
- CMake 3.14+
- Make/Ninja
- Platform SDK
Frequently Asked Questions
Below are frequently asked questions about llama.cpp that are usually asked by the users. We hope these answer all of your outstanding questions regarding running LLM inference using llama.cpp.