×

注意!页面内容来自https://llama-cpp.com/,本站不储存任何内容,为了更好的阅读体验进行在线解析,若有广告出现,请及时反馈。若您觉得侵犯了您的利益,请通知我们进行删除,然后访问 原网页

Llama.cpp – Run LLM Inference in C/C++

Llama-cpp

Llama.cpp (LLaMA C++) allows you to run efficient Large Language Model Inference in pure C/C++. You can run any powerful artificial intelligence model including all LLaMa modelsFalcon and RefinedWebMistral modelsGemma from GooglePhiQwenYiSolar 10.7B and Alpaca.

You do not need to pay to use Llama.cpp or buy a subscription. It is completely freeopen-sourceconstantly updated and available under the “MIT” license.

Purely C/C++ Based

Zero External Dependences

Cross-Platform Support

About Llama.cpp

Author and Development

Llama.cpp was created by Georgi Gerganov (@ggerganov) who is a software engineer based out of Bulgaria. Georgi developed llama.cpp shorty after Meta released its LLaMA models so users can run them on everyday consumer hardware as well without the need of having expensive GPUs or cloud infrastructure.

This became one of the most influential and impactful open-source AI projects on GitHub. Georgi’s focus on extreme optimizationminimal dependencies and usability resonated with developers around the world. He also created the ggml tensor library which powers llama.cpp but also other machine learning projects as well. His work on quantization techniquesspecifically the k-quants system has been groundbreaking in making large language models accessible to everyone.

The project has now grown into a massive successhas a lovely community and many contributors. Visit Georgi’s GitHub profile and explore his other projects including whisper.cpp (speech-to-text) and other innovative tools.

This website is an unofficial website built for informational purposes only.

Features

Built in C/C++

Built entirely in pure C/C++ with no external dependencies. This means that Llama.cpp requires no Python runtimeno complex dependency chains resulting in no version conflicts over time. The entire codebase currently combines to only a single binary that you can run pretty much anywhere. This includes high-end servers or a Raspberry Pi device.

Hardware Acceleration

Hardware acceleration is supported by Llama.cpp on all major platforms available today. It works on Apple’s new M1/M2/M3/M4 chipsleverages Metal for GPU compute with the unified architecture. AMD and Intel CPUs also benefit from optimized AVXAVX2 and AVX512 SIMD instructions. Nvidia GPUs use CUDA with support for compute with tensor cores. AMD GPUs work with ROCm with the help of optimized kernels.

Quantization Support

Llama.cpp includes top-notch quantization capabilities with different levels of precision 2-bit to 8-bit. The k-quants system (Q4_K_MQ5_K_SQ6_K and so on) incorporates block-wise quantization which also helps preserve model quality while dramatically reducing memory footprint. For examplea 7B parameter that would typically require 14 GB to run would be able to run with just 4 GB with 4-bit quantization.

OpenAI Compatible API

It comes with a built-in HTTP server that implements OpenAI’s API specifications. This makes Llama.cpp a worthy drop-in replacement for OpenAI API calls. It supports the following endpoints which include /v1/completions/v1/chat/completions/v1/embedding in the same request and response format.

Use Multiple Interfaces

Enjoy access via multiple interfaces so you can adapt various types of workloads. The CLI interface provides you with direct model LLM interaction with full control on the parameters. The interactive chat mode offers a conversational experience with persistent context and multi-turn dialogues. The built-in HTTP/Rest server allows integration with any programming language or tool.

Multi-Model Architecture Support

Experience comprehensive model architecture support covering the entire landscape of available LLMs. New architectures are constantly added as they are released allowing you to experiment with different models without changing your underlying infrastructure. You can also compare performance between different model architectures.

Privacy Focused

You have complete control over data sovereignty as it gives you local execution. All the tokens processed stay on your hardware where you run it no data is sent to any external servers. This may help users who are more privacy and security focused and want to process confidential business documentspersonal information or even medical records along with legal documents.

Memory Friendly

Llama.cpp utilizes advanced memory optimization techniques that allow you to run larger models on older hardware with lower specifications. Memory mapping loads the models directly from disk without the need to copy them to RAM which reduces memory requirements by the model size. KV cache quantization applies 8-bit or lower precision to the key-value cachecutting memory usage by up to 50% on average during generation.

Advanced Sampling

It has sophisticated text generation controls that allow you to fine-tune output and quality. Temperature controls randomness 0.1 for focused and 1.0 for creative. Top-p nucleus sampling dynamically adjusts the token pool based on probability mass. Top-k limits selection to k most likely tokens. Repeat penalty prevents repetitive text by penalizing recently used tokens.

Why choose Llama.cpp?

Open-Source and Free

MIT licensed and free to usemodify and distribute makes it an ideal choice. It also has an active community of thousands of contributors and is updated constantly.

You are in Control

You can get fine-grained control over your inference parameters. You can also adjust memory usagespeed and the quality so it directly matches your requirements.

Top Performance

Highly optimized inference code with assembly-level optimizations. This achieves optimal performance on CPU and GPU hardware with minimal overhead.

How It Works

1. Load the LLM Models:

Download any pre-trained models in the GGUF format (or convert your own if possible from PyTorch or SafeTensor formats). LLM models are typically between 2-10 GB in practical sizes for like 7B-13B parameters.

The GGUF format includes all the of necessary metadatatokenizer information and model weights in a single portable file.

2. Optimize the Execution:

llama.cpp is capable of automatically detecting your hardware including CPU features and available GPU(s) and thus configures optimal execution paths using SIMD instructions and GPU kernels.

The engine automatically selects the best quantization kernels for your processordetermines how many layers to offload to GPU if available and configures memory mapping too.

3. Run your Inference:

Process prompts through the model using quantized weights and optimized attention mechanisms. You can generate responses in real-time! The system maintains a key-value cache for efficient multi-turn conversationsstreams tokens as they are generated for responsive user experiences and applies your chosen sampling parameters to control the output quality.

You can always adjust temepraturepenalties and other such settings on the go for tuning generation behavior for specific use cases.

Technologies and Architecture

ggml Tensor Library

This is a core computational engine built on ggmla custom tensor library written in C that provides you with efficient operations for ML inference with minimal dependencies.

SIMD Optimization

This is hand-tuned assembly using AVXAVX2AVX512 and NEON instruction sets for maximum CPU throughput on matrix operations and other attention mechanisms.

GPU Compute APIs

llama.cpp has native integration with CUDAROCm from AMDVulkanOpencl and SYCL for accelerated inference.

KV Cache Management

Complex key-value cache which has quantization support that allows for efficient memory usage during long context generation runs and conversations.

Build Systems

llama.cpp’s support for CMakeMake and various platform-specific build tools is top-notch. You have easy compilation across LinuxmacOSAndroid and Windows 10/11.

System Requirements

  • C++11 compatible compiler
  • 4GB RAM (for small LLM models)
  • Any modern CPU
  • LinuxmacOSor Windows
  • 16GB+ RAM
  • Modern CPU with AVX2
  • NVIDIA/AMD GPU (optional)
  • SSD for model storage
  • Linux (x86ARM)
  • macOS (Intel & Apple Silicon)
  • Windows (x86)
  • AndroidiOSFreeBSD
  • NVIDIA CUDA (compute 6.0+)
  • AMD ROCm
  • Apple Metal
  • VulkanOpenCLSYCL

Core Dependencies

  • C++11 compiler (GCCClang and MSVC)
  • Standard C++ library
  • No external runtime dependencies
  • CUDA Toolkit from Nvidia
  • ROCm from AMD
  • Vulkan SDK
  • Intel oneAPI (SYCL)
  • CMake 3.14+
  • Make/Ninja
  • Platform SDK

Frequently Asked Questions

Below are frequently asked questions about llama.cpp that are usually asked by the users. We hope these answer all of your outstanding questions regarding running LLM inference using llama.cpp.