DeepSeek模型量化
技术背景大语言模型(Large Language Model,LLM),可以通过量化(Quantization)操作来节约内存/显存的使用,并且降低了通讯开销,进而达到加速模型推理的效果。常见的就是把Float16的浮点数,转换成低精度的整数,例如Int4整数。最极限的情况下,可以把参数转化成二值Bool变量,也就是只有0和1,但是这种大幅度的量化有可能导致模型的推理效果不佳。常用的是,在70B以下的模型用Q8,70B以上可以用Q4。具体的原理,包括对称量化和非对称量化等,这里就不作介绍了,主要看看工程上怎么实现,主要使用了llama.cpp来完成量化。
安装llama.cpp
这里我们在Ubuntu上使用本地编译构建的方法进行安装,首先从github上面clone下来:
$ git clone https://github.com/ggerganov/llama.cpp.git正克隆到 'llama.cpp'...remote: Enumerating objects: 43657, done.remote: Counting objects: 100% (15/15), done.remote: Compressing objects: 100% (14/14), done.remote: Total 43657 (delta 3), reused 5 (delta 1), pack-reused 43642 (from 3)接收对象中: 100% (43657/43657), 88.26 MiB | 8.30 MiB/s, 完成.处理 delta 中: 100% (31409/31409), 完成.最好创建一个虚拟环境,以避免各种软件依赖的问题,推荐Python3.10:
# 创建虚拟环境$ conda create -n llama python=3.10# 激活虚拟环境$ conda activate llama进入下载好的llama.cpp路径,安装所有的依赖项:
$ cd llama.cpp/$ python3 -m pip install -e .创建一个编译目录,执行编译指令:
$ mkdir build$ cd build/$ cmake ..-- The C compiler identification is GNU 7.5.0-- The CXX compiler identification is GNU 9.4.0-- Detecting C compiler ABI info-- Detecting C compiler ABI info - done-- Check for working C compiler: /usr/bin/cc - skipped-- Detecting C compile features-- Detecting C compile features - done-- Detecting CXX compiler ABI info-- Detecting CXX compiler ABI info - done-- Check for working CXX compiler: /usr/bin/c++ - skipped-- Detecting CXX compile features-- Detecting CXX compile features - done-- Found Git: /usr/bin/git (found version "2.25.1") -- Looking for pthread.h-- Looking for pthread.h - found-- Performing Test CMAKE_HAVE_LIBC_PTHREAD-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed-- Check if compiler accepts -pthread-- Check if compiler accepts -pthread - yes-- Found Threads: TRUE-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF-- CMAKE_SYSTEM_PROCESSOR: x86_64-- Including CPU backend-- Found OpenMP_C: -fopenmp (found version "4.5") -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5")-- x86 detected-- Adding CPU backend variant ggml-cpu: -march=native -- Configuring done-- Generating done-- Build files have been written to: /datb/DeepSeek/llama/llama.cpp/build$ cmake --build . --config ReleaseScanning dependencies of target ggml-base Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o Linking CXX executable ../../bin/llama-vdot Built target llama-vdot到这里,就成功构建了cpu版本的llama.cpp,可以直接使用了。如果需要安装gpu加速的版本,可以参考下面这一小节,如果嫌麻烦建议直接跳过。
llama.cpp之CUDA加速
安装GPU版本llama.cpp需要先安装一些依赖:
$ sudo apt install curl libcurl4-openssl-dev跟cpu版本不同的地方,主要在于cmake的编译指令(如果已经编译了cpu的版本,最好先清空build路径下的文件):
$ cmake .. -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17这里加的一个FLAG:-DCMAKE_CUDA_STANDARD=17可以解决Llama.cpp仓库里面的Issue,如果不加这个Flag,有可能出现下面这种报错:
Make Error in ggml/src/ggml-cuda/CMakeLists.txt:Target "ggml-cuda" requires the language dialect "CUDA17" (with compilerextensions), but CMake does not know the compile flags to use to enable it.如果顺利的话,执行下面这个指令,成功编译通过的话就是成功了:
$ cmake --build . --config Release但是如果像我这样有报错信息,那就得单独处理以下。
/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/vendors/cuda.h:6:10: fatal error: cuda_bf16.h: 没有那个文件或目录 #include <cuda_bf16.h> ^~~~~~~~~~~~~compilation terminated.这个报错是说找不到头文件,于是在环境里面find / -name cuda_bf16.h了一下,发现其实是有这个头文件的:
/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_bf16.h/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/triton/backends/nvidia/include/cuda_bf16.h处理方式是把这个路径加到CPATH里面:
$ export CPATH=$CPATH:/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/如果是出现这个报错:
/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_fp16.h:4100:10: fatal error: nv/target: 没有那个文件或目录 #include <nv/target> ^~~~~~~~~~~compilation terminated.那就是找不到target目录的路径,如果本地有target路径的话,也可以直接配置到CPATH里面:
$ export CPATH=/home/dechin/anaconda3/pkgs/cupy-core-13.3.0-py310h5da974a_2/lib/python3.10/site-packages/cupy/_core/include/cupy/_cccl/libcudacxx/:$CPATH如果是下面这些报错:
/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(138): error: identifier "cublasGetStatusString" is undefined/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(417): error: A __device__ variable cannot be marked constexpr/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(745): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined3 errors detected in the compilation of "/tmp/tmpxft_000a126f_00000000-9_acc.compute_75.cpp1.ii".make: *** 错误 1make: *** 错误 2make: *** 错误 2那么很有可能是cuda-toolkit的版本问题,尝试安装cuda-12:
$ conda install nvidia::cuda-toolkit如果使用conda安装过程有这种问题:
Collecting package metadata (current_repodata.json): failed# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<< Traceback (most recent call last): File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 132, in conda_http_errors yield File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 101, in repodata response.raise_for_status() File "/home/dechin/anaconda3/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/current_repodata.json那应该是conda源的问题,可以删掉旧的channels,使用默认channels或者找一个国内可以用的镜像源进行配置:
$ conda config --remove-key channels$ conda config --remove-key default_channels$ conda config --append channels conda-forge重新安装以后,nvcc的路径发生了变化,要注意修改下编译时的DCMAKE_CUDA_COMPILER参数配置:
$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17如果出现如下报错:
-- Unable to find cuda_runtime.h in "/home/dechin/anaconda3/envs/llama/include" for CUDAToolkit_INCLUDE_DIR.-- Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) CMake Error at ggml/src/ggml-cuda/CMakeLists.txt:151 (message):CUDA Toolkit not found-- Configuring incomplete, errors occurred!See also "/datb/DeepSeek/llama/llama.cpp/build/CMakeFiles/CMakeOutput.log".See also "/datb/DeepSeek/llama/llama.cpp/build/CMakeFiles/CMakeError.log".这是找不到CUDAToolkit_INCLUDE_DIR的路径配置,只要在cmake的指令里面加上一个include路径即可:
$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17 -DCUDAToolkit_INCLUDE_DIR=/home/dechin/anaconda3/envs/llama/targets/x86_64-linux/include/ -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/如果经过以上的一串处理,依然有报错信息,那我建议还是用个Docker吧,或者直接用CPU版本执行quantize,模型调用使用Ollama,这样方便一些。
下载Hugging Face模型
由于很多已经完成量化的GGUF模型文件,无法被二次量化,所以建议直接从Hugging Face下载safetensors模型文件。然后用llama.cpp里面的一个Python脚本将hf模型转为gguf模型,然后再使用llama.cpp进行模型quantize。
关于模型下载这部分,因为Hugging Face的访问有时候也会受限,所以这里首推的还是国内的ModelScope平台。从ModelScope平台下载模型,可以装一个这种Python形式的modelscope:
$ python3 -m pip install modelscopeLooking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleRequirement already satisfied: modelscope in /home/dechin/anaconda3/lib/python3.8/site-packages (1.22.3)Requirement already satisfied: requests>=2.25 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (2.25.1)Requirement already satisfied: urllib3>=1.26 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (1.26.5)Requirement already satisfied: tqdm>=4.64.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from modelscope) (4.67.1)Requirement already satisfied: certifi>=2017.4.17 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2021.5.30)Requirement already satisfied: chardet<5,>=3.0.2 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (4.0.0)Requirement already satisfied: idna<3,>=2.5 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2.10)然后使用modelcope下载模型:
$ modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B如果出现报错(如果没有报错就不用理会,等待模型下载完成即可):
safetensors integrity check failed, expected sha256 signature is xxx可以尝试另一种安装方式:
$ sudo apt install git-lfs下载模型:
$ git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.git正克隆到 'DeepSeek-R1-Distill-Qwen-32B'...remote: Enumerating objects: 52, done.remote: Counting objects: 100% (52/52), done.remote: Compressing objects: 100% (37/37), done.remote: Total 52 (delta 17), reused 42 (delta 13), pack-reused 0展开对象中: 100% (52/52), 2.27 MiB | 2.62 MiB/s, 完成.过滤内容: 100% (8/8), 5.02 GiB | 912.00 KiB/s, 完成.Encountered 8 file(s) that may not have been copied correctly on Windows: model-00005-of-000008.safetensors model-00004-of-000008.safetensors model-00008-of-000008.safetensors model-00002-of-000008.safetensors model-00007-of-000008.safetensors model-00003-of-000008.safetensors model-00006-of-000008.safetensors model-00001-of-000008.safetensorsSee: `git lfs help smudge` for more details.这个过程会消耗很多时间,请耐心等待模型下载完成为止。下载完成后查看路径:
$ cd DeepSeek-R1-Distill-Qwen-32B/$ ll总用量 63999072drwxrwxr-x 4 dechin dechin 4096 2月12 19:22 ./drwxrwxr-x 3 dechin dechin 4096 2月12 17:46 ../-rw-rw-r-- 1 dechin dechin 664 2月12 17:46 config.json-rw-rw-r-- 1 dechin dechin 73 2月12 17:46 configuration.jsondrwxrwxr-x 2 dechin dechin 4096 2月12 17:46 figures/-rw-rw-r-- 1 dechin dechin 181 2月12 17:46 generation_config.jsondrwxrwxr-x 9 dechin dechin 4096 2月12 19:22 .git/-rw-rw-r-- 1 dechin dechin 1519 2月12 17:46 .gitattributes-rw-rw-r-- 1 dechin dechin 1064 2月12 17:46 LICENSE-rw-rw-r-- 1 dechin dechin 8792578462 2月12 19:22 model-00001-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906899 2月12 19:03 model-00002-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月12 19:18 model-00003-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月12 18:56 model-00004-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月12 18:38 model-00005-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月12 19:19 model-00006-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月12 19:15 model-00007-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 4073821536 2月12 19:02 model-00008-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 64018 2月12 17:46 model.safetensors.index.json-rw-rw-r-- 1 dechin dechin 18985 2月12 17:46 README.md-rw-rw-r-- 1 dechin dechin 3071 2月12 17:46 tokenizer_config.json-rw-rw-r-- 1 dechin dechin 7031660 2月12 17:46 tokenizer.json这就是下载成功了。
HF模型转GGUF模型
找到编译好的llama/llama.cpp/下的python脚本文件,可以先看下其用法:
$ python3 convert_hf_to_gguf.py --helpusage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE] [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}] [--bigendian] [--use-temp-file] [--no-lazy] [--model-name MODEL_NAME] [--verbose] [--split-max-tensors SPLIT_MAX_TENSORS] [--split-max-size SPLIT_MAX_SIZE] [--dry-run] [--no-tensor-first-split] [--metadata METADATA] [--print-supported-models] Convert a huggingface model to a GGML compatible filepositional arguments:model directory containing model fileoptions:-h, --help show this help message and exit--vocab-only extract only the vocab--outfile OUTFILE path to write to; default: based on input. {ftype} will be replaced by the outtype.--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto} output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest- fidelity 16-bit float type depending on the first loaded tensor type--bigendian model is executed on big endian machine--use-temp-file use the tempfile library while processing (helpful when running out of memory, process killed)--no-lazy use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)--model-name MODEL_NAME name of the model--verbose increase output verbosity--split-max-tensors SPLIT_MAX_TENSORS max tensors in each split--split-max-size SPLIT_MAX_SIZE max size per split N(M|G)--dry-run only print out a split plan and exit, without writing any new files--no-tensor-first-split do not add tensors to the first split (disabled by default)--metadata METADATA Specify the path for an authorship metadata override file--print-supported-models Print the supported models然后执行构建GGUF:
$ python3 convert_hf_to_gguf.py /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B --outfile /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.ggufINFO:hf-to-gguf:Set model quantization versionINFO:gguf.gguf_writer:Writing the following files:INFO:gguf.gguf_writer:/datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf: n_tensors = 771, total_size = 65.5GWriting: 100%|██████████████████████████████████████████████████████████████| 65.5G/65.5G INFO:hf-to-gguf:Model successfully exported to /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf完成转化后,会在指定的路径下生成一个gguf文件,也就是all-in-one的模型文件。默认是fp32的精度,可以用于执行下一步的量化操作。
GGUF模型量化
在编译好的llama.cpp的build/bin/路径下,可以找到量化的可执行文件:
$ ./llama-quantize --helpusage: ./llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf type --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing--pure: Disable k-quant mixtures and quantize all tensors to the same type--imatrix file_name: use data in file_name as importance matrix for quant optimizations--include-weights tensor_name: use importance matrix for this/these tensor(s)--exclude-weights tensor_name: use importance matrix for this/these tensor(s)--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor--keep-split: will generate quantized model in the same shards as input--override-kv KEY=TYPE:VALUE Advanced option to override model metadata by key in the quantized model. May be specified multiple times.Note: --include-weights and --exclude-weights cannot be used togetherAllowed quantization types: 2orQ4_0 :4.34G, +0.4685 ppl @ Llama-3-8B 3orQ4_1 :4.78G, +0.4511 ppl @ Llama-3-8B 8orQ5_0 :5.21G, +0.1316 ppl @ Llama-3-8B 9orQ5_1 :5.65G, +0.1062 ppl @ Llama-3-8B19orIQ2_XXS :2.06 bpw quantization20orIQ2_XS:2.31 bpw quantization28orIQ2_S :2.5bpw quantization29orIQ2_M :2.7bpw quantization24orIQ1_S :1.56 bpw quantization31orIQ1_M :1.75 bpw quantization36orTQ1_0 :1.69 bpw ternarization37orTQ2_0 :2.06 bpw ternarization10orQ2_K :2.96G, +3.5199 ppl @ Llama-3-8B21orQ2_K_S:2.96G, +3.1836 ppl @ Llama-3-8B23orIQ3_XXS :3.06 bpw quantization26orIQ3_S :3.44 bpw quantization27orIQ3_M :3.66 bpw quantization mix12orQ3_K : alias for Q3_K_M22orIQ3_XS:3.3 bpw quantization11orQ3_K_S:3.41G, +1.6321 ppl @ Llama-3-8B12orQ3_K_M:3.74G, +0.6569 ppl @ Llama-3-8B13orQ3_K_L:4.03G, +0.5562 ppl @ Llama-3-8B25orIQ4_NL:4.50 bpw non-linear quantization30orIQ4_XS:4.25 bpw non-linear quantization15orQ4_K : alias for Q4_K_M14orQ4_K_S:4.37G, +0.2689 ppl @ Llama-3-8B15orQ4_K_M:4.58G, +0.1754 ppl @ Llama-3-8B17orQ5_K : alias for Q5_K_M16orQ5_K_S:5.21G, +0.1049 ppl @ Llama-3-8B17orQ5_K_M:5.33G, +0.0569 ppl @ Llama-3-8B18orQ6_K :6.14G, +0.0217 ppl @ Llama-3-8B 7orQ8_0 :7.96G, +0.0026 ppl @ Llama-3-8B 1orF16 : 14.00G, +0.0020 ppl @ Mistral-7B32orBF16 : 14.00G, -0.0050 ppl @ Mistral-7B 0orF32 : 26.00G @ 7B COPY : only copy tensors, no quantizing这里可以看到完整的可以执行量化操作的精度。例如我们可以量化一个q4_0精度的32B模型:
$ ./llama-quantize /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf q4_0输出结果对比(这里的Q8_0是直接从模型仓库里面下载的别人量化出来的Q8_0模型):
-rw-rw-r-- 1 dechin dechin 65535969184 2月13 09:33 DeepSeek-R1-Distill-Qwen-32B.gguf-rw-rw-r-- 1 dechin dechin 18640230304 2月13 09:51 DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf-rw-rw-r-- 1 dechin dechin 34820884384 2月 9 01:44 DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf从F32到Q8再到Q4,可以看到有一个很明显的内存占用的下降。我们可以根据自己本地的计算机资源来决定要做多少精度的量化操作。
量化完成后,可以参考这篇文章的内容构建模型,导入模型成功以后,可以用ollama list查看到所有的本地模型:
$ ollama listNAME ID SIZE MODIFIED deepseek-r1:32b-q2k 8d2a0c19f6e0 12 GB 5 seconds ago deepseek-r1:32b-q40 13c7c287f615 18 GB 3 minutes ago deepseek-r1:32b 91f2de3dd7fd 34 GB 42 hours ago nomic-embed-text-v1.5:latest 5b3683392ccb 274 MB 43 hours ago deepseek-r1:14b ea35dfe18182 9.0 GB 7 days ago这里q2k也是本地量化的Q2_K的模型。只是从Q4_0到Q2_k已经没有太大的参数内存缩减了,所以很多人量化一般就到Q4_0这个级别,可以兼具性能与精确性。
其他报错处理
如果运行llama-quantize这个可执行文件出现这种报错:
./xxx/llama-quantize: error while loading shared libraries: libllama.so: cannot open shared object file: No such file or directory动态链接库路径LD_LIBRARY_PATH没有设置,也可以选择直接进入到bin/路径下运行该可执行文件。
总结概要
这篇文章主要介绍了llama.cpp这一大模型工具的使用。因为已经使用Ollama来run大模型,因此仅介绍了llama.cpp在HF模型转GGUF模型中的应用,及其在大模型量化中的使用。大模型的参数量化技术,使得我们可以在本地有限预算的硬件条件下,也能够运行DeepSeek的蒸馏模型。
版权声明
本文首发链接为:https://www.cnblogs.com/dechinphy/p/quantize.html
作者ID:DechinPhy
更多原著文章:https://www.cnblogs.com/dechinphy/
请博主喝咖啡:https://www.cnblogs.com/dechinphy/gallery/image/379634.html
参考链接
[*]https://blog.csdn.net/m0_73365120/article/details/141901884
页:
[1]