W8lpk2ODs 发表于 2025-2-19 11:34:05

大模型工具KTransformer的安装

技术背景

前面写过几篇关于DeepSeek的文章,里面包含了通过Ollama来加载模型,以及通过llama.cpp来量化模型(实际上Llama.cpp也可以用来加载模型,功能类似于Ollama)。这里再介绍一个国产的高性能大模型加载工具:KTransformer。但是本文仅介绍KTransformer的安装方法,由于本地的GPU太老,导致无法正常的运行KTransformer,但是编译安装的过程没什么问题。
源码安装KTransformer

首先从Github克隆下来KTransformer的仓库,然后按照官方指导流程进行构建(不要使用0.2.1!!!按照官方的说法,0.2.1版本会导致大模型降智,在最新版本中已经修复,所以最好是从最新的源代码进行安装):
$ git clone https://github.com/kvcache-ai/ktransformers.git正克隆到 'ktransformers'...remote: Enumerating objects: 1866, done.remote: Counting objects: 100% (655/655), done.remote: Compressing objects: 100% (300/300), done.remote: Total 1866 (delta 440), reused 359 (delta 355), pack-reused 1211 (from 2)接收对象中: 100% (1866/1866), 9.33 MiB | 7.65 MiB/s, 完成.处理 delta 中: 100% (990/990), 完成.$ cd ktransformers/$ git submodule init子模组 'third_party/llama.cpp'(https://github.com/ggerganov/llama.cpp.git)已对路径 'third_party/llama.cpp' 注册子模组 'third_party/pybind11'(https://github.com/pybind/pybind11.git)已对路径 'third_party/pybind11' 注册$ git submodule update正克隆到 '/datb/DeepSeek/ktransformers/third_party/llama.cpp'...正克隆到 '/datb/DeepSeek/ktransformers/third_party/pybind11'...子模组路径 'third_party/llama.cpp':检出 'a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f'子模组路径 'third_party/pybind11':检出 'bb05e0810b87e74709d9f4c4545f1f57a1b386f5'$ bash install.shSuccessfully built ktransformersInstalling collected packages: wcwidth, zstandard, tomli, tenacity, sniffio, six, pyproject_hooks, pydantic-core, psutil, propcache, orjson, ninja, multidict, jsonpointer, h11, greenlet, frozenlist, exceptiongroup, colorlog, click, attrs, async-timeout, annotated-types, aiohappyeyeballs, yarl, uvicorn, SQLAlchemy, requests-toolbelt, pydantic, jsonpatch, httpcore, build, blessed, anyio, aiosignal, starlette, httpx, aiohttp, langsmith, fastapi, accelerate, langchain-core, langchain-text-splitters, langchain, ktransformersSuccessfully installed SQLAlchemy-2.0.38 accelerate-1.3.0 aiohappyeyeballs-2.4.6 aiohttp-3.11.12 aiosignal-1.3.2 annotated-types-0.7.0 anyio-4.8.0 async-timeout-4.0.3 attrs-25.1.0 blessed-1.20.0 build-1.2.2.post1 click-8.1.8 colorlog-6.9.0 exceptiongroup-1.2.2 fastapi-0.115.8 frozenlist-1.5.0 greenlet-3.1.1 h11-0.14.0 httpcore-1.0.7 httpx-0.28.1 jsonpatch-1.33 jsonpointer-3.0.0 ktransformers-0.2.1+cu128torch26fancy langchain-0.3.18 langchain-core-0.3.35 langchain-text-splitters-0.3.6 langsmith-0.3.8 multidict-6.1.0 ninja-1.11.1.3 orjson-3.10.15 propcache-0.2.1 psutil-7.0.0 pydantic-2.10.6 pydantic-core-2.27.2 pyproject_hooks-1.2.0 requests-toolbelt-1.0.0 six-1.17.0 sniffio-1.3.1 starlette-0.45.3 tenacity-9.0.0 tomli-2.2.1 uvicorn-0.34.0 wcwidth-0.2.13 yarl-1.18.3 zstandard-0.23.0Installation completed successfully到这里就是安装成功了。如果安装过程有出现报错,如下内容可以协助进行debug。
报错处理

如果出现报错:
FileNotFoundError: No such file or directory: '/xxx/nvcc'这是由于CUDA_HOME设置错误导致的,可以先使用which nvcc查看一些nvcc的路径:
$ which nvcc/home/llama/bin/nvcc然后再把这个路径(按照自己的环境来配置)配置到环境变量中:
$ export CUDA_HOME=/home/llama/targets/x86_64-linux如果出现报错:
sh: 1: cicc: not found是因为cicc不在系统路径里面,可以把cicc可执行文件路径添加到环境变量中:
$ export PATH=$PATH:/home/llama/nvvm/bin有一个比较坑的问题是:
make: *** 没有规则可制作目标“/home/lib64/libcudart.so”改了半天PATH、CUDA_HOME和LD_LIBRARY_PATH,最后发现这个路径写死了lib64,而本地的CUDA路径一般是lib,我觉得可以直接把这些lib拷贝一遍:
$ cp -r /home/lib/ /home/lib64/当然,因为涉及到动态链接库的调用问题,这个方法可能不是那么优雅,但是确实解决了这个报错,避免了重装CUDA的操作。
安装测试

安装成功后,可以用如下指令测试一下是否安装成功:
$ python -m ktransformers.local_chat --helpNAME    local_chat.pySYNOPSIS    local_chat.py <flags>FLAGS    --model_path=MODEL_PATH      Type: Optional      Default: None    -o, --optimize_rule_path=OPTIMIZE_RULE_PATH      Type: Optional      Default: None    -g, --gguf_path=GGUF_PATH      Type: Optional      Default: None    --max_new_tokens=MAX_NEW_TOKENS      Type: int      Default: 300    -c, --cpu_infer=CPU_INFER      Type: int      Default: 10    -u, --use_cuda_graph=USE_CUDA_GRAPH      Type: bool      Default: True    -p, --prompt_file=PROMPT_FILE      Type: Optional      Default: None    --mode=MODE      Type: str      Default: 'normal'    -f, --force_think=FORCE_THINK      Type: bool      Default: FalseKTransformer加载模型

要执行local_chat,还需要安装一个flash-attn:
$ python3 -m pip install flash-attn --no-build-isolation建议的模型列表:
       
如果要使用KTransformer的Chat功能,需要把模型路径放在KTransformer安装的根目录上,也就是git clone之后cd进去的那一个路径。使用KTransformer加载模型的话,safetensor模型和gguf模型都要。例如从ModelScope下载DeepSeek-V2.5-IQ4_XS这个模型(在KTransformer根目录执行这个操作):
$ git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-V2.5.git正克隆到 'DeepSeek-V2.5'...remote: Enumerating objects: 81, done.remote: Counting objects: 100% (81/81), done.remote: Compressing objects: 100% (80/80), done.remote: Total 81 (delta 6), reused 0 (delta 0), pack-reused 0展开对象中: 100% (81/81), 1.47 MiB | 1017.00 KiB/s, 完成.过滤内容: 100% (55/55), 3.10 GiB | 60.00 KiB/s, 完成.Encountered 55 file(s) that may not have been copied correctly on Windows:        model-00013-of-000055.safetensors        model-00052-of-000055.safetensors        model-00050-of-000055.safetensors        model-00027-of-000055.safetensors        model-00053-of-000055.safetensors        model-00016-of-000055.safetensors        model-00026-of-000055.safetensors        model-00051-of-000055.safetensors        model-00028-of-000055.safetensors        model-00039-of-000055.safetensors        model-00015-of-000055.safetensors        model-00014-of-000055.safetensors        model-00049-of-000055.safetensors        model-00040-of-000055.safetensors        model-00038-of-000055.safetensors        model-00021-of-000055.safetensors        model-00020-of-000055.safetensors        model-00025-of-000055.safetensors        model-00024-of-000055.safetensors        model-00037-of-000055.safetensors        model-00046-of-000055.safetensors        model-00047-of-000055.safetensors        model-00045-of-000055.safetensors        model-00012-of-000055.safetensors        model-00011-of-000055.safetensors        model-00023-of-000055.safetensors        model-00044-of-000055.safetensors        model-00048-of-000055.safetensors        model-00010-of-000055.safetensors        model-00035-of-000055.safetensors        model-00022-of-000055.safetensors        model-00033-of-000055.safetensors        model-00036-of-000055.safetensors        model-00043-of-000055.safetensors        model-00034-of-000055.safetensors        model-00031-of-000055.safetensors        model-00032-of-000055.safetensors        model-00054-of-000055.safetensors        model-00042-of-000055.safetensors        model-00030-of-000055.safetensors        model-00019-of-000055.safetensors        model-00002-of-000055.safetensors        model-00008-of-000055.safetensors        model-00018-of-000055.safetensors        model-00009-of-000055.safetensors        model-00007-of-000055.safetensors        model-00001-of-000055.safetensors        model-00041-of-000055.safetensors        model-00055-of-000055.safetensors        model-00005-of-000055.safetensors        model-00006-of-000055.safetensors        model-00004-of-000055.safetensors        model-00003-of-000055.safetensors        model-00017-of-000055.safetensors        model-00029-of-000055.safetensorsSee: `git lfs help smudge` for more details.然后在KTransformer根目录创建一个GGUF模型路径:DeepSeek-V2.5-GGUF,进入该文件夹下载相应的GGUF模型:
$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00001-of-00004.gguf --local_dir ./$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00002-of-00004.gguf --local_dir ./$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00003-of-00004.gguf --local_dir ./$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00004-of-00004.gguf --local_dir ./模型文件下载完成后,回到KTransformer根目录,使用KTransformer加载模型:
$ python -m ktransformers.local_chat --model_path ./DeepSeek-V2.5/ --gguf_path ./DeepSeek-V2.5-GGUF/运行日志大概长这样:
       
如果本地硬件条件正常、软件安装也正常的话,在一大串的日志之后,就可以进入到Chat的界面。如果有遇到报错,可以参考下面的这些debug内容。
报错处理

第一个有可能出现的问题是GPU管理问题:
       
我这里本地的硬件是2张8G的显卡,这个模型需要10GB的显存,理论上应该是够用的,但是遇到OOM报错:
Traceback (most recent call last):File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 196, in _run_module_as_main    return _run_code(code, main_globals, None,File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 86, in _run_code    exec(code, run_globals)File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 179, in <module>    fire.Fire(local_chat)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 135, in Fire    component_trace = _Fire(component, args, parsed_flag_args, context, name)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire    component, remaining_args = _CallAndUpdateTrace(File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace    component = fn(*varargs, **kwargs)File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 110, in local_chat    optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)File "/datb/DeepSeek/ktransformers/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf    load_weights(module, gguf_loader)File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 87, in load_weights    module.load()File "/datb/DeepSeek/ktransformers/ktransformers/operators/base_operator.py", line 60, in load    utils.load_weights(child, self.gguf_loader, self.key+".")File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 87, in load_weights    module.load()File "/datb/DeepSeek/ktransformers/ktransformers/operators/base_operator.py", line 60, in load    utils.load_weights(child, self.gguf_loader, self.key+".")File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 87, in load_weights    module.load()File "/datb/DeepSeek/ktransformers/ktransformers/operators/linear.py", line 432, in load    self.generate_linear.load(w=w)File "/datb/DeepSeek/ktransformers/ktransformers/operators/linear.py", line 215, in load    w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(File "/datb/DeepSeek/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/marlin_utils.py", line 93, in marlin_quantize    w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,File "/datb/DeepSeek/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 84, in quantize_weights    q_w = reshape_w(q_w)File "/datb/DeepSeek/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 81, in reshape_w    w = w.reshape((size_k, size_n)).contiguous()torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 7.78 GiB of which 243.12 MiB is free. Including non-PyTorch memory, this process has 6.96 GiB memory in use. Of the allocated memory 6.25 GiB is allocated by PyTorch, and 623.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.See documentation for Memory Management(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)这是因为KTransformer默认只用一张卡的显存,按照官方的说法,多卡PCIE通信开销较大,性能有可能劣化。但是架不住有些人想用低性能的显卡尝尝鲜,其实官方仓库的optimize路径下有一些关于multi-gpu的json配置文件,主要是V3和R1的,但是我发现似乎V2.5的可以直接复用V2的optimize-rule,可以参考这样来运行:
$ python -m ktransformers.local_chat --model_path ./DeepSeek-V2.5/ --gguf_path ./DeepSeek-V2.5-GGUF/ -f true --port 11434 --optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu.yaml --cpu_infer 45这样就手动把模型的一部分层数放到另一张显卡上面了,没有OOM的问题。
HF镜像配置

我这里本地使用的模型都是从modelscope下载的,但是KTransformer如果没有找到本地的模型文件,会从Hugging Face上面去搜索相应的模型进行下载。而由于国内的一些网络问题,可能会导致直接下载模型失败。有一种方法是配置HF的镜像地址:
$ export HF_ENDPOINT=https://hf-mirror.com这种方法我没有自己测试过,仅做一个记录,有兴趣的朋友可以自己试试。
Docker安装

如果实在是不想去配置环境,或者解决各种编译过程的问题,可以使用docker安装,已经有人做好了docker镜像approachingai/ktransformers:0.2.1。但是国内从dockerhub拉取镜像经常也会遇到网络上的问题,可以自己去找一找相关的国内镜像源,例如我这里用的是docker.1ms.run的一个临时镜像:
$ docker pull docker.1ms.run/approachingai/ktransformers:0.2.10.2.1: Pulling from approachingai/ktransformers43f89b94cd7d: Pull complete dd4939a04761: Pull complete b0d7cc89b769: Pull complete 1532d9024b9c: Pull complete 04fc8a31fa53: Pull complete a14a8a8a6ebc: Pull complete 7d61afc7a3ac: Pull complete 8bd2762ffdd9: Pull complete 2a5ee6fadd42: Pull complete 22ba0fb08ae2: Pull complete 4d37a6bba88f: Pull complete 4bc954eb910a: Pull complete 165aad6fabd0: Pull complete c49c630af66e: Pull complete 4f4fb700ef54: Pull complete ad3e53217d5f: Pull complete 4e5bba86d044: Pull complete 30e549924dd6: Pull complete d01b314170e5: Pull complete 819a99ec0367: Pull complete Digest: sha256:7de7da1cca07197aed37aff6e25daeb56370705f0ee05dbe3ad4803c51cc50a3Status: Downloaded newer image for docker.1ms.run/approachingai/ktransformers:0.2.1docker.1ms.run/approachingai/ktransformers:0.2.1拉取完成后可以查看一下正在运行的容器:
$ docker ps -a                                                                                                    现在是一个空的状态,再查看一下本地的镜像:
$ docker imagesREPOSITORY                                                          TAG       IMAGE ID       CREATED       SIZEdocker.1ms.run/approachingai/ktransformers                        0.2.1   614daa66a726   3 days ago    18.5GB使用镜像构建一个容器,并且把本地需要用到的目录映射进去:
$ docker run --gpus all -v /datb/DeepSeek/:/models --name ktransformers -itd approachingai/ktransformers:0.2.1再次查看正在运行的容器:
$ docker ps -aCONTAINER ID   IMAGE                                                                     COMMAND               CREATED         STATUS                     PORTS                                                                                                                     NAMES0d3fb9d57002   docker.1ms.run/approachingai/ktransformers:0.2.1                        "tail -f /dev/null"   3 minutes ago   Up 3 minutes                                                                                                                                           ktransformers这里我们的容器就启动成功了,下一步就是进入到容器环境中:
$ docker exec -it ktransformers /bin/bash查看当前目录和映射目录:
root@0d3fb9d57002:/workspace# lltotal 12drwxr-xr-x 1 root root 4096 Feb 15 05:45 ./drwxr-xr-x 1 root root 4096 Feb 19 00:54 ../drwxrwxr-x 8 1004 1004 4096 Feb 15 06:40 ktransformers/root@0d3fb9d57002:/workspace# ll /models/total 72drwxr-xr-x7 1001 10014096 Feb 17 06:19 ./drwxr-xr-x1 root root4096 Feb 19 00:54 ../drwxrwxr-x3 1001 10014096 Feb6 09:19 DeepSeek32B/drwxrwxr-x 12 1001 10014096 Feb 17 09:14 ktransformers/drwxrwxr-x4 1001 10014096 Feb 12 02:31 llama/drwxrwxr-x7 1001 10014096 Feb 18 06:46 models/-rwxrwxr-x1 1001 1001 13389 Feb5 04:16 ollama_install.sh*drwxrwxr-x7 1001 10014096 Feb 14 08:31 page-share-app/-rw-rw-r--1 1001 1001 22239 Feb 12 02:14 running.log-rw-rw-r--1 1001 1001   183 Feb 12 02:09 test_llama_cpp.py类似于上面的使用方法,直接启动Chat,只不过需要相应的修改一下路径配置:
root@0d3fb9d57002:/models# python3 -m ktransformers.local_chat --gguf_path ktransformers/DeepSeek-V2.5-GGUF --model_path ktransformers/DeepSeek-V2.5 --cpu_infer 45 -f true --port 11434 --optimize_rule_path ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu.yaml如果本地的硬件条件没有问题,这里应该会直接进入到Chat界面。
TORCH_USE_CUDA_DSA报错

接下来说一个无解的问题,在执行local_chat的编译过程,有可能弹出来的一个报错:
loading token_embd.weight to cpu/opt/conda/lib/python3.10/site-packages/ktransformers/util/custom_gguf.py:644: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905979055/work/torch/csrc/utils/tensor_numpy.cpp:206.)data = torch.from_numpy(data)Traceback (most recent call last):File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main    return _run_code(code, main_globals, None,File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code    exec(code, run_globals)File "/opt/conda/lib/python3.10/site-packages/ktransformers/local_chat.py", line 179, in <module>    fire.Fire(local_chat)File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire    component_trace = _Fire(component, args, parsed_flag_args, context, name)File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire    component, remaining_args = _CallAndUpdateTrace(File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace    component = fn(*varargs, **kwargs)File "/opt/conda/lib/python3.10/site-packages/ktransformers/local_chat.py", line 110, in local_chat    optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)File "/opt/conda/lib/python3.10/site-packages/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf    load_weights(module, gguf_loader)File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 87, in load_weights    module.load()File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/base_operator.py", line 60, in load    utils.load_weights(child, self.gguf_loader, self.key+".")File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 87, in load_weights    module.load()File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/base_operator.py", line 60, in load    utils.load_weights(child, self.gguf_loader, self.key+".")File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights    load_weights(child, gguf_loader, prefix+name+".")File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 87, in load_weights    module.load()File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/linear.py", line 430, in load    self.generate_linear.load(w=w)File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/linear.py", line 215, in load    w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(File "/opt/conda/lib/python3.10/site-packages/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/marlin_utils.py", line 93, in marlin_quantize    w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,File "/opt/conda/lib/python3.10/site-packages/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 61, in quantize_weights    w = w.reshape((group_size, -1))RuntimeError: CUDA error: no kernel image is available for execution on the deviceCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1.Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.或者是在进入到Chat界面后,输入任意的问题,有可能弹出来的一个报错:
Chat: who are you?Traceback (most recent call last):File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 196, in _run_module_as_main    return _run_code(code, main_globals, None,File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 86, in _run_code    exec(code, run_globals)File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 179, in <module>    fire.Fire(local_chat)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 135, in Fire    component_trace = _Fire(component, args, parsed_flag_args, context, name)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire    component, remaining_args = _CallAndUpdateTrace(File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace    component = fn(*varargs, **kwargs)File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 173, in local_chat    generated = prefill_and_generate(File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 154, in prefill_and_generate    logits = model(File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl    return self._call_impl(*args, **kwargs)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl    return forward_call(*args, **kwargs)File "/datb/DeepSeek/ktransformers/ktransformers/models/modeling_deepseek.py", line 1731, in forward    outputs = self.model(File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl    return self._call_impl(*args, **kwargs)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl    return forward_call(*args, **kwargs)File "/datb/DeepSeek/ktransformers/ktransformers/operators/models.py", line 722, in forward    layer_outputs = decoder_layer(File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl    return self._call_impl(*args, **kwargs)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl    return forward_call(*args, **kwargs)File "/datb/DeepSeek/ktransformers/ktransformers/models/modeling_deepseek.py", line 1238, in forward    hidden_states, self_attn_weights, present_key_value = self.self_attn(File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl    return self._call_impl(*args, **kwargs)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl    return forward_call(*args, **kwargs)File "/datb/DeepSeek/ktransformers/ktransformers/operators/attention.py", line 407, in forward    return self.forward_windows(File "/datb/DeepSeek/ktransformers/ktransformers/operators/attention.py", line 347, in forward_windows    return self.forward_chunck(File "/datb/DeepSeek/ktransformers/ktransformers/operators/attention.py", line 85, in forward_chunck    q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl    return self._call_impl(*args, **kwargs)File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl    return forward_call(*args, **kwargs)File "/datb/DeepSeek/ktransformers/ktransformers/models/modeling_deepseek.py", line 113, in forward    hidden_states = hidden_states.to(torch.float32)RuntimeError: CUDA error: invalid device functionCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.类似于这种TORCH_USE_CUDA_DSA的问题,按照官方的说法,很大概率是由于显卡型号老旧导致的。所以,这个就很无解,只能通过换显卡的方式来解决了。
总结概要

本文主要介绍的是国产高性能大模型加载工具KTransformer的安装方法。之所以是使用方法,是因为该工具对本地的硬件条件还是有一定的要求。如果是型号过于老旧的显卡,有可能出现TORCH_USE_CUDA_DSA相关的一个报错。而这个问题只能通过换显卡来解决,所以作者本地并未完全测试成功,只是源码安装方法和Docker安装方法经过确认没有问题。
版权声明

本文首发链接为:https://www.cnblogs.com/dechinphy/p/ktransformer.html
作者ID:DechinPhy
更多原著文章:https://www.cnblogs.com/dechinphy/
请博主喝咖啡:https://www.cnblogs.com/dechinphy/gallery/image/379634.html
页: [1]
查看完整版本: 大模型工具KTransformer的安装