tootfsg

V2EX member #674121, joined on 2024-01-31 00:23:51 +08:00

tootfsg 提问技术话题好玩工作信息交易信息城市相关

Per tootfsg's settings, the topics list is only visible after you sign in

Deals info, including closed deals, is not hidden

tootfsg's recent replies

May 22

Replied to a topic by schen1027a1 › MacBook Pro › 二手 m1pro 抉择，佬们帮忙看看

May 20

Replied to a topic by tootfsg › Local LLM › 关于 5070ti 模型推理的速度和本地部署思考

@uprit 设置-ngl 99 ，不全量加载到显存，启动是直接报错退出的，根本不会使用内存。
你是用 ai 自动回复的吗，怎么和网页版 gemini 给我的感觉太相似了，那种降智感，一模一样。
模型接近 16g... 我上面发了模型占 13.3g 。

May 20

Replied to a topic by tootfsg › Local LLM › 关于 5070ti 模型推理的速度和本地部署思考

@uprit 我刚才用 cuda13.1 重新编译了最新 llama.cpp ，发现问题了，不是 cuda 版本问题，是上下文问题。
我在之前的 chat 上继续问一开始也是 10t/s ，然后我开新 chat ，速度刚开始有 45t/s ，随着输出速度越来越低，完成任务最后 32.08t/s ，总共 1424tokens ，44s 。
然后我问了第二个问题，刚开始有 22t/s ，速度也是随着输出越来越慢，完成后 17.72t/s ，1813tokens ，1min 42s 。
两个问题都是编程相关问题，实现某个小功能。
到此，原因总算是真相大白，水落石出了。

May 20

Replied to a topic by tootfsg › Local LLM › 关于 5070ti 模型推理的速度和本地部署思考

说起稠密模型，前些天 mistral 发布了一个 110b 的稠密模型，大的吓人，我想试试，可是中转都买不到。

May 20

Replied to a topic by tootfsg › Local LLM › 关于 5070ti 模型推理的速度和本地部署思考

@uprit 稠密模型特别耗算力，光纸面比较就比 26b a4b 高了 6 倍算力需求。

May 20

Replied to a topic by tootfsg › Local LLM › 关于 5070ti 模型推理的速度和本地部署思考

@uprit 没爆显存，我所有模型都开-ngl 99 的，我 cpu 很低，10t/s 都跑不了的，而且爆了直接启动不了的，光上下文调了很多次的。
不过确实慢了，我记着以前好像是 20 多 t/s 的。我怀疑可能是 cuda 的版本问题，我电脑有 13.1 和 13.2 ，但是 13.2 有人提起过可能有问题。

启动参数 ngl 99 ，fa on ，jinja ，ctk q8 ，ctv q4 ，np 1 ，c 25600 ，启动后占用显存 15193 ，模型 13302 ，上下文 1625 。

我 cpu 是 12400f 和 ddr4 ，这个跑不了 10t/s 吧
唯一的可能就是可能指定了 cuda13.2 编译 llama.cpp 。

May 19

Replied to a topic by tootfsg › Local LLM › 关于 5070ti 模型推理的速度和本地部署思考

可以看出，统一内存只适合 MoE

» More replies by tootfsg