xinference-run-llm: 使用 xinference 部署大模型。支持函数调用

使用平台

在autodl 上面使用。 https://www.autodl.com/create

选择 pytroch 2.1 版本，python3.10

先创建相关配置的容器，然后克隆本项目，执行运行某些模型脚本：

git clone https://gitee.com/fly-llm/xinference-run-llm.git

使用xinference的特点是

功能特点	Xinference	FastChat	OpenLLM	RayLLM
兼容 OpenAI 的 RESTful API	✅	✅	✅	✅
vLLM 集成	✅	✅	✅	✅
更多推理引擎（GGML、TensorRT）	✅	❌	✅	✅
更多平台支持（CPU、Metal）	✅	✅	❌	❌
分布式集群部署	✅	❌	❌	✅
图像模型（文生图）	✅	✅	❌	❌
文本嵌入模型	✅	❌	❌	❌
多模态模型	✅	❌	❌	❌
语音识别模型	✅	❌	❌	❌
更多 OpenAI 功能 (函数调用)	✅	❌	❌	❌

官网文档： https://inference.readthedocs.io/zh-cn/latest/getting_started/

github地址： https://github.com/xorbitsai/inference/blob/main/README_zh_CN.md

关于 ChatGML3 大模型

https://www.modelscope.cn/models/ZhipuAI/chatglm3-6b/summary

ChatGLM3-6B 是 ChatGLM 系列最新一代的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 引入了如下特性：

更强大的基础模型： ChatGLM3-6B 的基础模型 ChatGLM3-6B-Base 采用了更多样的训练数据、更充分的训练步数和更合理的训练策略。在语义、数学、推理、代码、知识等不同角度的数据集上测评显示，ChatGLM3-6B-Base 具有在 10B 以下的预训练模型中最强的性能。更完整的功能支持： ChatGLM3-6B 采用了全新设计的 Prompt 格式，除正常的多轮对话外。同时原生支持工具调用（Function Call）、代码执行（Code Interpreter）和 Agent 任务等复杂场景。更全面的开源序列：除了对话模型 ChatGLM3-6B 外，还开源了基础模型 ChatGLM-6B-Base、长文本对话模型 ChatGLM3-6B-32K。

执行:

bash run_xinference.sh

export XINFERENCE_ENDPOINT=http://127.0.0.1:6006
启动成功执行
# https://inference.readthedocs.io/zh-cn/latest/models/builtin/llm/chatglm3.html#
xinference launch --model-name chatglm3 --size-in-billions 6 --model-format pytorch --quantization 8-bit

关于 Baichuan2 大模型

https://www.modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat/summary

Baichuan 2 是百川智能推出的新一代开源大语言模型，采用 2.6 万亿 Tokens 的高质量语料训练。 Baichuan 2 在多个权威的中文、英文和多语言的通用、领域 benchmark 上取得同尺寸最佳的效果。本次发布包含有 7B、13B 的 Base 和 Chat 版本，并提供了 Chat 版本的 4bits 量化。

执行:

bash run_xinference.sh

export XINFERENCE_ENDPOINT=http://127.0.0.1:6006
启动成功执行
# https://inference.readthedocs.io/zh-cn/latest/models/builtin/llm/baichuan-2-chat.html
xinference launch --model-name baichuan-2-chat --size-in-billions 7 --model-format pytorch

关于通义千问-7B 大模型

https://www.modelscope.cn/models/qwen/Qwen-7B-Chat/summary

**通义千问-7B（Qwen-7B）**是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。同时，在Qwen-7B的基础上，我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。相较于最初开源的Qwen-7B模型，我们现已将预训练模型和Chat模型更新到效果更优的版本。

执行:

bash run_xinference.sh

启动成功执行
# https://inference.readthedocs.io/zh-cn/latest/models/builtin/llm/qwen-chat.html

export XINFERENCE_ENDPOINT=http://127.0.0.1:6006

xinference launch --model-name qwen-chat --size-in-billions 7 --model-format pytorch --quantization 8-bit

xinference launch --model-name qwen-chat --size-in-billions 14 --model-format pytorch --quantization 8-bit

Yi-VL-6B 模型，需要显存13.7G

https://www.modelscope.cn/models/01ai/Yi-VL-6B/summary

# 不支持： --quantization 8-bit
xinference launch --model-name yi-vl-chat --size-in-billions 6 --model-format pytorch

Yi-VL-34B 模型，需要显存G

https://www.modelscope.cn/models/01ai/Yi-VL-34B/summary

# 不支持： --quantization 8-bit
xinference launch --model-name yi-vl-chat --size-in-billions 34 --model-format pytorch

Qwen-VL-Chat 模型，需要显存18.7G

https://www.modelscope.cn/models/qwen/Qwen-VL-Chat/summary

https://inference.readthedocs.io/zh-cn/latest/models/model_abilities/vision.html

# 不支持： --quantization 8-bit
xinference launch --model-name qwen-vl-chat --size-in-billions 7 --model-format pytorch

embedding 模型

https://modelscope.cn/models/Xorbits/bge-large-zh-v1.5/summary

文档： https://inference.readthedocs.io/zh-cn/latest/models/builtin/embedding/bge-large-zh.html

export XINFERENCE_ENDPOINT=http://127.0.0.1:6006
xinference launch --model-name bge-large-zh --model-type embedding

rerank 模型

https://modelscope.cn/models/Xorbits/bge-reranker-large/summary

文档 https://inference.readthedocs.io/zh-cn/latest/models/builtin/rerank/bge-reranker-large.html

export XINFERENCE_ENDPOINT=http://127.0.0.1:6006
xinference launch --model-name bge-reranker-large --model-type rerank

audio 模型

https://inference.readthedocs.io/zh-cn/latest/user_guide/client_api.html#audio

export XINFERENCE_ENDPOINT=http://127.0.0.1:6006

xinference launch --model-uid whisper-1 --model-name whisper-large-v3 --model-type audio

xinference launch --model-uid whisper-1 --model-name whisper-tiny --model-type audio

图片模型

https://inference.readthedocs.io/zh-cn/latest/user_guide/client_api.html#audio

export XINFERENCE_ENDPOINT=http://127.0.0.1:6006
xinference launch --model-name sdxl-turbo --model-type image

xinference launch --model-name sd-turbo --model-type image

fly-llm / xinference-run-llm

使用平台

关于 ChatGML3 大模型

关于 Baichuan2 大模型

关于通义千问-7B 大模型

Yi-VL-6B 模型，需要显存13.7G

Yi-VL-34B 模型，需要显存G

Qwen-VL-Chat 模型，需要显存18.7G

embedding 模型

rerank 模型

audio 模型

图片模型

简介

发行版

贡献者

近期动态

fly-llm / xinference-run-llm .gitee-modal { width: 500px !important; }

使用平台

关于 ChatGML3 大模型

关于 Baichuan2 大模型

关于 通义千问-7B 大模型

Yi-VL-6B 模型，需要显存13.7G

Yi-VL-34B 模型，需要显存G

Qwen-VL-Chat 模型，需要显存18.7G

embedding 模型

rerank 模型

audio 模型

图片 模型

简介

发行版

贡献者

近期动态

搜索帮助

fly-llm / xinference-run-llm

关于通义千问-7B 大模型

图片模型