5 Star 2 Fork 0

Gitee 极速下载 / sentencepiece

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
此仓库是为了提升国内下载速度的镜像仓库,每日同步一次。 原始仓库: https://github.com/google/sentencepiece
克隆/下载
special_symbols.md 1.07 KB
AI 代码解读
一键复制 编辑 原始数据 按行查看 历史
Taku Kudo 提交于 2017-03-07 19:43 . Initialize repository

Use custom symbols

SentencePiece model supports two types of special symbols.

Control symbol

Control symbols are used to encode special indicators for the decoder to change the behavior dynamically. Example includes the language indicators in multi-lingual models. <s> and </s> are reserved control symbols. Control symbols must be inserted outside of the SentencePiece segmentation. Developers need to take the responsibility to insert these symbols in data generation and decoding.

It is guaranteed that control symbols have no corresponding surface strings in the original user input. Control symbols are decoded into empty strings.

User defined symbol

User defined symbol is handled as one piece in any context. If this symbol is included in the input text, this symbol is always extracted as one piece.

Specify special symbols in training time

Use --control_symbols and --user_defined_symbols flags as follows

% spm_train --control_symbols=<foo>,<bar> --user_defined_symbols=<user1>,<user2> --input=<input file> --model_prefix=<model file> --vocab_size=8000
C++
1
https://gitee.com/mirrors/sentencepiece.git
git@gitee.com:mirrors/sentencepiece.git
mirrors
sentencepiece
sentencepiece
master

搜索帮助