SentencePiece model supports two types of special symbols.
Control symbols are used to encode special indicators for the decoder to change the behavior dynamically.
Example includes the language indicators in multi-lingual models. <s>
and </s>
are reserved control symbols.
Control symbols must be inserted outside of the SentencePiece segmentation. Developers need to take the responsibility to insert these symbols in data generation and decoding.
It is guaranteed that control symbols have no corresponding surface strings in the original user input. Control symbols are decoded into empty strings.
User defined symbol is handled as one piece in any context. If this symbol is included in the input text, this symbol is always extracted as one piece.
Use --control_symbols
and --user_defined_symbols
flags as follows
% spm_train --control_symbols=<foo>,<bar> --user_defined_symbols=<user1>,<user2> --input=<input file> --model_prefix=<model file> --vocab_size=8000
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。