3 Star 0 Fork 0

cockroachdb / uniseg

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

Unicode Text Segmentation for Go

Go Reference Go Report

This Go package implements Unicode Text Segmentation according to Unicode Standard Annex #29 and Unicode Line Breaking according to Unicode Standard Annex #14 (Unicode version 14.0.0).

Background

Grapheme Clusters

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

String Bytes (UTF-8) Code points (runes) Grapheme clusters
Käse 6 bytes: 4b 61 cc 88 73 65 5 code points: 4b 61 308 73 65 4 clusters: [4b],[61 308],[73],[65]
🏳️‍🌈 14 bytes: f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88 4 code points: 1f3f3 fe0f 200d 1f308 1 cluster: [1f3f3 fe0f 200d 1f308]
🇩🇪 8 bytes: f0 9f 87 a9 f0 9f 87 aa 2 code points: 1f1e9 1f1ea 1 cluster: [1f1e9 1f1ea]

This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).

Installation

go get github.com/rivo/uniseg

Examples

Counting Characters in a String

n := uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")
fmt.Println(n)
// 2

Using the Graphemes Class

This is the most convenient method of iterating over grapheme clusters:

gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
	fmt.Printf("%x ", gr.Runes())
}
// [1f44d 1f3fc] [21]

Using the Step or StepString Function

This is orders of magnitude faster than the Graphemes class, but it requires the handling of states and boundaries:

str := "🇩🇪🏳️‍🌈"
state := -1
var c string
for len(str) > 0 {
	c, str, _, state = uniseg.StepString(str, state)
	fmt.Printf("%x ", []rune(c))
}
// [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308]

Advanced Examples

Breaking into grapheme clusters and evaluating line breaks:

str := "First line.\nSecond line."
state := -1
var (
	c          string
	boundaries int
)
for len(str) > 0 {
	c, str, boundaries, state = uniseg.StepString(str, state)
	fmt.Print(c)
	if boundaries&uniseg.MaskLine == uniseg.LineCanBreak {
		fmt.Print("|")
	} else if boundaries&uniseg.MaskLine == uniseg.LineMustBreak {
		fmt.Print("‖")
	}
}
// First |line.
// ‖Second |line.‖

If you're only interested in word segmentation, use FirstWord or FirstWordInString:

str := "Hello, world!"
state := -1
var c string
for len(str) > 0 {
	c, str, state = uniseg.FirstWordInString(str, state)
	fmt.Printf("(%s)\n", c)
}
// (Hello)
// (,)
// ( )
// (world)
// (!)

Similarly, use

Documentation

Refer to https://pkg.go.dev/github.com/rivo/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Sponsor this Project

Become a Sponsor on GitHub to support this project!

Your Feedback

Add your issue here on GitHub, preferably before submitting any PR's. Feel free to get in touch if you have any questions.

MIT License Copyright (c) 2019 Oliver Kuederle Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
Go
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/mirrors_cockroachdb/uniseg.git
git@gitee.com:mirrors_cockroachdb/uniseg.git
mirrors_cockroachdb
uniseg
uniseg
master

搜索帮助