diff --git a/Readme.md b/Readme.md index 5ed80ed..afcf1c1 100644 --- a/Readme.md +++ b/Readme.md @@ -1,37 +1,31 @@ -## ideaseg -`ideaseg` 是一个基于最新的 [HanLP](https://github.com/hankcs/HanLP/tree/1.x) 自然语言处理工具包实现的中文分词器, -包含了最新的模型数据,同时移除了 HanLP 所包含的非商业友好许可的 [NeuralNetworkParser](https://github.com/hankcs/HanLP/issues/644) 相关代码和数据。 +ideaseg is a Chinese tokenizer based on the latest [hanlp](https://github.com/hankcs/hanlp/tree/1.x) natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related [neuralnetworkparser](https://github.com/hankcs/hanlp/issues/644) code and data contained in hanlp. -`HanLP` 相比其他诸如 `IK`、`jcseg` 等分词器而言,在分词的准确率上有巨大的提升,但速度上有所牺牲。 -通过对 `HanLP` 进行优化配置,`ideaseg` 在准确度和分词速度上取得了最佳的平衡。 +Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed. -与其他基于 `HanLP` 的插件相比,`ideaseg` 同步了最新 `HanLP` 的代码和数据,去除了无法商用的相关内容;实现了自动配置; -包含了模型数据,无需自行下载,使用简单方便。 +Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use. -`ideaseg` 提供三个模块包括: +ideaseg provides three modules including: -1. `core` ~ 核心分词器模块 -2. `elasticsearch` ~ ElasticSearch 的 ideaseg 分词插件 (最高支持 7.10.2 版本) -3. `opensearch` ~ OpenSearch 的 ideaseg 分词插件 (默认版本 2.4.1) +1. `core` ~ core tokenizer module +2. `elasticsearch` ~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2) +3. `opensearch` ~ ideaseg tokenizer plugin for opensearch (default version 2.4.1) -**关于 `ElasticSearch` 的版本说明,由于从 7.11.1 版本开始 Elastic 修改 ES 的许可证,同时修改了插件的权限策略, -不再允许插件对文件进行读写。由于 `HanLP` 本身的模型数据很大,为了提升速度其处理机制需要在插件的数据目录下生成一些相当于缓存的文件。 -因此,如果你使用的是 `ElasticSearch` 请尽量用 7.10.2 或者以下的版本,推荐使用 `OpenSearch` 。** +**Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.** -此外 `data` 包含 `HanLP` 的模型数据。 +In addition, the data folder contains model data of hanlp. -由于包含数据模型体积较大(打包后四五百兆),再加上 `ElasticSearch` 的插件机制严格绑定引擎本身的版本,而且版本众多,因此本项目不提供预编译的二进制版本,你需要自行下载源码进行构建。 +Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building. -### 构建 +### Building -以下是插件的构建过程,在开始之前请先安装好 `git`、`java`、`maven` 等相关工具。 +The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools. -首先确定你的 `ElasticSearch` 或者 `OpenSearch` 的具体版本,假设你使用的是 `ElasticSearch` 7.10.2 版本, -请使用文本编辑器打开 `ideaseg/elasticsearch/pom.xml` 文件,修改 `elasticsearch.version` 对应的值为 `7.10.2` -(如果是 `OpenSearch` 请修改 `opensearch/pom.xml`)。 +First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2, +open the `ideaseg/elasticsearch/pom.xml` file with a text editor, and modify the value of `elasticsearch.version` to `7.10.2` +(if it is opensearch, please modify `opensearch/pom.xml`). -保存文件并打开命令行窗口,执行如下命令开始构建: +Save the file and open the command line window, and execute the following command to start building: ```shell $ git clone https://gitee.com/indexea/ideaseg @@ -39,33 +33,33 @@ $ cd ideaseg $ mvn install ``` -构建完成后,将在 `elasticsearch/target` 和 `opensearch/target` 各生成两个插件文件为 `ideaseg.zip` 。 +After the build is completed, two plugin files `ideaseg.zip` will be generated in `elasticsearch/target` and `opensearch/target` respectively. -### 安装 +### Installation -构建完成后,我们可以利用 `ElasticSearch` 或 `OpenSearch` 提供的插件管理工具进行安装。 +After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install. -`ElasticSearch` 对应的插件管理工具为 `/bin/elasticsearch-plugin` , -而 `OpenSearch` 对应的管理工具为 `/bin/opensearch-plugin`。 -其中 `` 和 `` 为两个服务安装后所在的目录。 +The corresponding plugin management tool for elasticsearch is `/bin/elasticsearch-plugin`, +while the corresponding management tool for opensearch is `/bin/opensearch-plugin`. +The `` and `` are the respective directories of the two services after installation. -#### ElasticSearch 安装 ideaseg 插件 +#### Install ideaseg plugin for elasticsearch ```shell $ bin/elasticsearch-plugin install file:////elasticsearch/target/ideaseg.zip ``` -#### OpenSearch 安装 ideaseg 插件 +#### Install ideaseg plugin for opensearch ```shell $ bin/opensearch-plugin install file:////opensearch/target/ideaseg.zip ``` -其中 `` 为 `ideaseg` 源码所在的路径。要特别注意到是路径前必须有 `file://` ,如果是 Windows 系统,则需要路径前添加 `file:///` ,例如 `file:///D:\WORKDIR\Indexea\ideaseg\elasticsearch\target\ideaseg.zip`。 +where `` is the path to the `ideaseg` source code. Please note that the path must have `file://` before it. If it is a windows system, the path needs to be added with `file:///`, such as `file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip`. -安装过程会询问插件所需的权限,回车确认即可完成安装,安装完毕需要重启服务才能让插件生效。 +During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation. -接下来你可以使用分词测试工具来对插件进行测试,如下所示: +Next, you can use the word segmentation test tool to test the plugin as follows: ``` POST _analyze @@ -75,12 +69,12 @@ POST _analyze } ``` -关于分词测试的详情请参考 [ElasticSearch 官方文档](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。 +For more information on word segmentation testing, please refer to [ElasticSearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。 -### 反馈问题 +### Feedback -如果你在使用 `ideaseg` 过程中有任何问题,请通过 [Issues](https://gitee.com/indexea/ideaseg/issues) 提出。 +If you have any questions about using 'ideaseg', please raise them via [Issues](https://gitee.com/indexea/ideaseg/issues). -### 特别感谢 +### Special thanks https://github.com/KennFalcon/elasticsearch-analysis-hanlp \ No newline at end of file diff --git a/Readme_en.md b/Readme_en.md deleted file mode 100644 index afcf1c1..0000000 --- a/Readme_en.md +++ /dev/null @@ -1,80 +0,0 @@ - -ideaseg is a Chinese tokenizer based on the latest [hanlp](https://github.com/hankcs/hanlp/tree/1.x) natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related [neuralnetworkparser](https://github.com/hankcs/hanlp/issues/644) code and data contained in hanlp. - -Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed. - -Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use. - -ideaseg provides three modules including: - -1. `core` ~ core tokenizer module -2. `elasticsearch` ~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2) -3. `opensearch` ~ ideaseg tokenizer plugin for opensearch (default version 2.4.1) - -**Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.** - -In addition, the data folder contains model data of hanlp. - -Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building. - -### Building - -The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools. - -First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2, -open the `ideaseg/elasticsearch/pom.xml` file with a text editor, and modify the value of `elasticsearch.version` to `7.10.2` -(if it is opensearch, please modify `opensearch/pom.xml`). - -Save the file and open the command line window, and execute the following command to start building: - -```shell -$ git clone https://gitee.com/indexea/ideaseg -$ cd ideaseg -$ mvn install -``` - -After the build is completed, two plugin files `ideaseg.zip` will be generated in `elasticsearch/target` and `opensearch/target` respectively. - -### Installation - -After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install. - -The corresponding plugin management tool for elasticsearch is `/bin/elasticsearch-plugin`, -while the corresponding management tool for opensearch is `/bin/opensearch-plugin`. -The `` and `` are the respective directories of the two services after installation. - -#### Install ideaseg plugin for elasticsearch - -```shell -$ bin/elasticsearch-plugin install file:////elasticsearch/target/ideaseg.zip -``` - -#### Install ideaseg plugin for opensearch - -```shell -$ bin/opensearch-plugin install file:////opensearch/target/ideaseg.zip -``` - -where `` is the path to the `ideaseg` source code. Please note that the path must have `file://` before it. If it is a windows system, the path needs to be added with `file:///`, such as `file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip`. - -During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation. - -Next, you can use the word segmentation test tool to test the plugin as follows: - -``` -POST _analyze -{ - "analyzer": "ideaseg", - "text": "你好,我用的是 ideaseg 分词插件。" -} -``` - -For more information on word segmentation testing, please refer to [ElasticSearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。 - -### Feedback - -If you have any questions about using 'ideaseg', please raise them via [Issues](https://gitee.com/indexea/ideaseg/issues). - -### Special thanks - -https://github.com/KennFalcon/elasticsearch-analysis-hanlp \ No newline at end of file diff --git a/Readme_zh.md b/Readme_zh.md new file mode 100644 index 0000000..5ed80ed --- /dev/null +++ b/Readme_zh.md @@ -0,0 +1,86 @@ +## ideaseg + +`ideaseg` 是一个基于最新的 [HanLP](https://github.com/hankcs/HanLP/tree/1.x) 自然语言处理工具包实现的中文分词器, +包含了最新的模型数据,同时移除了 HanLP 所包含的非商业友好许可的 [NeuralNetworkParser](https://github.com/hankcs/HanLP/issues/644) 相关代码和数据。 + +`HanLP` 相比其他诸如 `IK`、`jcseg` 等分词器而言,在分词的准确率上有巨大的提升,但速度上有所牺牲。 +通过对 `HanLP` 进行优化配置,`ideaseg` 在准确度和分词速度上取得了最佳的平衡。 + +与其他基于 `HanLP` 的插件相比,`ideaseg` 同步了最新 `HanLP` 的代码和数据,去除了无法商用的相关内容;实现了自动配置; +包含了模型数据,无需自行下载,使用简单方便。 + +`ideaseg` 提供三个模块包括: + +1. `core` ~ 核心分词器模块 +2. `elasticsearch` ~ ElasticSearch 的 ideaseg 分词插件 (最高支持 7.10.2 版本) +3. `opensearch` ~ OpenSearch 的 ideaseg 分词插件 (默认版本 2.4.1) + +**关于 `ElasticSearch` 的版本说明,由于从 7.11.1 版本开始 Elastic 修改 ES 的许可证,同时修改了插件的权限策略, +不再允许插件对文件进行读写。由于 `HanLP` 本身的模型数据很大,为了提升速度其处理机制需要在插件的数据目录下生成一些相当于缓存的文件。 +因此,如果你使用的是 `ElasticSearch` 请尽量用 7.10.2 或者以下的版本,推荐使用 `OpenSearch` 。** + +此外 `data` 包含 `HanLP` 的模型数据。 + +由于包含数据模型体积较大(打包后四五百兆),再加上 `ElasticSearch` 的插件机制严格绑定引擎本身的版本,而且版本众多,因此本项目不提供预编译的二进制版本,你需要自行下载源码进行构建。 + +### 构建 + +以下是插件的构建过程,在开始之前请先安装好 `git`、`java`、`maven` 等相关工具。 + +首先确定你的 `ElasticSearch` 或者 `OpenSearch` 的具体版本,假设你使用的是 `ElasticSearch` 7.10.2 版本, +请使用文本编辑器打开 `ideaseg/elasticsearch/pom.xml` 文件,修改 `elasticsearch.version` 对应的值为 `7.10.2` +(如果是 `OpenSearch` 请修改 `opensearch/pom.xml`)。 + +保存文件并打开命令行窗口,执行如下命令开始构建: + +```shell +$ git clone https://gitee.com/indexea/ideaseg +$ cd ideaseg +$ mvn install +``` + +构建完成后,将在 `elasticsearch/target` 和 `opensearch/target` 各生成两个插件文件为 `ideaseg.zip` 。 + +### 安装 + +构建完成后,我们可以利用 `ElasticSearch` 或 `OpenSearch` 提供的插件管理工具进行安装。 + +`ElasticSearch` 对应的插件管理工具为 `/bin/elasticsearch-plugin` , +而 `OpenSearch` 对应的管理工具为 `/bin/opensearch-plugin`。 +其中 `` 和 `` 为两个服务安装后所在的目录。 + +#### ElasticSearch 安装 ideaseg 插件 + +```shell +$ bin/elasticsearch-plugin install file:////elasticsearch/target/ideaseg.zip +``` + +#### OpenSearch 安装 ideaseg 插件 + +```shell +$ bin/opensearch-plugin install file:////opensearch/target/ideaseg.zip +``` + +其中 `` 为 `ideaseg` 源码所在的路径。要特别注意到是路径前必须有 `file://` ,如果是 Windows 系统,则需要路径前添加 `file:///` ,例如 `file:///D:\WORKDIR\Indexea\ideaseg\elasticsearch\target\ideaseg.zip`。 + +安装过程会询问插件所需的权限,回车确认即可完成安装,安装完毕需要重启服务才能让插件生效。 + +接下来你可以使用分词测试工具来对插件进行测试,如下所示: + +``` +POST _analyze +{ + "analyzer": "ideaseg", + "text": "你好,我用的是 ideaseg 分词插件。" +} +``` + +关于分词测试的详情请参考 [ElasticSearch 官方文档](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。 + +### 反馈问题 + +如果你在使用 `ideaseg` 过程中有任何问题,请通过 [Issues](https://gitee.com/indexea/ideaseg/issues) 提出。 + +### 特别感谢 + +https://github.com/KennFalcon/elasticsearch-analysis-hanlp \ No newline at end of file