ideaseg is a Chinese tokenizer based on the latest [hanlp](https://github.com/hankcs/hanlp/tree/1.x) natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related [neuralnetworkparser](https://github.com/hankcs/hanlp/issues/644) code and data contained in hanlp. Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed. Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use. ideaseg provides three modules including: 1. `core` ~ core tokenizer module 2. `elasticsearch` ~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2) 3. `opensearch` ~ ideaseg tokenizer plugin for opensearch (default version 2.4.1) **Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.** In addition, the data folder contains model data of hanlp. Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building. ### Building The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools. First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2, open the `ideaseg/elasticsearch/pom.xml` file with a text editor, and modify the value of `elasticsearch.version` to `7.10.2` (if it is opensearch, please modify `opensearch/pom.xml`). Save the file and open the command line window, and execute the following command to start building: ```shell $ git clone https://gitee.com/indexea/ideaseg $ cd ideaseg $ mvn install ``` After the build is completed, two plugin files `ideaseg.zip` will be generated in `elasticsearch/target` and `opensearch/target` respectively. ### Installation After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install. The corresponding plugin management tool for elasticsearch is `/bin/elasticsearch-plugin`, while the corresponding management tool for opensearch is `/bin/opensearch-plugin`. The `` and `` are the respective directories of the two services after installation. #### Install ideaseg plugin for elasticsearch ```shell $ bin/elasticsearch-plugin install file:////elasticsearch/target/ideaseg.zip ``` #### Install ideaseg plugin for opensearch ```shell $ bin/opensearch-plugin install file:////opensearch/target/ideaseg.zip ``` where `` is the path to the `ideaseg` source code. Please note that the path must have `file://` before it. If it is a windows system, the path needs to be added with `file:///`, such as `file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip`. During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation. Next, you can use the word segmentation test tool to test the plugin as follows: ``` POST _analyze { "analyzer": "ideaseg", "text": "你好,我用的是 ideaseg 分词插件。" } ``` For more information on word segmentation testing, please refer to [ElasticSearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。 ### Feedback If you have any questions about using 'ideaseg', please raise them via [Issues](https://gitee.com/indexea/ideaseg/issues). ### Special thanks https://github.com/KennFalcon/elasticsearch-analysis-hanlp