You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

Readme.md 4.5 kB

2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
  1. ideaseg is a Chinese tokenizer based on the latest [hanlp](https://github.com/hankcs/hanlp/tree/1.x) natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related [neuralnetworkparser](https://github.com/hankcs/hanlp/issues/644) code and data contained in hanlp.
  2. Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed.
  3. Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use.
  4. ideaseg provides three modules including:
  5. 1. `core` ~ core tokenizer module
  6. 2. `elasticsearch` ~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2)
  7. 3. `opensearch` ~ ideaseg tokenizer plugin for opensearch (default version 2.4.1)
  8. **Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.**
  9. In addition, the data folder contains model data of hanlp.
  10. Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building.
  11. ### Building
  12. The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools.
  13. First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2,
  14. open the `ideaseg/elasticsearch/pom.xml` file with a text editor, and modify the value of `elasticsearch.version` to `7.10.2`
  15. (if it is opensearch, please modify `opensearch/pom.xml`).
  16. Save the file and open the command line window, and execute the following command to start building:
  17. ```shell
  18. $ git clone https://gitee.com/indexea/ideaseg
  19. $ cd ideaseg
  20. $ mvn install
  21. ```
  22. After the build is completed, two plugin files `ideaseg.zip` will be generated in `elasticsearch/target` and `opensearch/target` respectively.
  23. ### Installation
  24. After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install.
  25. The corresponding plugin management tool for elasticsearch is `<elasticsearch>/bin/elasticsearch-plugin`,
  26. while the corresponding management tool for opensearch is `<opensearch>/bin/opensearch-plugin`.
  27. The `<elasticsearch>` and `<opensearch>` are the respective directories of the two services after installation.
  28. #### Install ideaseg plugin for elasticsearch
  29. ```shell
  30. $ bin/elasticsearch-plugin install file:///<ideaseg>/elasticsearch/target/ideaseg.zip
  31. ```
  32. #### Install ideaseg plugin for opensearch
  33. ```shell
  34. $ bin/opensearch-plugin install file:///<ideaseg>/opensearch/target/ideaseg.zip
  35. ```
  36. where `<ideaseg>` is the path to the `ideaseg` source code. Please note that the path must have `file://` before it. If it is a windows system, the path needs to be added with `file:///`, such as `file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip`.
  37. During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation.
  38. Next, you can use the word segmentation test tool to test the plugin as follows:
  39. ```
  40. POST _analyze
  41. {
  42. "analyzer": "ideaseg",
  43. "text": "你好,我用的是 ideaseg 分词插件。"
  44. }
  45. ```
  46. For more information on word segmentation testing, please refer to [ElasticSearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。
  47. ### Feedback
  48. If you have any questions about using 'ideaseg', please raise them via [Issues](https://gitee.com/indexea/ideaseg/issues).
  49. ### Special thanks
  50. https://github.com/KennFalcon/elasticsearch-analysis-hanlp

No Description

Contributors (1)