ElasticsearchAnalyzer and ElasticsearchTextAnalyzer analyze texts with Elasticsearch analysis feature. Therefore, in Python, you can use text analyzer you want.
First of all, you need to setup elasticsearch,
$ wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/zip/elasticsearch/2.3.1/elasticsearch-2.3.1.zip
$ unzip elasticsearch-2.3.1.zip
$ cd elasticsearch-2.3.1
$ echo 'cluster.name: es-ml' >> config/elasticsearch.yml
$ echo 'network.host: "0"' >> config/elasticsearch.yml
$ ./bin/plugin install org.codelibs/elasticsearch-analyze-api/2.3.0
install analysis plugins you need,
$ ./bin/plugin install analysis-kuromoji
$ ./bin/plugin install analysis-icu
$ ./bin/plugin install org.codelibs/elasticsearch-analysis-synonym/2.3.0 -b
$ ./bin/plugin install org.codelibs/elasticsearch-analysis-ja/2.3.0 -b
$ ./bin/plugin install org.codelibs/elasticsearch-analysis-kuromoji-neologd/2.3.0 -b
and then start elasticsearch.
$ ./bin/elasticsearch &
To analyze texts, create elasticsearch's index with analyzers.
$ curl -XPUT localhost:9200/.analyzer -d '
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_neologd_tokenizer": {
"discard_punctuation": "false",
"type": "kuromoji_neologd_tokenizer",
"mode": "normal"
}
},
"analyzer": {
"kuromoji_neologd_analyzer": {
"tokenizer": "kuromoji_neologd_tokenizer",
"type": "custom"
}
}
},
"number_of_replicas": "0",
"number_of_shards": "10",
"refresh_interval": "60s"
}
}
}'
To check _analyze_api request, send the following request:
$ curl -XPOST "localhost:9200/.analyzer/_analyze_api?pretty&analyzer=kuromoji_neologd_analyzer&part_of_speech=true" -d'
{
"data":{
"text":"今日の天気は晴れです。"
}
}'
In [ ]:
from commonml import es
analyzer_url = 'es://localhost:9200/.analyzer/kuromoji_neologd_analyzer'
es_analyzer = es.build_analyzer(analyzer_url)
for term in es_analyzer('今日の天気は晴れです。'):
print(term)