[ELK Stack] Elasticsearch text field에서 특정 문자열 개수 구하기(Document Count)

티스토리 뷰

Software/Data Analytics

[ELK Stack] Elasticsearch text field에서 특정 문자열 개수 구하기(Document Count)

Arc Lab. 2017. 1. 6. 18:14

[업데이트 2017.01.06 17:55]

예를 들어 다음과 같은 시나리오가 있을 때, 특정 text field에 포함된 문자열의 개수를 counting 하는 방법을 찾아보았습니다.

현재 찾은 방법은 text field에 동일한 문자열이 여러개 있더라도 document당 1개로 count를 합니다.

- index: logstash-app-name

- type: data1

- field: AppNameList * field의 analyzed 속성이 true여야 full text 검색이 가능합니다.

- field text : "word, test, word, rundll32, autocad, autocad"

먼저 Logstash 등을 통해 Elasticsearch에 데이터를 insert할 때, 해당 field의 property 설정에 fielddata 속성이 true여야 합니다. default는 false인데, 설정이 안되어 있는 경우 다음과 같은 에러가 발생합니다.

"Fielddata is disabled on text fields by default. Set fielddata=true on AppNameList in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."

*참고: https://www.elastic.co/guide/en/elasticsearch/reference/5.1/fielddata.html

아래와 같이 index의 field property를 update하거나, index template에 미리 반영합니다.

PUT logstash-app-name/_mapping/data1
{
  "properties": {
    "AppNameList": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

PUT _template/logstash_appname
{
    "template": "logstash-appname-*",
    "settings": {
      "index": {
        "number_of_shards": "3",
        "number_of_replicas": "1",
        "mapping": {
          "total_fields": {
            "limit": "10000"
          }
        }
      }
    },
    "mappings": {
      "my_type": {
      "properties": {
        "AppNameList": {
          "type":        "text",
          "fielddata": true
        }
      }
    },
      "_default_": {
        "numeric_detection": true,
        "dynamic_templates": [
        ],
        "_all": {
          "enabled": false
        }
      }
  }
}

이제 아래와 같이 Search API를 통해 내림차순 정렬로 query를 수행해봅니다.

*참고: https://www.elastic.co/guide/en/elasticsearch/reference/5.1/search-aggregations-bucket-terms-aggregation.html

GET logstash-app-name/_search
{
    "aggs": {
        "count_by_type": {
            "terms": {
                "field": "AppNameList","order" : { "_count" : "desc" }
            }
        }
    }
    ,"size": 0

}

아래와 같이 field text가 주어질 때,

- field text : "word, test, word, rundll32, autocad, autocad"

다음과 같이 각 document별로 AppNameList text field의 문자열이 count됨을 알 수 있습니다.(중복 count 안됨)

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "count_by_type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "autocad",
          "doc_count": 1
        },
        {
          "key": "test",
          "doc_count": 1
        },
        {
          "key": "word",
          "doc_count": 1
        },
        {
          "key": "rundll32",
          "doc_count": 1
        }
      ]
    }
  }
}

Kibana에서 Y-Axis는 Count, X-Axis는 Aggregation > Terms > AppNameList 선택하면 각 document별 문자열의 count를 그래프로 보여줄 수 있습니다.

* index template은 __default__ mapping > custom mapping 순으로 반영이 됩니다.

저작자표시 비영리 변경금지 (새창열림)

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

Arc Lab.'s Blog

티스토리 뷰

[ELK Stack] Elasticsearch text field에서 특정 문자열 개수 구하기(Document Count)

티스토리툴바