Auto-tagging on Elasticsearch (with 멜론 ES 검색 시스템 소개) - 송준이 님

로엔엔터테인먼트
플랫폼개발팀
2016.03.31
송준이
(socurites@gmai.com)
Auto-tagging on Elasticsearch
- with 멜론 ES 검색 시스템 소개 -

목차
• 멜론 ES 검색 시스템 소개
– 색인 프로세스
– ES 검색 시스템 아키텍처
– 검색 애플리케이션 구조
• DeepDetect: API + Server
– DeepDetect 개요
– 설치하기
– 학습하고 서비스하기
• DeepDetect /predcit
– 이미지 분류하기
– pre-trained 모델 받기
– 서비스 등록하기
– 이미지 분류 예측하기
• DeepDetect /train
– 입력 데이터셋 준비하기
– 서비스 등록하기
– 모델 학습하기
– 텍스트 분류 예측하기

목차
• DeepDetect 운영
– pre-trained 이미지 분류 모델 등록하기
– pre-trained 텍스트 분류 모델 등록하기
– 학습 서버와 상용 서버 분리
• Auto-tagging on Elasticsearch
– DeepDetect + Elasticsearch
– Auto-tagging

멜론 ES 검색 시스템 소개

색인 프로세스
– 색인 주기
• 초기 색인 (full / indexing)
– 테이블 전체 데이터 1차례 색인
• 실시간 색인 (realtime / indexing)
– 실시간 변경(생성/수정) 데이터 색인
– Async, 실패 가능
• 증분 색인
– 주기적으로, 증분 데이터 재색인하여 실시간 색인 실패 데이터 보정 색인
원본 테이블
full / indexingfull / import1. 2.
변경분(periodic)
partial(changes) / import incremental / indexing
API 서버
생성 / 수정 realtime / indexing4.3.
5. 6.

elasticsearch 클러스터
ES 검색 시스템 아키텍처
API 클러스터client
검색 API 서버
http
ES 클러스터
#master 노드
#data 노드
색인 배치 서버
#bulk 노드
.
.
.
.
.
#client 노드
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ES 노드 ES 서버

검색 애플리케이션 구조
• 클래스 Hierarchy
– spring-data-elasticsearch 기반
• https://github.com/spring-projects/spring-data-elasticsearch

검색 애플리케이션 구조
• 검색 메서드 구현
Search
Aggregation

DeepDetect 개요
• DeepDetect 특징
– 딥러닝을 위한 API와 서버 제공
– 오픈소스
– C++1.1로 개발
– Caffe 딥러닝 라이브러리 지원(https://github.com/BVLC/caffe)
– 이미지/텍스트/CSV 파일에 대한 supervised 딥러닝을 구현
– API 뿐만 아니라 학습(training)과 상용 서비스를 위한 서버를 동시에 지원
– 학습을 위한 개발 서버와 서비스를 위한 상용 서버를 분리한 아키텍처 구성
https://www.elastic.co/blog/categorizing-images-with-deep-learning-into-elasticsearch

설치하기
• 설치를 위한 사전 준비
• 컴파일
$ sudo apt-get install build-essential libgoogle-glog-dev libgflags-dev libeigen3-dev libopencv-dev libcppnetlib-dev
libboost-dev libcurlpp-dev libcurl4-openssl-dev protobuf-compiler libopenblas-dev libhdf5-dev libprotobuf-dev
libleveldb-dev libsnappy-dev liblmdb-dev libutfcpp-dev cmake
$ git clone https://github.com/beniz/deepdetect.git
$ cd deepdetect
$ mkdir build
$ cd build
$ cmake ..
$ make
Installing DeepDetect, http://www.deepdetect.com/overview/installing/

설치하기
• 서버 실행하기
– 서버는 8080/http 포트에서 실행
• 서버 상태 확인하기
$ ./main/dede
DeepDetect [ commit c8556f0b3e7d970bcd9861b910f9eae87cfd4b0c ]
Running DeepDetect HTTP server on localhost:8080
Installing DeepDetect, http://www.deepdetect.com/overview/installing/
$ curl -XGET 'http://localhost:8080/info' | json_pp
{
"status" : {
"msg" : "OK",
"code" : 200
},
"head" : {
"version" : "0.1",
"services" : [],
"branch" : "master",
"commit" : "c8556f0b3e7d970bcd9861b910f9eae87cfd4b0c",
"method" : "/info"
}
}

학습하고 서비스하기
• 이미지 분류기(Classifier)
1. 분류(class)가 알려진 이미지를 입력으로 학습(training)시킨다
2. 딥러닝 엔진은 학습된 결과로 모델(model)을 만든다
3. 새로운 이미지에 대한 분류값을 예측(predict) 요청한다
4. 어플리케이션 서비스는 생성된 모델을 기반으로 새로운 이미지에 대한 분류값을 예
측한다.

학습하고 서비스하기
• DeepDetect로 이미지 분류하기
1. 분류(class)가 알려진 이미지를 입력으로 학습(training)시킨다
2. 딥러닝 엔진은 학습된 결과로 모델(model)을 만든다
3. 모델에 대한 서비스를 등록한다.
4. 새로운 이미지에 대한 분류값을 예측(predict) 요청한다
5. 어플리케이션 서비스는 생성된 모델을 기반으로 새로운 이미지에 대한 분류값을 예
측한다.
DeepDetect = API + Server

DeepDetect /predict
이미지 분류하기 예제

이미지 분류하기
• 학습된 모델로 이미지 분류하기
– 이미 학습된 모델이 있다면,
• 모델에 대한 서비스를 등록
• 새로운 이미지에 대한 분류를 예측
– 학습은 조금 후에 다룰 것…
Setup of image classifier, http://www.deepdetect.com/tutorials/imagenet-classifier/

pre-trained 모델 받기
• GoogleNet 모델 받기
• 서비스할 모델 레파지토리 만들기
– 서버에서 사용할 모델 레파지토리 디렉토리를 생성
• 레파지토리 경로는 서비스를 등록할 때 지정하므로, 위치는 어디든 상관 없다
• 서비스 단위로 레파지토리 디렉토리를 생성. 이 경우에는 imgnet을 사용
$ cd build/caffe_dd/src/caffe_dd
$ ./scripts/download_model_binary.py models/bvlc_googlenet/
$ ls -l models/bvlc_googlenet/*.caffemodel
$ cd deepdetect
$ mkdir models
$ mkdir models/imgnet
$ mv build/caffe_dd/src/caffe_dd/models/bvlc_googlenet/bvlc_googlenet.caffemodel models/imgnet

pre-trained 모델 받기
• 분류(Class) 매핑 데이터 추가하기
– 학습 단계에서 자동으로 생성됨
– 예제에서 사용할 pre-trained 모델에 대한 분류 매핑 데이터를 가짐
cp datasets/imagenet/corresp_ilsvrc12.txt models/imgnet/corresp.txt
$ head models/imgnet/corresp.txt
0 n01440764 tench, Tinca tinca
1 n01443537 goldfish, Carassius auratus
2 n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
3 n01491361 tiger shark, Galeocerdo cuvieri
4 n01494475 hammerhead, hammerhead shark
5 n01496331 electric ray, crampfish, numbfish, torpedo
…

서비스 등록하기
• 서비스 등록하기
$ cd deepdetect/build/main
$ ./dede
curl -X PUT "http://localhost:8080/services/imageserv" -d '{
"mllib":"caffe",
"description":"image classification service",
"type":"supervised",
"parameters":{
"input":{
"connector":"image"
},
"mllib":{
"template":"googlenet",
"nclasses":1000
}
},
"model":{
"templates":"../templates/caffe/",
"repository":"../../models/imgnet"
}
}' | json_pp
{
"status" : {
"code" : 201,
"msg" : "Created"
}
}

– 서비스명: imgserve
• mlib: Machine learning LIBrary
• type: supervised
• connector: image, text, csv 중 선택
• nclasses: Number of CLASSES, 예측할 분류값 개수
• repository: 서비스할 모델이 저장된 위치

이미지 분류 예측하기
• 로컬 이미지 파일로 예측하기
$ wget http://www.deepdetect.com/img/ambulance.jpg
$ curl -X POST "http://localhost:8080/predict" -d'
{
"service":"imageserv",
"parameters":{
"input":{
"width":224,
"height":224
},
"output":{
"best":3
}
},
"data":[
"ambulance.jpg"
]
}‘ | json_pp

• 로컬 이미지 파일로 예측하기
– 서비스명: imgserve
• output: 예측 결과에 포함할 최대 분류 개수
{
"status" : {
"code" : 200,
"msg" : "OK"
},
"body" : {
"predictions" : {
"uri" : "ambulance.jpg",
"classes" : [
{
"prob" : 0.992852032184601,
"cat" : "n02701002 ambulance"
},
{
"prob" : 0.0069321496412158,
"cat" : "n03977966 police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria"
},
{
"cat" : "n03769881 minibus",
"last" : true,
"prob" : 6.9531706685666e-05
}
]
}
},
"head" : {
"time" : 1211,
"service" : "imageserv",
"method" : "/predict"
}
}

• 이미지 URL로 예측하기
$ curl -X POST "http://localhost:8080/predict" -d'
{
"service":"imageserv",
"parameters":{
"input":{
"width":224,
"height":224
},
"output":{
"best":3
}
},
"data":[
"http://i.ytimg.com/vi/0vxOhd4qlnA/maxresdefault.jpg"
]
}' | json_pp
…
"predictions" : {
"classes" : [
{
"cat" : "n03868863 oxygen mask",
"prob" : 0.225514054298401
},
{
"prob" : 0.209176555275917,
"cat" : "n03127747 crash helmet"
},
],
"uri" : http://i.ytimg.com/vi/0vxOhd4qlnA/maxresdefault.jpg
…

DeepDetect /train
텍스트 학습하기 예제

입력 데이터셋 준비하기
• news20 데이터
– 20가지 주제에 대한 이메일 데이터
– 각 주제별 1000개 이하의 텍스트 이메일 파일
$ cd deepdetect/
$ mkdir input
$ mkdir input/models
$ mkdir input/models/n20
$ cd input/models/n20
$ wget http://www.deepdetect.com/dd/examples/all/n20/news20.tar.bz2
$ tar xvjf news20.tar.bz2
$ rm -rf news20.tar.bz2
$ ll news20/
drwxr-xr-x 2 socurites socurites 32768 9월 30 00:52 alt_atheism/
drwxr-xr-x 2 socurites socurites 36864 9월 30 00:52 comp_graphics/
drwxr-xr-x 2 socurites socurites 36864 9월 30 00:52 comp_os_ms_windows_m/
…
$ cat news20/rec_autos/000000431.eml
From: dduff@col.hp.com (Dave Duff)
Subject: Re: Waxing a new car
I just had my 41 Chrysler painted. I was told to refrain from waxing it and
to leave it out in the sun!! Supposedly this let's the volatiles escape from
the paint over a month or so (I can smell it 15 feet away on a hot day) and
lets any slight irregularites in the surface flow out, as the paint remains
a little soft for a while.
Training a model from text, http://www.deepdetect.com/tutorials/txt-training/

$ ./dede
$ cd deepdetect/models
$ mkdir n20
$ curl -X PUT "http://localhost:8080/services/n20" -d '{
"mllib":"caffe",
"description":"newsgroup classification service",
"parameters":{
"input":{
"connector":"txt"
},
"mllib":{
"template":"mlp",
"nclasses":20,
"layers":[200, 200],
"activation":"relu"
}
},
"model":{
"templates":"../../templates/caffe/",
"repository":"../../models/n20"
}
}' | json_pp

– 서비스명: n20
• layers: 200 X 200, 200개의 히든 노드(hidden node)의 2개 레이어로 구성
• activation: 활성함수로는 relu(Rectified Linear Unit)을 사용
• 상대 경로는 모두 DeepDetect 서버 실행한 경로로부터 시작
Training a model from text, http://www.deepdetect.com/tutorials/txt-training/

모델 학습하기
• BOW(Bag Of Words) 모델로 학습하기
curl -X POST "http://localhost:8080/train" -d '{
"service":"n20",
"async":true,
"parameters":{
"mllib":{
"gpu":true,
"solver":{
"iterations":2000,
"test_interval":200,
"base_lr":0.05
},
"net":{
"batch_size":300
}
},
"input":{
"shuffle":true,
"test_split":0.2,
"min_count":2,
"min_word_length":2,
"count":false
},
"output":{
"measure":[
"mcll",
"f1"
]
}
},
"data":[
"../../input/models/n20/news20"
]
}'

모델 학습하기
• BOW(Bag Of Words) 모델로 학습하기
– 파라미터
• gpu: 연산시 gpu 사용 여부
• iterations: 이터레이션 횟수
• test_split: 입력데이터 중 테스트 데이터 비율(20%)
• min_count: 최소 단어 빈도, 이 빈도보다 적은 단어는 BOW에서 제외
• min_word_length: 단어의 최소 길이, 이 길이보다 작은 단어는 BOW에서 제외
• count: 빈도값 학습 사용 여부
– 서버 로그
INFO - source=../../templates/caffe/mlp/
INFO - dest=../../models/n20/mlp.prototxt
list subdirs size=20
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0309 18:11:13.121186 31042 txtinputfileconn.cc:182] vocabulary size=88631
data split test size=3770 / remaining data size=15078
vocab size=88631
INFO - user batch_size=300 / inputc batch_size=15078
INFO - batch_size=359 / test_batch_size=290 / test_iter=13

모델 학습하기
• BOW(Bag Of Words) 단어 리스트
$ cd deepdetect
$ cd models/n20/
$ ll
total 1264
drwxrwxr-x 2 socurites socurites 4096 3월 9 18:11 ./
drwxrwxr-x 4 socurites socurites 4096 3월 9 17:50 ../
-rw-rw-r-- 1 socurites socurites 372 3월 9 18:11 corresp.txt
-rw-rw-r-- 1 socurites socurites 1065 3월 9 18:11 deploy.prototxt
-rw-rw-r-- 1 socurites socurites 1704 3월 9 18:11 mlp.prototxt
-rw-rw-r-- 1 socurites socurites 290 3월 9 18:10 mlp_solver.prototxt
-rw-rw-r-- 1 socurites socurites 4389 3월 9 18:11 model.json
-rw-rw-r-- 1 socurites socurites 1259121 3월 9 18:11 vocab.dat
$ head vocab.dat
autoposting,0
seidov,1
ambulances,2
isqat,3
earring,4
grigorevna,5
barfling,6
13271@cs,7
024858,8
aryeh,9

모델 학습하기
• 학습 진행상황 모니터링하기
$ curl -X GET "http://localhost:8080/train?service=n20&job=1" | json_pp
{
"status" : {
"msg" : "OK",
"code" : 200
},
"head" : {
"time" : 791,
"status" : "running",
"method" : "/train",
"job" : 1
},
"body" : {
"measure" : {
"accp" : 0.90053050397878,
"iteration" : 800,
"recall" : 0.902855203525655,
"train_loss" : 0.034172598272562,
"precision" : 0.897884533058947,
"mcll" : 0.389328922262699,
"f1" : 0.900363007899212
}
}
}
## 1s마다 상태 출력하기
$ while :; do curl -X GET "http://localhost:8080/train?service=n20&job=1"; sleep 1; echo ""; done

모델 학습하기
• 학습 완료
$ curl -X GET "http://localhost:8080/train?service=n20&job=1" | json_pp
{
"status":{
"msg":"OK",
"code":200
},
"body":{
"parameters":{
"mllib":{
"batch_size":359
}
},
"measure":{
"f1":0.8919178423728972,
"train_loss":0.0016851313412189484,
"mcll":0.5737156999301365,
"recall":0.8926410552973584,
"iteration":1999.0,
"precision":0.8911958003860988,
"accp":0.8936339522546419
}
},
"head":{
"status":"finished",
"job":1,
"method":"/train",
"time":541.0
}
}

텍스트 분류 예측하기
• /predict
– 서비스명: n20
$ curl -X POST 'http://localhost:8080/predict' -d '{
"service":"n20",
"parameters":{
"mllib":{
"gpu":true
}
},
"data":[
"my computer runs linux"
]
}‘ | json_pp

텍스트 분류 예측하기
• /predict
$ curl -X POST 'http://localhost:8080/predict' -d '{
"service":"n20",
"parameters":{
"mllib":{
"gpu":true
}
},
"data":[
"my computer runs linux"
]
}' | json_pp
{
"status" : {
"msg" : "OK",
"code" : 200
},
"body" : {
"predictions" : {
"classes" : {
"prob" : 0.392086714506149,
"last" : true,
"cat" : "comp_windows_x"
},
"uri" : "0"
}
},
"head" : {
"method" : "/predict",
"service" : "n20",
"time" : 925
}
}

DeepDetect 운영
학습 서버와 서비스 서버 분리

pre-trained 이미지 분류 모델 등록하기
• pre-trained 이미지 분류 모델 예
– 옷
• 분류 개수: 304
• http://www.deepdetect.com/models/clothing.tar.bz2
– 가방
• 분류 개수: 37
• http://www.deepdetect.com/models/bags.tar.bz2
– 신발
• 분류 개수: 51
• http://www.deepdetect.com/models/footwear.tar.bz2
Application-Ready Deep Neural Net Models, http://www.deepdetect.com/applications/model/

• pre-trained clothing 분류 모델 설치하기
$ cd deepdetect/
$ 2037 cd models/
$ mkdir dd
$ cd dd
$ wget http://www.deepdetect.com/models/clothing.tar.bz2
$ bunzip2 clothing.tar.bz2
$ tar xvf clothing.tar
$ rm -rf clothing.tar
$ ll clothing/
total 99688
drwxrwxr-x 2 socurites socurites 4096 11월 26 06:18 ./
drwxrwxr-x 3 socurites socurites 4096 3월 10 10:12 ../
-rw-rw-r-- 1 socurites socurites 6738 11월 23 16:28 corresp.txt
-rw-rw-r-- 1 socurites socurites 35884 11월 20 04:16 deploy.prototxt
-rw-rw-r-- 1 socurites socurites 395 11월 19 01:46 final.json
-rw-rw-r-- 1 socurites socurites 40791 11월 20 04:16 googlenet.prototxt
-rw-rw-r-- 1 socurites socurites 295 11월 20 04:16 googlenet_solver.prototxt
-rw-rw-r-- 1 socurites socurites 602126 11월 16 00:22 mean.binaryproto
-rw-rw-r-- 1 socurites socurites 50682165 11월 26 06:03 model_iter_300000.caffemodel
-rw-rw-r-- 1 socurites socurites 50661476 11월 26 06:03 model_iter_300000.solverstate
-rw-rw-r-- 1 socurites socurites 13896 11월 20 04:16 model.json
$ head clothing/corresp.txt
303 camisole
302 array, raiment, regalia
301 tricorn, tricorne
300 crash helmet
299 ensemble
298 robe
297 seat belt, seatbelt
296 parka, windbreaker, windcheater, anorak

• pre-trained clothing 분류 서비스 등록하기
$ curl -X PUT "http://localhost:8080/services/clothing" -d '{
"mllib":"caffe",
"description":"clothes classification",
"parameters":{
"input":{
"connector":"image",
"height":224,
"width":224
},
"mllib":{
"nclasses":304
}
},
"model":{
"repository":"../../models/dd/clothing"
}
}' | json_pp

pre-trained 텍스트 분류 모델 등록하기
• pre-trained n20 분류 서비스 등록하기
– dede(DeepDetect) 서버에 등록된 서비스는 메모리 휘발성
– 서버 재기동시 메모리 재등록 필요
– 앞서 직접 학습한 n20 모델을 서비스로 등록
– 학습을 위한 서비스 등록과 구분할 것
$ curl -X PUT "http://localhost:8080/services/n20" -d '{
"mllib":"caffe",
"description":"clothes classification",
"parameters":{
"input":{
"connector":"txt"
},
"mllib":{
"nclasses":20
}
},
"model":{
"repository":"../../models/n20"
}
}' | json_pp

학습 서버와 상용 서버 분리
• 서버 구조
학습 서버
서비스 서버

Auto-tagging on Elasticsearch
DeepDetect + Elasticsearch

• 이미지 분류 예측 문서 색인하기
– url: Elasticsearch 색인 URL
$ curl -XPOST "http://localhost:8080/predict" -d'
{
"service":"clothing",
"parameters":{
"mllib":{
"gpu":true
},
"input":{
"width":224,
"height":224
},
"output":{
"best":3,
"network":{
"url":"http://localhost:9200/images/img",
"http_method":"POST"
}
}
},
"data":[
"http://i.ytimg.com/vi/0vxOhd4qlnA/maxresdefault.jpg"
]
}' | json_pp
{
"created" : true,
"_type" : "img",
"_index" : "images",
"_version" : 1,
"_id" : "AVNeQmBY4KIRwyO4ideH"
}

• 색인 결과
$ curl -XGET 'http://localhost:9200/images/img/_search' | json_pp
…
{
"_source" : {
"body" : {
"predictions" : {
"classes" : [
{
"cat" : "spacesuit",
"prob" : 0.935437917709351
},
{
"cat" : "military uniform",
"prob" : 0.0494481474161148
}
],
"uri" : "http://i.ytimg.com/vi/0vxOhd4qlnA/maxresdefault.jpg"
}
},
"network" : {
"http_method" : "POST",
"url" : "http://localhost:9200/images/img"
},
"head" : {
"time" : 1875,
"service" : "clothing",
"method" : "/predict"
},
},
…

• 검색하기
– spacesuit(우주복)이 있는 이미지 검색하기
$ curl -XGET "http://localhost:9200/images/_search?q=spacesuit" | json_pp
…
"hits" : {
"hits" : [
{
"_id" : "AVNeNkfE4KIRwyO4idNw",
"_source" : {
"status" : {
"msg" : "OK",
"code" : 200
},
"head" : {
"time" : 1330,
"method" : "/predict",
"service" : "clothing"
},
"network" : {
"url" : "http://localhost:9200/images/img",
"http_method" : "POST"
},
"body" : {
"predictions" : {
"classes" : [
{
"prob" : 0.935437917709351,
"cat" : "spacesuit"
},
{
"prob" : 0.0494481474161148,
"cat" : "military uniform“
…

Auto-tagging
• 문서 자동 태그 붙이기
– 문서(이미지/텍스트 등)에 대한 분류 예측 요청
– 예측된 분류값을 문서에 포함하여 Elasticsearch 서버에 색인
– 검색된 문서에 대한 검색 요청

References
• Categorizing images with deep learning into Elasticsearch
https://www.elastic.co/blog/categorizing-images-with-deep-learning-
into-elasticsearch
• DeepDetect Homepage
http://www.deepdetect.com/
• DeepDetect: Installing DeepDetect
http://www.deepdetect.com/overview/installing/
• DeepDetect: Setup of an image classifier
http://www.deepdetect.com/tutorials/imagenet-classifier/
• DeepDetect: Training a model from text
http://www.deepdetect.com/tutorials/txt-training/
• DeepDetect: Application-Ready Deep Neural Net Models,
http://www.deepdetect.com/applications/model/

References
• Categorizing images with deep learning into Elasticesarch
https://www.elastic.co/blog/categorizing-images-with-deep-learning-
into-elasticsearch

Auto-tagging on Elasticsearch (with 멜론 ES 검색 시스템 소개) - 송준이 님

Recommended

Recommended

More Related Content

More from NAVER D2

More from NAVER D2 (20)

Auto-tagging on Elasticsearch (with 멜론 ES 검색 시스템 소개) - 송준이 님