SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
Knowledge Distillation 1
🧪
Knowledge Distillation
발표자 유용상
발표일자
논문링크
논문게재일
도메인 기타
발표자료
파일과미디어
Knowledge Distillation이란?
@2023년2월23일
Knowledge Distillation이란?
왜필요할까?
Distilling the Knowledge in a Neural Network (NIPS 2014)
진행과정
Soft Label
Distillation Loss
다양한KD 모델들
DistillBERT (NIPS 2019)
TinyBERT (EMNLP 2020)
SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL REPRESENTATION (ICLR
2021)
참고자료
Knowledge Distillation 2
지식(Knowledge) + 증류(Distillation)
→ Teacher Network로부터증류한지식을Student Network로transfer하는일련의과
정
왜필요할까?
처음등장했을때→ Model Deploy(모델배포) 측면에서필요하다고주장
Knowledge Distillation 3
현재→경량화된모델을만들기위해서, 학습단계에드는리소스를줄이기위해서등등다
양한이유로연구되고있는분야!
Distilling the Knowledge in a Neural
Network (NIPS 2014)
KD에대한개념을처음으로정의한논문
복잡한모델(ex.앙상블모델)을유저에게배포하는것은어렵기때문에KD를통해작은
모델로학습한결과를전달하고전달받은모델의성능을평가
사용데이터셋: MNIST (Multi Class Classification)
진행과정
Teacher Network 학습
▼
Teacher Network에서Soft Label(Soft output, Dark Knowledge) 추출
▼
추출한지식과 Student 모델이예측한결과와정답사이의CE Loss 를합쳐Distillation
Loss 구성
Knowledge Distillation 4
Soft Label
일반적인분류모델이🐮, 🐶, 😺, 🚗를구분한다면?
정답(Hard Label, Original Target) :
분류모델이추론한결과:
논문에서는정답확률이아닌나머지값에주목했고이것들을Dark Knowledge라고표현
But!! 분류에서주로사용되는소프트맥스함수는큰값은더크게만들고작은값은더작게
만드는특징이있음
따라서Teacher Model의Dark Knowledge를잘추출하기위해서는출력값의분포를조금
더Soft하게만들필요가있다!!
일반소프트맥스식에T(Temperature)가추가됨: 높아지면Soft, 낮아지면Hard
Knowledge Distillation 5
Distillation Loss
🧐: 추출한Teacher Model 의지식을Student Model한테어떻게학습시킴??
→ Student Model이Teacher Model의Soft Label을출력하도록함! (KD Loss)
🧐:하지만Soft Label만학습시키는것은정답Label을예측하는게아니라그저
Teacher Model이뱉는결과의‘분포’만답습하는모델을만드는거아닌가요??
→ Student 모델이예측하는결과와정답(Hard Label)이가까워지도록하는CE Loss
구성
위의두Loss의합을최종손실함수로정의함
실험결과
Knowledge Distillation 6
MNIST dataset에서숫자3 데이터를제거하여student model을knowledge
distillation 방법으로학습→ 숫자3에대한정보를학습하지않았지만, soft label
이갖고있는정보로만학습하여test 3 이미지에대해98.6%의정확도를달성
student model이10개의모델을ensemble한model과비슷한정확도를보여줌,
10개의모델을ensemble하는비용을생각하면, knowledge distillation은정말효
과적!!
파이토치구현블로그
https://deep-learning-study.tistory.com/700
다양한KD 모델들
DistillBERT (NIPS 2019)
Teacher : 사전학습된BERT 모델(RoBERTa처럼dynamic masking 사용)
Student : token-type embedding, pooler 제거+ 레이어개수2배감소
3가지Loss 사용
1. Distillation Loss
: 소프트타깃과소프트예측사이의CE Loss
2. Masked Language Modeling Loss
: 하드타깃과하드예측사이의CE Loss
3. Cosine Embedding Loss
Knowledge Distillation 7
: Teacher와Student의hidden state vector 사이의거리로두모델의state가같은방향을
바라보게함
→ 기본BERT 대비2배적은레이어(모델용량207MB) + 유사한(97%) 성능+ 빠른(60%)
추론속도
TinyBERT (EMNLP 2020)
세가지Loss
1. Transformer Distillation
: Teacher Model의Transformer Layer의어텐션행렬(정규화전) 학습
+Transformer Layer의아웃풋(=Hidden States) 학습
2. Embedding-layer Distillation
Knowledge Distillation 8
: Teacher Model 의임베딩결과학습
3. Prediction-layer Distillation
: 최종레이어의결과값에대한Soft CE loss
두가지단계의distillation
1. General Distillation
: Teacher Model에서[Transformer Distillation, Embedding layer Distillation]
수행
2. Task-Specific Distillation (Over-parameterization 해결)
a. Data Augmentation
b. Task-Specific Distillation (Fine Tuning)
→ 4개Layer 버전: BERT_base보다7.5배작고9.4배빠름+ 96.8% 성능
→ 6개Layer 버전: 파라미터40% 감소+ 2배빠름+ 성능유지
SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL
REPRESENTATION (ICLR 2021)
→ Constrastive Learning에서KD 도입
사전학습된Teacher Model을freeze해서더작은모델에게Distill
Knowledge Distillation 9
이미지를randomly augment한뒤계산하는feature에관해두모델의probability score
유사도를CE로구함
참고자료
https://baeseongsu.github.io/posts/knowledge-distillation/
https://deep-learning-study.tistory.com/699
https://deep-learning-study.tistory.com/700
https://velog.io/@dldydldy75/지식-증류-Knowledge-Distillation
https://syj9700.tistory.com/38
https://3months.tistory.com/436
https://facerain.club/distilbert-paper/
https://littlefoxdiary.tistory.com/64
Knowledge Distillation 10
KD, SSL 등을포괄하는개념인representation learning에대한좋은글:
https://89douner.tistory.com/339

Más contenido relacionado

Más de YongSang Yoo (8)

221220_페르소나챗봇
221220_페르소나챗봇221220_페르소나챗봇
221220_페르소나챗봇
 
220920_AI ETHICS
220920_AI ETHICS220920_AI ETHICS
220920_AI ETHICS
 
230309_LoRa
230309_LoRa230309_LoRa
230309_LoRa
 
230305_Characterizing English Variation across Social Media Communities with ...
230305_Characterizing English Variation across Social Media Communities with ...230305_Characterizing English Variation across Social Media Communities with ...
230305_Characterizing English Variation across Social Media Communities with ...
 
221108_Multimodal Transformer
221108_Multimodal Transformer221108_Multimodal Transformer
221108_Multimodal Transformer
 
221011_BERT
221011_BERT221011_BERT
221011_BERT
 
220910_GatedRNN
220910_GatedRNN220910_GatedRNN
220910_GatedRNN
 
220906_Glove
220906_Glove220906_Glove
220906_Glove
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Último (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 

230223_Knowledge_Distillation

  • 1. Knowledge Distillation 1 🧪 Knowledge Distillation 발표자 유용상 발표일자 논문링크 논문게재일 도메인 기타 발표자료 파일과미디어 Knowledge Distillation이란? @2023년2월23일 Knowledge Distillation이란? 왜필요할까? Distilling the Knowledge in a Neural Network (NIPS 2014) 진행과정 Soft Label Distillation Loss 다양한KD 모델들 DistillBERT (NIPS 2019) TinyBERT (EMNLP 2020) SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL REPRESENTATION (ICLR 2021) 참고자료
  • 2. Knowledge Distillation 2 지식(Knowledge) + 증류(Distillation) → Teacher Network로부터증류한지식을Student Network로transfer하는일련의과 정 왜필요할까? 처음등장했을때→ Model Deploy(모델배포) 측면에서필요하다고주장
  • 3. Knowledge Distillation 3 현재→경량화된모델을만들기위해서, 학습단계에드는리소스를줄이기위해서등등다 양한이유로연구되고있는분야! Distilling the Knowledge in a Neural Network (NIPS 2014) KD에대한개념을처음으로정의한논문 복잡한모델(ex.앙상블모델)을유저에게배포하는것은어렵기때문에KD를통해작은 모델로학습한결과를전달하고전달받은모델의성능을평가 사용데이터셋: MNIST (Multi Class Classification) 진행과정 Teacher Network 학습 ▼ Teacher Network에서Soft Label(Soft output, Dark Knowledge) 추출 ▼ 추출한지식과 Student 모델이예측한결과와정답사이의CE Loss 를합쳐Distillation Loss 구성
  • 4. Knowledge Distillation 4 Soft Label 일반적인분류모델이🐮, 🐶, 😺, 🚗를구분한다면? 정답(Hard Label, Original Target) : 분류모델이추론한결과: 논문에서는정답확률이아닌나머지값에주목했고이것들을Dark Knowledge라고표현 But!! 분류에서주로사용되는소프트맥스함수는큰값은더크게만들고작은값은더작게 만드는특징이있음 따라서Teacher Model의Dark Knowledge를잘추출하기위해서는출력값의분포를조금 더Soft하게만들필요가있다!! 일반소프트맥스식에T(Temperature)가추가됨: 높아지면Soft, 낮아지면Hard
  • 5. Knowledge Distillation 5 Distillation Loss 🧐: 추출한Teacher Model 의지식을Student Model한테어떻게학습시킴?? → Student Model이Teacher Model의Soft Label을출력하도록함! (KD Loss) 🧐:하지만Soft Label만학습시키는것은정답Label을예측하는게아니라그저 Teacher Model이뱉는결과의‘분포’만답습하는모델을만드는거아닌가요?? → Student 모델이예측하는결과와정답(Hard Label)이가까워지도록하는CE Loss 구성 위의두Loss의합을최종손실함수로정의함 실험결과
  • 6. Knowledge Distillation 6 MNIST dataset에서숫자3 데이터를제거하여student model을knowledge distillation 방법으로학습→ 숫자3에대한정보를학습하지않았지만, soft label 이갖고있는정보로만학습하여test 3 이미지에대해98.6%의정확도를달성 student model이10개의모델을ensemble한model과비슷한정확도를보여줌, 10개의모델을ensemble하는비용을생각하면, knowledge distillation은정말효 과적!! 파이토치구현블로그 https://deep-learning-study.tistory.com/700 다양한KD 모델들 DistillBERT (NIPS 2019) Teacher : 사전학습된BERT 모델(RoBERTa처럼dynamic masking 사용) Student : token-type embedding, pooler 제거+ 레이어개수2배감소 3가지Loss 사용 1. Distillation Loss : 소프트타깃과소프트예측사이의CE Loss 2. Masked Language Modeling Loss : 하드타깃과하드예측사이의CE Loss 3. Cosine Embedding Loss
  • 7. Knowledge Distillation 7 : Teacher와Student의hidden state vector 사이의거리로두모델의state가같은방향을 바라보게함 → 기본BERT 대비2배적은레이어(모델용량207MB) + 유사한(97%) 성능+ 빠른(60%) 추론속도 TinyBERT (EMNLP 2020) 세가지Loss 1. Transformer Distillation : Teacher Model의Transformer Layer의어텐션행렬(정규화전) 학습 +Transformer Layer의아웃풋(=Hidden States) 학습 2. Embedding-layer Distillation
  • 8. Knowledge Distillation 8 : Teacher Model 의임베딩결과학습 3. Prediction-layer Distillation : 최종레이어의결과값에대한Soft CE loss 두가지단계의distillation 1. General Distillation : Teacher Model에서[Transformer Distillation, Embedding layer Distillation] 수행 2. Task-Specific Distillation (Over-parameterization 해결) a. Data Augmentation b. Task-Specific Distillation (Fine Tuning) → 4개Layer 버전: BERT_base보다7.5배작고9.4배빠름+ 96.8% 성능 → 6개Layer 버전: 파라미터40% 감소+ 2배빠름+ 성능유지 SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL REPRESENTATION (ICLR 2021) → Constrastive Learning에서KD 도입 사전학습된Teacher Model을freeze해서더작은모델에게Distill
  • 9. Knowledge Distillation 9 이미지를randomly augment한뒤계산하는feature에관해두모델의probability score 유사도를CE로구함 참고자료 https://baeseongsu.github.io/posts/knowledge-distillation/ https://deep-learning-study.tistory.com/699 https://deep-learning-study.tistory.com/700 https://velog.io/@dldydldy75/지식-증류-Knowledge-Distillation https://syj9700.tistory.com/38 https://3months.tistory.com/436 https://facerain.club/distilbert-paper/ https://littlefoxdiary.tistory.com/64
  • 10. Knowledge Distillation 10 KD, SSL 등을포괄하는개념인representation learning에대한좋은글: https://89douner.tistory.com/339