SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
2023/3/6
Characterizing English Variation
across Social Media Communities with
BERT
유용상
TACL 2021
1
Introduction
• Much previous work characterizing
language variation across Internet social
groups has focused on the types of words
used by these groups.
• employing BERT to characterize variation in
the senses of words as well, analyzing two
months of English comments in 474 Reddit
communities.
2
Related Work
• Online language contains an abundance of “nonstandard” words (Rotabi and Kleinberg, 2016)
• Online communities’ linguistic norms and differences are often defined by which words are used
(Zhang et al. , 2017)
• The strength of BERT to capture word senses presents a new opportunity to measure semantic
variation in online communities of practice (Devlin et al., 2019)
• different senses tend to be segregated into different regions of BERT’s embedding space
(Wiedemann et al. , 2019).
1
Data
• select the top 500 most popular subreddits based on number of comments and remove
subreddits
• randomly sample 80,000 comments
• exclude too general and not specific
• removed 1044 multi-word expressions from analysis
2226 unique glossary words
1
Word Sense Disambiguation
1. 배가 불러서 더 이상 못 먹겠다.
2. 올 해에는 배가 풍년이다.
3. 내가 더보다 몇 배는 더 빠르다.
4. 사촌이 땅을 사면 내 배가 아프다.
1
Word Sense Induction
1. 배가 불러서~~
4. 내 배가 아프다~~
3. 몇 배는 더~~
2. ~~배가 풍년이다.
1
Methods for Identifying Community-Specific Language : Type
• focused on lexical choice, examining the word types unique to a community
• PMI
• TF-IDF
• TextRank
• Jensen-Shannon divergence (JSD)
1
Methods for Identifying Community-Specific Language : Meaning
OW
Overwatch (r/overwatch)
Off-White (r/sneakers)
Opening Week (r/BoxOffice)
• BERT Embeddings
• clusters representatives containing word
substitutes predicted by BERT
1
Methods for Identifying Community-Specific Language :
Meaning(Cont’d)
1
Evaluation
1
Evaluation
1
Conclusion
• set a foundation for further investigations on how BERT could
help define unknown words or meanings in niche communities
• Future work could develop annotated WSI datasets for online
language similar to the standard SemEval benchmarks
감사합니다

Más contenido relacionado

Similar a 230305_Characterizing English Variation across Social Media Communities with BERT

Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation ofAndi Wu
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 
Day 2 introduction to corpus work
Day 2  introduction to corpus workDay 2  introduction to corpus work
Day 2 introduction to corpus workNikki Mattson
 
Investigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of FluencyInvestigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of FluencyEllen Head
 
Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4Shehnaz Mehboob
 
Corpora and the Lexical Approach
Corpora and the Lexical ApproachCorpora and the Lexical Approach
Corpora and the Lexical ApproachDaniel_Lowe_1
 
Sociolinguistic and law
Sociolinguistic and lawSociolinguistic and law
Sociolinguistic and lawMd Syed Ahamad
 
Corpus linguistics in language learning
Corpus linguistics in language learningCorpus linguistics in language learning
Corpus linguistics in language learningnfuadah123
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary usersDuygu Aşıklar
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Textbook Eval Workshop
Textbook Eval WorkshopTextbook Eval Workshop
Textbook Eval WorkshopJoshua Durey
 
Cau3 spoken language - edex
Cau3   spoken language - edexCau3   spoken language - edex
Cau3 spoken language - edexhdowd84
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicographysyila239
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfFaishaMaeTangog
 
A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011Olaf Witkowski
 
10. noun clauses
10. noun clauses10. noun clauses
10. noun clausesIECP
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN
 

Similar a 230305_Characterizing English Variation across Social Media Communities with BERT (20)

Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Day 2 introduction to corpus work
Day 2  introduction to corpus workDay 2  introduction to corpus work
Day 2 introduction to corpus work
 
Investigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of FluencyInvestigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of Fluency
 
Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4
 
Corpora and the Lexical Approach
Corpora and the Lexical ApproachCorpora and the Lexical Approach
Corpora and the Lexical Approach
 
Syntax
SyntaxSyntax
Syntax
 
Sociolinguistic and law
Sociolinguistic and lawSociolinguistic and law
Sociolinguistic and law
 
Corpus linguistics in language learning
Corpus linguistics in language learningCorpus linguistics in language learning
Corpus linguistics in language learning
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary users
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Textbook Eval Workshop
Textbook Eval WorkshopTextbook Eval Workshop
Textbook Eval Workshop
 
Cau3 spoken language - edex
Cau3   spoken language - edexCau3   spoken language - edex
Cau3 spoken language - edex
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdf
 
A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011
 
10. noun clauses
10. noun clauses10. noun clauses
10. noun clauses
 
Mari-Carmen Mendez Garcia
Mari-Carmen Mendez GarciaMari-Carmen Mendez Garcia
Mari-Carmen Mendez Garcia
 
Barrie roberts
Barrie robertsBarrie roberts
Barrie roberts
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 

Más de YongSang Yoo

20230727_tinystories
20230727_tinystories20230727_tinystories
20230727_tinystoriesYongSang Yoo
 
221220_페르소나챗봇
221220_페르소나챗봇221220_페르소나챗봇
221220_페르소나챗봇YongSang Yoo
 
230223_Knowledge_Distillation
230223_Knowledge_Distillation230223_Knowledge_Distillation
230223_Knowledge_DistillationYongSang Yoo
 
221108_Multimodal Transformer
221108_Multimodal Transformer221108_Multimodal Transformer
221108_Multimodal TransformerYongSang Yoo
 

Más de YongSang Yoo (10)

20230727_tinystories
20230727_tinystories20230727_tinystories
20230727_tinystories
 
20230608_megabyte
20230608_megabyte20230608_megabyte
20230608_megabyte
 
221220_페르소나챗봇
221220_페르소나챗봇221220_페르소나챗봇
221220_페르소나챗봇
 
220920_AI ETHICS
220920_AI ETHICS220920_AI ETHICS
220920_AI ETHICS
 
230309_LoRa
230309_LoRa230309_LoRa
230309_LoRa
 
230223_Knowledge_Distillation
230223_Knowledge_Distillation230223_Knowledge_Distillation
230223_Knowledge_Distillation
 
221108_Multimodal Transformer
221108_Multimodal Transformer221108_Multimodal Transformer
221108_Multimodal Transformer
 
221011_BERT
221011_BERT221011_BERT
221011_BERT
 
220910_GatedRNN
220910_GatedRNN220910_GatedRNN
220910_GatedRNN
 
220906_Glove
220906_Glove220906_Glove
220906_Glove
 

Último

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 

Último (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

230305_Characterizing English Variation across Social Media Communities with BERT

  • 1. 2023/3/6 Characterizing English Variation across Social Media Communities with BERT 유용상 TACL 2021
  • 2. 1 Introduction • Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. • employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities.
  • 3. 2 Related Work • Online language contains an abundance of “nonstandard” words (Rotabi and Kleinberg, 2016) • Online communities’ linguistic norms and differences are often defined by which words are used (Zhang et al. , 2017) • The strength of BERT to capture word senses presents a new opportunity to measure semantic variation in online communities of practice (Devlin et al., 2019) • different senses tend to be segregated into different regions of BERT’s embedding space (Wiedemann et al. , 2019).
  • 4. 1 Data • select the top 500 most popular subreddits based on number of comments and remove subreddits • randomly sample 80,000 comments • exclude too general and not specific • removed 1044 multi-word expressions from analysis 2226 unique glossary words
  • 5. 1 Word Sense Disambiguation 1. 배가 불러서 더 이상 못 먹겠다. 2. 올 해에는 배가 풍년이다. 3. 내가 더보다 몇 배는 더 빠르다. 4. 사촌이 땅을 사면 내 배가 아프다.
  • 6. 1 Word Sense Induction 1. 배가 불러서~~ 4. 내 배가 아프다~~ 3. 몇 배는 더~~ 2. ~~배가 풍년이다.
  • 7. 1 Methods for Identifying Community-Specific Language : Type • focused on lexical choice, examining the word types unique to a community • PMI • TF-IDF • TextRank • Jensen-Shannon divergence (JSD)
  • 8. 1 Methods for Identifying Community-Specific Language : Meaning OW Overwatch (r/overwatch) Off-White (r/sneakers) Opening Week (r/BoxOffice) • BERT Embeddings • clusters representatives containing word substitutes predicted by BERT
  • 9. 1 Methods for Identifying Community-Specific Language : Meaning(Cont’d)
  • 12. 1 Conclusion • set a foundation for further investigations on how BERT could help define unknown words or meanings in niche communities • Future work could develop annotated WSI datasets for online language similar to the standard SemEval benchmarks