SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Mining The Social Web
  Ch8 Blogs et al.: Natural Language
     Processing (and Beyond)      Ⅰ


               발표 : 김연기
     네이버 아키텍트를 꿈꾸는 사람들
     http://Cafe.naver.com/architect1
Natural Language
       Processing
• 마침표로 문장을 처리하자!
Natural Language
       Processing
• 마침표로 문장을 처리하자!
NLP Pipeline With NLTK
        문장의 끝 찾기


        단어 자르기


       구문별 짝짖기(?)


        단어 의미 부여


          추출
Natural Language
         Processing
• 문장의 끝 찾기(EOS Detection)
Natural Language
         Processing
• 문장의 끝 찾기(EOS Detection)
Natural Language
         Processing
• 구문별 짝짓기 (POS Tagging)
Natural Language
   Processing
Natural Language
           Processing
• 추출( Extraction)
Natural Language
   Processing
Natural Language
   Processing
Natural Language
               Processing
def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" %
(len(fp.entries[0].title), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href})
Natural Language
               Processing
# Basic stats
num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items()
if w[0] not in stop_words][:10]
print post['title']
print 'tNum Sentences:'.ljust(25), len(sentences)
print 'tNum Words:'.ljust(25), num_words
print 'tNum Unique Words:'.ljust(25), num_unique_words
print 'tNum Hapaxes:'.ljust(25), num_hapaxes
print 'tTop 10 Most Frequent Words (sans stop words):ntt',
'ntt'.join(['%s (%s)‘
        % (w[0], w[1]) for w in top_10_words_sans_stop_words])
print
Natural Language
   Processing
Natural Language
               Processing
# Summaization Approach 1:
# Filter out non-significant sentences by using the average
score plus a
# fraction of the std dev as a filter

avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in
scored_sentences if score > avg + 0.5 * std]

# Summarization Approach 2:
# Another approach would be to return only the top N ranked
sentences

    top_n_scored = sorted(scored_sentences, key=lambda s:
s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
Natural Language
   Processing
Natural Language
         Processing
– Luhn’s Summarization Algorithm
  • Score = (문장에서 중요한 단어)^2)/(문장 총단어
    수)
Natural Language
         Processing
– Luhn’s Summarization Algorithm
  • Score =

Más contenido relacionado

Destacado

Yapcasia 2012 skyarc
Yapcasia 2012 skyarcYapcasia 2012 skyarc
Yapcasia 2012 skyarc
onagatani
 
Featuring my trip to Yunnan
Featuring my trip to YunnanFeaturing my trip to Yunnan
Featuring my trip to Yunnan
jwolfie
 
Sachin tuli
Sachin tuliSachin tuli
Sachin tuli
sknsz
 
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkołySpotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
sknsz
 
Application Software
Application SoftwareApplication Software
Application Software
Beth
 
Continuing Pakistan Floods
Continuing Pakistan FloodsContinuing Pakistan Floods
Continuing Pakistan Floods
Carlos Felipe
 
Web ve
Web veWeb ve
Web ve
Anam
 
The romans 3
The romans 3The romans 3
The romans 3
FranJLte
 
Project in mapeh(bravo)
Project in mapeh(bravo)Project in mapeh(bravo)
Project in mapeh(bravo)
Joyjoy Pena
 
Swiatowyponchiny
SwiatowyponchinySwiatowyponchiny
Swiatowyponchiny
sknsz
 

Destacado (20)

Yapcasia 2012 skyarc
Yapcasia 2012 skyarcYapcasia 2012 skyarc
Yapcasia 2012 skyarc
 
Featuring my trip to Yunnan
Featuring my trip to YunnanFeaturing my trip to Yunnan
Featuring my trip to Yunnan
 
Sachin tuli
Sachin tuliSachin tuli
Sachin tuli
 
Grocery Shopping at Fry's
Grocery Shopping at Fry'sGrocery Shopping at Fry's
Grocery Shopping at Fry's
 
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkołySpotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
 
VodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKingVodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKing
 
Application Software
Application SoftwareApplication Software
Application Software
 
Continuing Pakistan Floods
Continuing Pakistan FloodsContinuing Pakistan Floods
Continuing Pakistan Floods
 
Suburbarian - presentation
Suburbarian - presentationSuburbarian - presentation
Suburbarian - presentation
 
Web ve
Web veWeb ve
Web ve
 
The romans 3
The romans 3The romans 3
The romans 3
 
10. perilaku tercela sm t2
10. perilaku tercela sm t210. perilaku tercela sm t2
10. perilaku tercela sm t2
 
Power point 1 media
Power point 1 mediaPower point 1 media
Power point 1 media
 
Testing the Mysterious Sphere
Testing the Mysterious SphereTesting the Mysterious Sphere
Testing the Mysterious Sphere
 
Forever Presentation
Forever PresentationForever Presentation
Forever Presentation
 
Google themes
Google themesGoogle themes
Google themes
 
Project in mapeh(bravo)
Project in mapeh(bravo)Project in mapeh(bravo)
Project in mapeh(bravo)
 
Swiatowyponchiny
SwiatowyponchinySwiatowyponchiny
Swiatowyponchiny
 
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillaraVodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
 
Percobaan osmosis dan mitosis
Percobaan osmosis dan mitosisPercobaan osmosis dan mitosis
Percobaan osmosis dan mitosis
 

Similar a Mining the social web ch8 - 1

Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
Gopi Krishnan Nambiar
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
Software Guru
 

Similar a Mining the social web ch8 - 1 (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Nltk
NltkNltk
Nltk
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Multilingual drupal 7
Multilingual drupal 7Multilingual drupal 7
Multilingual drupal 7
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 

Más de scor7910

대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14
scor7910
 
Head first statistics ch15
Head first statistics ch15Head first statistics ch15
Head first statistics ch15
scor7910
 
Head first statistics ch.11
Head first statistics ch.11Head first statistics ch.11
Head first statistics ch.11
scor7910
 
어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기
scor7910
 
Mining the social web ch3
Mining the social web ch3Mining the social web ch3
Mining the social web ch3
scor7910
 
Software pattern
Software patternSoftware pattern
Software pattern
scor7910
 
Google app engine
Google app engineGoogle app engine
Google app engine
scor7910
 
Cpp 0x kimRyungee
Cpp 0x kimRyungeeCpp 0x kimRyungee
Cpp 0x kimRyungee
scor7910
 
Component configurator
Component configuratorComponent configurator
Component configurator
scor7910
 
Reflection
ReflectionReflection
Reflection
scor7910
 

Más de scor7910 (11)

대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14
 
Head first statistics ch15
Head first statistics ch15Head first statistics ch15
Head first statistics ch15
 
Head first statistics ch.11
Head first statistics ch.11Head first statistics ch.11
Head first statistics ch.11
 
어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기
 
Mining the social web ch3
Mining the social web ch3Mining the social web ch3
Mining the social web ch3
 
Software pattern
Software patternSoftware pattern
Software pattern
 
Google app engine
Google app engineGoogle app engine
Google app engine
 
Cpp 0x kimRyungee
Cpp 0x kimRyungeeCpp 0x kimRyungee
Cpp 0x kimRyungee
 
Component configurator
Component configuratorComponent configurator
Component configurator
 
Proxy pattern
Proxy patternProxy pattern
Proxy pattern
 
Reflection
ReflectionReflection
Reflection
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Último (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 

Mining the social web ch8 - 1

  • 1. Mining The Social Web Ch8 Blogs et al.: Natural Language Processing (and Beyond) Ⅰ 발표 : 김연기 네이버 아키텍트를 꿈꾸는 사람들 http://Cafe.naver.com/architect1
  • 2. Natural Language Processing • 마침표로 문장을 처리하자!
  • 3. Natural Language Processing • 마침표로 문장을 처리하자!
  • 4. NLP Pipeline With NLTK 문장의 끝 찾기 단어 자르기 구문별 짝짖기(?) 단어 의미 부여 추출
  • 5. Natural Language Processing • 문장의 끝 찾기(EOS Detection)
  • 6. Natural Language Processing • 문장의 끝 찾기(EOS Detection)
  • 7. Natural Language Processing • 구문별 짝짓기 (POS Tagging)
  • 8. Natural Language Processing
  • 9. Natural Language Processing • 추출( Extraction)
  • 10. Natural Language Processing
  • 11. Natural Language Processing
  • 12. Natural Language Processing def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = [] for e in fp.entries: blog_posts.append({'title': e.title, 'content' : cleanHtml(e.content[0].value), 'link': e.links[0].href})
  • 13. Natural Language Processing # Basic stats num_words = sum([i[1] for i in fdist.items()]) num_unique_words = len(fdist.keys()) # Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes()) top_10_words_sans_stop_words = [w for w in fdist.items() if w[0] not in stop_words][:10] print post['title'] print 'tNum Sentences:'.ljust(25), len(sentences) print 'tNum Words:'.ljust(25), num_words print 'tNum Unique Words:'.ljust(25), num_unique_words print 'tNum Hapaxes:'.ljust(25), num_hapaxes print 'tTop 10 Most Frequent Words (sans stop words):ntt', 'ntt'.join(['%s (%s)‘ % (w[0], w[1]) for w in top_10_words_sans_stop_words]) print
  • 14. Natural Language Processing
  • 15. Natural Language Processing # Summaization Approach 1: # Filter out non-significant sentences by using the average score plus a # fraction of the std dev as a filter avg = numpy.mean([s[1] for s in scored_sentences]) std = numpy.std([s[1] for s in scored_sentences]) mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences if score > avg + 0.5 * std] # Summarization Approach 2: # Another approach would be to return only the top N ranked sentences top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:] top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
  • 16. Natural Language Processing
  • 17. Natural Language Processing – Luhn’s Summarization Algorithm • Score = (문장에서 중요한 단어)^2)/(문장 총단어 수)
  • 18. Natural Language Processing – Luhn’s Summarization Algorithm • Score =