SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
LAZY MAN’S LEARNING
How to BuildYour OwnText Summarizer
Sho Fola Soboyejo, Digital Architect, Kroger Co.
April 19th, 2018
@shoreason
I’VE GOT A FEVER ANDTHE ONLY
PRESCRIPTION IS … MORE BOOKS
NATURAL LANGUAGE
PROCESSING (NLP) DOMAINS
• Mostly Solved: SPAM detection, parts of speech
tagging , named entity recognition
• Making Progress: Sentiment analysis, coreference
resolution, word sense disambiguation, parsing,
machine translation, information extraction
• Still Really Hard: Question answering, Paraphrase,
Summarization and dialogue
PROBLEMS IN NLP
• Ambiguity: RedTape Holds Up New Bridges
• Idioms: Get Cold Feet, Dark Horse
• Neologisms: Bromance, Unfriend, Retweet
• Tricky name entities:Where is Black Panther Playing?
• Non-Standard English: #challengeday, @mlmeetup
Stanford NLP: Dan Jurafsky
“HOW CANYOU
SAYTHE MOST
IMPORTANTTHINGS
INTHE SHORTEST
AMOUNT OFTIME ?”
- Siraj Raval
PRACTICAL APPLICATIONS
FOR SUMMARIZATION
• Headlines (from around the world)
• Outlines (notes for students)
• Minutes (of a meeting)
• Previews (of movies)
• Synopses (soap opera listings)
• Reviews (of a book, CD, movie, etc.)
• Bulletins (weather forecasts/stock market
reports)
• Sound bites (politicians on a current issue)
— Page 1, Advances in AutomaticText
Summarization, 1999.
FORMS OF SUMMARIZATION
Single Document vs Multi Document
APPROACHES
Extractive vs Abstractive
EXTRACTIVE
• Pick figure out most
important sentences in
document.Then simply
extract and order those.
• Same words and sentences
in document. No abstract.
• Ranking phrase relevance
ABSTRACTIVE
• Boil down the gist of a
document into an abstract
likely using new words in
summary.
• Very much what you and I
would do.
• Much harder
“IT’S FAR EASIERTO
RECOGNIZE
WORDSTHAN IT IS
TO UNDERSTAND
THE MEANING”
- Laura Klein (Design forVoice
Interfaces)
SPEED READINGTIPS
• 1st and last sentence
(Order in text)
• Title and other paragraphs
(Connection to other
sentences)
• Index (Word Frequency)
• Focus on Keywords
BASIC CLEAN UP EXPECTED
• Remove Stop Words
• Stemming
• Lower case
• Remove Punctuation
• Remove Numbers
STAGES
CONTENT
SELECTION
INFORMATION
ORDERING
▸ Sentence Segmentation
▸ Document order
▸ Sentence Extraction
▸ Keep original sentences
▸ Sentence weight
▸ Sentence simplification
SENTENCE
REALIZATION
SUMMARY OPTIONS
Algorithmia
Gensim (summarization)
OFFTOTHE RACES
Algorithmia &
Gensim in Action
NAIVE ALGORITHM
• Determine most frequent content words in original document
(Word frequency table)
• N most common words are stored and sorted (100)
• Score each sentence based on how many high frequency words it
contains
• Build summary by compiling sentences above certain score threshold
• Select N top sentences and sort based on order in original text
https://koko-summarizer.herokuapp.com/content
NAIVE 1.0
ALGORITHM
IN
ACTION
NAIVE EXTRACTIVE
ALGORITHM 2.0
• Compare each sentence in document against other sentences and determine
intersection
• [0][2] = intersection score of comparing sentence 1 to sentence 3
• Treating each sentence as a node the connection between the nodes is the intersection
score.Weight of the edges
• Calculate the score of each sentence/node as key value pair {sentence: nodeScore}
• NodeScore = sum of all intersections with other sentences excluding itself. Sum of all
edges connected to the node
• Split text into paragraphs pick best sentence in each paragraph. Essentially, treating
paragraphs as subset of graph and pick best node in each subset
• s1 = "my friend's car is nicer than
mine but my wife is way more
beautiful"
• s2 = "my wife is more beautiful and
has brown eyes”
• s1.intersection(s2) = {'is', 'wife',
'beautiful', 'my',‘more'}
• Intersection score =
len(s1.intersection(s2)) / ((len(s1) +
len(s2)) / 2) = .4762
• lower score less similarity, higher
score more similarity
SENTENCE INTERSECTIONS
1
3
8
1
3
1
2
6
6
1
11
12
2
1
3
8
1
3
1
2
GraphTheory Implications
WHYTHIS MIGHT WORK
• Again, a paragraph can be treated as a subatomic
piece of a text
• Sentences with strong intersection likely hold the
same or very similar information
• Sentences with intersection with many other
sentences is likely very key to the text
NAIVE 2.0
ALGORITHM
IN
ACTION
built on code by Shlomi Babluki
https://koko-summarizer.herokuapp.com/content
GOING MUCH FURTHER
• Bi-Grams
• TF-IDF (frequent in a
document but not across
documents)
• IncludingTitle
• Apply stemming
• RNN (Recurrent Neural
Network)
GOAL
Train an encoder-decoder recurrent neural network
with LSTM units and attention for generating
summaries using the texts of news articles from the
Gigaword dataset
WHAT IS A NEURAL
NETWORK?
• Modeled after the human brain
(neurons) and nervous system
• Like a neuron, it has input,
hidden and output layers
• Network initializes with a
guessers and the learns adjusts
as more data passes through it
• Deep learning is using a neural
network with more hidden
layers
NEURAL NETWORKS (WHITE
PAPERS)
SEQTO SEQ LEARNING
Courtesy: QuocV. Le & Mike Schuster, Research Scientists,
Google BrainTeam
SALESFORCE PAPER
https://www.salesforce.com/
products/einstein/ai-
research/tl-dr-reinforced-
model-abstractive-
summarization/
Abstractive
Neural Networks
Extractive
Algorithmia, Gensim, Naive 1.0 and 2.0
BRINGING ITTOGETHER
GETTING STARTED
• Try out Algorithmia and
Gensim
• Fork my github code and try
your hand on Naive 3.0
• Explore some NLP and
Machine Learning intro
courses
• Check out the White Papers
I referenced in this talk
ACCESSTO RICH DATASETS
• CNN/Daily Mail Stories (Kyunghyun Cho)
• https://drive.google.com/uc?
export=download&id=0BwmD_VLjR
OrfTHk4NFg2SndKcjQ
• BCC Stories
• http://mlg.ucd.ie/
• Annotated English Gigaword
• https://catalog.ldc.upenn.edu/
LDC2012T21
Look out for deck on Slideshare
@shoreason
www.shoreason.com
github.com/shoreason

Más contenido relacionado

La actualidad más candente

Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingankit_ppt
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLRBasis Technology
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1Pier Luca Lanzi
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 

La actualidad más candente (8)

Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 

Similar a Lazy man's learning: How To Build Your Own Text Summarizer

Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash CourseCharlie Greenbacker
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfAdityaMishra178868
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search SolutionsFindwise
 
Functional programming
Functional programmingFunctional programming
Functional programmingPrateek Jain
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchNoemi Derzsy
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...wltrimbl
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solrNick Zadrozny
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appNick Zadrozny
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing WorkshopLakshya Sivaramakrishnan
 

Similar a Lazy man's learning: How To Build Your Own Text Summarizer (20)

Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash Course
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Taming Text
Taming TextTaming Text
Taming Text
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLTK
NLTKNLTK
NLTK
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 

Último

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Lazy man's learning: How To Build Your Own Text Summarizer

  • 1. LAZY MAN’S LEARNING How to BuildYour OwnText Summarizer Sho Fola Soboyejo, Digital Architect, Kroger Co. April 19th, 2018 @shoreason
  • 2. I’VE GOT A FEVER ANDTHE ONLY PRESCRIPTION IS … MORE BOOKS
  • 3. NATURAL LANGUAGE PROCESSING (NLP) DOMAINS • Mostly Solved: SPAM detection, parts of speech tagging , named entity recognition • Making Progress: Sentiment analysis, coreference resolution, word sense disambiguation, parsing, machine translation, information extraction • Still Really Hard: Question answering, Paraphrase, Summarization and dialogue
  • 4. PROBLEMS IN NLP • Ambiguity: RedTape Holds Up New Bridges • Idioms: Get Cold Feet, Dark Horse • Neologisms: Bromance, Unfriend, Retweet • Tricky name entities:Where is Black Panther Playing? • Non-Standard English: #challengeday, @mlmeetup Stanford NLP: Dan Jurafsky
  • 5. “HOW CANYOU SAYTHE MOST IMPORTANTTHINGS INTHE SHORTEST AMOUNT OFTIME ?” - Siraj Raval
  • 6. PRACTICAL APPLICATIONS FOR SUMMARIZATION • Headlines (from around the world) • Outlines (notes for students) • Minutes (of a meeting) • Previews (of movies) • Synopses (soap opera listings) • Reviews (of a book, CD, movie, etc.) • Bulletins (weather forecasts/stock market reports) • Sound bites (politicians on a current issue) — Page 1, Advances in AutomaticText Summarization, 1999.
  • 7. FORMS OF SUMMARIZATION Single Document vs Multi Document
  • 9. EXTRACTIVE • Pick figure out most important sentences in document.Then simply extract and order those. • Same words and sentences in document. No abstract. • Ranking phrase relevance
  • 10. ABSTRACTIVE • Boil down the gist of a document into an abstract likely using new words in summary. • Very much what you and I would do. • Much harder
  • 11. “IT’S FAR EASIERTO RECOGNIZE WORDSTHAN IT IS TO UNDERSTAND THE MEANING” - Laura Klein (Design forVoice Interfaces)
  • 12. SPEED READINGTIPS • 1st and last sentence (Order in text) • Title and other paragraphs (Connection to other sentences) • Index (Word Frequency) • Focus on Keywords
  • 13. BASIC CLEAN UP EXPECTED • Remove Stop Words • Stemming • Lower case • Remove Punctuation • Remove Numbers
  • 14. STAGES CONTENT SELECTION INFORMATION ORDERING ▸ Sentence Segmentation ▸ Document order ▸ Sentence Extraction ▸ Keep original sentences ▸ Sentence weight ▸ Sentence simplification SENTENCE REALIZATION
  • 17. NAIVE ALGORITHM • Determine most frequent content words in original document (Word frequency table) • N most common words are stored and sorted (100) • Score each sentence based on how many high frequency words it contains • Build summary by compiling sentences above certain score threshold • Select N top sentences and sort based on order in original text
  • 19. NAIVE EXTRACTIVE ALGORITHM 2.0 • Compare each sentence in document against other sentences and determine intersection • [0][2] = intersection score of comparing sentence 1 to sentence 3 • Treating each sentence as a node the connection between the nodes is the intersection score.Weight of the edges • Calculate the score of each sentence/node as key value pair {sentence: nodeScore} • NodeScore = sum of all intersections with other sentences excluding itself. Sum of all edges connected to the node • Split text into paragraphs pick best sentence in each paragraph. Essentially, treating paragraphs as subset of graph and pick best node in each subset
  • 20. • s1 = "my friend's car is nicer than mine but my wife is way more beautiful" • s2 = "my wife is more beautiful and has brown eyes” • s1.intersection(s2) = {'is', 'wife', 'beautiful', 'my',‘more'} • Intersection score = len(s1.intersection(s2)) / ((len(s1) + len(s2)) / 2) = .4762 • lower score less similarity, higher score more similarity SENTENCE INTERSECTIONS
  • 23. WHYTHIS MIGHT WORK • Again, a paragraph can be treated as a subatomic piece of a text • Sentences with strong intersection likely hold the same or very similar information • Sentences with intersection with many other sentences is likely very key to the text
  • 24. NAIVE 2.0 ALGORITHM IN ACTION built on code by Shlomi Babluki https://koko-summarizer.herokuapp.com/content
  • 25. GOING MUCH FURTHER • Bi-Grams • TF-IDF (frequent in a document but not across documents) • IncludingTitle • Apply stemming • RNN (Recurrent Neural Network)
  • 26. GOAL Train an encoder-decoder recurrent neural network with LSTM units and attention for generating summaries using the texts of news articles from the Gigaword dataset
  • 27. WHAT IS A NEURAL NETWORK? • Modeled after the human brain (neurons) and nervous system • Like a neuron, it has input, hidden and output layers • Network initializes with a guessers and the learns adjusts as more data passes through it • Deep learning is using a neural network with more hidden layers
  • 29. SEQTO SEQ LEARNING Courtesy: QuocV. Le & Mike Schuster, Research Scientists, Google BrainTeam
  • 31. Abstractive Neural Networks Extractive Algorithmia, Gensim, Naive 1.0 and 2.0 BRINGING ITTOGETHER
  • 32. GETTING STARTED • Try out Algorithmia and Gensim • Fork my github code and try your hand on Naive 3.0 • Explore some NLP and Machine Learning intro courses • Check out the White Papers I referenced in this talk
  • 33. ACCESSTO RICH DATASETS • CNN/Daily Mail Stories (Kyunghyun Cho) • https://drive.google.com/uc? export=download&id=0BwmD_VLjR OrfTHk4NFg2SndKcjQ • BCC Stories • http://mlg.ucd.ie/ • Annotated English Gigaword • https://catalog.ldc.upenn.edu/ LDC2012T21
  • 34. Look out for deck on Slideshare @shoreason www.shoreason.com github.com/shoreason