SlideShare una empresa de Scribd logo
1 de 21
University of Sheffield, NLP
TwitIE: An Open-Source Information Extraction
Pipeline for Microblog Text
Kalina Bontcheva
Leon Derczynski
Adam Funk
Mark A. Greenwood
Diana Maynard
Niraj Aswani
© The University of Sheffield, 1995-2013
This work is licensed under
the Creative Commons Attribution-NonCommercial-NoDerivs Licence
University of Sheffield, NLP
The Problem
• Running ANNIE on 300 news articles – 87% f-score
• Running ANNIE on some tweets - < 40% f-score
University of Sheffield, NLP
Example: Persons in news articles
University of Sheffield, NLP
Example: Persons in tweets
University of Sheffield, NLP
Genre Differences in Entity Types
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities, celebrities,
names of friends
LOC Countries, cities, rivers,
and other places related to
current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies, government
organisations
Bands, internet companies,
sports clubs
University of Sheffield, NLP
Tweet-specific NER challenges
• Capitalisation is not indicative of named entities
• All uppercase, e.g. APPLE IS AWSOME
• All lowercase, e.g. all welcome, joe included
• All letters upper initial, e.g. 10 Quotes from Amy Poehler
That Will Get You Through High School
• Unusual spelling, acronyms, and abbreviations
• Social media conventions:
• Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance
• @Mentions, e.g. @edchi (PER), @mcg_graz (LOC),
@BBC (ORG)
University of Sheffield, NLP
TwitIE: GATE’s new Twitter NER pipeline
University of Sheffield, NLP
Importing tweets into GATE
• GATE now supports JSON format import for tweets
• Located in the Format_Twitter plugin
• Automatically used for files *.json
• Alternatively, specify text/x-json-twitter as a mime type
• The tweet text becomes the document, all other JSON
fields become features
University of Sheffield, NLP
Language Detection: Less than 50% English
 The main challenges on tweets/Facebook status updates:
the short number of tokens (10 tokens/tweet on average)
the noisy nature of the words (abbreviations, misspellings).
 Due to the length of the text, we can make the assumption that
one tweet is written in only one language
 We have adapted the TextCat language identification plugin
 Provided fingerprints for 5 languages: DE, EN, FR, ES, NL
 You can extend it to new languages easily
University of Sheffield, NLP
Language Detection Examples
University of Sheffield, NLP
Tokenisation
 Splitting a text into its constituent parts
 Plenty of “unusual”, but very important tokens in social media:
– @Apple – mentions of company/brand/person names
– #fail, #SteveJobs – hashtags expressing sentiment, person
or company names
– :-(, :-), :-P – emoticons (punctuation and optionally letters)
– URLs
 Tokenisation key for entity recognition and opinion mining
 A study of 1.1 million tweets: 26% of English tweets have a
URL, 16.6% - a hashtag, and 54.8% - a user name mention
[Carter, 2013].
University of Sheffield, NLP
Example
– #WiredBizCon #nike vp said when @Apple saw what
http://nikeplus.com did, #SteveJobs was like wow I didn't
expect this at all.
– Tokenising on white space doesn't work that well:
• Nike and Apple are company names, but if we have
tokens such as #nike and @Apple, this will make the
entity recognition harder, as it will need to look at sub-
token level
– Tokenising on white space and punctuation characters
doesn't work well either: URLs get separated (http,
nikeplus), as are emoticons and email addresses
University of Sheffield, NLP
The TwitIE Tokeniser
Treat RTs and URLs as 1 token each
#nike is two tokens (# and nike) plus a separate
annotation HashTag covering both. Same for @mentions
-> UserID
Capitalisation is preserved, but an orthography feature is
added: all caps, lowercase, mixCase
Date and phone number normalisation, lowercasing, and
emoticons are optionally done later in separate modules
Consequently, tokenisation is faster and more generic
Also, more tailored to our NER module
University of Sheffield, NLP
POS Tagging
• The accuracy of the Stanford POS tagger drops from about
97% on news to 80% on tweets (Ritter, 2011)
• Need for an adapted POS tagger, specifically for tweets
• We re-trained the Stanford POS tagger using some hand-
annotated tweets, IRC and news texts
• Next we compare the differences between the ANNIE POS
Tagger and the Tweet POS Tagger on the example tweets
University of Sheffield, NLP
POS Tagging Example
• TwitIE POS tagger on the left
• ANNIE POS tagger on the right
• The TwitIE POS tagger is a separate paper at RANLP’2013
• Beats Ritter (2011); uses a grown-up tag set (cf. Gimpel, 2011)
University of Sheffield, NLP
Tweet Normalisation
 “RT @Bthompson WRITEZ: @libbyabrego honored?!
Everybody knows the libster is nice with it...lol...(thankkkks a
bunch;))”
 OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!
 Similar to SMS normalisation
 For some components to work well (POS tagger, parser), it is
necessary to produce a normalised version of each token
 BUT uppercasing, and letter and exclamation mark repetition
often convey strong sentiment
 Therefore some choose not to normalise, while others keep
both versions of the tokens
University of Sheffield, NLP
A normalised example
 Normaliser currently based on spelling correction and some
lists of common abbreviations
 Outstanding issues:
Insert new Token annotations, so easier to POS tag, etc?
For example: “trying to” now 1 annotation
Some abbreviations which span token boundaries (e.g. gr8,
do n’t) difficult to handle
Capitalisation and punctuation normalisation
University of Sheffield, NLP
TwitIE NER Results
University of Sheffield, NLP
Trying TwitIE
• Plugin in the latest GATE snapshot and forthcoming 7.2
release
• Download details at: https://gate.ac.uk/wiki/twitie.html
• Available soon as a web service on the forthcoming
AnnoMarket NLP cloud marketplace:
• https://annomarket.com/
University of Sheffield, NLP
Coming Soon: TwitIE-as-a-Service
Preview of some text analytics services on AnnoMarket.com
University of Sheffield, NLP
Acknowledgements
• Kalina Bontcheva is supported by a Career Acceleration
Fellowship from the Engineering and Physical Sciences
Research Council (grant EP/I004327/1)
• This research is also partially supported by the EU-funded
FP7 TrendMiner project (http://www.trendminer-project.eu)
and the CHIST-ERA uComp project (http://www.ucomp.eu)
Thank you for your time!

Más contenido relacionado

La actualidad más candente

Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalTony Russell-Rose
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationStephen Shellman
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Natural language processing
Natural language processingNatural language processing
Natural language processingAbash shah
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games ResearchJose Zagal
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysisDiana Maynard
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 wordsananth
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingDavid Rostcheck
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Marina Santini
 
Natural language processing
Natural language processingNatural language processing
Natural language processingKarenVacca
 
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...PhD Assistance
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingAkhilPolisetty
 
Natural language processing and its application in ai
Natural language processing and its application in aiNatural language processing and its application in ai
Natural language processing and its application in aiRam Kumar
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 

La actualidad más candente (20)

Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information Retrieval
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and Application
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing and its application in ai
Natural language processing and its application in aiNatural language processing and its application in ai
Natural language processing and its application in ai
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 

Destacado

Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search EnginesAtul Shridhar
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformtoncho11
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search resultseSAT Journals
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryOntotext
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?Irfan Ullah
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
Semantics And Search
Semantics And SearchSemantics And Search
Semantics And SearchVestforsk.no
 
Semantic data mining: an ontology based approach
Semantic data mining: an ontology based approachSemantic data mining: an ontology based approach
Semantic data mining: an ontology based approachAgnieszka Ławrynowicz
 
Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...Natalia Díaz Rodríguez
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processingATHMAN HAJ-HAMOU
 
Semantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementAndre Freitas
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkMartin Goodson
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search EngineShikha Gupta
 

Destacado (17)

Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Semantics And Search
Semantics And SearchSemantics And Search
Semantics And Search
 
Semantic data mining: an ontology based approach
Semantic data mining: an ontology based approachSemantic data mining: an ontology based approach
Semantic data mining: an ontology based approach
 
Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
 
Semantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and Refinement
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
 
Ontology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIsOntology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIs
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search Engine
 

Similar a TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Olga Melnikova
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social MediaLeon Derczynski
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Effective communication via email
Effective communication via emailEffective communication via email
Effective communication via emailMarianna Semenova
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Electronic writing processes
Electronic writing processesElectronic writing processes
Electronic writing processesRabin Bhandari
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Paige Morgan
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxnikshaikh786
 
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011Gail Houston
 
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...Cornelius Puschmann
 
Technical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_septTechnical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_septHolly M. Justice
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位eLearning Consortium 電子學習聯盟
 
Writing Skills Ii
Writing Skills IiWriting Skills Ii
Writing Skills Iitabraiz123
 

Similar a TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text (20)

Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Email Tips and Trends 2010
Email Tips and Trends 2010Email Tips and Trends 2010
Email Tips and Trends 2010
 
Effective communication via email
Effective communication via emailEffective communication via email
Effective communication via email
 
Email Tips 2010
Email Tips 2010Email Tips 2010
Email Tips 2010
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Electronic writing processes
Electronic writing processesElectronic writing processes
Electronic writing processes
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
Ir 03
Ir   03Ir   03
Ir 03
 
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
 
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
 
Technical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_septTechnical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_sept
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
 
Writing Skills Ii
Writing Skills IiWriting Skills Ii
Writing Skills Ii
 

Más de Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and VeracityLeon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCLeon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingLeon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social MediaLeon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 

Más de Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 

Último

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Último (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

  • 1. University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark A. Greenwood Diana Maynard Niraj Aswani © The University of Sheffield, 1995-2013 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence
  • 2. University of Sheffield, NLP The Problem • Running ANNIE on 300 news articles – 87% f-score • Running ANNIE on some tweets - < 40% f-score
  • 3. University of Sheffield, NLP Example: Persons in news articles
  • 4. University of Sheffield, NLP Example: Persons in tweets
  • 5. University of Sheffield, NLP Genre Differences in Entity Types News Tweets PER Politicians, business leaders, journalists, celebrities Sportsmen, actors, TV personalities, celebrities, names of friends LOC Countries, cities, rivers, and other places related to current affairs Restaurants, bars, local landmarks/areas, cities, rarely countries ORG Public and private companies, government organisations Bands, internet companies, sports clubs
  • 6. University of Sheffield, NLP Tweet-specific NER challenges • Capitalisation is not indicative of named entities • All uppercase, e.g. APPLE IS AWSOME • All lowercase, e.g. all welcome, joe included • All letters upper initial, e.g. 10 Quotes from Amy Poehler That Will Get You Through High School • Unusual spelling, acronyms, and abbreviations • Social media conventions: • Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance • @Mentions, e.g. @edchi (PER), @mcg_graz (LOC), @BBC (ORG)
  • 7. University of Sheffield, NLP TwitIE: GATE’s new Twitter NER pipeline
  • 8. University of Sheffield, NLP Importing tweets into GATE • GATE now supports JSON format import for tweets • Located in the Format_Twitter plugin • Automatically used for files *.json • Alternatively, specify text/x-json-twitter as a mime type • The tweet text becomes the document, all other JSON fields become features
  • 9. University of Sheffield, NLP Language Detection: Less than 50% English  The main challenges on tweets/Facebook status updates: the short number of tokens (10 tokens/tweet on average) the noisy nature of the words (abbreviations, misspellings).  Due to the length of the text, we can make the assumption that one tweet is written in only one language  We have adapted the TextCat language identification plugin  Provided fingerprints for 5 languages: DE, EN, FR, ES, NL  You can extend it to new languages easily
  • 10. University of Sheffield, NLP Language Detection Examples
  • 11. University of Sheffield, NLP Tokenisation  Splitting a text into its constituent parts  Plenty of “unusual”, but very important tokens in social media: – @Apple – mentions of company/brand/person names – #fail, #SteveJobs – hashtags expressing sentiment, person or company names – :-(, :-), :-P – emoticons (punctuation and optionally letters) – URLs  Tokenisation key for entity recognition and opinion mining  A study of 1.1 million tweets: 26% of English tweets have a URL, 16.6% - a hashtag, and 54.8% - a user name mention [Carter, 2013].
  • 12. University of Sheffield, NLP Example – #WiredBizCon #nike vp said when @Apple saw what http://nikeplus.com did, #SteveJobs was like wow I didn't expect this at all. – Tokenising on white space doesn't work that well: • Nike and Apple are company names, but if we have tokens such as #nike and @Apple, this will make the entity recognition harder, as it will need to look at sub- token level – Tokenising on white space and punctuation characters doesn't work well either: URLs get separated (http, nikeplus), as are emoticons and email addresses
  • 13. University of Sheffield, NLP The TwitIE Tokeniser Treat RTs and URLs as 1 token each #nike is two tokens (# and nike) plus a separate annotation HashTag covering both. Same for @mentions -> UserID Capitalisation is preserved, but an orthography feature is added: all caps, lowercase, mixCase Date and phone number normalisation, lowercasing, and emoticons are optionally done later in separate modules Consequently, tokenisation is faster and more generic Also, more tailored to our NER module
  • 14. University of Sheffield, NLP POS Tagging • The accuracy of the Stanford POS tagger drops from about 97% on news to 80% on tweets (Ritter, 2011) • Need for an adapted POS tagger, specifically for tweets • We re-trained the Stanford POS tagger using some hand- annotated tweets, IRC and news texts • Next we compare the differences between the ANNIE POS Tagger and the Tweet POS Tagger on the example tweets
  • 15. University of Sheffield, NLP POS Tagging Example • TwitIE POS tagger on the left • ANNIE POS tagger on the right • The TwitIE POS tagger is a separate paper at RANLP’2013 • Beats Ritter (2011); uses a grown-up tag set (cf. Gimpel, 2011)
  • 16. University of Sheffield, NLP Tweet Normalisation  “RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody knows the libster is nice with it...lol...(thankkkks a bunch;))”  OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!  Similar to SMS normalisation  For some components to work well (POS tagger, parser), it is necessary to produce a normalised version of each token  BUT uppercasing, and letter and exclamation mark repetition often convey strong sentiment  Therefore some choose not to normalise, while others keep both versions of the tokens
  • 17. University of Sheffield, NLP A normalised example  Normaliser currently based on spelling correction and some lists of common abbreviations  Outstanding issues: Insert new Token annotations, so easier to POS tag, etc? For example: “trying to” now 1 annotation Some abbreviations which span token boundaries (e.g. gr8, do n’t) difficult to handle Capitalisation and punctuation normalisation
  • 18. University of Sheffield, NLP TwitIE NER Results
  • 19. University of Sheffield, NLP Trying TwitIE • Plugin in the latest GATE snapshot and forthcoming 7.2 release • Download details at: https://gate.ac.uk/wiki/twitie.html • Available soon as a web service on the forthcoming AnnoMarket NLP cloud marketplace: • https://annomarket.com/
  • 20. University of Sheffield, NLP Coming Soon: TwitIE-as-a-Service Preview of some text analytics services on AnnoMarket.com
  • 21. University of Sheffield, NLP Acknowledgements • Kalina Bontcheva is supported by a Career Acceleration Fellowship from the Engineering and Physical Sciences Research Council (grant EP/I004327/1) • This research is also partially supported by the EU-funded FP7 TrendMiner project (http://www.trendminer-project.eu) and the CHIST-ERA uComp project (http://www.ucomp.eu) Thank you for your time!

Notas del editor

  1. Leon, in the paper you show ANNIE 60% on the dev set. The above 40% is on the entire ds that’s in svn. Feel free to replace that table, as you like. I could not load the dev set into GATE, due to its strange format. I am sure there’s a script somewhere that’ll convert it into a proper .conll format, I just had no time to find and run it. It’s ok, nobody will notice perhaps :)
  2. These are mostly politicians. Often their names are preceded by their titles. There is also bigger context, within which entity coreference helps with detection (e.g. Atef and Mohammed Atef; bin Laden and Osama bin Laden).
  3. These are names of friends, singers, artists, sportspeople, and celebrities. Often in lowercase, referred to by first or surname only and sometimes misspelled.
  4. Hashtags: some contain locations, some – person names, and others are phrases For the @Mentions – IIRR Ritter (or some similar recent paper on Twitter NER) wrote that @mentions were excluded from their evaluation, since they are trivially recognisable as persons. Well, the point is – they are not all persons (used to be true). Now we have locations/facilities, organisations, as well as some products, research projects, and the like. Hence, even though it’s trivial to identify @mentions as an NE, assigning it the appropriate NE type is far from a solved problem!