SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
University of Sheffield, NLP
Text Analysis with GATE
Diana Maynard
University of Sheffield
Big Social Data Workshop
Reading, UK, April 2015
© The University of Sheffield, 2015
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
University of Sheffield, NLP
Outline of the Tutorial
• Introduction to NLP and Information Extraction
• Introduction to GATE
• Social media analysis
• Example applications
• Semantic Search, Nesta, Decarbonet.
University of Sheffield, NLP
We are all connected to each other...
● Information,
thoughts and
opinions are
shared prolifically
on the social web
these days
● 72% of online
adults use social
networking sites
University of Sheffield, NLP
Your grandmother is three times as likely to
use a social networking site now as in 2009
University of Sheffield, NLP
There are hundreds of tools for social media
analytics
● Most of them are commercial and not freely available
● The research tools tend to focus on specific topics and
scenarios, and aren't easily adaptable
● The analysis they do often doesn't go much beyond number
crunching, e.g.
– look at number of tweets, retweets, favourites
– filter by hashtag or keyword for topic categorisation
– use off-the-shelf sentiment tools
– use counts of word length, POS categories etc
– very little semantics, don't deal with variation, ambiguity,
slang, sarcasm etc.
University of Sheffield, NLP
Analysing Social Media is harder than it sounds
There are lots
of things to
think about!
University of Sheffield, NLP
Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people
don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
University of Sheffield, NLP
Let's search for keywords like “Arctic”
Oops!
University of Sheffield, NLP
Seems like we need something to help!
How about NLP?
University of Sheffield, NLP
It is difficult to access unstructured information
efficiently
Information extraction tools can help you:
● Save time and money on management of text and data from
multiple sources
● Find hidden links scattered across huge volumes of diverse
information
● Integrate structured data from variety of sources
● Interlink text and data
● Collect information and extract new facts
University of Sheffield, NLP
What is Entity Recognition?
●
Entity Recognition is about recogising and classifying key Named
Entities and terms in the text
●
A Named Entity is a Person, Location, Organisation, Date etc.
●
A term is a key concept or phrase that is representative of the text
●
Entities and terms may be described in different ways but refer to
the same thing. We call this co-reference.
Mitt Romney, the favorite to win the Republican nomination for president in 2012
DatePerson Term
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.
Organisation
co-reference
Location
University of Sheffield, NLP
What is Event Recognition?
●
An event is an action or situation relevant to the domain
expressed by some relation between entities or terms.
●
It is always grounded in time, e.g. the performance of a
band, an election, the death of a person
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation
University of Sheffield, NLP
Why are Entities and Events Useful?
● They can help answer the “Big 5” journalism questions
(who, what, when, where, why)
● They can be used to categorise the texts in different ways
– look at all texts about Obama.
● They can be used as targets for opinion mining
– find out what people think about President Obama
● When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
– seeing how opinions about different American
presidents have changed over the years
University of Sheffield, NLP
Approaches to Information Extraction
Knowledge Engineering

rule based

developed by
experienced language
engineers

make use of human
intuition

easier to understand
results

development could be
very time consuming

some changes may be
hard to accommodate
Learning Systems

use statistics or other
machine learning

developers do not need
LE expertise

requires large amounts of
annotated training data

some changes may
require re-annotation of
the entire training corpus
University of Sheffield, NLP
Seems like we need a tool to do this clever
stuff for us.
How about GATE?
University of Sheffield, NLP
What is GATE?
GATE is an NLP toolkit developed at the University of Sheffield
over the last 20 years.
It's open source and freely available. http://gate.ac.uk
• components for language processing, e.g. parsers, machine
learning tools, stemmers, IR tools, IE components for various
languages...
• tools for visualising and manipulating text, annotations,
ontologies, parse trees, etc.
• various information extraction tools
• evaluation and benchmarking tools
University of Sheffield, NLP
GATE components
● Language Resources (LRs), e.g. lexicons, corpora,
ontologies
● Processing Resources (PRs), e.g. parsers, generators,
taggers
●
Visual Resources (VRs), i.e. visualisation and editing
components
● Algorithms are separated from the data, which means:
– the two can be developed independently by users with
different expertise.
– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource
University of Sheffield, NLP
ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also
have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of
components:
– generic preprocessing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text
University of Sheffield, NLP
ANNIE Modules
University of Sheffield, NLP
20
Document with Tokens
University of Sheffield, NLP
21
Document with Sentences
University of Sheffield, NLP
Gazetteer editor
definition file
entries entries for selected list
University of Sheffield, NLP
Named Entity Grammars
• Hand-coded rules written in JAPE applied to annotations to
identify NEs
• Phases run sequentially and constitute a cascade of FSTs over
annotations
• Annotations from format analysis, tokeniser. splitter, POS tagger,
morphological analysis, gazetteer etc.
• Because phases are sequential, annotations can be built up over
a period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates,
addresses, money
• Basic NE grammars can be adapted for new applications, domains
and languages
University of Sheffield, NLP
Document with NEs
University of Sheffield, NLP
Coreference
University of Sheffield, NLP
Hands-on: Running ANNIE on news texts
●
Start GATE by double clicking on the icon
●
Because ANNIE is a ready-made application, we can just load it
directly from the menu
●
Right click on Applications, select “Ready-made applications”
and then ANNIE
●
Create a new corpus (right click on Language Resources →
New → GATE corpus and click “OK)
●
Populate the corpus (right click on the corpus, and use the file
chooser in Directory URL to select the hands-on/corpora/news-
texts directory. Click “Open” and then “OK”)
●
Double click on ANNIE and select “Run this application”
●
This will now run the application on all the documents in your
corpus
University of Sheffield, NLP
Hands-on: Looking at the results
● Now you can inspect your annotated data: double click on a
document to open it
● Select “Annotation Sets” and “Annotation List” from the tabs
● Click on the top arrow above “Original markups” in the right
pane
● You should see a mixture of Named Entity annotations
(Person, Location etc) and some other linguistic annotations
(Token, Sentence etc) in the Default annotation set
● Click on the name of an annotation in the right hand pane that
you want to view (you can select multiple annotations)
● Hover over the annotation in the text to get more info
● Try the Annotation Stack for a different view
University of Sheffield, NLP
3. Analysing Social Media
University of Sheffield, NLP
Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people
don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
University of Sheffield, NLP
Challenges for NLP
● Noisy language: unusual punctuation, capitalisation, spelling, use
of slang, sarcasm etc.
● Terse nature of microposts such as tweets
● Use of hashtags, @mentions etc causes problems for
tokenisation #thisistricky
● Lack of context gives rise to ambiguities
● NER performs poorly on microposts, mainly because of linguistic
pre-processing failure
– Performance of standard IE tools decreases from ~90% to
~40% when run on tweets rather than news articles
University of Sheffield, NLP
Lack of context causes ambiguity
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
??
University of Sheffield, NLP
Getting the NEs right is crucial
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
University of Sheffield, NLP
Hands-on with TwitIE
● First we need to load the TWITIE plugin
● Click on the green jigsaw-shaped icon at the top to load the
plugins manager (this takes a few seconds to load)
● Scroll down to the Twitter plugin and select “Load now”.
● Click “Apply All” and then “Close”.
● Now right-click on “Applications”, select “Ready-made
applications” and “TwitIE”
● Create another new corpus, name it “Tweets” and populate
with texts from hands-on/corpora/tweets
● Once loaded, double click on TwitIE to open it, and then select
“Run this application” (make sure the tweets corpus is chosen)
● Use the Annotation Stack view to look at Tokens in hashtags
University of Sheffield, NLP
Semantic Search with GATE
University of Sheffield, NLP
GATE Mímir
• can be used to index and search over text, annotations,
semantic metadata (concepts and instances)
• allows queries that arbitrarily mix full-text, structural, linguistic
and semantic annotations
• is open source
University of Sheffield, NLP
What can GATE Mímir do that Google can't?
Show me:
• all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
• all abstracts written in French on patent documents from the last
6 months which mention any form of the word “transistor” in the
English abstract
• the names of the patent inventors of those abstracts
• all documents mentioning steel industries in the UK, along with
their location
University of Sheffield, NLP
Search News Articles for
Politicians born in Sheffield
http://demos.gate.ac.uk/mimir/gpd/search/gus
University of Sheffield, NLP
MIMIR: Searching Text Mining Results
●
Searching and managing text annotations, semantic
information, and full text documents in one search engine
●
Queries over annotation graphs
●
Regular expressions, Kleene operators
●
Designed to be integrated as a web service in custom end-user
systems with bespoke interfaces
●
Demos at http://services.gate.ac.uk/mimir/
University of Sheffield, NLP
Hands-on with Semantic Search
Try these queries on the BBC News demo:
http://services.gate.ac.uk/mimir/gpd/search/index
● Gordon Brown
● Gordon Brown said
●
Gordon Brown [0..3] root:say
●
{Person} [0..3] root:say
●
{Person.gender=female}[0..3] root:say
●
Make sure you type the queries EXACTLY or they probably won't
work!
●
Try making up some of your own queries.
University of Sheffield, NLP
Try your hand with some SPARQL
(for the more adventurous!)
●
{Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3]
root:say
●
{Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"}
[0..3] root:say
●
{Person sparql = "SELECT ?inst WHERE { ?inst :party
<http://dbpedia.org/resource/Labour_Party_%28UK%29> }" }
[0..3] root:say
●
{Person sparql = "SELECT ?inst WHERE {
?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK
%29> .
?inst :almaMater
<http://dbpedia.org/resource/University_of_Edinburgh> }"
} [0..3] root:say
Can you work out what these do?
University of Sheffield, NLP
We don't just have to look at politicians saying and
measuring things
● If we first process the text with other NLP tools such as
sentiment analysis, we can also search for positive or negative
documents
● Or positive/negative comments about certain
people/topics/things/companies
● In the DecarboNet project, we're looking at people's opinions
about topics relating to climate change, e.g. fracking
● We could index on the sentiment annotations too
● Other people are using the combination of opinion mining and
MIMIR to look at e.g. customer feedback about products /
companies on a huge corpus
University of Sheffield, NLP
A positive tweet
University of Sheffield, NLP
A negative tweet
University of Sheffield, NLP
A Sarcastic Tweet
University of Sheffield, NLP
What diseases are in these documents?
University of Sheffield, NLP
What Pathogens?
University of Sheffield, NLP
Disease vs Disease Co-ocurrences
University of Sheffield, NLP
Diseases vs Pathogens
University of Sheffield, NLP
The Political Futures Tracker
● Example of using the technology on a real scenario - analysing
political tweets in the run-up to the UK elections
● Project funded by Nesta http://www.nesta.org.uk/
● Series of blog posts about the project, leading up to the
election
http://www.nesta.org.uk/news/political-futures-tracker
● Some of our own more technical blog posts about the
underlying tools:
http://gate.ac.uk/projects/pft/
University of Sheffield, NLP
Recognition of MPs / Candidates
University of Sheffield, NLP
Topic recognition
This term is found by first
recognising the head
word “job” from a list
under the theme
“employment” and
matching against its root
form in the text, i.e. “job”.
It is then extended to
include the adjectival
modifier “British”, which
is not present in a list
anywhere.
University of Sheffield, NLP
Positive opinion about science and innovation
University of Sheffield, NLP
What do the different parties talk about?
Conservative vs Labour
Transport, Europe and employment are
mentioned more by Conservatives
NHS is mentioned more by Labour
University of Sheffield, NLP
Where did people tweet about the economy?
University of Sheffield, NLP
Terms that co-occur with immigration topic
University of Sheffield, NLP
Which tweets express sentiment?
University of Sheffield, NLP
Summary
● Text mining is a very useful pre-requisite for doing all kinds of
more interesting things, like semantic search
● Semantic web search allows you to do much more interesting
kinds of search than a standard text-based search
● Text mining is hard, so it won't always be correct, however
● This is especially true on lower quality text such as social
media
● On the other hand, social media has some of the most
interesting material to look at
● Still plenty of work to be done, but there are lots of tools that
you can use now and get useful results
● We run annual GATE training courses where you can spend a
whole week learning all this and more!
University of Sheffield, NLP
Acknowledgements and further information
● Research partially supported by the European Union/EU under the Information
and Communication Technologies (ICT) theme of the 7th Framework Programme
for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and Nesta
http://nesta.org.uk
● Annual GATE training course in June in
Sheffield:https://gate.ac.uk/family/training.html
● Download gate: http://gate.ac.uk/download
University of Sheffield, NLP
Relevant publications
● K Bontcheva, V Tablan, H Cunningham. Semantic Search over Documents
and Ontologies (2014) Bridging Between Information Retrieval and
Databases, 31-53
● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open-
source semantic search framework for interactive information seeking and
discovery. Journal of Web Semantics: Science, Services and Agents on the
World Wide Web, 2014
● A. Dietzel and D. Maynard. Climate Change: A Chance for Political Re-
Engagement? In Proc. of the Political Studies Association 65th Annual
International Conference, April 2015, Sheffield, UK.
● Diana Maynard. Challenges in Analysing Social Media. In Adrian Duşa,
Dietrich Nelle, Günter Stock and Gert G. Wagner (eds.) (2014): Facing the
Future: European Research Infrastructures for the Humanities and Social
Sciences. SCIVERO Verlag, Berlin, 2014.
● These and other relevant papers on the GATE website:
http://gate.ac.uk/gate/doc/papers.html

Más contenido relacionado

La actualidad más candente

Example Content Strategy
Example Content Strategy Example Content Strategy
Example Content Strategy
Arbell Noach
 

La actualidad más candente (20)

Website giới thiệu sản phẩm
Website giới thiệu sản phẩmWebsite giới thiệu sản phẩm
Website giới thiệu sản phẩm
 
Level Up Your Content Strategy – 5 Steps To SEO Success.pdf
Level Up Your Content Strategy – 5 Steps To SEO Success.pdfLevel Up Your Content Strategy – 5 Steps To SEO Success.pdf
Level Up Your Content Strategy – 5 Steps To SEO Success.pdf
 
Lập Trình an toàn - Secure programming
Lập Trình an toàn - Secure programmingLập Trình an toàn - Secure programming
Lập Trình an toàn - Secure programming
 
Blockchain là gì? Những ứng dụng của công nghệ Blockchain 2019
Blockchain là gì? Những ứng dụng của công nghệ Blockchain 2019Blockchain là gì? Những ứng dụng của công nghệ Blockchain 2019
Blockchain là gì? Những ứng dụng của công nghệ Blockchain 2019
 
các khái niệm cơ bản dự án phần mềm
các khái niệm cơ bản dự án phần mềmcác khái niệm cơ bản dự án phần mềm
các khái niệm cơ bản dự án phần mềm
 
Example Content Strategy
Example Content Strategy Example Content Strategy
Example Content Strategy
 
Personal branding on LinkedIn
Personal branding on LinkedInPersonal branding on LinkedIn
Personal branding on LinkedIn
 
Tackling SharePoint Site And Microsoft Teams Sprawl In Microsoft 365 What You...
Tackling SharePoint Site And Microsoft Teams Sprawl In Microsoft 365 What You...Tackling SharePoint Site And Microsoft Teams Sprawl In Microsoft 365 What You...
Tackling SharePoint Site And Microsoft Teams Sprawl In Microsoft 365 What You...
 
List of do follow websites used to build backlinks
List of do follow websites used to build backlinksList of do follow websites used to build backlinks
List of do follow websites used to build backlinks
 
2010 inDinero Pitch Deck
2010 inDinero Pitch Deck 2010 inDinero Pitch Deck
2010 inDinero Pitch Deck
 
Visão geral da plataforma: Soluções de Marketing do LinkedIn
Visão geral da plataforma: Soluções de Marketing do LinkedInVisão geral da plataforma: Soluções de Marketing do LinkedIn
Visão geral da plataforma: Soluções de Marketing do LinkedIn
 
Import Payslip Input in odoo (Hr Payroll odoo )
Import Payslip Input in odoo  (Hr Payroll odoo )Import Payslip Input in odoo  (Hr Payroll odoo )
Import Payslip Input in odoo (Hr Payroll odoo )
 
Top 40 Ways to Make Money With ChatGPT Today.pdf
Top 40 Ways to Make Money With ChatGPT Today.pdfTop 40 Ways to Make Money With ChatGPT Today.pdf
Top 40 Ways to Make Money With ChatGPT Today.pdf
 
Đề tài: Lập trình game trên thiết bị di động, HOT, 9đ
Đề tài: Lập trình game trên thiết bị di động, HOT, 9đĐề tài: Lập trình game trên thiết bị di động, HOT, 9đ
Đề tài: Lập trình game trên thiết bị di động, HOT, 9đ
 
Introduction to LinkedIn
Introduction to LinkedInIntroduction to LinkedIn
Introduction to LinkedIn
 
Semantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and RecruitingSemantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and Recruiting
 
ChatGPT Content Creation Master Class - Leah Faul, 15000 Cubits
ChatGPT Content Creation Master Class - Leah Faul, 15000 CubitsChatGPT Content Creation Master Class - Leah Faul, 15000 Cubits
ChatGPT Content Creation Master Class - Leah Faul, 15000 Cubits
 
Smarter Link Building: How to Use Machine Learning to Accelerate Organic Growth
Smarter Link Building: How to Use Machine Learning to Accelerate Organic Growth Smarter Link Building: How to Use Machine Learning to Accelerate Organic Growth
Smarter Link Building: How to Use Machine Learning to Accelerate Organic Growth
 
Core Web Vitals and SEO: Don't Panic. Improve.
Core Web Vitals and SEO: Don't Panic. Improve.Core Web Vitals and SEO: Don't Panic. Improve.
Core Web Vitals and SEO: Don't Panic. Improve.
 
So analysierst du deinen Search Intent programmatisch!
So analysierst du deinen Search Intent programmatisch!So analysierst du deinen Search Intent programmatisch!
So analysierst du deinen Search Intent programmatisch!
 

Destacado

Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
Atul Shridhar
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
toncho11
 

Destacado (20)

Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATE
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Big Data for Big Results in Chinese Social Media
Big Data for Big Results in Chinese Social MediaBig Data for Big Results in Chinese Social Media
Big Data for Big Results in Chinese Social Media
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media Analysis
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
Rd 60
Rd 60Rd 60
Rd 60
 
Erba investor presentation
Erba investor presentationErba investor presentation
Erba investor presentation
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Semantics And Search
Semantics And SearchSemantics And Search
Semantics And Search
 

Similar a GATE: a text analysis tool for social media

Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Diana Maynard
 
Introduction to NLP_1.pptx
Introduction to NLP_1.pptxIntroduction to NLP_1.pptx
Introduction to NLP_1.pptx
jkamble
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
jkamble
 

Similar a GATE: a text analysis tool for social media (20)

Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-search
 
Scale2014
Scale2014Scale2014
Scale2014
 
Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advanced
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 
A tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisA tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysis
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Unit 5f.pptx
Unit 5f.pptxUnit 5f.pptx
Unit 5f.pptx
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptx
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Addis Ababa University.pptx
Addis Ababa University.pptxAddis Ababa University.pptx
Addis Ababa University.pptx
 
ChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deChatGPT in academic settings H2.de
ChatGPT in academic settings H2.de
 
Impact the UX of Your Website with Contextual Inquiry
Impact the UX of Your Website with Contextual InquiryImpact the UX of Your Website with Contextual Inquiry
Impact the UX of Your Website with Contextual Inquiry
 
Jon Rubin & Katherine Spivey - User-Useful Government Websites: Intersection ...
Jon Rubin & Katherine Spivey - User-Useful Government Websites: Intersection ...Jon Rubin & Katherine Spivey - User-Useful Government Websites: Intersection ...
Jon Rubin & Katherine Spivey - User-Useful Government Websites: Intersection ...
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Introduction to NLP_1.pptx
Introduction to NLP_1.pptxIntroduction to NLP_1.pptx
Introduction to NLP_1.pptx
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 

Más de Diana Maynard

Más de Diana Maynard (20)

Filth and lies: analysing social media
Filth and lies: analysing social mediaFilth and lies: analysing social media
Filth and lies: analysing social media
 
Adding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long wayAdding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long way
 
Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...
 
Getting the-most-out-of-conferences
Getting the-most-out-of-conferencesGetting the-most-out-of-conferences
Getting the-most-out-of-conferences
 
Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...
 
Adapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutionsAdapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutions
 
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
 
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
 
Challenges of social media analysis in the real world
Challenges of social media analysis in the real worldChallenges of social media analysis in the real world
Challenges of social media analysis in the real world
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
Cls8 decarbonet
Cls8 decarbonetCls8 decarbonet
Cls8 decarbonet
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?
 
Disability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged SwordDisability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged Sword
 
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
 
Multimodal opinion mining from social media
Multimodal opinion mining from social mediaMultimodal opinion mining from social media
Multimodal opinion mining from social media
 
Practical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaPractical Opinion Mining for Social Media
Practical Opinion Mining for Social Media
 
What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

GATE: a text analysis tool for social media

  • 1. University of Sheffield, NLP Text Analysis with GATE Diana Maynard University of Sheffield Big Social Data Workshop Reading, UK, April 2015 © The University of Sheffield, 2015 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
  • 2. University of Sheffield, NLP Outline of the Tutorial • Introduction to NLP and Information Extraction • Introduction to GATE • Social media analysis • Example applications • Semantic Search, Nesta, Decarbonet.
  • 3. University of Sheffield, NLP We are all connected to each other... ● Information, thoughts and opinions are shared prolifically on the social web these days ● 72% of online adults use social networking sites
  • 4. University of Sheffield, NLP Your grandmother is three times as likely to use a social networking site now as in 2009
  • 5. University of Sheffield, NLP There are hundreds of tools for social media analytics ● Most of them are commercial and not freely available ● The research tools tend to focus on specific topics and scenarios, and aren't easily adaptable ● The analysis they do often doesn't go much beyond number crunching, e.g. – look at number of tweets, retweets, favourites – filter by hashtag or keyword for topic categorisation – use off-the-shelf sentiment tools – use counts of word length, POS categories etc – very little semantics, don't deal with variation, ambiguity, slang, sarcasm etc.
  • 6. University of Sheffield, NLP Analysing Social Media is harder than it sounds There are lots of things to think about!
  • 7. University of Sheffield, NLP Analysing language in social media is hard ● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do ● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX ● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99% ● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!
  • 8. University of Sheffield, NLP Let's search for keywords like “Arctic” Oops!
  • 9. University of Sheffield, NLP Seems like we need something to help! How about NLP?
  • 10. University of Sheffield, NLP It is difficult to access unstructured information efficiently Information extraction tools can help you: ● Save time and money on management of text and data from multiple sources ● Find hidden links scattered across huge volumes of diverse information ● Integrate structured data from variety of sources ● Interlink text and data ● Collect information and extract new facts
  • 11. University of Sheffield, NLP What is Entity Recognition? ● Entity Recognition is about recogising and classifying key Named Entities and terms in the text ● A Named Entity is a Person, Location, Organisation, Date etc. ● A term is a key concept or phrase that is representative of the text ● Entities and terms may be described in different ways but refer to the same thing. We call this co-reference. Mitt Romney, the favorite to win the Republican nomination for president in 2012 DatePerson Term The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior. Organisation co-reference Location
  • 12. University of Sheffield, NLP What is Event Recognition? ● An event is an action or situation relevant to the domain expressed by some relation between entities or terms. ● It is always grounded in time, e.g. the performance of a band, an election, the death of a person Mitt Romney, the favorite to win the Republican nomination for president in 2012 Event DatePerson Relation Relation
  • 13. University of Sheffield, NLP Why are Entities and Events Useful? ● They can help answer the “Big 5” journalism questions (who, what, when, where, why) ● They can be used to categorise the texts in different ways – look at all texts about Obama. ● They can be used as targets for opinion mining – find out what people think about President Obama ● When linked to an ontology and/or combined with other information, they can be used for reasoning about things not explicit in the text – seeing how opinions about different American presidents have changed over the years
  • 14. University of Sheffield, NLP Approaches to Information Extraction Knowledge Engineering  rule based  developed by experienced language engineers  make use of human intuition  easier to understand results  development could be very time consuming  some changes may be hard to accommodate Learning Systems  use statistics or other machine learning  developers do not need LE expertise  requires large amounts of annotated training data  some changes may require re-annotation of the entire training corpus
  • 15. University of Sheffield, NLP Seems like we need a tool to do this clever stuff for us. How about GATE?
  • 16. University of Sheffield, NLP What is GATE? GATE is an NLP toolkit developed at the University of Sheffield over the last 20 years. It's open source and freely available. http://gate.ac.uk • components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages... • tools for visualising and manipulating text, annotations, ontologies, parse trees, etc. • various information extraction tools • evaluation and benchmarking tools
  • 17. University of Sheffield, NLP GATE components ● Language Resources (LRs), e.g. lexicons, corpora, ontologies ● Processing Resources (PRs), e.g. parsers, generators, taggers ● Visual Resources (VRs), i.e. visualisation and editing components ● Algorithms are separated from the data, which means: – the two can be developed independently by users with different expertise. – alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource
  • 18. University of Sheffield, NLP ANNIE • ANNIE is GATE's rule-based IE system • It uses the language engineering approach (though we also have tools in GATE for ML) • Distributed as part of GATE • Uses a finite-state pattern-action rule language, JAPE • ANNIE contains a reusable and easily extendable set of components: – generic preprocessing components for tokenisation, sentence splitting etc – components for performing NE on general open domain text
  • 19. University of Sheffield, NLP ANNIE Modules
  • 20. University of Sheffield, NLP 20 Document with Tokens
  • 21. University of Sheffield, NLP 21 Document with Sentences
  • 22. University of Sheffield, NLP Gazetteer editor definition file entries entries for selected list
  • 23. University of Sheffield, NLP Named Entity Grammars • Hand-coded rules written in JAPE applied to annotations to identify NEs • Phases run sequentially and constitute a cascade of FSTs over annotations • Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc. • Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned • Standard named entities: persons, locations, organisations, dates, addresses, money • Basic NE grammars can be adapted for new applications, domains and languages
  • 24. University of Sheffield, NLP Document with NEs
  • 25. University of Sheffield, NLP Coreference
  • 26. University of Sheffield, NLP Hands-on: Running ANNIE on news texts ● Start GATE by double clicking on the icon ● Because ANNIE is a ready-made application, we can just load it directly from the menu ● Right click on Applications, select “Ready-made applications” and then ANNIE ● Create a new corpus (right click on Language Resources → New → GATE corpus and click “OK) ● Populate the corpus (right click on the corpus, and use the file chooser in Directory URL to select the hands-on/corpora/news- texts directory. Click “Open” and then “OK”) ● Double click on ANNIE and select “Run this application” ● This will now run the application on all the documents in your corpus
  • 27. University of Sheffield, NLP Hands-on: Looking at the results ● Now you can inspect your annotated data: double click on a document to open it ● Select “Annotation Sets” and “Annotation List” from the tabs ● Click on the top arrow above “Original markups” in the right pane ● You should see a mixture of Named Entity annotations (Person, Location etc) and some other linguistic annotations (Token, Sentence etc) in the Default annotation set ● Click on the name of an annotation in the right hand pane that you want to view (you can select multiple annotations) ● Hover over the annotation in the text to get more info ● Try the Annotation Stack for a different view
  • 28. University of Sheffield, NLP 3. Analysing Social Media
  • 29. University of Sheffield, NLP Analysing language in social media is hard ● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do ● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX ● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99% ● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!
  • 30. University of Sheffield, NLP Challenges for NLP ● Noisy language: unusual punctuation, capitalisation, spelling, use of slang, sarcasm etc. ● Terse nature of microposts such as tweets ● Use of hashtags, @mentions etc causes problems for tokenisation #thisistricky ● Lack of context gives rise to ambiguities ● NER performs poorly on microposts, mainly because of linguistic pre-processing failure – Performance of standard IE tools decreases from ~90% to ~40% when run on tweets rather than news articles
  • 31. University of Sheffield, NLP Lack of context causes ambiguity Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter! ??
  • 32. University of Sheffield, NLP Getting the NEs right is crucial Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter!
  • 33. University of Sheffield, NLP Hands-on with TwitIE ● First we need to load the TWITIE plugin ● Click on the green jigsaw-shaped icon at the top to load the plugins manager (this takes a few seconds to load) ● Scroll down to the Twitter plugin and select “Load now”. ● Click “Apply All” and then “Close”. ● Now right-click on “Applications”, select “Ready-made applications” and “TwitIE” ● Create another new corpus, name it “Tweets” and populate with texts from hands-on/corpora/tweets ● Once loaded, double click on TwitIE to open it, and then select “Run this application” (make sure the tweets corpus is chosen) ● Use the Annotation Stack view to look at Tokens in hashtags
  • 34. University of Sheffield, NLP Semantic Search with GATE
  • 35. University of Sheffield, NLP GATE Mímir • can be used to index and search over text, annotations, semantic metadata (concepts and instances) • allows queries that arbitrarily mix full-text, structural, linguistic and semantic annotations • is open source
  • 36. University of Sheffield, NLP What can GATE Mímir do that Google can't? Show me: • all documents mentioning a temperature between 30 and 90 degrees F (expressed in any unit) • all abstracts written in French on patent documents from the last 6 months which mention any form of the word “transistor” in the English abstract • the names of the patent inventors of those abstracts • all documents mentioning steel industries in the UK, along with their location
  • 37. University of Sheffield, NLP Search News Articles for Politicians born in Sheffield http://demos.gate.ac.uk/mimir/gpd/search/gus
  • 38. University of Sheffield, NLP MIMIR: Searching Text Mining Results ● Searching and managing text annotations, semantic information, and full text documents in one search engine ● Queries over annotation graphs ● Regular expressions, Kleene operators ● Designed to be integrated as a web service in custom end-user systems with bespoke interfaces ● Demos at http://services.gate.ac.uk/mimir/
  • 39. University of Sheffield, NLP Hands-on with Semantic Search Try these queries on the BBC News demo: http://services.gate.ac.uk/mimir/gpd/search/index ● Gordon Brown ● Gordon Brown said ● Gordon Brown [0..3] root:say ● {Person} [0..3] root:say ● {Person.gender=female}[0..3] root:say ● Make sure you type the queries EXACTLY or they probably won't work! ● Try making up some of your own queries.
  • 40. University of Sheffield, NLP Try your hand with some SPARQL (for the more adventurous!) ● {Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3] root:say ● {Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"} [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29> }" } [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK %29> . ?inst :almaMater <http://dbpedia.org/resource/University_of_Edinburgh> }" } [0..3] root:say Can you work out what these do?
  • 41. University of Sheffield, NLP We don't just have to look at politicians saying and measuring things ● If we first process the text with other NLP tools such as sentiment analysis, we can also search for positive or negative documents ● Or positive/negative comments about certain people/topics/things/companies ● In the DecarboNet project, we're looking at people's opinions about topics relating to climate change, e.g. fracking ● We could index on the sentiment annotations too ● Other people are using the combination of opinion mining and MIMIR to look at e.g. customer feedback about products / companies on a huge corpus
  • 42. University of Sheffield, NLP A positive tweet
  • 43. University of Sheffield, NLP A negative tweet
  • 44. University of Sheffield, NLP A Sarcastic Tweet
  • 45. University of Sheffield, NLP What diseases are in these documents?
  • 46. University of Sheffield, NLP What Pathogens?
  • 47. University of Sheffield, NLP Disease vs Disease Co-ocurrences
  • 48. University of Sheffield, NLP Diseases vs Pathogens
  • 49. University of Sheffield, NLP The Political Futures Tracker ● Example of using the technology on a real scenario - analysing political tweets in the run-up to the UK elections ● Project funded by Nesta http://www.nesta.org.uk/ ● Series of blog posts about the project, leading up to the election http://www.nesta.org.uk/news/political-futures-tracker ● Some of our own more technical blog posts about the underlying tools: http://gate.ac.uk/projects/pft/
  • 50. University of Sheffield, NLP Recognition of MPs / Candidates
  • 51. University of Sheffield, NLP Topic recognition This term is found by first recognising the head word “job” from a list under the theme “employment” and matching against its root form in the text, i.e. “job”. It is then extended to include the adjectival modifier “British”, which is not present in a list anywhere.
  • 52. University of Sheffield, NLP Positive opinion about science and innovation
  • 53. University of Sheffield, NLP What do the different parties talk about? Conservative vs Labour Transport, Europe and employment are mentioned more by Conservatives NHS is mentioned more by Labour
  • 54. University of Sheffield, NLP Where did people tweet about the economy?
  • 55. University of Sheffield, NLP Terms that co-occur with immigration topic
  • 56. University of Sheffield, NLP Which tweets express sentiment?
  • 57. University of Sheffield, NLP Summary ● Text mining is a very useful pre-requisite for doing all kinds of more interesting things, like semantic search ● Semantic web search allows you to do much more interesting kinds of search than a standard text-based search ● Text mining is hard, so it won't always be correct, however ● This is especially true on lower quality text such as social media ● On the other hand, social media has some of the most interesting material to look at ● Still plenty of work to be done, but there are lots of tools that you can use now and get useful results ● We run annual GATE training courses where you can spend a whole week learning all this and more!
  • 58. University of Sheffield, NLP Acknowledgements and further information ● Research partially supported by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework Programme for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and Nesta http://nesta.org.uk ● Annual GATE training course in June in Sheffield:https://gate.ac.uk/family/training.html ● Download gate: http://gate.ac.uk/download
  • 59. University of Sheffield, NLP Relevant publications ● K Bontcheva, V Tablan, H Cunningham. Semantic Search over Documents and Ontologies (2014) Bridging Between Information Retrieval and Databases, 31-53 ● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open- source semantic search framework for interactive information seeking and discovery. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 2014 ● A. Dietzel and D. Maynard. Climate Change: A Chance for Political Re- Engagement? In Proc. of the Political Studies Association 65th Annual International Conference, April 2015, Sheffield, UK. ● Diana Maynard. Challenges in Analysing Social Media. In Adrian Duşa, Dietrich Nelle, Günter Stock and Gert G. Wagner (eds.) (2014): Facing the Future: European Research Infrastructures for the Humanities and Social Sciences. SCIVERO Verlag, Berlin, 2014. ● These and other relevant papers on the GATE website: http://gate.ac.uk/gate/doc/papers.html