SlideShare una empresa de Scribd logo
1 de 7
Descargar para leer sin conexión
News construction from microblogging
posts using open data
Francisco Berrizbeitia
Universidad Simón Bolívar
Caracas, Venezuela
fberrizbeitia@gmail.com
June, 2014
Abstract
Information access can be limited in some situations where traditional media outlets can’t
cover the events due to geographical limitations or censorship. Examples of those situations
can be civil unrest, war or natural disasters. In these situations citizen journalism replace or
complement traditional media in the documentation of such events. Microblogging services
such as Twitter have become of great use in these scenarios due their mobile nature and
multimedia capabilities.
In this research we propose a method to create searchable, semantically annotated news
articles from tweets in an automated way using the cloud of linked open data.
Keywords
Semantic web, News, Microblogging, Twitter, Automatic document generation, Data
Journalims, citizen journalism.
1 Introduction
Citizen journalism has become a very common practice with the arrival of the smartphones
and microblogging services such as Twitter. Due to the multimedia capacities of the devices
and the mobile nature of the social network people all over the word are documenting all sort
events and publishing on the Web on real time. This type of journalism has particular
importance in situations where the traditional media can’t cover the events, such as natural
disasters, war, civil unrest or due to government or self-imposed censorship.
Citizen journalism is protected by the Universal Declaration of Human Rights, article 19:
(United Nations)
“Everyone has the right to freedom of opinion and expression; this right includes
freedom to hold opinions without interference and to seek, receive and impart
information and ideas through any media and regardless of frontiers.”
This protection has had tremendous implications in the recent past, in situations where the
only available information was found on social networks and in international media outlets
with very limited coverage. We believe it’s of great importance to develop a technology that
allows the creation of “fair” documents from all the contributions made by the users during
such events.
The hope is that the automated documents created by this technology will be closer to what
really happened and guarantee impartiality.
As a first step in this research we want to construct a news article from a single 140 character
message using the open data cloud.
In the rest of the report we will first describe the overall approach we took to the problem
then describe the system we developed for this task and finally look at the results.
2 Related Work
Information extraction on from Twitter and other microblogging plataforms has been done in
the past. (David Laniado, 2010) explored the semantic value of hashtags as identifiers for the
semantic web. (Shinavier, 2010) proposed the possibility of creating a real-time semantic web
using structured microblogging messages. (Ritter, 2012) uses natural language processing and
information extraction techniques over a corpora of tweets to extract machine readable
information.
Sentiment analysis has also been a topic of research like the work of (Alexander Pak, 2010)
where they propose a machine learning method to classify the tweets in positive, negative and
neutral.
3 Description
The main objective is to obtain the semantically meaningful concepts expressed in the
micropost from the Open Data Cloud and then create a document that extends the original
text with the retrieved concepts. If we succeed in this task we will end up with a news article
where the questions: who, what, where, when and why (Wikipedia, 2014) are going to be
derived from the micropost and extended with the linked open data cloud.
Figure 1. Overall view of the process
In figure 1 we can see the overall process of the news creation. Being this our first approach to
the problem we decided to limit the sources of information to Twitter as the only microblog
input and DBpedia as our source of semantically annotated information.
The system was implemented as a web application written in PHP. In the next section we will
describe each part of the system.
3.1 Information gathering and text preparation
The first task consists in gathering the posted information by a user of the social network; we
collect not only the published text, but also the media when available and information about
the author. We obtain all the information using the public API provided by Twitter. As shown in
figure 2, the only input the system need is the tweet ID .
Figure 2. Input screen of the system
After the text is retrieved it must be “denoised” before any further processing. At this point all
the stop words are removed as well as links and Tweeter specific words such as RT or FF. The
hashtag character (#) is removed leaving the remaining word.
3.2 Candidate selection
Before querying the DBpedia endpoint we run first a local analysis using a version of the
Wordnet database. Each word is analyzed and a matrix of acceptations for the words is
created. Following a set rules we create a list possible 2-words and 1-word candidates that
may be relevant concepts, places or persons. By doing this we wanted to reduce the queries
we need to make the endpoint.
Since the Wikipedia and the DBpedia are tightly related, we decided to query first the
Wikipedia page using the API to obtain the Wikipedia page URL of witch the candidate is the
main topic.
And the end of this process we ended up having a list of candidate with known Wikipedia
pages.
3.3 Semantically annotated information retrieval from the Open Data
Cloud
The next step is to query DBpedia ‘s sparql endpoint to retrieve the semantically annotated
information related to the tweet topic detected in the previous step. Once the information is
received from the endpoint it is put together with the author information from Twitter in a
turtle file in order to make it available via a sparql endpoint. We used a subset of the rNews
Ontology (International Press Telecominication Counsil, 2011) shown in Figure 3.
Figure 3. Subset of the rNews Ontology used for the project
4 Results
To test the approach and the system we selected 90 tweets directly from the Twitter search on
3 subjects: The Brazilian riot during the 2014 world cup, Barack Obama and Venezuela. The
process of collecting the microposts consisted on making the search thru the API and collect
the first 30 messages with an associated picture, doing the same process for each of the
selected topics.
After the sample was obtained we proceeded to manually tag each tweet. This was made two
times by different persons to minimize the human errors. After the sample was manually
tagged we ran the automated process for each tweet and saved the results for each case. The
results can be seen on Figure 4. We expected to find 415 terms for all tweets and found 433, of
those 317 were an exact match to what was expected in the manual process, 63 resulted in
information that is not wrong but adds no real value, 53 that were wrong concepts. This give a
precision of 76.36%, that’s the expect terms that were automatically detected using the
method and 12.24% of errors.
Figure 4. Result of the test cases
Analyzing the errors we noticed that, the automatically retrieved concept brought a wrong a
meaning for the context. For example, in the context of the Brazilian riots, the concept “fire”
was defined as in “a burning fire” instead of “fire a gun”. Similar cases can be found in the
other topics that were tested.
The terms that were not detected by the automated method were candidate with known
Wikipedia pages that had no corresponding entrance in the DBpedia.
5 Future work
We’re encouraged with the obtained results to further develop the method and include
automated context detection as a way to maximize the precision. A possible approach to solve
this is described in (Esther Villar Rodríguez, 2012) and (Nebhi, 2012).
We also would like to further develop the system, to not only detect, retrieve and save
information of one message but to be able to create a complete documentation of an event
for extended period of time, based on several micro blogging platforms and media outlets,
both independent and corporate. The end result we hope to reach is create a full searchable,
semantically annotated news stream that will serve as a neutral and centralized endpoint for
data journalism.
6 Conclusions
In this research we proposed a method to automatically create a news article from a tweet
using the cloud of linked open data, to do it we successfully implemented a web system that
takes a Tweet ID as input and generate semantically annotated news article based on a subset
of the rNews Ontology. To test our approach we collected a group of 90 tweet on three
subjects: the Brazilian riots during the 2014 World Cup, Barack Obama and Venezuela. The
messages where tagged manually and then compared with automatically found annotations.
Our method was able to capture 76.36% of the manually detected terms with an error of rate
12.24% due mostly to disambiguation problems.
These results encourage us to further develop the method and the system to solve first the
disambiguation problems and to create a more ambitious approach that will allow us to create
a semantically annotated news stream based not only on tweet, but also includes other
microblogging services, independent blogs and corporate media outlets that can serve a
centralized semantic endpoint for data journalism.
7 References
Alexander Pak, P. P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining.
Valletta, Malta: Proceedings of the Seventh International Conference on Language
Resources and Evaluation.
David Laniado, P. M. (2010). Making sense of Twitter. Shangai, China: ISWC 2010.
Esther Villar Rodríguez, A. I. (2012). Using Linked Open Data sources for Entity Disambiguation.
Rome: CLEF Iniciative.
International Press Telecominication Counsil. (2011, 10 7). rNews. Retrieved 6 21, 2014, from
IPTC site for developers: http://dev.iptc.org/rNews
Nebhi, K. (2012). Ontology-Based Information Extraction from Twitter. (pp. 17-22). Mumbai:
Proceedings of the Workshop on Information Extraction and Entity Analytics on Social
Media Data.
Ritter, A. (2012). Extracting Knowledge from Twitter and The Web. Doctorate Thesis. University
of Washington.
Shinavier, J. (2010). Realtime #SemanticWeb in <= 140 Characters. WWW2010. Raleigh, North
Carolina.
United Nations. (n.d.). United Nations. Retrieved 6 22, 2014, from The Universal Declaration of
Human Rights: http://www.un.org/en/documents/udhr/index.shtml
Wikipedia. (2014, 6 11). Five Ws. Retrieved 6 20, 2014, from wikipedia.org:
http://en.wikipedia.org/wiki/Five_Ws

Más contenido relacionado

La actualidad más candente

A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...IRJET Journal
 
IRJET- Fake Profile Identification using Machine Learning
IRJET-  	  Fake Profile Identification using Machine LearningIRJET-  	  Fake Profile Identification using Machine Learning
IRJET- Fake Profile Identification using Machine LearningIRJET Journal
 
Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learningijtsrd
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Kyunghoon Kim
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News DetectionIRJET Journal
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine LearningIRJET Journal
 
Automatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature ReviewAutomatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature ReviewDr. Amarjeet Singh
 
IRJET- Fake News Detection using Logistic Regression
IRJET- Fake News Detection using Logistic RegressionIRJET- Fake News Detection using Logistic Regression
IRJET- Fake News Detection using Logistic RegressionIRJET Journal
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine LearningIRJET Journal
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
 
Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...ijnlc
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
IRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source IdentificationIRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source IdentificationIRJET Journal
 
Seminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningSeminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningParvathi Sanil Nair
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET Journal
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia UniversityTunghai University
 

La actualidad más candente (19)

A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
 
IRJET- Fake Profile Identification using Machine Learning
IRJET-  	  Fake Profile Identification using Machine LearningIRJET-  	  Fake Profile Identification using Machine Learning
IRJET- Fake Profile Identification using Machine Learning
 
Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learning
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News Detection
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine Learning
 
Automatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature ReviewAutomatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature Review
 
IRJET- Fake News Detection using Logistic Regression
IRJET- Fake News Detection using Logistic RegressionIRJET- Fake News Detection using Logistic Regression
IRJET- Fake News Detection using Logistic Regression
 
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine Learning
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Ppt
PptPpt
Ppt
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
IRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source IdentificationIRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source Identification
 
Seminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningSeminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learning
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia University
 

Destacado

Tarea - Mensaje inicial - curso formación de tutores en línea
Tarea - Mensaje inicial - curso formación de tutores en líneaTarea - Mensaje inicial - curso formación de tutores en línea
Tarea - Mensaje inicial - curso formación de tutores en líneaFrancisco Berrizbeitia
 
Enric valor(3)
Enric valor(3)Enric valor(3)
Enric valor(3)VANESA
 
News construction from microblogging posts using open data
News construction from microblogging posts using open data News construction from microblogging posts using open data
News construction from microblogging posts using open data Francisco Berrizbeitia
 
Evaluación de diferentes estrategias de muestreo para tratar el problema de ...
Evaluación de diferentes estrategias de muestreo  para tratar el problema de ...Evaluación de diferentes estrategias de muestreo  para tratar el problema de ...
Evaluación de diferentes estrategias de muestreo para tratar el problema de ...Francisco Berrizbeitia
 
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...Francisco Berrizbeitia
 
Módulo 1, Diplomado Tutorias
Módulo 1, Diplomado TutoriasMódulo 1, Diplomado Tutorias
Módulo 1, Diplomado Tutoriasanavicenta
 
Capacitación inicial para psicólogos de nuevo ingreso a la educación
Capacitación inicial para psicólogos de nuevo ingreso a la educaciónCapacitación inicial para psicólogos de nuevo ingreso a la educación
Capacitación inicial para psicólogos de nuevo ingreso a la educaciónGerardo Cruz
 

Destacado (9)

Autosimilaridad en vinculaciones
Autosimilaridad en vinculacionesAutosimilaridad en vinculaciones
Autosimilaridad en vinculaciones
 
Tarea - Mensaje inicial - curso formación de tutores en línea
Tarea - Mensaje inicial - curso formación de tutores en líneaTarea - Mensaje inicial - curso formación de tutores en línea
Tarea - Mensaje inicial - curso formación de tutores en línea
 
Enric valor(3)
Enric valor(3)Enric valor(3)
Enric valor(3)
 
News construction from microblogging posts using open data
News construction from microblogging posts using open data News construction from microblogging posts using open data
News construction from microblogging posts using open data
 
Evaluación de diferentes estrategias de muestreo para tratar el problema de ...
Evaluación de diferentes estrategias de muestreo  para tratar el problema de ...Evaluación de diferentes estrategias de muestreo  para tratar el problema de ...
Evaluación de diferentes estrategias de muestreo para tratar el problema de ...
 
Vinculaciones autosimilares
Vinculaciones autosimilaresVinculaciones autosimilares
Vinculaciones autosimilares
 
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...
 
Módulo 1, Diplomado Tutorias
Módulo 1, Diplomado TutoriasMódulo 1, Diplomado Tutorias
Módulo 1, Diplomado Tutorias
 
Capacitación inicial para psicólogos de nuevo ingreso a la educación
Capacitación inicial para psicólogos de nuevo ingreso a la educaciónCapacitación inicial para psicólogos de nuevo ingreso a la educación
Capacitación inicial para psicólogos de nuevo ingreso a la educación
 

Similar a News construction from microblogging post using open data

FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
 
DP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDivyaPatel729457
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET Journal
 
20574-38941-1-PB.pdf
20574-38941-1-PB.pdf20574-38941-1-PB.pdf
20574-38941-1-PB.pdfIjictTeam
 
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET Journal
 
Ins and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis ServiceIns and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis ServiceArjumand Younus
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
 
Event detection in twitter using text and image fusion
Event detection in twitter using text and image fusionEvent detection in twitter using text and image fusion
Event detection in twitter using text and image fusioncsandit
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET Journal
 
Analyzing Social media’s real data detection through Web content mining using...
Analyzing Social media’s real data detection through Web content mining using...Analyzing Social media’s real data detection through Web content mining using...
Analyzing Social media’s real data detection through Web content mining using...IRJET Journal
 
Meliorating usable document density for online event detection
Meliorating usable document density for online event detectionMeliorating usable document density for online event detection
Meliorating usable document density for online event detectionIJICTJOURNAL
 
Categorize balanced dataset for troll detection
Categorize balanced dataset for troll detectionCategorize balanced dataset for troll detection
Categorize balanced dataset for troll detectionvivatechijri
 
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...ijaia
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Farida Vis
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for eventijma
 
Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysisAntaraBhattacharya12
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
 
Data extraction tools
Data extraction toolsData extraction tools
Data extraction toolsCristian Ruiz
 

Similar a News construction from microblogging post using open data (20)

FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
 
DP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptx
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
 
20574-38941-1-PB.pdf
20574-38941-1-PB.pdf20574-38941-1-PB.pdf
20574-38941-1-PB.pdf
 
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
 
Ins and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis ServiceIns and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis Service
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 
Event detection in twitter using text and image fusion
Event detection in twitter using text and image fusionEvent detection in twitter using text and image fusion
Event detection in twitter using text and image fusion
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
 
Analyzing Social media’s real data detection through Web content mining using...
Analyzing Social media’s real data detection through Web content mining using...Analyzing Social media’s real data detection through Web content mining using...
Analyzing Social media’s real data detection through Web content mining using...
 
Meliorating usable document density for online event detection
Meliorating usable document density for online event detectionMeliorating usable document density for online event detection
Meliorating usable document density for online event detection
 
Categorize balanced dataset for troll detection
Categorize balanced dataset for troll detectionCategorize balanced dataset for troll detection
Categorize balanced dataset for troll detection
 
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...
A DATA MINING APPROACH FOR FILTERING OUT SOCIAL SPAMMERS IN LARGE-SCALE TWITT...
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for event
 
FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT
 
Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysis
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
 
F017433947
F017433947F017433947
F017433947
 
Data extraction tools
Data extraction toolsData extraction tools
Data extraction tools
 

Más de Francisco Berrizbeitia

Trabajo 1 - Definición de un sitio web de contenido multimedia
Trabajo 1 - Definición de un sitio web de contenido multimediaTrabajo 1 - Definición de un sitio web de contenido multimedia
Trabajo 1 - Definición de un sitio web de contenido multimediaFrancisco Berrizbeitia
 
Introducción al el mercadeo en Internet
Introducción al el mercadeo en InternetIntroducción al el mercadeo en Internet
Introducción al el mercadeo en InternetFrancisco Berrizbeitia
 
2013 digital future_in_focus_venezuela
2013 digital future_in_focus_venezuela2013 digital future_in_focus_venezuela
2013 digital future_in_focus_venezuelaFrancisco Berrizbeitia
 
Tiene sentido crear contenido audiovisual para ser difundido exclusivamente ...
Tiene sentido crear contenido audiovisual para ser difundido  exclusivamente ...Tiene sentido crear contenido audiovisual para ser difundido  exclusivamente ...
Tiene sentido crear contenido audiovisual para ser difundido exclusivamente ...Francisco Berrizbeitia
 
Caracterización de la popularidad de los archivos de un wiki a gran escala v3
Caracterización de la popularidad de los archivos de un wiki a gran escala v3Caracterización de la popularidad de los archivos de un wiki a gran escala v3
Caracterización de la popularidad de los archivos de un wiki a gran escala v3Francisco Berrizbeitia
 
Formación en salud y seguridad industrial llave en mano
Formación en salud y seguridad industrial llave en manoFormación en salud y seguridad industrial llave en mano
Formación en salud y seguridad industrial llave en manoFrancisco Berrizbeitia
 
Trabajo 1 - Conceptualización del proyecto de difusión audiovisual
Trabajo 1 - Conceptualización del proyecto de difusión audiovisualTrabajo 1 - Conceptualización del proyecto de difusión audiovisual
Trabajo 1 - Conceptualización del proyecto de difusión audiovisualFrancisco Berrizbeitia
 
Emprendimiento en web 2.0 / Cifras y casos de exito
Emprendimiento en web 2.0 / Cifras y casos de exitoEmprendimiento en web 2.0 / Cifras y casos de exito
Emprendimiento en web 2.0 / Cifras y casos de exitoFrancisco Berrizbeitia
 
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?Francisco Berrizbeitia
 
SEM - Search Engine Marketing / Mercadeo en búscadores
SEM - Search Engine Marketing / Mercadeo en búscadoresSEM - Search Engine Marketing / Mercadeo en búscadores
SEM - Search Engine Marketing / Mercadeo en búscadoresFrancisco Berrizbeitia
 
Imágenes Digitales. Raster y Vectoriales
Imágenes Digitales. Raster y VectorialesImágenes Digitales. Raster y Vectoriales
Imágenes Digitales. Raster y VectorialesFrancisco Berrizbeitia
 

Más de Francisco Berrizbeitia (20)

Trabajo 1 - Definición de un sitio web de contenido multimedia
Trabajo 1 - Definición de un sitio web de contenido multimediaTrabajo 1 - Definición de un sitio web de contenido multimedia
Trabajo 1 - Definición de un sitio web de contenido multimedia
 
Introducción al el mercadeo en Internet
Introducción al el mercadeo en InternetIntroducción al el mercadeo en Internet
Introducción al el mercadeo en Internet
 
¿ Cómo empezar con mi sitio web?
¿ Cómo empezar con mi sitio web?¿ Cómo empezar con mi sitio web?
¿ Cómo empezar con mi sitio web?
 
2013 digital future_in_focus_venezuela
2013 digital future_in_focus_venezuela2013 digital future_in_focus_venezuela
2013 digital future_in_focus_venezuela
 
Tiene sentido crear contenido audiovisual para ser difundido exclusivamente ...
Tiene sentido crear contenido audiovisual para ser difundido  exclusivamente ...Tiene sentido crear contenido audiovisual para ser difundido  exclusivamente ...
Tiene sentido crear contenido audiovisual para ser difundido exclusivamente ...
 
Caracterización de la popularidad de los archivos de un wiki a gran escala v3
Caracterización de la popularidad de los archivos de un wiki a gran escala v3Caracterización de la popularidad de los archivos de un wiki a gran escala v3
Caracterización de la popularidad de los archivos de un wiki a gran escala v3
 
Formación en salud y seguridad industrial llave en mano
Formación en salud y seguridad industrial llave en manoFormación en salud y seguridad industrial llave en mano
Formación en salud y seguridad industrial llave en mano
 
Listado de cursos manual rse
Listado de cursos manual rseListado de cursos manual rse
Listado de cursos manual rse
 
Text mining
Text miningText mining
Text mining
 
AID Aprendizaje - Nosotros
AID Aprendizaje - NosotrosAID Aprendizaje - Nosotros
AID Aprendizaje - Nosotros
 
Keylight ae user guide
Keylight ae user guideKeylight ae user guide
Keylight ae user guide
 
Personalizacion de blogspot
Personalizacion de blogspotPersonalizacion de blogspot
Personalizacion de blogspot
 
Trabajo 1 - Conceptualización del proyecto de difusión audiovisual
Trabajo 1 - Conceptualización del proyecto de difusión audiovisualTrabajo 1 - Conceptualización del proyecto de difusión audiovisual
Trabajo 1 - Conceptualización del proyecto de difusión audiovisual
 
Clase 3 estrategias de difusion
Clase 3   estrategias de difusionClase 3   estrategias de difusion
Clase 3 estrategias de difusion
 
Emprendimiento en web 2.0 / Cifras y casos de exito
Emprendimiento en web 2.0 / Cifras y casos de exitoEmprendimiento en web 2.0 / Cifras y casos de exito
Emprendimiento en web 2.0 / Cifras y casos de exito
 
Clase02
Clase02Clase02
Clase02
 
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?
 
SEM - Search Engine Marketing / Mercadeo en búscadores
SEM - Search Engine Marketing / Mercadeo en búscadoresSEM - Search Engine Marketing / Mercadeo en búscadores
SEM - Search Engine Marketing / Mercadeo en búscadores
 
Estrategías de difusión en web 2.0
Estrategías de difusión en web 2.0Estrategías de difusión en web 2.0
Estrategías de difusión en web 2.0
 
Imágenes Digitales. Raster y Vectoriales
Imágenes Digitales. Raster y VectorialesImágenes Digitales. Raster y Vectoriales
Imágenes Digitales. Raster y Vectoriales
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

News construction from microblogging post using open data

  • 1. News construction from microblogging posts using open data Francisco Berrizbeitia Universidad Simón Bolívar Caracas, Venezuela fberrizbeitia@gmail.com June, 2014 Abstract Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities. In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data. Keywords Semantic web, News, Microblogging, Twitter, Automatic document generation, Data Journalims, citizen journalism. 1 Introduction Citizen journalism has become a very common practice with the arrival of the smartphones and microblogging services such as Twitter. Due to the multimedia capacities of the devices and the mobile nature of the social network people all over the word are documenting all sort events and publishing on the Web on real time. This type of journalism has particular importance in situations where the traditional media can’t cover the events, such as natural disasters, war, civil unrest or due to government or self-imposed censorship. Citizen journalism is protected by the Universal Declaration of Human Rights, article 19: (United Nations) “Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.”
  • 2. This protection has had tremendous implications in the recent past, in situations where the only available information was found on social networks and in international media outlets with very limited coverage. We believe it’s of great importance to develop a technology that allows the creation of “fair” documents from all the contributions made by the users during such events. The hope is that the automated documents created by this technology will be closer to what really happened and guarantee impartiality. As a first step in this research we want to construct a news article from a single 140 character message using the open data cloud. In the rest of the report we will first describe the overall approach we took to the problem then describe the system we developed for this task and finally look at the results. 2 Related Work Information extraction on from Twitter and other microblogging plataforms has been done in the past. (David Laniado, 2010) explored the semantic value of hashtags as identifiers for the semantic web. (Shinavier, 2010) proposed the possibility of creating a real-time semantic web using structured microblogging messages. (Ritter, 2012) uses natural language processing and information extraction techniques over a corpora of tweets to extract machine readable information. Sentiment analysis has also been a topic of research like the work of (Alexander Pak, 2010) where they propose a machine learning method to classify the tweets in positive, negative and neutral. 3 Description The main objective is to obtain the semantically meaningful concepts expressed in the micropost from the Open Data Cloud and then create a document that extends the original text with the retrieved concepts. If we succeed in this task we will end up with a news article where the questions: who, what, where, when and why (Wikipedia, 2014) are going to be derived from the micropost and extended with the linked open data cloud.
  • 3. Figure 1. Overall view of the process In figure 1 we can see the overall process of the news creation. Being this our first approach to the problem we decided to limit the sources of information to Twitter as the only microblog input and DBpedia as our source of semantically annotated information. The system was implemented as a web application written in PHP. In the next section we will describe each part of the system. 3.1 Information gathering and text preparation The first task consists in gathering the posted information by a user of the social network; we collect not only the published text, but also the media when available and information about the author. We obtain all the information using the public API provided by Twitter. As shown in figure 2, the only input the system need is the tweet ID . Figure 2. Input screen of the system
  • 4. After the text is retrieved it must be “denoised” before any further processing. At this point all the stop words are removed as well as links and Tweeter specific words such as RT or FF. The hashtag character (#) is removed leaving the remaining word. 3.2 Candidate selection Before querying the DBpedia endpoint we run first a local analysis using a version of the Wordnet database. Each word is analyzed and a matrix of acceptations for the words is created. Following a set rules we create a list possible 2-words and 1-word candidates that may be relevant concepts, places or persons. By doing this we wanted to reduce the queries we need to make the endpoint. Since the Wikipedia and the DBpedia are tightly related, we decided to query first the Wikipedia page using the API to obtain the Wikipedia page URL of witch the candidate is the main topic. And the end of this process we ended up having a list of candidate with known Wikipedia pages. 3.3 Semantically annotated information retrieval from the Open Data Cloud The next step is to query DBpedia ‘s sparql endpoint to retrieve the semantically annotated information related to the tweet topic detected in the previous step. Once the information is received from the endpoint it is put together with the author information from Twitter in a turtle file in order to make it available via a sparql endpoint. We used a subset of the rNews Ontology (International Press Telecominication Counsil, 2011) shown in Figure 3. Figure 3. Subset of the rNews Ontology used for the project
  • 5. 4 Results To test the approach and the system we selected 90 tweets directly from the Twitter search on 3 subjects: The Brazilian riot during the 2014 world cup, Barack Obama and Venezuela. The process of collecting the microposts consisted on making the search thru the API and collect the first 30 messages with an associated picture, doing the same process for each of the selected topics. After the sample was obtained we proceeded to manually tag each tweet. This was made two times by different persons to minimize the human errors. After the sample was manually tagged we ran the automated process for each tweet and saved the results for each case. The results can be seen on Figure 4. We expected to find 415 terms for all tweets and found 433, of those 317 were an exact match to what was expected in the manual process, 63 resulted in information that is not wrong but adds no real value, 53 that were wrong concepts. This give a precision of 76.36%, that’s the expect terms that were automatically detected using the method and 12.24% of errors. Figure 4. Result of the test cases Analyzing the errors we noticed that, the automatically retrieved concept brought a wrong a meaning for the context. For example, in the context of the Brazilian riots, the concept “fire” was defined as in “a burning fire” instead of “fire a gun”. Similar cases can be found in the other topics that were tested. The terms that were not detected by the automated method were candidate with known Wikipedia pages that had no corresponding entrance in the DBpedia.
  • 6. 5 Future work We’re encouraged with the obtained results to further develop the method and include automated context detection as a way to maximize the precision. A possible approach to solve this is described in (Esther Villar Rodríguez, 2012) and (Nebhi, 2012). We also would like to further develop the system, to not only detect, retrieve and save information of one message but to be able to create a complete documentation of an event for extended period of time, based on several micro blogging platforms and media outlets, both independent and corporate. The end result we hope to reach is create a full searchable, semantically annotated news stream that will serve as a neutral and centralized endpoint for data journalism. 6 Conclusions In this research we proposed a method to automatically create a news article from a tweet using the cloud of linked open data, to do it we successfully implemented a web system that takes a Tweet ID as input and generate semantically annotated news article based on a subset of the rNews Ontology. To test our approach we collected a group of 90 tweet on three subjects: the Brazilian riots during the 2014 World Cup, Barack Obama and Venezuela. The messages where tagged manually and then compared with automatically found annotations. Our method was able to capture 76.36% of the manually detected terms with an error of rate 12.24% due mostly to disambiguation problems. These results encourage us to further develop the method and the system to solve first the disambiguation problems and to create a more ambitious approach that will allow us to create a semantically annotated news stream based not only on tweet, but also includes other microblogging services, independent blogs and corporate media outlets that can serve a centralized semantic endpoint for data journalism. 7 References Alexander Pak, P. P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. Valletta, Malta: Proceedings of the Seventh International Conference on Language Resources and Evaluation. David Laniado, P. M. (2010). Making sense of Twitter. Shangai, China: ISWC 2010. Esther Villar Rodríguez, A. I. (2012). Using Linked Open Data sources for Entity Disambiguation. Rome: CLEF Iniciative. International Press Telecominication Counsil. (2011, 10 7). rNews. Retrieved 6 21, 2014, from IPTC site for developers: http://dev.iptc.org/rNews
  • 7. Nebhi, K. (2012). Ontology-Based Information Extraction from Twitter. (pp. 17-22). Mumbai: Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data. Ritter, A. (2012). Extracting Knowledge from Twitter and The Web. Doctorate Thesis. University of Washington. Shinavier, J. (2010). Realtime #SemanticWeb in <= 140 Characters. WWW2010. Raleigh, North Carolina. United Nations. (n.d.). United Nations. Retrieved 6 22, 2014, from The Universal Declaration of Human Rights: http://www.un.org/en/documents/udhr/index.shtml Wikipedia. (2014, 6 11). Five Ws. Retrieved 6 20, 2014, from wikipedia.org: http://en.wikipedia.org/wiki/Five_Ws