SlideShare a Scribd company logo
1 of 7
Download to read offline
> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 1

Abstract—This project aimed to create a series of models for the
extraction of Named Entities (People, Locations, Organizations,
Dates) from news headlines obtained online. We created two
models: a traditional Natural Processing Language Model using
Maximum Entropy , and a Deep Neural Network Model using pre-
trained word embeddings. Accuracy results of both models show
similar performance, but the requirements and limitations of both
models are different and can help determine what type of model is
best suited for each specific use case.
This project was completed as part of the DS8008 Natural
Language Processing Course at the Masters in Data Science
Program at Ryerson University in Toronto, Ontario during the
months of January through April 2018. All code is available online
at https://github.com/bnajlis/named_entity_recognition
Index Terms— Named Entity Recognition, Online News,
Natural Language Processing, Maximum Entropy, Deep Neural
Networks, Long-Short Term Memory, LSTM
I. CASE DESCRIPTION AND PROBLEM PRESENTATION
AMED ENTITY RECOGNITION (NER) is one of the
most valuable tasks in natural language processing (NLP).
It is widely used in information retrieval systems, machine
translation, question answering systems, and summarization
tools. It is also one of the most popular methodologies to extract
information from unstructured data (emails, blogs, documents,
news articles, etc.).
NER systems have various needs and depending on their
requirements can produce different type of entities, the most
common types are:
• person,
• location,
• organization,
• date and
• money
The problem with named entities is that the names are part of
an open class where new words can be added to the class as they
often as new ones get created. The new entities have to be added
into large dictionary called gazetteer that contains various
entities. The gazetteer can is difficult to maintain with the
growth of innovation, products development, new explorations
and as new acronyms become popular. A good example of the
above is the area of Cryptocurrency, where various coins and
tokens such as Bitcoin have been widely adopted and used in
major news topics in the most recent months.
The second problem with entities is the surprising ambiguity
between various types for example Ford could refer to a person
or the company Ford, Jordan can refer to a person or the
location of the country Jordan, April could either be a month or
a person and the same goes for Ryerson that can refer to either
a person or a university organization.
II. RELATED WORK
Research has been focused on improving the accuracy of named
entity recognition with the use of machine learning specifically
in natural language processing. Among the most common
models used, there are Decision Tree Model, Naive Bayes
Classifier, Maximum Entropy Model [1], Hidden Markov
Model [2].
Newer research has also focused on the use of Deep Neural
Network models to improve performance of Named Entity
Recognition. Usage of Long-Short Term Memory cells and
word embeddings is common in this type of models.
III. METHODOLOGY
Our approach is to apply an NER system using dataset for
training in the domain of online news.
A. News Tagged Dataset
Pre-tagged dataset for specific domain are nearly non
existent, if such exists there is no assurance of quality. In order
to train our model in the news domain and ensure that we obtain
sample of recent named entities to train and evaluate our
models, we opted to produce the gold corpus by manually
extracting online news articles and pre-processing the data to
obtain the Gold Corpus.
DS8008 Natural Language Processing Project
Named Entity Recognition from Online News
(April 2018)
Fadel, Fady – Kashyap, Akshat - Najlis, Bernardo
Masters in Data Science, Ryerson University, Toronto, Canada
N
> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 2
The data was obtained from the Ryerson University Library
& Archived newspaper database (RULA) [RULA]. The data
consists of 2,116 news articles from The Globe and Mail, a
Canadian newspaper for the periods of February and March
2018.
B. Obtaining the Gold Corpus
We proceeded with pre-processing the data from its raw form
where each article was delimited by multiple underscore. Then
each individual article was cleaned from unnecessary fields that
were either manual entered by the in the news field in error or
simply did not belong to the news article eg. author, credits,
keywords. The cleaned news articles were then saved each with
an and individual for easy referencing and accessibility for later
validation.
Field Name Description
Unigram Word in the
POS tag Value for the DJIA at market open, in points
IOB tag Highest value for the day, in points
Table 1. Schema for the Gold Corpus dataset created
Fig. 2.a Sample of raw data obtained from the Rula system
Fig. 2.b High level workflow to obtain the Gold Corpus used in the project
and sample of IOB tagged data
The data was obtained in the form of raw unstructured text
file[Fig. 2.a], we then did extensive cleaning of the raw data
and found to produce errors even with restricted validation.
After further analysis, we discovered that it did not only
contain the main new content but there were instances where
the it contained fields that were manually imputed in error by
the news writer in the incorrect field. Once the previous errors
where validated, we split the content into individual files in
order to reference and being able validate in future step. Next,
we used assisted tagging method where we leaned on the use
of an open source natural language application called
SpaCy[4]. After further obtaining the required format we
validated the tagged dataset manually to ensure accuracy of
our dataset is free of errors. The final dataset was then saved
using CoNLL format which consisted of tab separated (word,
POS tag, IOB tag), to ease the data model ingestion in the
later steps where we utilized the built-in nltk CoNLL corpus
reader to feed the data to each model.
C. Maximum Entropy Model
1. Introduction
The maximum entropy framework estimates probabilities
based on the principle of making as few assumptions as
possible, other than the constraints imposed. Such constraints
are derived from training data, expressing some relationship
between features and outcome. The probability distribution that
satisfies the above property is the one with the highest entropy.
Fig. 1. IOB (Inside, Outside, Beginning) Tagging example for a news article headline
> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 3
It is unique, agrees with the maximum-likelihood distribution,
and has the exponential form (Della Pietra et al., 1997).
where refers to the outcome, h the history (or context), and
Z(h) is a normalization function. The features used in the
maximum
entropy framework are binary. An example of a feature
function is
The parameters αj are estimated by a procedure called
MEGAM. MEGAM (MEGA Model Optimization Package) is
an OCaml based Maximum Entropy project that originated
from Utah university. MEGAM tends to perform much better
in terms of speed and resource consum.
The maximum entropy classifier is used to classify each word
as one of the following: the beginning of a NE (B tag), a word
inside a NE (I tag), NE delimiter word (O tag). During testing,
it is possible that the classifier produces a sequence of
inadmissible classes (e.g., O followed by I). To eliminate
such sequences, we define a transition probability between
word classes P(Ci|Cj) to be equal to 1 if the sequence is
admissible, and 0 otherwise. The probability of the classes
C1,...,Cn assigned to the words in a sentence 4 in a document
5 is defined as follows:
where is determined by the maximum entropy
classifier. The Viterbi algorithm is then used to select the
sequence of word classes with the highest probability
2. Feature Representation
We have used 2 types of features in our model, Global and
Local. Global features are generalized features, they can be
used universally with any corpus while local features are
context specific, they depend on training and testing corpus.
Global features - These features do not depend on the
domain, language, they are quite generic and can be applied in
any dataset.
The global features include:
• Bigrams: combination of current word with it’s
previous, next word
• Trigrams: combination of current word and it’s
next, next to next, previous and previous to previous
word
• Bigrams of POS tags: combination of POS tags of
current word and it’s next, previous word
• Trigrams of POS tags: combination of POS tags of
current word and it’s next, next to next, previous and
previous to previous word
• Previous IOB tag: IOB tag of previous word
Local Features: These Features are contextual in nature to
our training data. They are specific and can’t be generalized to
other datasets.
• Lemmatization: lemmatized word, lemmatizer that
we used is specific to English language
• Capitalized, PrevCapitalized, NextCapitalized: we
tagged words with first letter capital, it’s very
important in english language, which identify entities
like geopolitical entities etc.
• isNumeric: this tag is also specific to representation
of entities like money, date etc. it might not work
with other datasets, if they use word representations
for these entities.
• Tags-since-DT: this tag is also specific to language;
other languages might not have concept of
determiners.
3. Implementation
We read the pre-tagged data from CoNLL Corpus Reader in
the format (word, POS tag, IOB tag), which was sentence
delimited. We had following IOB tags as part of our corpus
("O", "I", "B", "B-PERSON", "I-PERSON", "B-GPE", "I-
GPE", “B-DATE”, “I-DATE”, “B-ORGANIZATION”, “I-
ORGANIZATION”). We split the data using Sklearn library in
70/30 ratio. We prepared the feature list based on global and
local features. We defined maxent classifier from NLTK library
with MEGAM procedure. MEGAM is good for speed and
resources. We gave 70% data and feature list as input for
training of max-ent classified. Once training finished, we tested
it with remaining 30% of data. We were able to obtain 93.8%
accuracy.
> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 4
Fig. 3. High level design for Maximum Entropy model workflow.
4. Challenges
First and the biggest challenge was understanding the data
format requirement, we had to first understand the IOB
format and come up with a proper format which was
required by the classifier. We had multiple iterations of
data formatting and cleaning as we went through the entire
workflow, beginning with corpus reader and ending with
the classifier testing.
We had challenges with testing and training after
incrementally adding new features to the feature list
because of computation power required. Execution time
kept increasing with each additional feature. After reaching
the current list, it was very difficult to add and test further,
so we decided to stop at 93.8%
5. Future Enhancements
Although Max-Ent classifier is inherently low-biased, we
can improve it further by doing k-fold cross validation.
We can also use a high-performance computation machine
or distributing computing to train and test on the expanded
feature list.
D. Deep Neural Network model with LSTM and GloVe
pre-trained embeddings
1. Introduction
A less traditional and more recent approach for Named Entity
Recognition makes use of Deep Neural Networks. The
approach relies on the good performance of DNNs when
applied to classification tasks: in essence a NER problem can
be interpreted as a classification problem where every word
belongs to a certain class (the IOB tag for the word). This
simple analogy over-simplifies the need for the word's context
and the fact that some Names or Entities span over more than
word, so some additional details will have to be taken in account
for the network to recognize this.
Another tool that we can leverage in solving this NLP
problem with neural networks is the use of embeddings. These
are vector translations of the words, instead of the traditional
one-hot approach for a complete vocabulary analized. In a
sense, the embeddings are a dimensionality reduction of the
vocabulary, so they also help in reducing the number of input
dimensions (and computations) required for the classification.
We can build these embedding vectors based on our corpus
vocabulary, or use pre-trained embeddings. Pre-trained
embeddings are usually built on large corpus of documents
(Wikipedia, web scraping, Twitter corpora) and would require
long time and computation to calculate, so it is not uncommon
to reuse them in other projects.
2. Implementation
Fig. 4. Deep Neural Network structure: input unigram, translation into embedding vector, input layer, hidden layers, one-hot encoded output layer and
translation back to IOB tag.
> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 5
In our implementation, we used Keras to simplify the
construction of the network using Tensorflow as the tensor
calculation backend. The network has multiple layers (input,
several hidden and output). As each word has 50 dimensions,
the input layer has 50 units. The output layer has as many IOB
tags as our training dataset, in our case 14 ("O", "I", "B", "B-
PERSON", "I-PERSON", "B-GPE", "I-GPE", “B-DATE”, “I-
DATE”, “B-ORGANIZATION”, “I-ORGANIZATION”).
The pre-trained embeddings set used for our project is GloVe
[3] (Global Vectors), created by Stanford Research. According
to the project website: “GloVe is an unsupervised learning
algorithm for obtaining vector representation for words.
Training is performed on aggregated global word-word co-
occurrence statistics from a corpus, and the resulting
representations showcase interesting linear substructures for the
word vector space.”
GloVe provides various sets of word vectors, pre trained on
different data sets (Wikipedia, Common Crawl, Twitter) and of
different vector dimension sizes. Based on the nature of our data
domain (News articles) we selected the Wikipedia Embeddings,
as it also trained on the Gigaword dataset (an archive of
newswire text data). As this dataset is provided on multiple
vector dimensions (50d, 100d, 200d, 300d) we will be able to
experiment with this variable as another factor to vary and
evaluate impact on accuracy.
3. Challenges
As expected, most of the challenges appeared in the data
transformation phase: adapting the data into the shape and
format to be fed into the network. Loading the training/test
dataset and manipulating it using the CoNLL2002 format
library alleviated some of that (obtaining just words, words and
IOB tags, skipping POS tags) so it would be a good
recommended practice to use all data into this format as it is the
easiest to manipulate using this NLTK helper object.
Another challenge (also associated with data manipulation)
was with the word embeddings: the volume of data required to
load the complete dataset takes some time, and also depends on
the number of dimensions selected (i.e. GloVe has datasets with
50, 100, 200 and 300 dimensions).
Features Accuracy Run duration
Global 88 % 30 mins
Local + Global 93.8% 3 hr
Table II. Max Entropy results and training times for different
feature sets
IV. RESULTS
A. Max Entropy Model
For model evaluation, we split the dataset in 70% training and
30% testing. We ran the model multiple times, with global and
local features, accuracy increased with more number of
features but it plateaus after reaching a certain threshold and it
becomes very time consuming and resource consuming to run
with increasing number of features.
B. Deep Neural Network model
For model evaluation, we split the dataset in 70% testing and
30% training, same as the Max Entropy model.
The model was run multiple times, using different size of
GloVe embedding vectors: 50, 100, 200 and 300 dimensions.
As expected (and presented in the GloVe paper) dimension
does have some impact increasing accuracy, but only to a
certain point after which it becomes detrimental. In the case of
our dataset, there were no changes (increase or decrease) in
the model accuracy result.
Fig. 5. Improvements in accuracy with number of training epochs, plateaus
at 6 epochs.
For all the vector dimensions, we ran the model with 1 to 10
epochs: results improved up until 6 epochs, and then it
plateaued at 93.5% accuracy.
> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 6
V. CONCLUSIONS
Several conclusions were extracted after this project:
Dataset: obtaining a gold dataset for Named Entity
Recognition is not a simple feat. We ran into multitude of
challenges (IOB format, manual vs assisted tagging, data
sanitization) and there is always a medium to high degree of
manual effort required that should not be minimized.
Accuracy: Performance for Max Entropy (93.8%) and Deep
Neural Networks (93.5%) are comparable, there is not a
significant difference. We don’t think this conclusion can be
generalized (as literature in general still presents accuracy in
Deep Neural Network models being higher than traditional Max
Entropy) and it may be attributable to the dataset used.
Limitations: The performance obtained by the Maximum
Entropy model is limited by the number of features. There is no
way to draw a linear relationship in between the number of
features and model performance: each feature (and also
combination of features) contributes independently to the
model performance. On the other hand, the performance
obtained by the Deep Neural Network is dependent on the
embeddings used: pre-trained vs custom built embeddings and
also the number of dimensions have an impact on the accuracy
obtained.
Domain Knowledge: The Maximum Entropy model is
highly dependent on domain knowledge on the corpus
language, as its features are directly related to the language
grammar rules (i.e. proper nouns in English are capitalized).
The Deep Neural Network model is completely independent
of language grammar rules, so the process to create a model can
be reproduced independently of language, as it depends only on
the stochastic properties of the corpora used for embedding and
IOB tagged dataset.
Computing resources: Maximum Entropy (as is common
with more traditional Machine Learning models) use less
computing resources than the Deep Neural Network, which
required more processing capacity to train in reasonable times.
The final conclusion is that, as the Deep Learning Model is
less dependent on specific language grammar rules, it is more
generalizable (given embeddings and some labeled corpora is
provided in any language) whereas the Maximum Entropy
model will perform poorly on an language where there is no
Domain Knowledge to create the required features.
VI. TECHNICAL TOOLS
1. Python Programming Language on Jupyter Notebooks:
All programming was done using the Python
programming language on Jupyter notebook
environments.
2. NLTK library for Natural Language Processing: One of
the most popular natural language processing packages
in Python, it provides functionality for reading data
formatted in the CoNLL2002 format.
3. SpaCy for assisted tagging. Raw data was pre-tagged
using spaCy and then manually reviewed.
4. Keras and Tensorflow for Deep Neural Networks. Keras
simplified the job of creating the network, Tensorflow
provides the backend for mathematical processing.
5. Google Cloud Datalab: For Deep Neural Network
processing, some limitations in our local environments
were resolved by using a Jupyter Notebook cloud
managed environment.
VII. REFERENCES, RESOURCES AND LITERTURE REVIEW
[1] Ridong Jiang, Rafael E. Banchs, Haizhou Li
Evaluating and Combining Named Entity Recognition
Systems “News Analytics”, Wikipedia:
http://www.aclweb.org/anthology/W16-2703
[2] Named Entity Recognition with Bidirectional
LSTM-CNNs
Jason P.C.Chiu, Eric Nichols. University of British
Columbia, Honda Reserch Institute Japan Co., Ltd.
https://www.aclweb.org/anthology/Q16-1026
[3] GloVe: Global Vectors for Word Representation.
Jeffrey Pennington, Richard Socher. Computer Science
Department, Stanford University
Open-source software library for advanced Natural
Language processing
https://www.aclweb.org/anthology/D14-1162
[4] SpaCy, Accessed April 3, 2018
Open-source software library for advanced Natural
Language processing
https://spacy.io/usage/linguistic-features
[5] Ryerson University Library – Newspapers and
Journals Library / Online edition, Accessed April 22,
2018
http://learn.library.ryerson.ca/newspapers
> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 7
Fady Fadel (Masters of Data
Science and Analytics, Ryerson
University 2018) was born in
Basrah, Iraq in 1980. Received a
Bachelor of Engineering in
Information Technology from
McMaster University in Hamilton
Ontario, Canada in 2009, and is
candidate for Masters of Data
Science and Analytics at Ryerson
University in Toronto Ontario,
Canada.
He currently is a Senior Data
Analyst in the Infrastructure Capacity Engineering team at
Cogeco Connexion, a Canadian telecommunications company
serving residential video, high-speed Internet and telephony
service, along with data and voice transmission services and
cloud-based applications for businesses.
Akshat Kashyap (Masters of Data
Science and Analytics, Ryerson
University 2018)) was born in
Ujjain, India in 1987. Received a
Bachelor of Engineering in
Information Technology from Rajiv
Gandhi Technical University in
Indore, India in 2009, and is
candidate for Masters of Data
Science and Analytics at Ryerson
University in Toronto, Canada.
He currently is a Senior Middleware Specialist in the Ministry
of Government and Consumer Services, Ontario. Which
delivers vital programs, services, and products — ranging from
health cards, drivers licences, and birth certificates to consumer
protection and public safety to Ontario citizens.
Bernardo Najlis (Masters of Data
Science and Analytics, Ryerson
University 2018) was born in
Buenos Aires, Argentina in 1977.
Received a BS in Systems Analysis
from Universidad CAECE in Buenos
Aires, Argentina in 2007, and is
candidate for Masters of Data
Science and Analytics at Ryerson
University in Toronto, Canada.
He is currently a Solution Principal
in the Information Management &
Analytics team at Slalom, a
worldwide consulting firm, working in Cloud Big Data and
Advanced Analytics projects.

More Related Content

What's hot

A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives TaiwanA Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwanandrea huang
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityunivTope Omitola
 
Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Zakaria Zubi
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Ontotext
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data queryIJDKP
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013Luis Daniel Ibáñez
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
Inference on the Semantic Web
Inference on the Semantic WebInference on the Semantic Web
Inference on the Semantic WebMyungjin Lee
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the WebArmin Haller
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data csandit
 

What's hot (19)

A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives TaiwanA Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
disertation
disertationdisertation
disertation
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityuniv
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases Knowledge Discovery in Remote Access Databases
Knowledge Discovery in Remote Access Databases
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data query
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Inference on the Semantic Web
Inference on the Semantic WebInference on the Semantic Web
Inference on the Semantic Web
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the Web
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data
 

Similar to Named Entity Recognition from Online News

Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory acijjournal
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...IJCSIS Research Publications
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALijaia
 
Requirementv4
Requirementv4Requirementv4
Requirementv4stat
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining  A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining ijsc
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
A Survey of Ontology-based Information Extraction for Social Media Content An...
A Survey of Ontology-based Information Extraction for Social Media Content An...A Survey of Ontology-based Information Extraction for Social Media Content An...
A Survey of Ontology-based Information Extraction for Social Media Content An...ijcnes
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Miningijsc
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Robert Monné
 
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsNLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsHimanshu kandwal
 
Temporal Information Processing: A Survey
Temporal Information Processing: A SurveyTemporal Information Processing: A Survey
Temporal Information Processing: A Surveykevig
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Love Arora
 
Comp_Ling_Resume_Generator.pdf
Comp_Ling_Resume_Generator.pdfComp_Ling_Resume_Generator.pdf
Comp_Ling_Resume_Generator.pdfHanaBaSabaa
 

Similar to Named Entity Recognition from Online News (20)

Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
 
Requirementv4
Requirementv4Requirementv4
Requirementv4
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining  A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
 
The Value and Benefits of Data-to-Text Technologies
The Value and Benefits of Data-to-Text TechnologiesThe Value and Benefits of Data-to-Text Technologies
The Value and Benefits of Data-to-Text Technologies
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
A Survey of Ontology-based Information Extraction for Social Media Content An...
A Survey of Ontology-based Information Extraction for Social Media Content An...A Survey of Ontology-based Information Extraction for Social Media Content An...
A Survey of Ontology-based Information Extraction for Social Media Content An...
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)
 
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsNLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
 
Temporal Information Processing: A Survey
Temporal Information Processing: A SurveyTemporal Information Processing: A Survey
Temporal Information Processing: A Survey
 
G0361034038
G0361034038G0361034038
G0361034038
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
Comp_Ling_Resume_Generator.pdf
Comp_Ling_Resume_Generator.pdfComp_Ling_Resume_Generator.pdf
Comp_Ling_Resume_Generator.pdf
 

More from Bernardo Najlis

Toastmasters speech #7 - Research your Subject
Toastmasters speech #7  - Research your SubjectToastmasters speech #7  - Research your Subject
Toastmasters speech #7 - Research your SubjectBernardo Najlis
 
Toastmasters project #5 - Just a jump
Toastmasters project #5  - Just a jumpToastmasters project #5  - Just a jump
Toastmasters project #5 - Just a jumpBernardo Najlis
 
Business Intelligence Presentation - Data Mining (2/2)
Business Intelligence Presentation - Data Mining (2/2)Business Intelligence Presentation - Data Mining (2/2)
Business Intelligence Presentation - Data Mining (2/2)Bernardo Najlis
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Bernardo Najlis
 

More from Bernardo Najlis (8)

#FluxFlow
#FluxFlow#FluxFlow
#FluxFlow
 
Introduction to knime
Introduction to knimeIntroduction to knime
Introduction to knime
 
Toastmasters speech #7 - Research your Subject
Toastmasters speech #7  - Research your SubjectToastmasters speech #7  - Research your Subject
Toastmasters speech #7 - Research your Subject
 
Toastmasters project #5 - Just a jump
Toastmasters project #5  - Just a jumpToastmasters project #5  - Just a jump
Toastmasters project #5 - Just a jump
 
What is lomography?
What is lomography?What is lomography?
What is lomography?
 
Plethora
PlethoraPlethora
Plethora
 
Business Intelligence Presentation - Data Mining (2/2)
Business Intelligence Presentation - Data Mining (2/2)Business Intelligence Presentation - Data Mining (2/2)
Business Intelligence Presentation - Data Mining (2/2)
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Named Entity Recognition from Online News

  • 1. > DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 1  Abstract—This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre- trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case. This project was completed as part of the DS8008 Natural Language Processing Course at the Masters in Data Science Program at Ryerson University in Toronto, Ontario during the months of January through April 2018. All code is available online at https://github.com/bnajlis/named_entity_recognition Index Terms— Named Entity Recognition, Online News, Natural Language Processing, Maximum Entropy, Deep Neural Networks, Long-Short Term Memory, LSTM I. CASE DESCRIPTION AND PROBLEM PRESENTATION AMED ENTITY RECOGNITION (NER) is one of the most valuable tasks in natural language processing (NLP). It is widely used in information retrieval systems, machine translation, question answering systems, and summarization tools. It is also one of the most popular methodologies to extract information from unstructured data (emails, blogs, documents, news articles, etc.). NER systems have various needs and depending on their requirements can produce different type of entities, the most common types are: • person, • location, • organization, • date and • money The problem with named entities is that the names are part of an open class where new words can be added to the class as they often as new ones get created. The new entities have to be added into large dictionary called gazetteer that contains various entities. The gazetteer can is difficult to maintain with the growth of innovation, products development, new explorations and as new acronyms become popular. A good example of the above is the area of Cryptocurrency, where various coins and tokens such as Bitcoin have been widely adopted and used in major news topics in the most recent months. The second problem with entities is the surprising ambiguity between various types for example Ford could refer to a person or the company Ford, Jordan can refer to a person or the location of the country Jordan, April could either be a month or a person and the same goes for Ryerson that can refer to either a person or a university organization. II. RELATED WORK Research has been focused on improving the accuracy of named entity recognition with the use of machine learning specifically in natural language processing. Among the most common models used, there are Decision Tree Model, Naive Bayes Classifier, Maximum Entropy Model [1], Hidden Markov Model [2]. Newer research has also focused on the use of Deep Neural Network models to improve performance of Named Entity Recognition. Usage of Long-Short Term Memory cells and word embeddings is common in this type of models. III. METHODOLOGY Our approach is to apply an NER system using dataset for training in the domain of online news. A. News Tagged Dataset Pre-tagged dataset for specific domain are nearly non existent, if such exists there is no assurance of quality. In order to train our model in the news domain and ensure that we obtain sample of recent named entities to train and evaluate our models, we opted to produce the gold corpus by manually extracting online news articles and pre-processing the data to obtain the Gold Corpus. DS8008 Natural Language Processing Project Named Entity Recognition from Online News (April 2018) Fadel, Fady – Kashyap, Akshat - Najlis, Bernardo Masters in Data Science, Ryerson University, Toronto, Canada N
  • 2. > DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 2 The data was obtained from the Ryerson University Library & Archived newspaper database (RULA) [RULA]. The data consists of 2,116 news articles from The Globe and Mail, a Canadian newspaper for the periods of February and March 2018. B. Obtaining the Gold Corpus We proceeded with pre-processing the data from its raw form where each article was delimited by multiple underscore. Then each individual article was cleaned from unnecessary fields that were either manual entered by the in the news field in error or simply did not belong to the news article eg. author, credits, keywords. The cleaned news articles were then saved each with an and individual for easy referencing and accessibility for later validation. Field Name Description Unigram Word in the POS tag Value for the DJIA at market open, in points IOB tag Highest value for the day, in points Table 1. Schema for the Gold Corpus dataset created Fig. 2.a Sample of raw data obtained from the Rula system Fig. 2.b High level workflow to obtain the Gold Corpus used in the project and sample of IOB tagged data The data was obtained in the form of raw unstructured text file[Fig. 2.a], we then did extensive cleaning of the raw data and found to produce errors even with restricted validation. After further analysis, we discovered that it did not only contain the main new content but there were instances where the it contained fields that were manually imputed in error by the news writer in the incorrect field. Once the previous errors where validated, we split the content into individual files in order to reference and being able validate in future step. Next, we used assisted tagging method where we leaned on the use of an open source natural language application called SpaCy[4]. After further obtaining the required format we validated the tagged dataset manually to ensure accuracy of our dataset is free of errors. The final dataset was then saved using CoNLL format which consisted of tab separated (word, POS tag, IOB tag), to ease the data model ingestion in the later steps where we utilized the built-in nltk CoNLL corpus reader to feed the data to each model. C. Maximum Entropy Model 1. Introduction The maximum entropy framework estimates probabilities based on the principle of making as few assumptions as possible, other than the constraints imposed. Such constraints are derived from training data, expressing some relationship between features and outcome. The probability distribution that satisfies the above property is the one with the highest entropy. Fig. 1. IOB (Inside, Outside, Beginning) Tagging example for a news article headline
  • 3. > DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 3 It is unique, agrees with the maximum-likelihood distribution, and has the exponential form (Della Pietra et al., 1997). where refers to the outcome, h the history (or context), and Z(h) is a normalization function. The features used in the maximum entropy framework are binary. An example of a feature function is The parameters αj are estimated by a procedure called MEGAM. MEGAM (MEGA Model Optimization Package) is an OCaml based Maximum Entropy project that originated from Utah university. MEGAM tends to perform much better in terms of speed and resource consum. The maximum entropy classifier is used to classify each word as one of the following: the beginning of a NE (B tag), a word inside a NE (I tag), NE delimiter word (O tag). During testing, it is possible that the classifier produces a sequence of inadmissible classes (e.g., O followed by I). To eliminate such sequences, we define a transition probability between word classes P(Ci|Cj) to be equal to 1 if the sequence is admissible, and 0 otherwise. The probability of the classes C1,...,Cn assigned to the words in a sentence 4 in a document 5 is defined as follows: where is determined by the maximum entropy classifier. The Viterbi algorithm is then used to select the sequence of word classes with the highest probability 2. Feature Representation We have used 2 types of features in our model, Global and Local. Global features are generalized features, they can be used universally with any corpus while local features are context specific, they depend on training and testing corpus. Global features - These features do not depend on the domain, language, they are quite generic and can be applied in any dataset. The global features include: • Bigrams: combination of current word with it’s previous, next word • Trigrams: combination of current word and it’s next, next to next, previous and previous to previous word • Bigrams of POS tags: combination of POS tags of current word and it’s next, previous word • Trigrams of POS tags: combination of POS tags of current word and it’s next, next to next, previous and previous to previous word • Previous IOB tag: IOB tag of previous word Local Features: These Features are contextual in nature to our training data. They are specific and can’t be generalized to other datasets. • Lemmatization: lemmatized word, lemmatizer that we used is specific to English language • Capitalized, PrevCapitalized, NextCapitalized: we tagged words with first letter capital, it’s very important in english language, which identify entities like geopolitical entities etc. • isNumeric: this tag is also specific to representation of entities like money, date etc. it might not work with other datasets, if they use word representations for these entities. • Tags-since-DT: this tag is also specific to language; other languages might not have concept of determiners. 3. Implementation We read the pre-tagged data from CoNLL Corpus Reader in the format (word, POS tag, IOB tag), which was sentence delimited. We had following IOB tags as part of our corpus ("O", "I", "B", "B-PERSON", "I-PERSON", "B-GPE", "I- GPE", “B-DATE”, “I-DATE”, “B-ORGANIZATION”, “I- ORGANIZATION”). We split the data using Sklearn library in 70/30 ratio. We prepared the feature list based on global and local features. We defined maxent classifier from NLTK library with MEGAM procedure. MEGAM is good for speed and resources. We gave 70% data and feature list as input for training of max-ent classified. Once training finished, we tested it with remaining 30% of data. We were able to obtain 93.8% accuracy.
  • 4. > DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 4 Fig. 3. High level design for Maximum Entropy model workflow. 4. Challenges First and the biggest challenge was understanding the data format requirement, we had to first understand the IOB format and come up with a proper format which was required by the classifier. We had multiple iterations of data formatting and cleaning as we went through the entire workflow, beginning with corpus reader and ending with the classifier testing. We had challenges with testing and training after incrementally adding new features to the feature list because of computation power required. Execution time kept increasing with each additional feature. After reaching the current list, it was very difficult to add and test further, so we decided to stop at 93.8% 5. Future Enhancements Although Max-Ent classifier is inherently low-biased, we can improve it further by doing k-fold cross validation. We can also use a high-performance computation machine or distributing computing to train and test on the expanded feature list. D. Deep Neural Network model with LSTM and GloVe pre-trained embeddings 1. Introduction A less traditional and more recent approach for Named Entity Recognition makes use of Deep Neural Networks. The approach relies on the good performance of DNNs when applied to classification tasks: in essence a NER problem can be interpreted as a classification problem where every word belongs to a certain class (the IOB tag for the word). This simple analogy over-simplifies the need for the word's context and the fact that some Names or Entities span over more than word, so some additional details will have to be taken in account for the network to recognize this. Another tool that we can leverage in solving this NLP problem with neural networks is the use of embeddings. These are vector translations of the words, instead of the traditional one-hot approach for a complete vocabulary analized. In a sense, the embeddings are a dimensionality reduction of the vocabulary, so they also help in reducing the number of input dimensions (and computations) required for the classification. We can build these embedding vectors based on our corpus vocabulary, or use pre-trained embeddings. Pre-trained embeddings are usually built on large corpus of documents (Wikipedia, web scraping, Twitter corpora) and would require long time and computation to calculate, so it is not uncommon to reuse them in other projects. 2. Implementation Fig. 4. Deep Neural Network structure: input unigram, translation into embedding vector, input layer, hidden layers, one-hot encoded output layer and translation back to IOB tag.
  • 5. > DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 5 In our implementation, we used Keras to simplify the construction of the network using Tensorflow as the tensor calculation backend. The network has multiple layers (input, several hidden and output). As each word has 50 dimensions, the input layer has 50 units. The output layer has as many IOB tags as our training dataset, in our case 14 ("O", "I", "B", "B- PERSON", "I-PERSON", "B-GPE", "I-GPE", “B-DATE”, “I- DATE”, “B-ORGANIZATION”, “I-ORGANIZATION”). The pre-trained embeddings set used for our project is GloVe [3] (Global Vectors), created by Stanford Research. According to the project website: “GloVe is an unsupervised learning algorithm for obtaining vector representation for words. Training is performed on aggregated global word-word co- occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures for the word vector space.” GloVe provides various sets of word vectors, pre trained on different data sets (Wikipedia, Common Crawl, Twitter) and of different vector dimension sizes. Based on the nature of our data domain (News articles) we selected the Wikipedia Embeddings, as it also trained on the Gigaword dataset (an archive of newswire text data). As this dataset is provided on multiple vector dimensions (50d, 100d, 200d, 300d) we will be able to experiment with this variable as another factor to vary and evaluate impact on accuracy. 3. Challenges As expected, most of the challenges appeared in the data transformation phase: adapting the data into the shape and format to be fed into the network. Loading the training/test dataset and manipulating it using the CoNLL2002 format library alleviated some of that (obtaining just words, words and IOB tags, skipping POS tags) so it would be a good recommended practice to use all data into this format as it is the easiest to manipulate using this NLTK helper object. Another challenge (also associated with data manipulation) was with the word embeddings: the volume of data required to load the complete dataset takes some time, and also depends on the number of dimensions selected (i.e. GloVe has datasets with 50, 100, 200 and 300 dimensions). Features Accuracy Run duration Global 88 % 30 mins Local + Global 93.8% 3 hr Table II. Max Entropy results and training times for different feature sets IV. RESULTS A. Max Entropy Model For model evaluation, we split the dataset in 70% training and 30% testing. We ran the model multiple times, with global and local features, accuracy increased with more number of features but it plateaus after reaching a certain threshold and it becomes very time consuming and resource consuming to run with increasing number of features. B. Deep Neural Network model For model evaluation, we split the dataset in 70% testing and 30% training, same as the Max Entropy model. The model was run multiple times, using different size of GloVe embedding vectors: 50, 100, 200 and 300 dimensions. As expected (and presented in the GloVe paper) dimension does have some impact increasing accuracy, but only to a certain point after which it becomes detrimental. In the case of our dataset, there were no changes (increase or decrease) in the model accuracy result. Fig. 5. Improvements in accuracy with number of training epochs, plateaus at 6 epochs. For all the vector dimensions, we ran the model with 1 to 10 epochs: results improved up until 6 epochs, and then it plateaued at 93.5% accuracy.
  • 6. > DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 6 V. CONCLUSIONS Several conclusions were extracted after this project: Dataset: obtaining a gold dataset for Named Entity Recognition is not a simple feat. We ran into multitude of challenges (IOB format, manual vs assisted tagging, data sanitization) and there is always a medium to high degree of manual effort required that should not be minimized. Accuracy: Performance for Max Entropy (93.8%) and Deep Neural Networks (93.5%) are comparable, there is not a significant difference. We don’t think this conclusion can be generalized (as literature in general still presents accuracy in Deep Neural Network models being higher than traditional Max Entropy) and it may be attributable to the dataset used. Limitations: The performance obtained by the Maximum Entropy model is limited by the number of features. There is no way to draw a linear relationship in between the number of features and model performance: each feature (and also combination of features) contributes independently to the model performance. On the other hand, the performance obtained by the Deep Neural Network is dependent on the embeddings used: pre-trained vs custom built embeddings and also the number of dimensions have an impact on the accuracy obtained. Domain Knowledge: The Maximum Entropy model is highly dependent on domain knowledge on the corpus language, as its features are directly related to the language grammar rules (i.e. proper nouns in English are capitalized). The Deep Neural Network model is completely independent of language grammar rules, so the process to create a model can be reproduced independently of language, as it depends only on the stochastic properties of the corpora used for embedding and IOB tagged dataset. Computing resources: Maximum Entropy (as is common with more traditional Machine Learning models) use less computing resources than the Deep Neural Network, which required more processing capacity to train in reasonable times. The final conclusion is that, as the Deep Learning Model is less dependent on specific language grammar rules, it is more generalizable (given embeddings and some labeled corpora is provided in any language) whereas the Maximum Entropy model will perform poorly on an language where there is no Domain Knowledge to create the required features. VI. TECHNICAL TOOLS 1. Python Programming Language on Jupyter Notebooks: All programming was done using the Python programming language on Jupyter notebook environments. 2. NLTK library for Natural Language Processing: One of the most popular natural language processing packages in Python, it provides functionality for reading data formatted in the CoNLL2002 format. 3. SpaCy for assisted tagging. Raw data was pre-tagged using spaCy and then manually reviewed. 4. Keras and Tensorflow for Deep Neural Networks. Keras simplified the job of creating the network, Tensorflow provides the backend for mathematical processing. 5. Google Cloud Datalab: For Deep Neural Network processing, some limitations in our local environments were resolved by using a Jupyter Notebook cloud managed environment. VII. REFERENCES, RESOURCES AND LITERTURE REVIEW [1] Ridong Jiang, Rafael E. Banchs, Haizhou Li Evaluating and Combining Named Entity Recognition Systems “News Analytics”, Wikipedia: http://www.aclweb.org/anthology/W16-2703 [2] Named Entity Recognition with Bidirectional LSTM-CNNs Jason P.C.Chiu, Eric Nichols. University of British Columbia, Honda Reserch Institute Japan Co., Ltd. https://www.aclweb.org/anthology/Q16-1026 [3] GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher. Computer Science Department, Stanford University Open-source software library for advanced Natural Language processing https://www.aclweb.org/anthology/D14-1162 [4] SpaCy, Accessed April 3, 2018 Open-source software library for advanced Natural Language processing https://spacy.io/usage/linguistic-features [5] Ryerson University Library – Newspapers and Journals Library / Online edition, Accessed April 22, 2018 http://learn.library.ryerson.ca/newspapers
  • 7. > DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 7 Fady Fadel (Masters of Data Science and Analytics, Ryerson University 2018) was born in Basrah, Iraq in 1980. Received a Bachelor of Engineering in Information Technology from McMaster University in Hamilton Ontario, Canada in 2009, and is candidate for Masters of Data Science and Analytics at Ryerson University in Toronto Ontario, Canada. He currently is a Senior Data Analyst in the Infrastructure Capacity Engineering team at Cogeco Connexion, a Canadian telecommunications company serving residential video, high-speed Internet and telephony service, along with data and voice transmission services and cloud-based applications for businesses. Akshat Kashyap (Masters of Data Science and Analytics, Ryerson University 2018)) was born in Ujjain, India in 1987. Received a Bachelor of Engineering in Information Technology from Rajiv Gandhi Technical University in Indore, India in 2009, and is candidate for Masters of Data Science and Analytics at Ryerson University in Toronto, Canada. He currently is a Senior Middleware Specialist in the Ministry of Government and Consumer Services, Ontario. Which delivers vital programs, services, and products — ranging from health cards, drivers licences, and birth certificates to consumer protection and public safety to Ontario citizens. Bernardo Najlis (Masters of Data Science and Analytics, Ryerson University 2018) was born in Buenos Aires, Argentina in 1977. Received a BS in Systems Analysis from Universidad CAECE in Buenos Aires, Argentina in 2007, and is candidate for Masters of Data Science and Analytics at Ryerson University in Toronto, Canada. He is currently a Solution Principal in the Information Management & Analytics team at Slalom, a worldwide consulting firm, working in Cloud Big Data and Advanced Analytics projects.