Named Entity Recognition from Online News

> DS 8008 NATURAL LANGUAGE PROCESSING – NAMED ENTITY RECOGNITION FROM ONLINE NEWS (APRIL 2018) < 1

Abstract—This project aimed to create a series of models for the
extraction of Named Entities (People, Locations, Organizations,
Dates) from news headlines obtained online. We created two
models: a traditional Natural Processing Language Model using
Maximum Entropy , and a Deep Neural Network Model using pre-
trained word embeddings. Accuracy results of both models show
similar performance, but the requirements and limitations of both
models are different and can help determine what type of model is
best suited for each specific use case.
This project was completed as part of the DS8008 Natural
Language Processing Course at the Masters in Data Science
Program at Ryerson University in Toronto, Ontario during the
months of January through April 2018. All code is available online
at https://github.com/bnajlis/named_entity_recognition
Index Terms— Named Entity Recognition, Online News,
Natural Language Processing, Maximum Entropy, Deep Neural
Networks, Long-Short Term Memory, LSTM
I. CASE DESCRIPTION AND PROBLEM PRESENTATION
AMED ENTITY RECOGNITION (NER) is one of the
most valuable tasks in natural language processing (NLP).
It is widely used in information retrieval systems, machine
translation, question answering systems, and summarization
tools. It is also one of the most popular methodologies to extract
information from unstructured data (emails, blogs, documents,
news articles, etc.).
NER systems have various needs and depending on their
requirements can produce different type of entities, the most
common types are:
• person,
• location,
• organization,
• date and
• money
The problem with named entities is that the names are part of
an open class where new words can be added to the class as they
often as new ones get created. The new entities have to be added
into large dictionary called gazetteer that contains various
entities. The gazetteer can is difficult to maintain with the
growth of innovation, products development, new explorations
and as new acronyms become popular. A good example of the
above is the area of Cryptocurrency, where various coins and
tokens such as Bitcoin have been widely adopted and used in
major news topics in the most recent months.
The second problem with entities is the surprising ambiguity
between various types for example Ford could refer to a person
or the company Ford, Jordan can refer to a person or the
location of the country Jordan, April could either be a month or
a person and the same goes for Ryerson that can refer to either
a person or a university organization.
II. RELATED WORK
Research has been focused on improving the accuracy of named
entity recognition with the use of machine learning specifically
in natural language processing. Among the most common
models used, there are Decision Tree Model, Naive Bayes
Classifier, Maximum Entropy Model [1], Hidden Markov
Model [2].
Newer research has also focused on the use of Deep Neural
Network models to improve performance of Named Entity
Recognition. Usage of Long-Short Term Memory cells and
word embeddings is common in this type of models.
III. METHODOLOGY
Our approach is to apply an NER system using dataset for
training in the domain of online news.
A. News Tagged Dataset
Pre-tagged dataset for specific domain are nearly non
existent, if such exists there is no assurance of quality. In order
to train our model in the news domain and ensure that we obtain
sample of recent named entities to train and evaluate our
models, we opted to produce the gold corpus by manually
extracting online news articles and pre-processing the data to
obtain the Gold Corpus.
DS8008 Natural Language Processing Project
Named Entity Recognition from Online News
(April 2018)
Fadel, Fady – Kashyap, Akshat - Najlis, Bernardo
Masters in Data Science, Ryerson University, Toronto, Canada
N

The data was obtained from the Ryerson University Library
& Archived newspaper database (RULA) [RULA]. The data
consists of 2,116 news articles from The Globe and Mail, a
Canadian newspaper for the periods of February and March
2018.
B. Obtaining the Gold Corpus
We proceeded with pre-processing the data from its raw form
where each article was delimited by multiple underscore. Then
each individual article was cleaned from unnecessary fields that
were either manual entered by the in the news field in error or
simply did not belong to the news article eg. author, credits,
keywords. The cleaned news articles were then saved each with
an and individual for easy referencing and accessibility for later
validation.
Field Name Description
Unigram Word in the
POS tag Value for the DJIA at market open, in points
IOB tag Highest value for the day, in points
Table 1. Schema for the Gold Corpus dataset created
Fig. 2.a Sample of raw data obtained from the Rula system
Fig. 2.b High level workflow to obtain the Gold Corpus used in the project
and sample of IOB tagged data
The data was obtained in the form of raw unstructured text
file[Fig. 2.a], we then did extensive cleaning of the raw data
and found to produce errors even with restricted validation.
After further analysis, we discovered that it did not only
contain the main new content but there were instances where
the it contained fields that were manually imputed in error by
the news writer in the incorrect field. Once the previous errors
where validated, we split the content into individual files in
order to reference and being able validate in future step. Next,
we used assisted tagging method where we leaned on the use
of an open source natural language application called
SpaCy[4]. After further obtaining the required format we
validated the tagged dataset manually to ensure accuracy of
our dataset is free of errors. The final dataset was then saved
using CoNLL format which consisted of tab separated (word,
POS tag, IOB tag), to ease the data model ingestion in the
later steps where we utilized the built-in nltk CoNLL corpus
reader to feed the data to each model.
C. Maximum Entropy Model
1. Introduction
The maximum entropy framework estimates probabilities
based on the principle of making as few assumptions as
possible, other than the constraints imposed. Such constraints
are derived from training data, expressing some relationship
between features and outcome. The probability distribution that
satisfies the above property is the one with the highest entropy.
Fig. 1. IOB (Inside, Outside, Beginning) Tagging example for a news article headline

It is unique, agrees with the maximum-likelihood distribution,
and has the exponential form (Della Pietra et al., 1997).
where refers to the outcome, h the history (or context), and
Z(h) is a normalization function. The features used in the
maximum
entropy framework are binary. An example of a feature
function is
The parameters αj are estimated by a procedure called
MEGAM. MEGAM (MEGA Model Optimization Package) is
an OCaml based Maximum Entropy project that originated
from Utah university. MEGAM tends to perform much better
in terms of speed and resource consum.
The maximum entropy classifier is used to classify each word
as one of the following: the beginning of a NE (B tag), a word
inside a NE (I tag), NE delimiter word (O tag). During testing,
it is possible that the classifier produces a sequence of
inadmissible classes (e.g., O followed by I). To eliminate
such sequences, we define a transition probability between
word classes P(Ci|Cj) to be equal to 1 if the sequence is
admissible, and 0 otherwise. The probability of the classes
C1,...,Cn assigned to the words in a sentence 4 in a document
5 is defined as follows:
where is determined by the maximum entropy
classifier. The Viterbi algorithm is then used to select the
sequence of word classes with the highest probability
2. Feature Representation
We have used 2 types of features in our model, Global and
Local. Global features are generalized features, they can be
used universally with any corpus while local features are
context specific, they depend on training and testing corpus.
Global features - These features do not depend on the
domain, language, they are quite generic and can be applied in
any dataset.
The global features include:
• Bigrams: combination of current word with it’s
previous, next word
• Trigrams: combination of current word and it’s
next, next to next, previous and previous to previous
word
• Bigrams of POS tags: combination of POS tags of
current word and it’s next, previous word
• Trigrams of POS tags: combination of POS tags of
current word and it’s next, next to next, previous and
previous to previous word
• Previous IOB tag: IOB tag of previous word
Local Features: These Features are contextual in nature to
our training data. They are specific and can’t be generalized to
other datasets.
• Lemmatization: lemmatized word, lemmatizer that
we used is specific to English language
• Capitalized, PrevCapitalized, NextCapitalized: we
tagged words with first letter capital, it’s very
important in english language, which identify entities
like geopolitical entities etc.
• isNumeric: this tag is also specific to representation
of entities like money, date etc. it might not work
with other datasets, if they use word representations
for these entities.
• Tags-since-DT: this tag is also specific to language;
other languages might not have concept of
determiners.
3. Implementation
We read the pre-tagged data from CoNLL Corpus Reader in
the format (word, POS tag, IOB tag), which was sentence
delimited. We had following IOB tags as part of our corpus
("O", "I", "B", "B-PERSON", "I-PERSON", "B-GPE", "I-
GPE", “B-DATE”, “I-DATE”, “B-ORGANIZATION”, “I-
ORGANIZATION”). We split the data using Sklearn library in
70/30 ratio. We prepared the feature list based on global and
local features. We defined maxent classifier from NLTK library
with MEGAM procedure. MEGAM is good for speed and
resources. We gave 70% data and feature list as input for
training of max-ent classified. Once training finished, we tested
it with remaining 30% of data. We were able to obtain 93.8%
accuracy.

Fig. 3. High level design for Maximum Entropy model workflow.
4. Challenges
First and the biggest challenge was understanding the data
format requirement, we had to first understand the IOB
format and come up with a proper format which was
required by the classifier. We had multiple iterations of
data formatting and cleaning as we went through the entire
workflow, beginning with corpus reader and ending with
the classifier testing.
We had challenges with testing and training after
incrementally adding new features to the feature list
because of computation power required. Execution time
kept increasing with each additional feature. After reaching
the current list, it was very difficult to add and test further,
so we decided to stop at 93.8%
5. Future Enhancements
Although Max-Ent classifier is inherently low-biased, we
can improve it further by doing k-fold cross validation.
We can also use a high-performance computation machine
or distributing computing to train and test on the expanded
feature list.
D. Deep Neural Network model with LSTM and GloVe
pre-trained embeddings
1. Introduction
A less traditional and more recent approach for Named Entity
Recognition makes use of Deep Neural Networks. The
approach relies on the good performance of DNNs when
applied to classification tasks: in essence a NER problem can
be interpreted as a classification problem where every word
belongs to a certain class (the IOB tag for the word). This
simple analogy over-simplifies the need for the word's context
and the fact that some Names or Entities span over more than
word, so some additional details will have to be taken in account
for the network to recognize this.
Another tool that we can leverage in solving this NLP
problem with neural networks is the use of embeddings. These
are vector translations of the words, instead of the traditional
one-hot approach for a complete vocabulary analized. In a
sense, the embeddings are a dimensionality reduction of the
vocabulary, so they also help in reducing the number of input
dimensions (and computations) required for the classification.
We can build these embedding vectors based on our corpus
vocabulary, or use pre-trained embeddings. Pre-trained
embeddings are usually built on large corpus of documents
(Wikipedia, web scraping, Twitter corpora) and would require
long time and computation to calculate, so it is not uncommon
to reuse them in other projects.
2. Implementation
Fig. 4. Deep Neural Network structure: input unigram, translation into embedding vector, input layer, hidden layers, one-hot encoded output layer and
translation back to IOB tag.

In our implementation, we used Keras to simplify the
construction of the network using Tensorflow as the tensor
calculation backend. The network has multiple layers (input,
several hidden and output). As each word has 50 dimensions,
the input layer has 50 units. The output layer has as many IOB
tags as our training dataset, in our case 14 ("O", "I", "B", "B-
PERSON", "I-PERSON", "B-GPE", "I-GPE", “B-DATE”, “I-
DATE”, “B-ORGANIZATION”, “I-ORGANIZATION”).
The pre-trained embeddings set used for our project is GloVe
[3] (Global Vectors), created by Stanford Research. According
to the project website: “GloVe is an unsupervised learning
algorithm for obtaining vector representation for words.
Training is performed on aggregated global word-word co-
occurrence statistics from a corpus, and the resulting
representations showcase interesting linear substructures for the
word vector space.”
GloVe provides various sets of word vectors, pre trained on
different data sets (Wikipedia, Common Crawl, Twitter) and of
different vector dimension sizes. Based on the nature of our data
domain (News articles) we selected the Wikipedia Embeddings,
as it also trained on the Gigaword dataset (an archive of
newswire text data). As this dataset is provided on multiple
vector dimensions (50d, 100d, 200d, 300d) we will be able to
experiment with this variable as another factor to vary and
evaluate impact on accuracy.
3. Challenges
As expected, most of the challenges appeared in the data
transformation phase: adapting the data into the shape and
format to be fed into the network. Loading the training/test
dataset and manipulating it using the CoNLL2002 format
library alleviated some of that (obtaining just words, words and
IOB tags, skipping POS tags) so it would be a good
recommended practice to use all data into this format as it is the
easiest to manipulate using this NLTK helper object.
Another challenge (also associated with data manipulation)
was with the word embeddings: the volume of data required to
load the complete dataset takes some time, and also depends on
the number of dimensions selected (i.e. GloVe has datasets with
50, 100, 200 and 300 dimensions).
Features Accuracy Run duration
Global 88 % 30 mins
Local + Global 93.8% 3 hr
Table II. Max Entropy results and training times for different
feature sets
IV. RESULTS
A. Max Entropy Model
For model evaluation, we split the dataset in 70% training and
30% testing. We ran the model multiple times, with global and
local features, accuracy increased with more number of
features but it plateaus after reaching a certain threshold and it
becomes very time consuming and resource consuming to run
with increasing number of features.
B. Deep Neural Network model
For model evaluation, we split the dataset in 70% testing and
30% training, same as the Max Entropy model.
The model was run multiple times, using different size of
GloVe embedding vectors: 50, 100, 200 and 300 dimensions.
As expected (and presented in the GloVe paper) dimension
does have some impact increasing accuracy, but only to a
certain point after which it becomes detrimental. In the case of
our dataset, there were no changes (increase or decrease) in
the model accuracy result.
Fig. 5. Improvements in accuracy with number of training epochs, plateaus
at 6 epochs.
For all the vector dimensions, we ran the model with 1 to 10
epochs: results improved up until 6 epochs, and then it
plateaued at 93.5% accuracy.

V. CONCLUSIONS
Several conclusions were extracted after this project:
Dataset: obtaining a gold dataset for Named Entity
Recognition is not a simple feat. We ran into multitude of
challenges (IOB format, manual vs assisted tagging, data
sanitization) and there is always a medium to high degree of
manual effort required that should not be minimized.
Accuracy: Performance for Max Entropy (93.8%) and Deep
Neural Networks (93.5%) are comparable, there is not a
significant difference. We don’t think this conclusion can be
generalized (as literature in general still presents accuracy in
Deep Neural Network models being higher than traditional Max
Entropy) and it may be attributable to the dataset used.
Limitations: The performance obtained by the Maximum
Entropy model is limited by the number of features. There is no
way to draw a linear relationship in between the number of
features and model performance: each feature (and also
combination of features) contributes independently to the
model performance. On the other hand, the performance
obtained by the Deep Neural Network is dependent on the
embeddings used: pre-trained vs custom built embeddings and
also the number of dimensions have an impact on the accuracy
obtained.
Domain Knowledge: The Maximum Entropy model is
highly dependent on domain knowledge on the corpus
language, as its features are directly related to the language
grammar rules (i.e. proper nouns in English are capitalized).
The Deep Neural Network model is completely independent
of language grammar rules, so the process to create a model can
be reproduced independently of language, as it depends only on
the stochastic properties of the corpora used for embedding and
IOB tagged dataset.
Computing resources: Maximum Entropy (as is common
with more traditional Machine Learning models) use less
computing resources than the Deep Neural Network, which
required more processing capacity to train in reasonable times.
The final conclusion is that, as the Deep Learning Model is
less dependent on specific language grammar rules, it is more
generalizable (given embeddings and some labeled corpora is
provided in any language) whereas the Maximum Entropy
model will perform poorly on an language where there is no
Domain Knowledge to create the required features.
VI. TECHNICAL TOOLS
1. Python Programming Language on Jupyter Notebooks:
All programming was done using the Python
programming language on Jupyter notebook
environments.
2. NLTK library for Natural Language Processing: One of
the most popular natural language processing packages
in Python, it provides functionality for reading data
formatted in the CoNLL2002 format.
3. SpaCy for assisted tagging. Raw data was pre-tagged
using spaCy and then manually reviewed.
4. Keras and Tensorflow for Deep Neural Networks. Keras
simplified the job of creating the network, Tensorflow
provides the backend for mathematical processing.
5. Google Cloud Datalab: For Deep Neural Network
processing, some limitations in our local environments
were resolved by using a Jupyter Notebook cloud
managed environment.
VII. REFERENCES, RESOURCES AND LITERTURE REVIEW
[1] Ridong Jiang, Rafael E. Banchs, Haizhou Li
Evaluating and Combining Named Entity Recognition
Systems “News Analytics”, Wikipedia:
http://www.aclweb.org/anthology/W16-2703
[2] Named Entity Recognition with Bidirectional
LSTM-CNNs
Jason P.C.Chiu, Eric Nichols. University of British
Columbia, Honda Reserch Institute Japan Co., Ltd.
https://www.aclweb.org/anthology/Q16-1026
[3] GloVe: Global Vectors for Word Representation.
Jeffrey Pennington, Richard Socher. Computer Science
Department, Stanford University
Open-source software library for advanced Natural
Language processing
https://www.aclweb.org/anthology/D14-1162
[4] SpaCy, Accessed April 3, 2018
Open-source software library for advanced Natural
Language processing
https://spacy.io/usage/linguistic-features
[5] Ryerson University Library – Newspapers and
Journals Library / Online edition, Accessed April 22,
2018
http://learn.library.ryerson.ca/newspapers

Fady Fadel (Masters of Data
Science and Analytics, Ryerson
University 2018) was born in
Basrah, Iraq in 1980. Received a
Bachelor of Engineering in
Information Technology from
McMaster University in Hamilton
Ontario, Canada in 2009, and is
candidate for Masters of Data
Science and Analytics at Ryerson
University in Toronto Ontario,
Canada.
He currently is a Senior Data
Analyst in the Infrastructure Capacity Engineering team at
Cogeco Connexion, a Canadian telecommunications company
serving residential video, high-speed Internet and telephony
service, along with data and voice transmission services and
cloud-based applications for businesses.
Akshat Kashyap (Masters of Data
University 2018)) was born in
Ujjain, India in 1987. Received a
Bachelor of Engineering in
Information Technology from Rajiv
Gandhi Technical University in
Indore, India in 2009, and is
University in Toronto, Canada.
He currently is a Senior Middleware Specialist in the Ministry
of Government and Consumer Services, Ontario. Which
delivers vital programs, services, and products — ranging from
health cards, drivers licences, and birth certificates to consumer
protection and public safety to Ontario citizens.
Bernardo Najlis (Masters of Data
University 2018) was born in
Buenos Aires, Argentina in 1977.
Received a BS in Systems Analysis
from Universidad CAECE in Buenos
Aires, Argentina in 2007, and is
University in Toronto, Canada.
He is currently a Solution Principal
in the Information Management &
Analytics team at Slalom, a
worldwide consulting firm, working in Cloud Big Data and
Advanced Analytics projects.

Named Entity Recognition from Online News

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Named Entity Recognition from Online News

Similar to Named Entity Recognition from Online News (20)

More from Bernardo Najlis

More from Bernardo Najlis (8)

Recently uploaded

Recently uploaded (20)

Named Entity Recognition from Online News