DevEX - reference for building teams, processes, and platforms
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis
1. Reaction Paper Discussing Articles in Fields of Outlier Detection &
Sentiment Analysis
Khan Mostafa
Graduate Student, Computer Science, Stony Brook University, NY 11794, USA
Email: khan.mostafa@stonybrook.edu
Student ID# 109365509
ABSTRACT
This Reaction Paper is submitted as an assignment to critique and brainstorm upon
reading few papers.
1 INTRODUCTION
In this article I am discussing four papers, two in field of outlier detection and two in field of
sentiment analysis. The first publication I discuss introduced LOF (Local Outlier Factor) as a
density based approach to detect outliers. LOF is a very useful and widely employed approach
although will not work very well in high-dimensions. Second article proposes using angular
measures to detect outliers in high-dimension. Next two articles address sentiment analysis; one
examined appraisal taxonomies for sentiment analysis and other one is about using Twitter as
corpus for sentiment analysis.
2 RELATED TERMS
2.1 Outlier detection
Two of the discussed paper addressed outlier detection. An outlier is significantly different from
the rest (a.k.a. normal). In clusters, an outlier is some point which do not fit into any cluster.
Outliers can be of interest to many. Especially, it is interesting to detect anomalies. Anomalous
events or objects cannot be detected using supervised learning as we the nature of anomalies in
unknown. Thus some unsupervised method can be suitable. Outlier detection can be also used,
before clustering a dataset. It helps by removing outlying objects and thus better performing in
producing clusters. Sometimes, outliers are outstanding or crucial points of a system.
2.2 Sentiment Analysis
Identifying sentiment is important to many. Especially, corporations, politicians, banks want to
know how people are feeling about some certain product, campaign or thing. Sentiment is
generally expression of emotion or feeling regarding some object.
Generally, human can identify sentiment by reading text. But, to understand public opinion,
applications need to understand sentiment from massive amount of text. To do this, approaches
Reaction Paper submitted for CSE590 Networks and Data Mining Techniques on 22/10/2013
2. Khan Mostafa
Student ID# 109365509
are taken from fields spanning data mining, natural language processing, data mining and statistics.
The trend in sentiment analysis is to identify if some text is subjective and whether they convey
positive sentiment of negative.
3 REACTIONS AND OUTLINES
In this section methods presented in each paper is briefly outlined and then reacted upon.
3.1 Local Outlier Factor for Outlier Detection
Earlier approaches for outlier detection considered outliers globally. However, a more appropriate
way of measuring outlier is measuring how outlying they are from the cluster they were supposed
to be if they were not outliers. Outliers should be calculated locally based on how deviant it is from
its neighbors. One early approach to consider outliers locally was by Knorr and Ng (Knorr and
Ng, Finding Intensional Knowledge of Distance-Based Outliers 1999) (Knorr and Ng, Algorithms
for Mining Distance-Based Outliers in Large Datasets 1998) where they proposed the notion of
distance-based outlier detection. A more efficient algorithm proposed considering distance to k
nearest neighbors (Ramaswamy, Rastogi and Shim 2000). However, distance is not an appropriate
measure when density of clusters vary. The work being examined (Breunig, et al. 2000) have
advanced local approach by introducing a density based concept, Local Outlier Factor (LOF).
Authors posit, “being outlying is not a binary property”. Hence, for each point a score, LOF, is calculated
which estimates the degree of being an outlier. For this, they calculate reachability distance of each
point and then estimates reachability density. Reachability density is the inverse of average
reachability distance in terms of k (= MinPts) nearest neighbors. Then LOF for a point p is
calculated as, “the average of the ratio of the local reachability density of p and those of p’s MinPts-nearest
neighbors”. LOF is higher when the point is further from its nearest neighbor. Reachability density
of a point deep in cluster (points that are not outlying) will have similar reachability distance as of
its neighbors. Hence, it has been shown that, LOF for non-outlying points will be approximately
one.
Estimation of LOF is largely influenced by the parameter MinPts. MinPts is the number of
neighboring points in terms of which the reachability density is measured. If MinPts is larger than
the number of some cluster C, then all points in C will have LOF much larger than 1. Again, if
MinPts is much smaller, then outliers that are neighbors to n<MinPts outlying points may have a
LOF score approximately one. Therefore, estimation of MinPts are suggested to be done
heuristically.
After being proposed, LOF has gained a lot of attention and has widely been studied in last decade.
As LOF depends on parameter MinPts, another approach was suggested (Papadimitriou, et al.
2003) to calculate Local Outlier Correlation Integral (LOCI). Here, a sampling neighborhood of radius
r and a counting neighborhood of radius αr is considered. For a point p, points {… , 𝑝𝑛 𝑖 , … } that are
in sampling neighborhood are taken. Then cardinality of each point 𝑝𝑛 are respect to points within
its counting neighborhood. These calculations combined with mean and standard deviation are
used. They also introduced concept of LOCI plot. However, estimates of LOCI depends on input
parameter α.
Some other variations and extensions studied are,
2
3. Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis
Outlier Detection using In-degree Number (ODIN) (Hautamaki, Karkkainen and Franti 2004)
Connectivity-based Outlier Factor (COF) (Tang, et al. 2002)
Using probabilistic suffix tree (PST) for detecting nearest neighbors
An approach to enhance efficiency is studied (Jin, Tung and Han 2001) to create micro clusters
first
Another study covered the case, when clusters of different densities lie closely
Many other studies are done, covering each and every variations and extensions that comes in
mind, rather not survey them all.
LOF is a very useful measure as it can identify outliers in local domain. It indeed covers global
outliers as all global outlier is also local outliers. A major weakness is however its computation cost
which is 𝑂(𝑛2 ). To reduced complexity, one approach might be to use some kind of locality
hashing. Where, a prior run will be made to hash each point into a bucket consisting of neighboring
points to it. A grid based approach can also be employed. For a point, k= (MinPts) number of
points will be chosen randomly from the bucket (or grid) where it belongs to. If there are fewer
points than k in a grid, nearby grids can be taken. Another improvement can be to calculate
reachability distance for each grid a priori while grid spacing them.
Some normal points can have very few neighbors – in such cases, LOF might yield a high LOF
for them indicating them as outlying.
LOF is density bases, density is defined in in terms of distance. In higher dimensions, distance are
almost similar (curse of dimensionality) for each points. In such case, LOF cannot be directly
employed. However, feature bagging is often suggested.
In high dimension, when there is a need to select few features, LOF can be used. In such case, few
features can be used each time to estimate LOFs. Those sets of features, which yield less diverse
LOFs (i.e. yields high LOF for fewer points) can be potentially good feature approximations.
LOF is spatial algorithm. Hence, it cannot be used in situations where there is not distance
measure.
LOFs can be also used to cluster points. In this case, a hierarchical clustering can be employed.
When a point is calculated to have LOF approximately 1 then it can be assigned to the cluster in
which its neighbors are belonging to.
LOF can be used to identify anomalies within clusters. Say, a small portion within a class is
significantly more or less dense. Points amongst them will result LOF scores which will be
different from LOF scores of other points of the cluster.
3.2 Angle Based Outlier Detection in Higher Dimension
In higher dimension, distance is uniform. But, an assumption was posed by (Hans-peter Kriegel
2008) that, angles are more stable and outliers reside in periphery. This method (ABOD) would
calculate angular distance of other points from the point in question. A point is said to be outlying
if most points rest in one side of the point i.e. if angular distance from this point is similar. While,
normal points will have sparse angular distance.
3
4. Khan Mostafa
Student ID# 109365509
ABOD has falls back in that it requires a very high computational complexity. This can be
minimized, by selecting random points. Angular distances will be similar from an outlying point
even if the points chosen in random. Another way can be to further subspace dividing features
and then calculate angular distances in these subspaces; these measures can be collectively used.
3.3 Appraisal Taxonomies in Sentiment Analysis
Sentiment analysis is being heavily investigated for more than a decade, instigated by Pang, et al
(Pang, Lee and Vaithyanathan 2002) when they attempted to solve the problem of sentiment
classification as a case of topic based categorization. Sentiment, however, in any granularity level
(viz. article, paragraph and sentence) is generally perceived by human as appraisal. Many words
and phrases are used to praise and many are to express negative comment about things. This case
was investigated by Whitelaw, et al. (Whitelaw, Garg and Argamon 2005).
They identified the need for semantic analysis of attitude expression and also hypothesized that,
atomic units of sentiment expression are not individual word but rather appraisal groups (Attitude,
Orientation, Graduation and Polarity). [See Appraisal Theory (Martin and White 2005)]. Basing
on WordNet and two other thesauri they constructed a lexicon. They used coarse ranking of
relevance to enlist such terms. However, final set of terms were produced using manual
examination. Then they tested several feature sets and found that, union of bag-of-words and
appraisal group by attitude & orientation (BoW + GAO) yields best result.
Proposed approach is not very scalable as they require a lot of manual labor and cannot create
absolute appraisal estimates for a lot of words. It also employs much computation intensive
classification technique. Though, this investigation brings light to the case that, appraisal are not
simply done with adjectives alone. Other parts of speech in sentences are also responsible for
sentiments in it. Several studies have tried to employ adverb and verbs along with adjectives to
estimate sentiment.
Overall, it can be said that, sentiment is expressed with tone of the sentence and different POS
occur differently in positive and negative statements. Hence, subjectivity of statements can be
scored using parts of speech tagging and estimating then by using some classifier.
Furthermore, nouns and names can also embody polarity. Especially, when comparative phrases
are used. Same word can also express different feeling in different context. An approach can be to
enlist appraisal scores of words along with contexts. Yet some problem may remain when,
qualifiers may indicate opposite feeling when used with different words (viz. fast access as opposed
to fast heating in PC RAM description).
3.4 Sentiment Analysis in Twitter
Twitter is a widely used blog-sphere where people often covey sentiment. Several studies have
tried to analyze sentiment in such platform. One of them is by A. Pak & P. Paroubek (Pak and
Paroubek 2010) where authors build a sentiment corpus by using tweets as corpus. They exploited
that user put emoticons and used them to build a sentiment lexicon. Along with that they trained
classifier based on parts of speech tagging. They also used then to estimate sentiment of tweets
with help of n-gram and POS classifiers.
4
5. Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis
Their work is elegant one as it can estimate sentiments in real time. However, an extension to their
work can be to use sentiment score, in place of strict classifying (negative, positive, objective).
This approach cannot deal contextual sentiment dependency. An approach to deal with it can be,
to first extract keywords from the tweet. Then associate appraisal key-words with objective key
terms to estimate how this subjective word express sentiment for that particular key term. Even,
key terms can be used to identify context category first.
Sentiment analysis troubles when sarcastic and ironic speeches are used. Although, there are some
studies to solve this, it requires more investigation and may be more rigorous language processing
and logic techniques might be needed to more effectively estimate irony. Hence, a perfect
sentiment analysis tool is yet to emerge.
4 CONCLUSION
In this paper, four articles are discussed. First pair of papers which are about outlier detection
address different portions of same problem. LOF is widely studied and used, hence there is a
multitude of approaches to enhance, extend it as well as to improve its computational complexity.
Second article (ABOD) is also motivated from LOF. In present there are not much connection
found between sentiment analysis and outlier detection. However, in opinion mining of mass data
it can be useful. When, opinion about some entity is mined, first approach is to pull statements
about that entity. These pulled statements might also include some statement which is actually not
about that very entity. These outlying statements can be filtered to better reflect sentiment about
the entity.
5 BIBLIOGRAPHY
Breunig, Markus M., Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. "LOF:
Identifying Density-Based Local Outliers." International Conference on Management of Data SIGMOD. ACM. 93-104.
Hans-peter Kriegel, Matthias Schubert, Arthur Zimek. 2008. "Angle-based outlier detection in
high-dimensional data." Knowledge Discovery and Data Mining - KDD. 444-452.
Hautamaki, Ville, Ismo Karkkainen, and asi Franti. 2004. "Outlier detection using k-nearest
neighbour graph." Proceedings of the 17th International Conference on Pattern Recognition, ICPR
2004. IEEE. 430-433.
Jin, Wen, Anthony K. H. Tung, and Jiawei Han. 2001. "Mining top-n local outliers in large
databases." Knowledge Discovery and Data Mining - KDD. 293-298.
Knorr, Edwin M., and Raymond T. Ng. 1998. "Algorithms for Mining Distance-Based Outliers in
Large Datasets." Very Large Data Bases - VLDB. 392-403.
—. 1999. "Finding Intensional Knowledge of Distance-Based Outliers." Very Large Data Bases VLDB. 211-222.
5
6. Khan Mostafa
Student ID# 109365509
Martin, J. R., and P. R. R. White. 2005. Language of Evaluation: Appraisal in English. London: Palgrave.
http://grammatics.com/appraisal/.
Pak, Alexander, and Patrick Paroubek. 2010. "Twitter as a Corpus for Sentiment Analysis and
Opinion Mining." Language Resources and Evaluation. 1320-1326.
Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. "Thumbs up? Sentiment Classification
using Machine Learning Techniques." Proceedings of the ACL-02 conference on Empirical methods
in natural language processing. Philadelphia, PA, USA: Association for Computational
Linguistics. 79-86.
Papadimitriou, Spiros, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. 2003.
"Loci: Fast outlier detection using the local correlation integral." Proceedings. 19th
International Conference on Data Engineering. IEEE. 315-326.
Ramaswamy, Sridhar, Rajeev Rastogi, and Kyuseok Shim. 2000. "Efficient Algorithms for Mining
Outliers from Large Data Sets." Proc. ACM SIDMOD Int. Conf. on Management of Data.
ACM. 427-438.
Tang, Jian, Zhixiang Chen, Ada Wai-Chee Fu, and David W. Cheung. 2002. "Enhancing
effectiveness of outlier detections for low density patterns." In Advances in Knowledge
Discovery and Data Mining, 535-548. Springer Berlin Heidelberg.
Whitelaw, Casey, Navendu Garg, and Shlomo Argamon. 2005. "Using appraisal groups for
sentiment analysis." Proceedings of the 14th ACM international conference on Information and
knowledge management. ACM. 625-631.
6