SlideShare una empresa de Scribd logo
1 de 8
Name: Jisu Oh, Shan Huang
   Date : April 12, 2004
   Course : Csci 8715
   Professor : Shashi Shekhar


                        Project Report (draft version)
                         “Spatial Outlier Detection”
   Shan Huang, Jisu Oh
   Computer Science Department, University of Minnesota, 200 Union Street SE,
   Minneapolis, MN 55455, U.S.A
   E-mail: shahuang@cs.umn.edu, joh@cs.umn.edu
   http://www-users.cs.umn.edu/~joh/csci8715/HW-list.htm




1. Introduction


   A spatial outlier is a spatially referenced object whose non-spatial attribute values are
   significantly different from the values of its neighborhood. Identification of spatial
   outliers can lead to the discovery of unexpected, interesting, and useful spatial
   patterns for further analysis. WEKA is a collection of machine learning algorithms
   for solving real-world data mining problems. It is written in Java and runs on almost
   any platform. Basic data mining functions as well as regression, association rules and
   clustering algorithms have also been implemented in WEKA, but their algorithms can
   only operate on traditional non-spatial database. The purpose of this project is to
   build a new class, which can detect spatial outlier in a spatial data set.



2. Motivation


   Machine learning/data mining discovers new things or structure that is unknown to
   humans. It enables a computer program to automatically analyze large-scale data and
   decide what information is most important. We can then use this information to make
   predictions or to make decisions faster and more accurately.



                                              1
Many organizations rely on spatial analysis to make business and agency decisions
   and to conduct research. The main difference between data mining in relational DBS
   and in spatial DBS is the interest of neighboring object’s attributes may have an
   influence on the current object, so the neighboring object have to be considered as
   well. The explicit location and extension of spatial objects define implicit relations of
   spatial neighborhood which are used by spatial data mining algorithms. Therefore,
   new techniques are required for effective and efficient data mining.


   WEKA is a collection of machine learning algorithms for solving real-world data
   mining problems. It is written in Java and runs on almost any platform. Basic data
   mining functions as well as regression, association rules and clustering algorithms
   have also been implemented in WEKA, but these algorithms can only operate on
   traditional non-spatial database.


   The aim of this project is to build new classes and algorithm which can handle spatial
   data, such as spatial regression, spatial association rule (co-location), and spatial
   outlier detection.

2. Related works


   Detecting spatial outliers is useful in many applications of geographic information
   systems, including transportation, ecology, public safety, public health, climatology,
   and location based services [2].


   Shekhar et al. introduced a method for detecting spatial outliers in graph data set
   based on the distribution property of the difference between an attribute value and the
   average attribute value of its neighbors [3]. Shekhar also proposed an algorithm to
   find all outliers in a dataset, which replace many statistical discordance tests,
   regardless of any knowledge about the underlying distribution of the attributes [7].
   Stephen D. Bay et al. introduced a simple nested loop algorithm to detect spatial



                                             2
outlier, which gives linear time performance when data is in random order and a
simple pruning rule is used [4]. Existing methods for finding outliers can only deal
efficiently with two dimensions/attributes of a dataset.


A distance-based detection method was introduced by Sridhar Ramaswamy et al.,

which ranks each point on the basis of its distance to its kth nearest neighbor and
declares the top n points in this ranking to outliers. A highly efficient partition-based
algorithm was also introduced in this paper [6]. Edwin M. Knorr et al. proposed
another distance-base outlier detection method that can be done efficiently for large
datasets, and for k-dimensional datasets with large value of k [9]. Spatial outliers are
most time represented as point data, but they are frequently represented in region, i.e.,
a group of point. Jiang Zhao et al. proposed a wavelet analysis based approach to
detect region outlier [5].


Markus M. Breunig et al. showed a different approach to detecting spatial outliers; it
was done by assigning to each object a degree of being an outlier, the degree, which
was called the local outlier factor of an object, depends on how isolated the object is
with respect to the surrounding neighborhood [10].


Currently, there are many spatial statictis software available. S-PLUS spatial
statistics are the first comprehensive, object-oriented software package for the
analysis of spatial data. It includes a fairly wide range of techniques for spatial data
analysis.
R is a language similar to S for statistical data analysis, based on modern
programming concepts and released under the GNU General Public License. It
follows a broad outline of existing collections of functions for spatial statistics written
for S. Functions for three types of spatial statistics are covered: spatially continuous
data, point pattern data, and area data.


SAS is another powerful analytical and reporting system. The SAS Bridge for ESRI
provides a new way to exchange spatial attribute data between ArcGIS, the market


                                           3
leading geographic information system (GIS) software from ESRI, and SAS. This
  new product links spatial, numeric and textual data through a single interface to
  improve efficiency, produce more intelligent results and communicate those results
  more effectively.

3. Problem Statement


  The input data set using in this project were collected from the sensor stations
  embedded in Interstate highways surrounding the Twin Cities area in Minnesota, US.
  Each station measures the traffic volume and occupancy on a particular stretch of the
  highway at 5-min intervals. Each data set consists of 288 rows of the 5-min detector
  records, starting from 0:0 AM; each row contains 300 tuples of (volume, occupancy)
  for 150 stations; each tuple in the row represents the traffic volume and occupancy of
  the detector within the 5-min period. The neighbor is defined in terms of topological
  rather than Euclidean distance. Our objective is to determine stations that are
  “outliers” based on the volumes of the traffic measurements from each station.


  A spatial outlier is a spatially referenced object whose non-spatial attribute values are
  significantly different from those of other spatially referenced objects in its spatial
  neighborhood. In this application, the outlier would be the one station which detects
  a very high volume compare to the neighboring station. For instance, at 1:00 AM,
  station A detects a volume of 250, which the two neighbor stations B and C only
  collect single digits volume, then in this case station A would be considered as an
  local outlier.


  The algorithm used in this project was proposed in the paper “A Unified Approach to
  Detecting Spatial Outliers”.[7] The location is compared to its neighborhood using
  the function:
  S(x) = [ f ( x ) − Εy ∈ N(x)(f(y))], where

          f(x) - attribute value for a location x
          N(x) - set of neighbors of x



                                               4
Ey∈ N(x)(f(y)) - average attribute value for the neighbors of x

    S(x) – difference of the attribute value of a sensor located at x and the average
          attribute value of x’s neighbors.


    Spatial statistic is used for detecting spatial outliers for normally distributed f(x).
               s ( x ) − µs
    Zs(x) =                 〉θ
                    σs
    µs - Mean value of S(x)
    σs - Standard deviation of S(x)
    θ - Specified confidence level



4. Implementation

4.1 Algorithm
The algorithm is divided into two subparts, (1) Model construction (2) Outlier detection.
The first part of the algorithm is finding Ey∈ N(x)(f(y) (E(x)), the average attributes

value for the neighbors of x. For each station, its two neighbor stations are retrieved, and
the average of neighbor station’s volume is computed. The second part of the algorithm,
for each iteration one outlier is detected. First, the standard deviation and the average for
the all the E(x) is computed, then for each station using f(x) – E(x) to find the S(x), which
S(x) is the function that compares a station with its neighborhood. Lastly, the spatial

                      s ( x ) − µs
statistics Zs(x) =                 〉θ are computed and compare to θ , user specified value. In
                           σs
the outlier detection program, it means 68%, 95%, or 99% confidence interval. Once one
outlier is identified, its original value is replace with the average value of its
neighborhood, and the algorithm will starts over again to second outlier, and so on. In
this algorithm, the number of outliers are detected is depend on user’s specification, for
instance, if user need to find 10 outliers in a given data set, the algorithm will run for 10
iterations.



                                                5
4.2 User Interface


The user interface of our application is based on WEKA, in other words, it works WEKA
environment. So its interface looks like WEKA, but the differences are dealing with
spatial outliers effectively. To find outliers, there are 3 kinds of user specified feature:
chosen data file, types of confidence interval, the number of outliers. These features
allow users to figure out different outlier sets that are founded depending on their
choices. And users can find outliers again and again, it means, they can detect different
outlier sets on same data set continuously.


And our system provides detected outliers through 3 different ways: plain text, overall
traffic volume for one day, and neighbor relationship between stations. ‘Outlier result’
panel display plain text, which consist of detail information about time slots of one day,
measured time, stations, and their volume. And users can see overall view of this
information on one image with two graph, one is an average traffic volume at each time
and each station and detected outliers given timeslot and stations. Different colors of the
graphs indicate different volume. It would be helpful to get a big idea about the outliers.
Last visual result is image to show volume of user specified station and its neighborhood.
Using this image, users see 3 different traffic volume graphs and can compare them each
other. This enable for users to analyze relationship between user specified station and its
neighborhood. For example, suppose we want to see traffic volume of station 24. The
system displays traffic volume of station 23, 24, and 25. From this one, users know
pattern of traffic volume of station 23 and 25 are very similar but not station 24 so station
24 should be one of outliers. As mentioned so far, interface of our system consist of
several visual components to use easily rather than command line. User-centered
interface is big difference from existing systems.


5. Methodology

Constructing several experiments to test how exactly find outliers using different spatial
data .
    1) Case study


                                               6
We will find a set of outliers using different data sets then analyze how exactly they are
found.



6. Contributions

Major contribution of this project is development application to find spatial outlier using
WEKA system. WEKA provides basic data mining functions but these are working on
non-spatial database. Building a new class which can detect sets of spatial outliers using
given spatial data asset and incorporating the class in existing WEKA will enable the
discovery of unexpected, interesting, and useful spatial patterns for further analysis.



7. Conclusion
still working on


8. Future work
      - upgrade to allow various file format and data type
      - provide written analysis about outlier information
      - experiments to find more efficient algorithm using different outlier detection
       algorithms.
      - Some tool to compare or contrast analysis of different result from different options
       to detect outliers



References

[1]      EXPLORATORY ANALYSIS OF SPATIAL DATA
[2]      Chang-Tien Lu, Dechang Chen, Yufeng Kou, “Algorithms for Spatial Outlier
         Detection”, 15th IEEE International Conference on Tools with Artificial
         Intelligence (ICTAI'03) November 03 - 05, 2003




                                               7
[3]    Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang , “Detecting graph-based spatial
       outliers: algorithms and applications (a summary of results)”, Proceedings of the
       seventh ACM SIGKDD international conference on Knowledge discovery and
       data mining, San Francisco, CA, USA. ACM, 2001
[4]    Stephen D. Bay, Mark Schwabacher , “Research track: Mining distance-based
       outliers in near linear time with randomization and a simple pruning”
       ruleProceedings of the ninth ACM SIGKDD international conference on
       Knowledge discovery and data mining, pp. 29-38, Washington, D.C. ACM 2003
[5]    Jiang Zhao, Chang-Tien Lu, Yufeng Kou, “Detecting region outliers in
       meteorological data”, Proceedings of the eleventh ACM international symposium
       on Advances in geographic information systems, pp . 49-55, New Orleans,
       Louisiana, USA, 2003
[6]    Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for
       mining outliers from large data sets”, 2000 ACM SIGMOD international
       conference on Management of data, pp. 427-438, Dallas, Texas, USA. ACM 2000
[7]    S. Shekhar, C. T. Lu, and P. Zhang, “A Unified Approach to Detecting Spatial
       Outliers” , GeoInformatica, pp. 139-166. 2003
[8]    Edwin M. Knorr, Raymond T. Ng, “A unified approach for mining outliers”,
       Proceedings of the 1997 conference of the Centre for Advanced Studies on
       Collaborative research, pp.11, Toronto, Ontario, Canada, 1997
[9]    Edwin M. Knorr, Raymond T. Ng, Vladimir Tucakov, “Distance-based outliers:
       algorithms and applications”, The VLDB Journal - The International Journal on
       Very Large Data Bases, pp. 237-253, Volume 8 , Issue 3-4, 2000
[10]   Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, “LOF: identifying
       density-based local outliers”, Jörg Sander, 2000 ACM SIGMOD international
       conference on Management of data, pp. 93-104, ACM, New York, NY, USA ,
       2000
[11]   Ian H. Witten and Eibe Frank, Morgan Kaufmann, “"Data Mining: Practical
       machine learning tools with Java implementations," San Fran




                                           8

Más contenido relacionado

La actualidad más candente

Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories Debaleena Chattopadhyay
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 
Outlier Detection Approaches in Data Mining
Outlier Detection Approaches in Data MiningOutlier Detection Approaches in Data Mining
Outlier Detection Approaches in Data MiningIRJET Journal
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive MetabonomicsMarilyn Arceo
 
A framework for outlier detection in
A framework for outlier detection inA framework for outlier detection in
A framework for outlier detection inijfcstjournal
 
Adaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detectionAdaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detectionNguyen Duong
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Data mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cubeData mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cubeIJECEIAES
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS TECSI FEA USP
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
A Study of Firefly Algorithm and its Application in Non-Linear Dynamic Systems
A Study of Firefly Algorithm and its Application in Non-Linear Dynamic SystemsA Study of Firefly Algorithm and its Application in Non-Linear Dynamic Systems
A Study of Firefly Algorithm and its Application in Non-Linear Dynamic Systemsijtsrd
 
A New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsA New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsijcsa
 

La actualidad más candente (17)

Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
Outlier Detection Approaches in Data Mining
Outlier Detection Approaches in Data MiningOutlier Detection Approaches in Data Mining
Outlier Detection Approaches in Data Mining
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive Metabonomics
 
A framework for outlier detection in
A framework for outlier detection inA framework for outlier detection in
A framework for outlier detection in
 
Adaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detectionAdaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detection
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
PggLas12
PggLas12PggLas12
PggLas12
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Data mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cubeData mining techniques application for prediction in OLAP cube
Data mining techniques application for prediction in OLAP cube
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
 
Az36311316
Az36311316Az36311316
Az36311316
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
A Study of Firefly Algorithm and its Application in Non-Linear Dynamic Systems
A Study of Firefly Algorithm and its Application in Non-Linear Dynamic SystemsA Study of Firefly Algorithm and its Application in Non-Linear Dynamic Systems
A Study of Firefly Algorithm and its Application in Non-Linear Dynamic Systems
 
A New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsA New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item sets
 
G44093135
G44093135G44093135
G44093135
 

Destacado

LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 

Destacado (8)

Firebird general polish
Firebird general polishFirebird general polish
Firebird general polish
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 

Similar a report2.doc

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.ppsbutest
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot netredpel dot com
 
Object tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform ImplementationObject tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform ImplementationEditor IJCATR
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text IJERA Editor
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERIJCSEA Journal
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
Iaetsd modified artificial potential fields algorithm for mobile robot path ...
Iaetsd modified  artificial potential fields algorithm for mobile robot path ...Iaetsd modified  artificial potential fields algorithm for mobile robot path ...
Iaetsd modified artificial potential fields algorithm for mobile robot path ...Iaetsd Iaetsd
 
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...sherinmm
 
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...sherinmm
 
EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...
EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...
EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...Md Kafiul Islam
 
Fault detection and_diagnosis
Fault detection and_diagnosisFault detection and_diagnosis
Fault detection and_diagnosisM Reza Rahmati
 
Combined cosine-linear regression model similarity with application to handwr...
Combined cosine-linear regression model similarity with application to handwr...Combined cosine-linear regression model similarity with application to handwr...
Combined cosine-linear regression model similarity with application to handwr...IJECEIAES
 
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...IJERD Editor
 

Similar a report2.doc (20)

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.pps
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot net
 
Object tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform ImplementationObject tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform Implementation
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
Iaetsd modified artificial potential fields algorithm for mobile robot path ...
Iaetsd modified  artificial potential fields algorithm for mobile robot path ...Iaetsd modified  artificial potential fields algorithm for mobile robot path ...
Iaetsd modified artificial potential fields algorithm for mobile robot path ...
 
Space Tug Rendezvous
Space Tug RendezvousSpace Tug Rendezvous
Space Tug Rendezvous
 
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
 
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
 
EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...
EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...
EMBC'13 Poster Presentation on "A Bio-Inspired Cooperative Algorithm for Dist...
 
Fault detection and_diagnosis
Fault detection and_diagnosisFault detection and_diagnosis
Fault detection and_diagnosis
 
Combined cosine-linear regression model similarity with application to handwr...
Combined cosine-linear regression model similarity with application to handwr...Combined cosine-linear regression model similarity with application to handwr...
Combined cosine-linear regression model similarity with application to handwr...
 
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
 
C1804011117
C1804011117C1804011117
C1804011117
 

Más de butest

The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 
Download
DownloadDownload
Downloadbutest
 
resume.doc
resume.docresume.doc
resume.docbutest
 
Download.doc.doc
Download.doc.docDownload.doc.doc
Download.doc.docbutest
 
Resume
ResumeResume
Resumebutest
 
Web Design Contract
Web Design ContractWeb Design Contract
Web Design Contractbutest
 
Download a Web Design Proposal Form
Download a Web Design Proposal FormDownload a Web Design Proposal Form
Download a Web Design Proposal Formbutest
 
Word version.doc.doc
Word version.doc.docWord version.doc.doc
Word version.doc.docbutest
 

Más de butest (20)

PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 
resume.doc
resume.docresume.doc
resume.doc
 
Download.doc.doc
Download.doc.docDownload.doc.doc
Download.doc.doc
 
Resume
ResumeResume
Resume
 
Web Design Contract
Web Design ContractWeb Design Contract
Web Design Contract
 
Download a Web Design Proposal Form
Download a Web Design Proposal FormDownload a Web Design Proposal Form
Download a Web Design Proposal Form
 
Word version.doc.doc
Word version.doc.docWord version.doc.doc
Word version.doc.doc
 

report2.doc

  • 1. Name: Jisu Oh, Shan Huang Date : April 12, 2004 Course : Csci 8715 Professor : Shashi Shekhar Project Report (draft version) “Spatial Outlier Detection” Shan Huang, Jisu Oh Computer Science Department, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, U.S.A E-mail: shahuang@cs.umn.edu, joh@cs.umn.edu http://www-users.cs.umn.edu/~joh/csci8715/HW-list.htm 1. Introduction A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from the values of its neighborhood. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and useful spatial patterns for further analysis. WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. Basic data mining functions as well as regression, association rules and clustering algorithms have also been implemented in WEKA, but their algorithms can only operate on traditional non-spatial database. The purpose of this project is to build a new class, which can detect spatial outlier in a spatial data set. 2. Motivation Machine learning/data mining discovers new things or structure that is unknown to humans. It enables a computer program to automatically analyze large-scale data and decide what information is most important. We can then use this information to make predictions or to make decisions faster and more accurately. 1
  • 2. Many organizations rely on spatial analysis to make business and agency decisions and to conduct research. The main difference between data mining in relational DBS and in spatial DBS is the interest of neighboring object’s attributes may have an influence on the current object, so the neighboring object have to be considered as well. The explicit location and extension of spatial objects define implicit relations of spatial neighborhood which are used by spatial data mining algorithms. Therefore, new techniques are required for effective and efficient data mining. WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. Basic data mining functions as well as regression, association rules and clustering algorithms have also been implemented in WEKA, but these algorithms can only operate on traditional non-spatial database. The aim of this project is to build new classes and algorithm which can handle spatial data, such as spatial regression, spatial association rule (co-location), and spatial outlier detection. 2. Related works Detecting spatial outliers is useful in many applications of geographic information systems, including transportation, ecology, public safety, public health, climatology, and location based services [2]. Shekhar et al. introduced a method for detecting spatial outliers in graph data set based on the distribution property of the difference between an attribute value and the average attribute value of its neighbors [3]. Shekhar also proposed an algorithm to find all outliers in a dataset, which replace many statistical discordance tests, regardless of any knowledge about the underlying distribution of the attributes [7]. Stephen D. Bay et al. introduced a simple nested loop algorithm to detect spatial 2
  • 3. outlier, which gives linear time performance when data is in random order and a simple pruning rule is used [4]. Existing methods for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. A distance-based detection method was introduced by Sridhar Ramaswamy et al., which ranks each point on the basis of its distance to its kth nearest neighbor and declares the top n points in this ranking to outliers. A highly efficient partition-based algorithm was also introduced in this paper [6]. Edwin M. Knorr et al. proposed another distance-base outlier detection method that can be done efficiently for large datasets, and for k-dimensional datasets with large value of k [9]. Spatial outliers are most time represented as point data, but they are frequently represented in region, i.e., a group of point. Jiang Zhao et al. proposed a wavelet analysis based approach to detect region outlier [5]. Markus M. Breunig et al. showed a different approach to detecting spatial outliers; it was done by assigning to each object a degree of being an outlier, the degree, which was called the local outlier factor of an object, depends on how isolated the object is with respect to the surrounding neighborhood [10]. Currently, there are many spatial statictis software available. S-PLUS spatial statistics are the first comprehensive, object-oriented software package for the analysis of spatial data. It includes a fairly wide range of techniques for spatial data analysis. R is a language similar to S for statistical data analysis, based on modern programming concepts and released under the GNU General Public License. It follows a broad outline of existing collections of functions for spatial statistics written for S. Functions for three types of spatial statistics are covered: spatially continuous data, point pattern data, and area data. SAS is another powerful analytical and reporting system. The SAS Bridge for ESRI provides a new way to exchange spatial attribute data between ArcGIS, the market 3
  • 4. leading geographic information system (GIS) software from ESRI, and SAS. This new product links spatial, numeric and textual data through a single interface to improve efficiency, produce more intelligent results and communicate those results more effectively. 3. Problem Statement The input data set using in this project were collected from the sensor stations embedded in Interstate highways surrounding the Twin Cities area in Minnesota, US. Each station measures the traffic volume and occupancy on a particular stretch of the highway at 5-min intervals. Each data set consists of 288 rows of the 5-min detector records, starting from 0:0 AM; each row contains 300 tuples of (volume, occupancy) for 150 stations; each tuple in the row represents the traffic volume and occupancy of the detector within the 5-min period. The neighbor is defined in terms of topological rather than Euclidean distance. Our objective is to determine stations that are “outliers” based on the volumes of the traffic measurements from each station. A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from those of other spatially referenced objects in its spatial neighborhood. In this application, the outlier would be the one station which detects a very high volume compare to the neighboring station. For instance, at 1:00 AM, station A detects a volume of 250, which the two neighbor stations B and C only collect single digits volume, then in this case station A would be considered as an local outlier. The algorithm used in this project was proposed in the paper “A Unified Approach to Detecting Spatial Outliers”.[7] The location is compared to its neighborhood using the function: S(x) = [ f ( x ) − Εy ∈ N(x)(f(y))], where f(x) - attribute value for a location x N(x) - set of neighbors of x 4
  • 5. Ey∈ N(x)(f(y)) - average attribute value for the neighbors of x S(x) – difference of the attribute value of a sensor located at x and the average attribute value of x’s neighbors. Spatial statistic is used for detecting spatial outliers for normally distributed f(x). s ( x ) − µs Zs(x) = 〉θ σs µs - Mean value of S(x) σs - Standard deviation of S(x) θ - Specified confidence level 4. Implementation 4.1 Algorithm The algorithm is divided into two subparts, (1) Model construction (2) Outlier detection. The first part of the algorithm is finding Ey∈ N(x)(f(y) (E(x)), the average attributes value for the neighbors of x. For each station, its two neighbor stations are retrieved, and the average of neighbor station’s volume is computed. The second part of the algorithm, for each iteration one outlier is detected. First, the standard deviation and the average for the all the E(x) is computed, then for each station using f(x) – E(x) to find the S(x), which S(x) is the function that compares a station with its neighborhood. Lastly, the spatial s ( x ) − µs statistics Zs(x) = 〉θ are computed and compare to θ , user specified value. In σs the outlier detection program, it means 68%, 95%, or 99% confidence interval. Once one outlier is identified, its original value is replace with the average value of its neighborhood, and the algorithm will starts over again to second outlier, and so on. In this algorithm, the number of outliers are detected is depend on user’s specification, for instance, if user need to find 10 outliers in a given data set, the algorithm will run for 10 iterations. 5
  • 6. 4.2 User Interface The user interface of our application is based on WEKA, in other words, it works WEKA environment. So its interface looks like WEKA, but the differences are dealing with spatial outliers effectively. To find outliers, there are 3 kinds of user specified feature: chosen data file, types of confidence interval, the number of outliers. These features allow users to figure out different outlier sets that are founded depending on their choices. And users can find outliers again and again, it means, they can detect different outlier sets on same data set continuously. And our system provides detected outliers through 3 different ways: plain text, overall traffic volume for one day, and neighbor relationship between stations. ‘Outlier result’ panel display plain text, which consist of detail information about time slots of one day, measured time, stations, and their volume. And users can see overall view of this information on one image with two graph, one is an average traffic volume at each time and each station and detected outliers given timeslot and stations. Different colors of the graphs indicate different volume. It would be helpful to get a big idea about the outliers. Last visual result is image to show volume of user specified station and its neighborhood. Using this image, users see 3 different traffic volume graphs and can compare them each other. This enable for users to analyze relationship between user specified station and its neighborhood. For example, suppose we want to see traffic volume of station 24. The system displays traffic volume of station 23, 24, and 25. From this one, users know pattern of traffic volume of station 23 and 25 are very similar but not station 24 so station 24 should be one of outliers. As mentioned so far, interface of our system consist of several visual components to use easily rather than command line. User-centered interface is big difference from existing systems. 5. Methodology Constructing several experiments to test how exactly find outliers using different spatial data . 1) Case study 6
  • 7. We will find a set of outliers using different data sets then analyze how exactly they are found. 6. Contributions Major contribution of this project is development application to find spatial outlier using WEKA system. WEKA provides basic data mining functions but these are working on non-spatial database. Building a new class which can detect sets of spatial outliers using given spatial data asset and incorporating the class in existing WEKA will enable the discovery of unexpected, interesting, and useful spatial patterns for further analysis. 7. Conclusion still working on 8. Future work - upgrade to allow various file format and data type - provide written analysis about outlier information - experiments to find more efficient algorithm using different outlier detection algorithms. - Some tool to compare or contrast analysis of different result from different options to detect outliers References [1] EXPLORATORY ANALYSIS OF SPATIAL DATA [2] Chang-Tien Lu, Dechang Chen, Yufeng Kou, “Algorithms for Spatial Outlier Detection”, 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'03) November 03 - 05, 2003 7
  • 8. [3] Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang , “Detecting graph-based spatial outliers: algorithms and applications (a summary of results)”, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA. ACM, 2001 [4] Stephen D. Bay, Mark Schwabacher , “Research track: Mining distance-based outliers in near linear time with randomization and a simple pruning” ruleProceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 29-38, Washington, D.C. ACM 2003 [5] Jiang Zhao, Chang-Tien Lu, Yufeng Kou, “Detecting region outliers in meteorological data”, Proceedings of the eleventh ACM international symposium on Advances in geographic information systems, pp . 49-55, New Orleans, Louisiana, USA, 2003 [6] Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large data sets”, 2000 ACM SIGMOD international conference on Management of data, pp. 427-438, Dallas, Texas, USA. ACM 2000 [7] S. Shekhar, C. T. Lu, and P. Zhang, “A Unified Approach to Detecting Spatial Outliers” , GeoInformatica, pp. 139-166. 2003 [8] Edwin M. Knorr, Raymond T. Ng, “A unified approach for mining outliers”, Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research, pp.11, Toronto, Ontario, Canada, 1997 [9] Edwin M. Knorr, Raymond T. Ng, Vladimir Tucakov, “Distance-based outliers: algorithms and applications”, The VLDB Journal - The International Journal on Very Large Data Bases, pp. 237-253, Volume 8 , Issue 3-4, 2000 [10] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, “LOF: identifying density-based local outliers”, Jörg Sander, 2000 ACM SIGMOD international conference on Management of data, pp. 93-104, ACM, New York, NY, USA , 2000 [11] Ian H. Witten and Eibe Frank, Morgan Kaufmann, “"Data Mining: Practical machine learning tools with Java implementations," San Fran 8