SlideShare una empresa de Scribd logo
1 de 20
Introduction to
                      Text Mining
                       & Support
                   Vector Machines
                         (SVM)



                    Dr. Anton Heijs
                         CEO
    Treparel
 Delftechpark 26
  2628 XH Delft        July 2012
The Netherlands
www.treparel.com
KMX enables information and knowledge professionals
to gain faster, reliable, more precise insights in large
complex unstructured data sets allowing them to make
better informed decisions.




                   Treparel is a leading technology solution provider in
                         Big Data Text Analytics & Visualization

Treparel KMX – All rights reserved 2012   www.treparel.com                 2
Topics covered in this presentation


         • Who is Treparel?
         • Introduction in Text Mining
         • What is Automated Classification & Clustering?
         • Introducing Support Vector Machines




Treparel KMX – All rights reserved 2012   www.treparel.com   3
Nexus of Forces: Social, Cloud, Mobile, Information
         IT Market shift driving Big Data challenges
                                                                                 Copyright: Gartner, 2011




                 80% of data is Unstructured (Documents, Text, Images, Graphs)



Treparel KMX – All rights reserved 2012     www.treparel.com                                 4
About Treparel

         • Delft, The Netherlands, 2006.
         • Treparel is an innovative technology solution provider in Big Data
           Analytics, Text Mining and Visualization.
         • KMX is an integrated data analysis toolset which provide faster,
           reliable intelligent insights in large complex unstructured data sets to
           allow companies to make better informed decisions.
         • Clients: Philips, Bayer, Abbott, European Patent Office, European
           Commission
         • Part of Research Centers and University ecosystem; TU Delft,
           Universities of Paris and Sao Paulo
         • More info: www.treparel.com




Treparel KMX – All rights reserved 2012   www.treparel.com                        5
Positioning of Treparel’s KMX technology

Text Acquisition & Preparation   Analysis and processing         Output and display
‘Seek’                           ‘Model’                         ‘Adapt’


External sources                                                 Reporting &
                             Text preprocessing
Patents                                                          Presentation
Legal
                                                                 Media and publishing
Research                     Indexing                            databases
Media / Publishers
                                                                 Content management
Other sources                Clustering                          systems
Documents
Websites                                                         Line-of-business
                             Classification                      applications
Blogs
Newsfeeds                                                        Research applications
Email                        Semantic Analysis
Application notes                                                Search engines
Search results
Social networks                                    Visualization


            Information extraction (entities, facts, relationships, concepts, patents)
                        Management, Development and Configuration
                                                                    Copyright: Gartner, J. Popkin 2010
Getting to know the basics

        PART A: Intro in Text Mining
        • The Data (text & image) Mining evolution
        • What is Data Mining: in or out-side the database
        • The Data Mining process
        • Two types of Data Mining tasks: Predictive and Descriptive
        • Two modes of Data Mining tasks: Supervised and Unsupervised
        • The most important algorithms per category


        PART B: SVM
        • Machine Learning & Support Vector Machines (SVM)
        • What makes SVM unique
        • When and How to deploy SVM
        • Case Studies & Examples


Treparel KMX – All rights reserved 2012   www.treparel.com              7
The Data/Text/Image mining evolution
         The Road ahead
                                                                                               Future
            High                                                                                        Enterprise
                                                                               Today                    Text Analytics
                                                                                  Analytical
                                                                                  Modeling
                                                                 1995 - 2000

                                                                        SVM
                                                                        Predictive
                                                                        Modeling
             Application Value




                                               1980’s

                                     Traditional
                                                               “Easy-to-Use”
                                     Data Mining
                                                                Data Mining
                                                                   Tools
                                                               1980’s


                                                                                                            1990’s
                                                                   OLAP                   Query and
                                                                                          Reporting
             Low

                                 Hard to use                                                            Easy to Use
                                                         Usability

Treparel KMX – All rights reserved 2012                 www.treparel.com                                                 8
Knowledge Mining
         Different levels of depth in knowledge discovery

          Visualization (Adapt)



                                                                    Models of semantic data


                                                  Models of data


                           Models of meta data


                                                   Data Mining      Knowledge
         Filtered data
                                                   Text Mining      Discovery
                           Meta Data               Graph Mining


          Data Collection (Seek)

                                                                      Time
Treparel KMX – All rights reserved 2012          www.treparel.com                             9
What is Data Mining?
           Getting to know the basics
        • Most businesses have an enormous amount of data, with a great deal of
          information hiding within it; The data is also growing faster then the knowledge
          which is now extracted from the data, which leads to a growing gap between
          data and knowledge.
        • Data mining provides a way to automatically extract information buried in the
          data.
        • Data Mining creates mathematical models which describe patterns in large,
          complex collections of data.
        • Patterns elude traditional statistical approaches to analysis because of the large
          number of attributes, the complexity of the patterns, or the difficulty to perform
          the analysis
        • Mining the data directly in the database has advantages:
          less data movement, more data security, one source of the
          data
        • Basically 2 Types of Data exist:
              – Structured (tables & numbers) – 20% of data volume
              – Un-Structured (text, images) - 80% of data volume




Treparel KMX – All rights reserved 2012        www.treparel.com                          10
The Data & Text Mining process
            Automating the mining steps; adding new features

                    Understanding the knowledge mining value chain




                                   Data                                              Model
              Data                 Preparation    Algorithm   Model       Model      generation
                                   &                                      De-        (All models) &   Visualization
              Collection &                        Selection   Building
              Understanding        Cleansing                  & Testing   ployment   coordination




                                                                          Treparel's Focus
                                                                          & Core competence


                                  Traditional Players


Treparel KMX – All rights reserved 2012
2 types of Data Mining Functions
         Predictive Data Mining (supervised):
         •    Are used to predict a value; they require the specification of a
              target (known outcome)
         •    Targets are either binary attributes (indicating yes/no) decisions or
              multi-class targets indicating a preferred alternative (color of
              sweater, salary range).
         •    Constructs one or more models; these models are used to predict
              outcomes for data sets
         Descriptive Data Mining (Unsupervised):
         •    Are used to find the intrinsic structure, relations, or affinities in
              data.
         •    Describes a data set in a concise way and presents interesting
              characteristics of the data
         •    The functions are: clustering, association models, and feature
              extraction

Treparel KMX – All rights reserved 2012   www.treparel.com                       12
How does Automated Classification & Clustering
         works?
         • Consists of dividing the items that make up a collection into
           categories or classes.
         • The goal is to accurately predict the target class for each
           record in new data.
         • Algorithms for classification: different algorithms for
           different problems
                  Naïve Bayes
                  Adaptive Bayes Network
                  Support Vector Machine
                  Decision Tree


            Classification is used in: customer segmentation, sentiment
                analysis, competitive analysis, business modeling, credit
                 analysis, Smart content, Fraud and terrorist detection,
                        Diagnosis support, Patent & Drug discovery
Treparel KMX – All rights reserved 2012     www.treparel.com          13
Text Mining algorithms and features

         Feature                  Naive Bayes         Adaptive        Suport Vector     Decision Tree
                                                      Bayes           Machine
                                                      Network
         Speed                    Very fast           Fast            Fast with         Fast
                                                                      active learning
         Accuracy                 Good in many        Good in many    Significant       Good in many
                                  domains             domains                           domains

         Transparancy             No rules (black Rules for           No rules (black Rules
                                  box)                                box)

         Missing value            Missing value       Missing value   Sparse Data       Missing value
         intrepretation




Treparel KMX – All rights reserved 2012           www.treparel.com                               14
What is Support Vector Machine Learning?
        State of the Art algorithm
        • SVM is a state of the art classification and regression algorithm
        • The SVM optimization procedure maximizes predictive accuracy
          while automatically avoiding over-fitting the training data
        • SVM projects the input data into a kernel space. Then it builds a
          linear model in this kernel space
        • SVM performs well with real world applications such as
          classifying text, recognizing hand-written characters, classifying
          images, as well as bioinformatics and bio sequence analysis.
        • SVM are the standard tools for machine learning and data mining




Treparel KMX – All rights reserved 2012   www.treparel.com                     15
What is Support Vector Machine Learning?
                 Classical Data Mining vs SVM

                     Classical Statistics            SVM - Support Vector Machines

                   Hypothesis on Data                  Study of the model family:
                    distribution                         the VC dimension

                   Large number of dimensions          Number of dimensions can be
                    implies large number of model        very high because generalization
                    parameters which leads to            is controlled
                    generalization problems


                   Modeling seeks to get the best      Modeling seeks to get the best
                    Fit                                  compromise between Fit and
                                                         Robustness


                   Manual iterations and time          Automation is possible
                    are necessary



Treparel KMX –
All rights
reserved 2012
What makes SVM such a unique technology?
         • Strong theoretical foundation (Vapnik-Chervonenkis theory)
         • There is no upper limit on the number of attributes ; Only constraint is
           the hardware
         • Good generalization to novel data
         • SVM is the preferred algorithm for sparse data
         • Algorithm of choice for challenging high-dimensional data
         • SVM supports active learning.
               – SVM models grow as the size of the training set increases, big data
                 sets would be difficult to handle.
               – Aative learning forces the SVM algorithm to restrict learning to the
                 most informative training examples.
         • SVM automatically selects a kernel
         • You can control both the model quality (accuracy) and the performance
           (build time)

Treparel KMX – All rights reserved 2012   www.treparel.com                        17
What makes SVM unique?
         SVM gives you control over the models
                  Robustness
                          High
                    Robustness




                                   Under Fit Model                              Robust Model
                                   High Robustness                              Low Training Error Low Test
                                   Training Error = Test Error                  Error




                          Low                                                   Over Fit Model
                    Robustness
                                                                                Low Robustness
                                                                                No Training Error, High Test
                                                                                Error
                                 Low accuracy                                                      High accuracy
                                                                                                               Quality of fit
Treparel KMX – All rights reserved 2012                          www.treparel.com                                         18
What makes SVM unique?
         SVM gives you control over the models




                                 Need more training data                 Safe to Deploy
                         High
            Robustness



                                 (rows)



                                Need more data
                                                                Need more variables
                                (rows/columns)
                         Low




                                                                (columns) or different model
                                or different model type         type

                                            Low                              High

                                                           Quality

Treparel KMX – All rights reserved 2012               www.treparel.com                         19
Treparel is a leading technology solution provider
       in Big Data Text Analytics & Visualization


                                              Treparel
                                           Delftechpark 26
                                            2628 XH Delft
                                          The Netherlands
                                          www.treparel.com


Treparel KMX – All rights reserved 2012      www.treparel.com   20

Más contenido relacionado

La actualidad más candente

II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchII-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 

La actualidad más candente (11)

[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchII-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
Machine Learning - Intro
Machine Learning - IntroMachine Learning - Intro
Machine Learning - Intro
 

Destacado

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
Musa Hawamdah
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
Shao-Chuan Wang
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
butest
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 

Destacado (20)

Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Lecture12 - SVM
Lecture12 - SVMLecture12 - SVM
Lecture12 - SVM
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Linear regression without tears
Linear regression without tearsLinear regression without tears
Linear regression without tears
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Support Vector Machine(SVM) with Iris and Mushroom Dataset
Support Vector Machine(SVM) with Iris and Mushroom DatasetSupport Vector Machine(SVM) with Iris and Mushroom Dataset
Support Vector Machine(SVM) with Iris and Mushroom Dataset
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine Learning
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
09 Machine Learning - Introduction Support Vector Machines
09 Machine Learning - Introduction Support Vector Machines09 Machine Learning - Introduction Support Vector Machines
09 Machine Learning - Introduction Support Vector Machines
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 

Similar a Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2
David Linthicum
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...
Erik Fransen
 
Data Mining
Data MiningData Mining
Data Mining
swami920
 
A Trading-Based Knowledge Representation Metamodel for Management Information...
A Trading-Based Knowledge Representation Metamodel for Management Information...A Trading-Based Knowledge Representation Metamodel for Management Information...
A Trading-Based Knowledge Representation Metamodel for Management Information...
Applied Computing Group
 
Metadata Use Cases You Can Use
Metadata Use Cases You Can UseMetadata Use Cases You Can Use
Metadata Use Cases You Can Use
dmurph4
 
Metadata Use Cases
Metadata Use CasesMetadata Use Cases
Metadata Use Cases
dmurph4
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
MapR Technologies
 

Similar a Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012 (20)

Teradata Big Data London Seminar
Teradata Big Data London SeminarTeradata Big Data London Seminar
Teradata Big Data London Seminar
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the Future
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
 
Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...
 
Data Mining
Data MiningData Mining
Data Mining
 
A Trading-Based Knowledge Representation Metamodel for Management Information...
A Trading-Based Knowledge Representation Metamodel for Management Information...A Trading-Based Knowledge Representation Metamodel for Management Information...
A Trading-Based Knowledge Representation Metamodel for Management Information...
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
NIEM and Oracle Overview October 2011
NIEM and Oracle Overview October 2011NIEM and Oracle Overview October 2011
NIEM and Oracle Overview October 2011
 
Metadata Use Cases You Can Use
Metadata Use Cases You Can UseMetadata Use Cases You Can Use
Metadata Use Cases You Can Use
 
Metadata Use Cases
Metadata Use CasesMetadata Use Cases
Metadata Use Cases
 
Unity: Because the Sum is Greater than the Parts
Unity: Because the Sum is Greater than the PartsUnity: Because the Sum is Greater than the Parts
Unity: Because the Sum is Greater than the Parts
 
Web 2.0 And The End Of DITA
Web 2.0 And The End Of DITAWeb 2.0 And The End Of DITA
Web 2.0 And The End Of DITA
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

  • 1. Introduction to Text Mining & Support Vector Machines (SVM) Dr. Anton Heijs CEO Treparel Delftechpark 26 2628 XH Delft July 2012 The Netherlands www.treparel.com
  • 2. KMX enables information and knowledge professionals to gain faster, reliable, more precise insights in large complex unstructured data sets allowing them to make better informed decisions. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization Treparel KMX – All rights reserved 2012 www.treparel.com 2
  • 3. Topics covered in this presentation • Who is Treparel? • Introduction in Text Mining • What is Automated Classification & Clustering? • Introducing Support Vector Machines Treparel KMX – All rights reserved 2012 www.treparel.com 3
  • 4. Nexus of Forces: Social, Cloud, Mobile, Information IT Market shift driving Big Data challenges Copyright: Gartner, 2011 80% of data is Unstructured (Documents, Text, Images, Graphs) Treparel KMX – All rights reserved 2012 www.treparel.com 4
  • 5. About Treparel • Delft, The Netherlands, 2006. • Treparel is an innovative technology solution provider in Big Data Analytics, Text Mining and Visualization. • KMX is an integrated data analysis toolset which provide faster, reliable intelligent insights in large complex unstructured data sets to allow companies to make better informed decisions. • Clients: Philips, Bayer, Abbott, European Patent Office, European Commission • Part of Research Centers and University ecosystem; TU Delft, Universities of Paris and Sao Paulo • More info: www.treparel.com Treparel KMX – All rights reserved 2012 www.treparel.com 5
  • 6. Positioning of Treparel’s KMX technology Text Acquisition & Preparation Analysis and processing Output and display ‘Seek’ ‘Model’ ‘Adapt’ External sources Reporting & Text preprocessing Patents Presentation Legal Media and publishing Research Indexing databases Media / Publishers Content management Other sources Clustering systems Documents Websites Line-of-business Classification applications Blogs Newsfeeds Research applications Email Semantic Analysis Application notes Search engines Search results Social networks Visualization Information extraction (entities, facts, relationships, concepts, patents) Management, Development and Configuration Copyright: Gartner, J. Popkin 2010
  • 7. Getting to know the basics PART A: Intro in Text Mining • The Data (text & image) Mining evolution • What is Data Mining: in or out-side the database • The Data Mining process • Two types of Data Mining tasks: Predictive and Descriptive • Two modes of Data Mining tasks: Supervised and Unsupervised • The most important algorithms per category PART B: SVM • Machine Learning & Support Vector Machines (SVM) • What makes SVM unique • When and How to deploy SVM • Case Studies & Examples Treparel KMX – All rights reserved 2012 www.treparel.com 7
  • 8. The Data/Text/Image mining evolution The Road ahead Future High Enterprise Today Text Analytics Analytical Modeling 1995 - 2000 SVM Predictive Modeling Application Value 1980’s Traditional “Easy-to-Use” Data Mining Data Mining Tools 1980’s 1990’s OLAP Query and Reporting Low Hard to use Easy to Use Usability Treparel KMX – All rights reserved 2012 www.treparel.com 8
  • 9. Knowledge Mining Different levels of depth in knowledge discovery Visualization (Adapt) Models of semantic data Models of data Models of meta data Data Mining Knowledge Filtered data Text Mining Discovery Meta Data Graph Mining Data Collection (Seek) Time Treparel KMX – All rights reserved 2012 www.treparel.com 9
  • 10. What is Data Mining? Getting to know the basics • Most businesses have an enormous amount of data, with a great deal of information hiding within it; The data is also growing faster then the knowledge which is now extracted from the data, which leads to a growing gap between data and knowledge. • Data mining provides a way to automatically extract information buried in the data. • Data Mining creates mathematical models which describe patterns in large, complex collections of data. • Patterns elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of the patterns, or the difficulty to perform the analysis • Mining the data directly in the database has advantages: less data movement, more data security, one source of the data • Basically 2 Types of Data exist: – Structured (tables & numbers) – 20% of data volume – Un-Structured (text, images) - 80% of data volume Treparel KMX – All rights reserved 2012 www.treparel.com 10
  • 11. The Data & Text Mining process Automating the mining steps; adding new features Understanding the knowledge mining value chain Data Model Data Preparation Algorithm Model Model generation & De- (All models) & Visualization Collection & Selection Building Understanding Cleansing & Testing ployment coordination Treparel's Focus & Core competence Traditional Players Treparel KMX – All rights reserved 2012
  • 12. 2 types of Data Mining Functions Predictive Data Mining (supervised): • Are used to predict a value; they require the specification of a target (known outcome) • Targets are either binary attributes (indicating yes/no) decisions or multi-class targets indicating a preferred alternative (color of sweater, salary range). • Constructs one or more models; these models are used to predict outcomes for data sets Descriptive Data Mining (Unsupervised): • Are used to find the intrinsic structure, relations, or affinities in data. • Describes a data set in a concise way and presents interesting characteristics of the data • The functions are: clustering, association models, and feature extraction Treparel KMX – All rights reserved 2012 www.treparel.com 12
  • 13. How does Automated Classification & Clustering works? • Consists of dividing the items that make up a collection into categories or classes. • The goal is to accurately predict the target class for each record in new data. • Algorithms for classification: different algorithms for different problems  Naïve Bayes  Adaptive Bayes Network  Support Vector Machine  Decision Tree Classification is used in: customer segmentation, sentiment analysis, competitive analysis, business modeling, credit analysis, Smart content, Fraud and terrorist detection, Diagnosis support, Patent & Drug discovery Treparel KMX – All rights reserved 2012 www.treparel.com 13
  • 14. Text Mining algorithms and features Feature Naive Bayes Adaptive Suport Vector Decision Tree Bayes Machine Network Speed Very fast Fast Fast with Fast active learning Accuracy Good in many Good in many Significant Good in many domains domains domains Transparancy No rules (black Rules for No rules (black Rules box) box) Missing value Missing value Missing value Sparse Data Missing value intrepretation Treparel KMX – All rights reserved 2012 www.treparel.com 14
  • 15. What is Support Vector Machine Learning? State of the Art algorithm • SVM is a state of the art classification and regression algorithm • The SVM optimization procedure maximizes predictive accuracy while automatically avoiding over-fitting the training data • SVM projects the input data into a kernel space. Then it builds a linear model in this kernel space • SVM performs well with real world applications such as classifying text, recognizing hand-written characters, classifying images, as well as bioinformatics and bio sequence analysis. • SVM are the standard tools for machine learning and data mining Treparel KMX – All rights reserved 2012 www.treparel.com 15
  • 16. What is Support Vector Machine Learning? Classical Data Mining vs SVM Classical Statistics SVM - Support Vector Machines  Hypothesis on Data  Study of the model family: distribution the VC dimension  Large number of dimensions  Number of dimensions can be implies large number of model very high because generalization parameters which leads to is controlled generalization problems  Modeling seeks to get the best  Modeling seeks to get the best Fit compromise between Fit and Robustness  Manual iterations and time  Automation is possible are necessary Treparel KMX – All rights reserved 2012
  • 17. What makes SVM such a unique technology? • Strong theoretical foundation (Vapnik-Chervonenkis theory) • There is no upper limit on the number of attributes ; Only constraint is the hardware • Good generalization to novel data • SVM is the preferred algorithm for sparse data • Algorithm of choice for challenging high-dimensional data • SVM supports active learning. – SVM models grow as the size of the training set increases, big data sets would be difficult to handle. – Aative learning forces the SVM algorithm to restrict learning to the most informative training examples. • SVM automatically selects a kernel • You can control both the model quality (accuracy) and the performance (build time) Treparel KMX – All rights reserved 2012 www.treparel.com 17
  • 18. What makes SVM unique? SVM gives you control over the models Robustness High Robustness Under Fit Model Robust Model High Robustness Low Training Error Low Test Training Error = Test Error Error Low Over Fit Model Robustness Low Robustness No Training Error, High Test Error Low accuracy High accuracy Quality of fit Treparel KMX – All rights reserved 2012 www.treparel.com 18
  • 19. What makes SVM unique? SVM gives you control over the models Need more training data Safe to Deploy High Robustness (rows) Need more data Need more variables (rows/columns) Low (columns) or different model or different model type type Low High Quality Treparel KMX – All rights reserved 2012 www.treparel.com 19
  • 20. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.com Treparel KMX – All rights reserved 2012 www.treparel.com 20