SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Identification of Relevant Sections in Web Pages Using a
               Machine Learning Approach




                                  Jerrin Shaji George

                                      NIT Calicut


                                  November 8, 2012
Introduction

  There is a massive amount of data available on the internet.
  Extracting only the relevant content has become very important.
  A Machine Learning approach is suitable as it can adapt to the
  rapidly changing dynamics of the internet.




2 of 28
Machine Learning

  The science of getting computers to act without being explicitly
  programmed.
  A method of teaching computers to make and improve predictions
  or behaviors based on some data.
  Machine Learning Algorithms :
          Supervised Machine Learning
          Unsupervised Machine Learning




3 of 28
Supervised Learning

  Machine learning task of inferring a function from labeled training
  data.




           Figure: Supervised Learning Model (courtesy scikit-learn)
4 of 28
Supervised Learning

  Example of a classification problem - discrete valued output.




                   Figure: Copyright c Victor Lavrenko

5 of 28
Supervised Learning

  Example of a regression problem - continuous valued output.




                   Figure: Copyright c Victor Lavrenko

6 of 28
Unsupervised Learning

  The data has no labels. The algorithm tries to find similarities
  between the objects in question.




          Figure: Unsupervised Learning Model (courtesy scikit-learn)
7 of 28
Unsupervised Learning

  Example of a clustering problem




                   Figure: Copyright c Victor Lavrenko
8 of 28
Support Vector machines (SVM)

  A supervised learning model.
  Used for classification and regression analysis.
  The basic SVM:
          A non-probabilistic binary linear classifier.
          Classifies each given input into one of the two possible classes which
          forms the output.




9 of 28
The SVM Algorithm

   Inputs are formulated as feature vectors.
   The feature vectors are mapped into a feature space by using a
   kernel function.
   A division is computed in the feature space to optimally separate
   the classes of training vectors.




10 of 28
The SVM Algorithm

               φ: The Kernel Function




11 of 28
Formal Definition of SVM

   An SVM constructs a hyperplane or set of hyperplanes in a high-
   or infinite-dimensional space.
   It can be used for classification and regression.
   A good separation is achieved by the hyperplane that has the
   largest distance to the nearest training data point of any class
   (called the functional margin).




12 of 28
Optimal Separating Hyperplane




                 Figure: Courtesy Steve Gunn

13 of 28
Functional Margin

   The vectors (points) that constrain the width of the margin are the
   support vectors.




14 of 28
                       Figure: Image from scikit-learn
Mapping to Higher Dimensions

   Sometime data is not linearly separable.
   If the original finite-dimensional space is mapped into a much
   higher-dimensional space, the separation is made easier in that
   space.
   This is achieved by the SVM using the Kernel Trick.




15 of 28
Mapping to Higher Dimensions

   Mapping from 1D to 2D




   Mapping from 2D to 3D




16 of 28
                     Figure: Coutesy Steve Gunn
Identification of Relevant Sections in a Web Page for
Web Search

   Shallow techniques like keyword matching gives unsatisfactory
   results.
   Search methodologies must focus more on contextual information
   than just keyword occurrences.
           Search term might not a be very differentiating term.
           It might not appear in the section at all.

   SQUINT : an SVM based approach to identify sections of a Web
   page relevant to a Web Search.



17 of 28
Overall Architecure




18 of 28
Feature Generation

   Word Rank Based Features
   Bigram Rank Based Features
   Coverage of Top Ranked Tokens
   Query Word Frequency
   Distance from the Query




19 of 28
Word Rank Based Features

   The rank of a word is defined to be its position in the list if the
   words were ordered by frequency of occurrence across all search
   results.
   The value of this feature is the frequency of the particular word in
   the given section.
   Bucketing can be used to reduce dimensionality.




20 of 28
Bigram Rank Based Features

   A bigram is defined to be two consecutive words occurring in a
   section.
   Eg. Machine learning may be more important than machine and
   learning separately.
   The value of the feature is calculated same as Word Rank Based
   Features.




21 of 28
Coverage of Top Ranked Tokens

   Relevance may also be determined by the number of top ranked
   words which occur in the section.
   The value of this feature is the coverage of top ranked words per
   bucket.




22 of 28
Distance from the Query

   The intuition here is that the closer a section is to the query in the
   Web page, the more likely it is to be relevant.
   The value of this feature is the section-wise distance between the
   section in question and the nearest section which contains the
   query.




23 of 28
Query Word Frequency

   The value of this feature is the frequency of the query word in the
   section.
   The value is normalized by the number of words in the section.




24 of 28
Training Set Generation

   Query Google to get a set of pages
   Clean each page remove scripts, pictures, links etc.
   Break each page into sections.
   Label each section of every page.




25 of 28
Learning Algorithm

   An Support Vector Machine with a linear kernel is used.
   Given the relatively high dimensionality of the feature vector, it is a
   reasonable choice to use an SVM.
   The predicted margins of each sample are used to get a non-binary
   metric of how relevant each sections are.




26 of 28
Conclusion

   Support Vector Machines are an attractive approach to data
   modelling.
   Evaluations suggest that using information retrieval inspired
   features and some basic hints from summarization give respectable
   accuracy with respect to detecting the most relevant section in a
   page.
   Thus SQUINT can have a large impact on the user’s overall search
   experience.




27 of 28
References

   Cristianini, Nello; and Shawe-Taylor, John; An Introduction to
   Support Vector Machines and other kernel-based learning methods,
   Cambridge University Press, 2000.
   Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT
   SVM for Identification of Relevant Sections in Web Pages for Web
   Search.
   Wikipedia article on Machine Learning,
   http://en.wikipedia.org/wiki/Support vector machine
   Machine Learning Course on Coursera,
   https://class.coursera.org/ml-2012-002/class/index



28 of 28

Más contenido relacionado

La actualidad más candente

Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applicationsAnish Das
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and ApplicationsGeeta Arora
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)butest
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine LearningKnoldus Inc.
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning ModelsEng Teong Cheah
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Introduction into machine learning
Introduction into machine learningIntroduction into machine learning
Introduction into machine learningmohamed Naas
 

La actualidad más candente (20)

Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
ML Basics
ML BasicsML Basics
ML Basics
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and Applications
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine Learning
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning Models
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Introduction into machine learning
Introduction into machine learningIntroduction into machine learning
Introduction into machine learning
 

Similar a Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...Editor Jacotech
 
Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...CloudTechnologies
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMIRJET Journal
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance PredictorIRJET Journal
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxRakshaAgrawal21
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCRakshaAgrawal21
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningIRJET Journal
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET Journal
 
Record matching over multiple query result - Document
Record matching over multiple query result - DocumentRecord matching over multiple query result - Document
Record matching over multiple query result - DocumentNishna Ma
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelDr. Abdul Ahad Abro
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Top 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdfTop 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdfJetender Sharma
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsAM Publications
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersIJAEMSJORNAL
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
 

Similar a Identification of Relevant Sections in Web Pages Using a Machine Learning Approach (20)

A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...
 
Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTM
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance Predictor
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptx
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSC
 
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI), International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine Learning
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
 
Record matching over multiple query result - Document
Record matching over multiple query result - DocumentRecord matching over multiple query result - Document
Record matching over multiple query result - Document
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Top 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdfTop 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdf
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning Algorithms
 
IJET-V3I2P2
IJET-V3I2P2IJET-V3I2P2
IJET-V3I2P2
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their Classifiers
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Journal Publishers
Journal PublishersJournal Publishers
Journal Publishers
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

  • 1. Identification of Relevant Sections in Web Pages Using a Machine Learning Approach Jerrin Shaji George NIT Calicut November 8, 2012
  • 2. Introduction There is a massive amount of data available on the internet. Extracting only the relevant content has become very important. A Machine Learning approach is suitable as it can adapt to the rapidly changing dynamics of the internet. 2 of 28
  • 3. Machine Learning The science of getting computers to act without being explicitly programmed. A method of teaching computers to make and improve predictions or behaviors based on some data. Machine Learning Algorithms : Supervised Machine Learning Unsupervised Machine Learning 3 of 28
  • 4. Supervised Learning Machine learning task of inferring a function from labeled training data. Figure: Supervised Learning Model (courtesy scikit-learn) 4 of 28
  • 5. Supervised Learning Example of a classification problem - discrete valued output. Figure: Copyright c Victor Lavrenko 5 of 28
  • 6. Supervised Learning Example of a regression problem - continuous valued output. Figure: Copyright c Victor Lavrenko 6 of 28
  • 7. Unsupervised Learning The data has no labels. The algorithm tries to find similarities between the objects in question. Figure: Unsupervised Learning Model (courtesy scikit-learn) 7 of 28
  • 8. Unsupervised Learning Example of a clustering problem Figure: Copyright c Victor Lavrenko 8 of 28
  • 9. Support Vector machines (SVM) A supervised learning model. Used for classification and regression analysis. The basic SVM: A non-probabilistic binary linear classifier. Classifies each given input into one of the two possible classes which forms the output. 9 of 28
  • 10. The SVM Algorithm Inputs are formulated as feature vectors. The feature vectors are mapped into a feature space by using a kernel function. A division is computed in the feature space to optimally separate the classes of training vectors. 10 of 28
  • 11. The SVM Algorithm φ: The Kernel Function 11 of 28
  • 12. Formal Definition of SVM An SVM constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space. It can be used for classification and regression. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (called the functional margin). 12 of 28
  • 13. Optimal Separating Hyperplane Figure: Courtesy Steve Gunn 13 of 28
  • 14. Functional Margin The vectors (points) that constrain the width of the margin are the support vectors. 14 of 28 Figure: Image from scikit-learn
  • 15. Mapping to Higher Dimensions Sometime data is not linearly separable. If the original finite-dimensional space is mapped into a much higher-dimensional space, the separation is made easier in that space. This is achieved by the SVM using the Kernel Trick. 15 of 28
  • 16. Mapping to Higher Dimensions Mapping from 1D to 2D Mapping from 2D to 3D 16 of 28 Figure: Coutesy Steve Gunn
  • 17. Identification of Relevant Sections in a Web Page for Web Search Shallow techniques like keyword matching gives unsatisfactory results. Search methodologies must focus more on contextual information than just keyword occurrences. Search term might not a be very differentiating term. It might not appear in the section at all. SQUINT : an SVM based approach to identify sections of a Web page relevant to a Web Search. 17 of 28
  • 19. Feature Generation Word Rank Based Features Bigram Rank Based Features Coverage of Top Ranked Tokens Query Word Frequency Distance from the Query 19 of 28
  • 20. Word Rank Based Features The rank of a word is defined to be its position in the list if the words were ordered by frequency of occurrence across all search results. The value of this feature is the frequency of the particular word in the given section. Bucketing can be used to reduce dimensionality. 20 of 28
  • 21. Bigram Rank Based Features A bigram is defined to be two consecutive words occurring in a section. Eg. Machine learning may be more important than machine and learning separately. The value of the feature is calculated same as Word Rank Based Features. 21 of 28
  • 22. Coverage of Top Ranked Tokens Relevance may also be determined by the number of top ranked words which occur in the section. The value of this feature is the coverage of top ranked words per bucket. 22 of 28
  • 23. Distance from the Query The intuition here is that the closer a section is to the query in the Web page, the more likely it is to be relevant. The value of this feature is the section-wise distance between the section in question and the nearest section which contains the query. 23 of 28
  • 24. Query Word Frequency The value of this feature is the frequency of the query word in the section. The value is normalized by the number of words in the section. 24 of 28
  • 25. Training Set Generation Query Google to get a set of pages Clean each page remove scripts, pictures, links etc. Break each page into sections. Label each section of every page. 25 of 28
  • 26. Learning Algorithm An Support Vector Machine with a linear kernel is used. Given the relatively high dimensionality of the feature vector, it is a reasonable choice to use an SVM. The predicted margins of each sample are used to get a non-binary metric of how relevant each sections are. 26 of 28
  • 27. Conclusion Support Vector Machines are an attractive approach to data modelling. Evaluations suggest that using information retrieval inspired features and some basic hints from summarization give respectable accuracy with respect to detecting the most relevant section in a page. Thus SQUINT can have a large impact on the user’s overall search experience. 27 of 28
  • 28. References Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT SVM for Identification of Relevant Sections in Web Pages for Web Search. Wikipedia article on Machine Learning, http://en.wikipedia.org/wiki/Support vector machine Machine Learning Course on Coursera, https://class.coursera.org/ml-2012-002/class/index 28 of 28