SlideShare una empresa de Scribd logo
1 de 36
Topic:
Machine learning           Synthetic minority over-
Imbalanced data sets   sampling technique (SMOTE)
                        Presented by Hector Franco
                                               TCD
Basic concepts

     Introduction
1.
     Recent developments
2.
     Algorithms description.
3.
     Evaluation.
4.
     Discursion.
5.
0
Multi class problems are imbalance when we

    compare one against all.
    In some cases the data set is very small, to

    generalize well.
    Text classification is an example of imbalanced

    data.
    It can be use with tree-kernels.

Effect of SMOTE and DEC – (SDC)




 After DEC   alone    After SMOTE
 and DEC
: Majority sample
: Minority sample
: Synthetic sample
                                         6
introduction
By convention the class with less number of

    examples is called minority or positive
    samples.
The recent developments in
imbalanced data sets learning
Between-class imbalanced.

    (where we focused on)
    Within-class imbalanced.



    It is important in text classification.

    We focused on the minority class, we want a

    high prediction for the minority class..
    Two class problem = multiclass problem .

NOT VERY GOOD
                         IN UNBALANCED
                              DATA




Popular evaluation for
 imbalance problem.
 Usually B=1, and =1
    in this paper
AUC:
TP rate
          AREA
          UNDER
          ROC


                  FP rate
Data level: Change the distribution

    ◦ make the data balanced
    Modify the existing data mining algorithms

    ◦ Make new algorithms
Random oversampling: duplicate

    Random under sampling: (can remove

    important data)
    Remove noise

    SMOTE

    Combine under sampling and over sampling.

    Find the hard examples and over sample

    them.
Adaboost (increase weights of misclassified),

    it does not perform well on imbalances ds. 
    Improve updated weights of TP & FP, better
    than weights of prediction based on TP & FP.
    Use a kernel of SVM

    Use a BMPM

    Biased Mini max Probability Machine.
    There are other cost-based learning…

A new Over-Sampling Method:
Borderline-SMOTE.
Algorithms usually

    try to learn the
    borderline, as
    exactly as possible.
Borderline-SMOTE1

    Borderline-SMOTE2

Also oversampling the majority class.

    The random numbers are between 0 and 0.5

    so the synthetic examples are more close to
    each other.
Experiments
Nothing: base line.

    SMOTE

    Random over-sampling

    Borderline-SMOTE1

    Borderline-SMOTE2



    K=5

    10 Fold cross validation.

    C4.5 classified

    We only want to improve the prediction of the

    minority class
conclusion
Is a common problem to work with

    imbalanced data sets.
    Borderline examples are more easy to

    misclassified.
    Our methods are better than traditional

    SMOTE.
    Open to research:

    ◦ how to define DANGER examples.
    ◦ Determination of number of examples in DANGER.
    ◦ Combine to data mining algorithms.
You are free:
•to copy, distribute, display, and perform the work
•to make derivative works

Under the following conditions:
•Attribution. You must give the original author credit.
What does quot;Attribute this workquot; mean?
The page you came from contained embedded licensing metadata, including how the
creator wishes to be attributed for re-use. You can use the HTML here to cite the work.
Doing so will also include metadata on your page so that others can find the original work
as well.

•Non-Commercial. You may not use this work for commercial purposes.
•For any reuse or distribution, you must make clear to others the licence terms of this
work.
•Any of these conditions can be waived if you get permission from the copyright holder.
•Nothing in this license impairs or restricts the author's moral rights.

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniques
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Generative adversarial text to image synthesis
Generative adversarial text to image synthesisGenerative adversarial text to image synthesis
Generative adversarial text to image synthesis
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Lecture6 - C4.5
Lecture6 - C4.5Lecture6 - C4.5
Lecture6 - C4.5
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
 
Deep belief network.pptx
Deep belief network.pptxDeep belief network.pptx
Deep belief network.pptx
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lecture
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 

Destacado

Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
萍華 楊
 
Ensemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and BeyondEnsemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and Beyond
zukun
 
Tong quan ve phan cum data mining
Tong quan ve phan cum   data miningTong quan ve phan cum   data mining
Tong quan ve phan cum data mining
Hoa Chu
 
はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)
Motoya Wakiyama
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
Shintaro Fukushima
 

Destacado (12)

Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Predictive Modeling: Predict Premium Subscriber for a Leading International M...
Predictive Modeling: Predict Premium Subscriber for a Leading International M...Predictive Modeling: Predict Premium Subscriber for a Leading International M...
Predictive Modeling: Predict Premium Subscriber for a Leading International M...
 
Ensemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and BeyondEnsemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and Beyond
 
Présentation ardia pfe sabrine gharbi 2015 slide share
Présentation ardia pfe sabrine gharbi 2015 slide sharePrésentation ardia pfe sabrine gharbi 2015 slide share
Présentation ardia pfe sabrine gharbi 2015 slide share
 
Présentation pfe
Présentation pfePrésentation pfe
Présentation pfe
 
Tong quan ve phan cum data mining
Tong quan ve phan cum   data miningTong quan ve phan cum   data mining
Tong quan ve phan cum data mining
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Lecture12 - SVM
Lecture12 - SVMLecture12 - SVM
Lecture12 - SVM
 
はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
 

Similar a Borderline Smote

NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
ssuserd23711
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detection
butest
 

Similar a Borderline Smote (20)

Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Simulating data to gain insights into power and p-hacking
Simulating data to gain insights intopower and p-hackingSimulating data to gain insights intopower and p-hacking
Simulating data to gain insights into power and p-hacking
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
ICML2015 Slides
ICML2015 SlidesICML2015 Slides
ICML2015 Slides
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptx
 
Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction
 
Deep learning
Deep learningDeep learning
Deep learning
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detection
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architectures
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 

Más de Trector Rancor (7)

Cryptocurrencies overview
Cryptocurrencies overviewCryptocurrencies overview
Cryptocurrencies overview
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
 
Virtual Journalist
Virtual JournalistVirtual Journalist
Virtual Journalist
 
Class Diagram Uml
Class Diagram UmlClass Diagram Uml
Class Diagram Uml
 
A Comparative Study On Featuree Selection In Text2
A Comparative Study On Featuree Selection In Text2A Comparative Study On Featuree Selection In Text2
A Comparative Study On Featuree Selection In Text2
 
going to uni
going to unigoing to uni
going to uni
 
My First Presentation
My First PresentationMy First Presentation
My First Presentation
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Borderline Smote

  • 1. Topic: Machine learning Synthetic minority over- Imbalanced data sets sampling technique (SMOTE) Presented by Hector Franco TCD
  • 2. Basic concepts  Introduction 1. Recent developments 2. Algorithms description. 3. Evaluation. 4. Discursion. 5.
  • 3. 0
  • 4. Multi class problems are imbalance when we  compare one against all. In some cases the data set is very small, to  generalize well. Text classification is an example of imbalanced  data. It can be use with tree-kernels. 
  • 5. Effect of SMOTE and DEC – (SDC) After DEC alone After SMOTE and DEC
  • 6. : Majority sample : Minority sample : Synthetic sample 6
  • 7.
  • 9. By convention the class with less number of  examples is called minority or positive samples.
  • 10. The recent developments in imbalanced data sets learning
  • 11. Between-class imbalanced.  (where we focused on) Within-class imbalanced.  It is important in text classification.  We focused on the minority class, we want a  high prediction for the minority class.. Two class problem = multiclass problem . 
  • 12. NOT VERY GOOD IN UNBALANCED DATA Popular evaluation for imbalance problem. Usually B=1, and =1 in this paper
  • 13. AUC: TP rate AREA UNDER ROC FP rate
  • 14. Data level: Change the distribution  ◦ make the data balanced Modify the existing data mining algorithms  ◦ Make new algorithms
  • 15. Random oversampling: duplicate  Random under sampling: (can remove  important data) Remove noise  SMOTE  Combine under sampling and over sampling.  Find the hard examples and over sample  them.
  • 16. Adaboost (increase weights of misclassified),  it does not perform well on imbalances ds.  Improve updated weights of TP & FP, better than weights of prediction based on TP & FP. Use a kernel of SVM  Use a BMPM  Biased Mini max Probability Machine. There are other cost-based learning… 
  • 17. A new Over-Sampling Method: Borderline-SMOTE.
  • 18. Algorithms usually  try to learn the borderline, as exactly as possible.
  • 19. Borderline-SMOTE1  Borderline-SMOTE2 
  • 20.
  • 21.
  • 22. Also oversampling the majority class.  The random numbers are between 0 and 0.5  so the synthetic examples are more close to each other.
  • 23.
  • 24.
  • 25.
  • 27.
  • 28. Nothing: base line.  SMOTE  Random over-sampling  Borderline-SMOTE1  Borderline-SMOTE2  K=5  10 Fold cross validation.  C4.5 classified  We only want to improve the prediction of the  minority class
  • 29.
  • 30.
  • 31.
  • 32.
  • 34. Is a common problem to work with  imbalanced data sets. Borderline examples are more easy to  misclassified. Our methods are better than traditional  SMOTE. Open to research:  ◦ how to define DANGER examples. ◦ Determination of number of examples in DANGER. ◦ Combine to data mining algorithms.
  • 35.
  • 36. You are free: •to copy, distribute, display, and perform the work •to make derivative works Under the following conditions: •Attribution. You must give the original author credit. What does quot;Attribute this workquot; mean? The page you came from contained embedded licensing metadata, including how the creator wishes to be attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on your page so that others can find the original work as well. •Non-Commercial. You may not use this work for commercial purposes. •For any reuse or distribution, you must make clear to others the licence terms of this work. •Any of these conditions can be waived if you get permission from the copyright holder. •Nothing in this license impairs or restricts the author's moral rights.