SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Understanding your data
with Bayesian networks
(in python)
Bartek Wilczyński
bartek@mimuw.edu.pl
University of Warsaw
PyData Silicon Valey, May 5th 2014
Are you confused enough?
Or should I confuse you a bit more ?
Image from xkcd.org/552/
Data show: Confused students score better!
Data from Eric Mazur
There may be factors we haven't thought about
● Maybe confusion helps
with learning?
● Or maybe there is
an alternative explanation?
● As long as these are just
cartoon models – we
cannot really rule out any
structure
Paying
attention
Being
confused
Correct
answer
Being
confused
Correct
answer
or
What do I mean by data?
Sex Age Smoking Stress Lung Heart Feel
M 0-20 never N No no great
F 70 sometimes N minor no OK
M 50-70 daily Y no severe Not-so-well
M 20-50 daily N no minor OK
F 70 never N no minor great
F 20-50 sometimes Y severe minor Not-so-well
F 20-50 never Y no no great
M 20-50 sometimes N minor no great
M 50-70 never Y severe no OK
F 0-20 never N no severe OK
M 20-50 daily Y no no OK
M 0-20 daily N no no Not-so-well
M 20-50 never N minor no OK
.... ... ... ... ... ... ...
Network of connections
Smoking
(daily, sometimes, never)
Age
(0-20,20-50, 50-70,70+)
Stressful job
(yes,no)
Lung problems
(no,minor,severe)
Heart problems
(no,minor,severe)
Sex
(male,female)
How did you feel this morning?
(great, OK, not-so-well, terrible)
What is a Bayesian Network ?
●
A directed acyclic graph without cycles
●
with nodes representing random variables
●
and edges between nodes representing dependencies
(not necessarily causal)
●
Each edge is directed from a parent to a child, so all
nodes with connections to a given node constitute its
set of parents
●
Each variable is associated with a value domain and a
probability distribution conditional on parents' values
Back to our confused students
● Let us consider our model of
confused students
● We can consider the model
with an additional variable
● We need to heve data on the
additional variable to be
predictive
● Sometimes we need to use
“wrong” models if they are
predictive
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
confused 80% 0%
not confused 20% 100%
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
correct 50% 20%
incorrect 50% 80%
Can we find the “best” Bayesian Network?
● Given a dataset with observations,
we can try to find the “best”
network topology (i.e. the best
collection of parents' sets)
● In order to do it automatically we
need a scoring function to define
what we mean by “best”
● A score function is useful if it can
be written as a sum over
variables, i.e. the best network
consists of best parent sets for
variables (modulo acyclicity)
How to find the best network?
● There are generally three main approaches to defining BN scores:
– Bayesian statistics, e.g. BDe (Herskovits et al. '95)
– Information Theoretic, e.g. MDL (Lam et al. '94)
– Hypothesis testing, e.g. MMPC (Salehi et al. '10)
● There are also hybrid approaches, like the recent MIT (de Campos '06)
approach that uses information theory and hypothesis testing
● We have two issues:
– There are exponentially many potential parent sets
– The desired network needs to have no cycles
● The second issue is more important and makes the problem NP-complete
(Chickering '96)
Cycles are not always a problem
● Dynamic Bayesian
Networks are avariant of
BN models that describe
temporal dependencies
● We can safely assume that
the causal links only go
forward in time
● That breaks the problem of
cycles as we now have two
versions of each variable:
“before” and “after”
X1
X2
X3
X1 X1
t t+1
X2 X2
X3 X3
Different types of variables
● Another common situation is
when we have different types
of variables
● We may know that only
certain types of connections
are causal
● Or we may be interested only in
certain types of connections
● This breaks the cycles as well
Mutations
Protein expression
Diseases
BNFinder – python library for Bayesian Networks
● A library for identification of
optimal Bayesian Networks
● Works under assumption of
acyclicity by external
constraints (disjoint sets of
variables or dynamic
networks)
● fast and efficient (relatively)
Example1 – the simplest possible
Now, parallellize!
● Since we have external
constraints on acyclicity, we
can search for parent sets
independently
● This leads to a simple
parallelization scheme and
good efficiency
Bonn et al. Nat. Genet, 2012
Active Inactive
Making the training set for “activity” variable
Handling continuous data
Network model
Does it provide useful predictions?
• 12 positive and 4 negative predictions tested
• >90% success (1 error)
Some more continuous data with perturbations
• 8008 enhancers compiled
from 15 ChIP experiments
(almost 20k binding peaks)
• Activity data for ~140
enhancers divided into
– 3 tissues (MESO, VM, SM)
– 5 stages
(4-6,7-8,9-10,1112,13-16)
• Gene expression data for
5082 genes from the BDGP
database
Wilczynski et al.PLoS Comp.Biol 2012
Predictions validated:
19/20 correct stage, 10/20 correct tissue
Summary
● Bayesian Networks can provide predictive models based on
conditional probability distributions
● BNFinder is an effective tool for finding optimal networks given
tabular data. And it's open source!
● It can be used as a commandline tool or as a library
● It can use continuous data as well as discrete
● Can be run in parallel on multiple cores (with good efficiency)
● Convenience functions (cross-validation, ROC plots) included
http://launchpad.net/bnfinder
Thanks!
● Norbert Dojer
● Alina Frolova
● Paweł Bednarz
● Agnieszka Podsiadło
● Questions?

Más contenido relacionado

La actualidad más candente

Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network AnalysisSujoy Bag
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsMathias Niepert
 
DeepWalk: Online Learning of Representations
DeepWalk: Online Learning of RepresentationsDeepWalk: Online Learning of Representations
DeepWalk: Online Learning of RepresentationsBryan Perozzi
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networksmilad abbasi
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdfBong-Ho Lee
 
Principal component analysis, Code and Time Complexity
Principal component analysis, Code and Time ComplexityPrincipal component analysis, Code and Time Complexity
Principal component analysis, Code and Time ComplexityYounesCharfaoui
 
Social Recommendation
Social RecommendationSocial Recommendation
Social Recommendationgu wendong
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)NamHyuk Ahn
 
Generative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging ApplicationsGenerative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging ApplicationsKyuhwan Jung
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Krishnaram Kenthapadi
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation LearningJure Leskovec
 
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...Edge AI and Vision Alliance
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)Thomas da Silva Paula
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksDing Li
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 

La actualidad más candente (20)

Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
DeepWalk: Online Learning of Representations
DeepWalk: Online Learning of RepresentationsDeepWalk: Online Learning of Representations
DeepWalk: Online Learning of Representations
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
Principal component analysis, Code and Time Complexity
Principal component analysis, Code and Time ComplexityPrincipal component analysis, Code and Time Complexity
Principal component analysis, Code and Time Complexity
 
Social Recommendation
Social RecommendationSocial Recommendation
Social Recommendation
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)
 
Generative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging ApplicationsGenerative Adversarial Networks and Their Medical Imaging Applications
Generative Adversarial Networks and Their Medical Imaging Applications
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
 
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 

Similar a Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

The current state of prediction in neuroimaging
The current state of prediction in neuroimagingThe current state of prediction in neuroimaging
The current state of prediction in neuroimagingSaigeRutherford
 
Machine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersMachine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersShareDocView.com
 
41 essential machine learning interview questions!
41 essential machine learning interview questions!41 essential machine learning interview questions!
41 essential machine learning interview questions!SrinevethaAR
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsScott Fraundorf
 
AI Unit 5 machine learning
AI Unit 5 machine learning AI Unit 5 machine learning
AI Unit 5 machine learning Narayan Dhamala
 
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docxrhetttrevannion
 
Machine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgeMachine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgePaul Agapow
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnKodok Ngorex
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical DataPaul Agapow
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Datadr_jp_ebejer
 
Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sigmadhucharis
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 

Similar a Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014 (20)

The current state of prediction in neuroimaging
The current state of prediction in neuroimagingThe current state of prediction in neuroimaging
The current state of prediction in neuroimaging
 
Machine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersMachine Learning Interview Questions Answers
Machine Learning Interview Questions Answers
 
41 essential machine learning interview questions!
41 essential machine learning interview questions!41 essential machine learning interview questions!
41 essential machine learning interview questions!
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
AI Unit 5 machine learning
AI Unit 5 machine learning AI Unit 5 machine learning
AI Unit 5 machine learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
 
Machine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgeMachine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledge
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can Learn
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 
Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sig
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 

Más de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Más de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

  • 1. Understanding your data with Bayesian networks (in python) Bartek Wilczyński bartek@mimuw.edu.pl University of Warsaw PyData Silicon Valey, May 5th 2014
  • 2. Are you confused enough? Or should I confuse you a bit more ? Image from xkcd.org/552/
  • 3. Data show: Confused students score better! Data from Eric Mazur
  • 4. There may be factors we haven't thought about ● Maybe confusion helps with learning? ● Or maybe there is an alternative explanation? ● As long as these are just cartoon models – we cannot really rule out any structure Paying attention Being confused Correct answer Being confused Correct answer or
  • 5. What do I mean by data? Sex Age Smoking Stress Lung Heart Feel M 0-20 never N No no great F 70 sometimes N minor no OK M 50-70 daily Y no severe Not-so-well M 20-50 daily N no minor OK F 70 never N no minor great F 20-50 sometimes Y severe minor Not-so-well F 20-50 never Y no no great M 20-50 sometimes N minor no great M 50-70 never Y severe no OK F 0-20 never N no severe OK M 20-50 daily Y no no OK M 0-20 daily N no no Not-so-well M 20-50 never N minor no OK .... ... ... ... ... ... ...
  • 6. Network of connections Smoking (daily, sometimes, never) Age (0-20,20-50, 50-70,70+) Stressful job (yes,no) Lung problems (no,minor,severe) Heart problems (no,minor,severe) Sex (male,female) How did you feel this morning? (great, OK, not-so-well, terrible)
  • 7. What is a Bayesian Network ? ● A directed acyclic graph without cycles ● with nodes representing random variables ● and edges between nodes representing dependencies (not necessarily causal) ● Each edge is directed from a parent to a child, so all nodes with connections to a given node constitute its set of parents ● Each variable is associated with a value domain and a probability distribution conditional on parents' values
  • 8. Back to our confused students ● Let us consider our model of confused students ● We can consider the model with an additional variable ● We need to heve data on the additional variable to be predictive ● Sometimes we need to use “wrong” models if they are predictive Paying attention Being confused Correct answer Paying attention yes no confused 80% 0% not confused 20% 100% Paying attention Being confused Correct answer Paying attention yes no correct 50% 20% incorrect 50% 80%
  • 9. Can we find the “best” Bayesian Network? ● Given a dataset with observations, we can try to find the “best” network topology (i.e. the best collection of parents' sets) ● In order to do it automatically we need a scoring function to define what we mean by “best” ● A score function is useful if it can be written as a sum over variables, i.e. the best network consists of best parent sets for variables (modulo acyclicity)
  • 10. How to find the best network? ● There are generally three main approaches to defining BN scores: – Bayesian statistics, e.g. BDe (Herskovits et al. '95) – Information Theoretic, e.g. MDL (Lam et al. '94) – Hypothesis testing, e.g. MMPC (Salehi et al. '10) ● There are also hybrid approaches, like the recent MIT (de Campos '06) approach that uses information theory and hypothesis testing ● We have two issues: – There are exponentially many potential parent sets – The desired network needs to have no cycles ● The second issue is more important and makes the problem NP-complete (Chickering '96)
  • 11. Cycles are not always a problem ● Dynamic Bayesian Networks are avariant of BN models that describe temporal dependencies ● We can safely assume that the causal links only go forward in time ● That breaks the problem of cycles as we now have two versions of each variable: “before” and “after” X1 X2 X3 X1 X1 t t+1 X2 X2 X3 X3
  • 12. Different types of variables ● Another common situation is when we have different types of variables ● We may know that only certain types of connections are causal ● Or we may be interested only in certain types of connections ● This breaks the cycles as well Mutations Protein expression Diseases
  • 13. BNFinder – python library for Bayesian Networks ● A library for identification of optimal Bayesian Networks ● Works under assumption of acyclicity by external constraints (disjoint sets of variables or dynamic networks) ● fast and efficient (relatively)
  • 14. Example1 – the simplest possible
  • 15. Now, parallellize! ● Since we have external constraints on acyclicity, we can search for parent sets independently ● This leads to a simple parallelization scheme and good efficiency
  • 16. Bonn et al. Nat. Genet, 2012
  • 18. Making the training set for “activity” variable
  • 21.
  • 22. Does it provide useful predictions? • 12 positive and 4 negative predictions tested • >90% success (1 error)
  • 23. Some more continuous data with perturbations
  • 24. • 8008 enhancers compiled from 15 ChIP experiments (almost 20k binding peaks) • Activity data for ~140 enhancers divided into – 3 tissues (MESO, VM, SM) – 5 stages (4-6,7-8,9-10,1112,13-16) • Gene expression data for 5082 genes from the BDGP database Wilczynski et al.PLoS Comp.Biol 2012
  • 25.
  • 26. Predictions validated: 19/20 correct stage, 10/20 correct tissue
  • 27. Summary ● Bayesian Networks can provide predictive models based on conditional probability distributions ● BNFinder is an effective tool for finding optimal networks given tabular data. And it's open source! ● It can be used as a commandline tool or as a library ● It can use continuous data as well as discrete ● Can be run in parallel on multiple cores (with good efficiency) ● Convenience functions (cross-validation, ROC plots) included http://launchpad.net/bnfinder
  • 28. Thanks! ● Norbert Dojer ● Alina Frolova ● Paweł Bednarz ● Agnieszka Podsiadło ● Questions?