SlideShare una empresa de Scribd logo
1 de 30
Big Data and Machine Learning
An introduction to Key Ideas
Mauritian JEDI
Bruce Bassett
bruce@saao.ac.za
AIMS/SAAO/UCT
Jan 2015
History of the JEDI concept
We developed the format at several SA workshops (2005-
2008)
NRF-Royal Society 5 year Bilateral with Portsmouth, Sussex
and Oxford: train new researchers & do excellent
cosmology research
• JEDI 1 – Langebaan 2008
• JEDI 2 – STIAS/Avalon 2008
• We are now past JEDI X…
Aim of the JEDI series: explore to find the most efficient way
of teaching & learning research, building new
collaborations and doing excellent research
“Sciama” Principles
• Creativity has to be nurtured creatively
• Ideas are a non-linear function of interaction – want as much
discussion/interaction as possible
• Learning is most efficient when it is fun, informal and play.
• Academia is a small-world network…
• Hence personal contacts and networking are crucial for progress
• Being part of the “fratelli fisici” (Coleman) is important. People
need to know and trust you…
“Google” Principles
• Take good people and treat them really well.
• Trust that good things will come out…things that you can’t
predict before hand.
• Get out of your comfort zone!
“Creativity requires chaos”. Talk to people you would not
normally talk to. Do things that scare you!
• Attitude and atmosphere is crucial: be friendly, have fun,
relax, enjoy yourself, be proactive, interact, work hard.
How does the JEDI work?
• Research is best learned by doing it with people who
do it better or differently than you.
• Work with a “screw-it let’s do it” attitude
• Work on coming up with and evaluating new ideas
• Work on real research projects in teams.
• You choose the projects you are interested in and how
you spend your time.
• 1-3 years: are there any ongoing projects between people
who met at the JEDI?
• 10-20 years: Successful if two people can look back and say,
“actually I first worked or became good friends with X at JEDI
and we have since written papers together, they took my
students for post-docs, they wrote a letter of reference for
me, examined my student’s thesis, helped referee my grant,
get me promoted etc…”
Success on different timescales
Brain Teaser
• A man tosses a coin 30 times and it comes up
heads 30 times in a row.
• What is the probability that it comes up heads
on the 31st coin toss?
What is the scientific method?
• What is the first thing we do when we try to
understand something with physics/applied
mathematics?
• We build a toy model of it, a representation,
that we can study.
• We then study this simplified model and make
predictions.
Machine Learning
• In machine learning, we do the same. We
must choose a set of features that we think
are the most important to achieve our goals
• We then train the machine learning, and use it
to make predictions.
www.quora.com
Data Science in 3 nutshells
The Deeper Drivers
Data Science is really driven by the intersection of:
• Moore’s Law – cheaper, faster, smaller…
• Development of powerful, fast new algorithms that
take advantage of the computing power (e.g. Bayesian
methods)
• Turing completeness which allows near universal
application of the algorithms…
Moore’s Law applies to lots of things…
250,000 x more storage and
about 10 x Cheaper!
The Lean Startup Model
• What we are trying is very close to running a
startup in a competitive landscape
• In Lean Startup, the Minimum Viable Product
is central… test basic assumptions!
• The same is true in data science – start with
something very basic. You will learn a lot…
then build a better model.
A Very Simply & Brief Intro to
Machine Learning
Typically there are two classes of
problems people want solved…
• Classification – what group does this data fall
into? (e.g. male vs female, big spender vs
spendthrift etc…)
• Regression – predict the value of this variable.
(e.g. how much money will our store make
next year?)
Separate these two classes…
Campbell et al, 2012
There are two basic steps in machine
learning
1. Feature extraction – what information do you pull
from the data to learn from?
(e.g. “you dunt neid atl the leytirs to reqd tjis”)
2. Apply the learning algorithm – feed the features to
the algorithm you have chosen and get the answers.
You can play with either step to get better results (and
there are algorithms that do both in one step, e.g.
deep learning, convnets).
There are typically two types of ML
problems…
• Supervised – “here are some examples with the
model answers. Learn from these and apply to
new examples…” (labeled data). Just like school.
Learn from Training set  Apply to Test data set
• Unsupervised – ‘Here is some data. I don’t know
anything, figure everything out yourself.’
(unlabeled data). This is basically clustering 
Nadeem’s dataset.
Pitfalls and Warnings
https://www.topstocks.com.au/
1. Correlation is not causation…
If you look through enough correlations (and algorithms),
some of them will appear significant, just by chance…
But they have no real value.
2. Representative training data
• If the data you train on is not similar to the
test data, you will usually get very bad results!
Representative Training
The Ugly Ducking lacked representative training data…
3. Overfitting
If your friend says “I know how to get to the
supermarket, follow me” and then goes to the
toilet before getting in the car, you probably
don’t need to follow them into the
bathroom…
Robust Classification…
Overfitting
Data Science: First Steps
Step 1. Determine sample size, an indicator of data depth.
Step 2. Know the number of numeric and character variables, an indicator
of data breadth.
Step 3. Calculate the percentage of missing data for each numeric variable.
Step 4. Histogram, plot or otherwise map each variable
Step 5. Start a search for unexpected values of each variable: Improbable
values; and, undefined values due to dividing by 1/0.
Step 6. Know the nature of numeric variables. I.e., declare the formats of
the numerics as decimal, integer or date.
If your data has some nasty peculiarities you don’t know about, it can
really upset a clever algorithm.
• Machine learning competition site
(kaggle.com)
• They give a training dataset and a test set for
which we need to predict the answers.
• We can submit up to 5 test submissions per
day until the competition closes.
• Final scores is based on an unknown subset of
the test data.
The Titanic Problem
• Start with: https://www.kaggle.com/c/titanic-
gettingStarted
• Do the tutorials!
• Read the forums (https://www.kaggle.com/c/titanic-
gettingStarted/forums)
• Download the ipython notebook:
https://www.kaggle.com/c/titanic-
gettingStarted/forums/t/5105/ipython-notebook-
tutorial-for-titanic-machine-learning-from-disaster
• This is a classification problem (0 = died, 1 = survived)
• Good luck!

Más contenido relacionado

La actualidad más candente

[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSMSunView Software, Inc.
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big DataDataWorks Summit
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningPruet Boonma
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRaveen Perera
 
Machine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersMachine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersSudha Jamthe
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentationDavid Raj Kanthi
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsKürşat İNCE
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningLex Fridman
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in pythonUmmeSalmaM1
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big dataPoo Kuan Hoong
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist? HackerEarth
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningNik Spirin
 

La actualidad más candente (20)

[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersMachine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business Leaders
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and Applications
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine Learning
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big data
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 

Similar a Mauritius Big Data and Machine Learning JEDI workshop

Morgan uw mse900 2020 040-25 v2.0
Morgan uw mse900 2020 040-25 v2.0Morgan uw mse900 2020 040-25 v2.0
Morgan uw mse900 2020 040-25 v2.0ddm314
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveoralonso
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Miningebelani
 
How to Find your Research Idea? KMCA workshop3-1-2012
How to Find your Research Idea? KMCA workshop3-1-2012How to Find your Research Idea? KMCA workshop3-1-2012
How to Find your Research Idea? KMCA workshop3-1-2012Anwar F.A. Dafa-Alla
 
Computational Thinking - a 4 step approach and a new pedagogy
Computational Thinking - a 4 step approach and a new pedagogyComputational Thinking - a 4 step approach and a new pedagogy
Computational Thinking - a 4 step approach and a new pedagogyPaul Herring
 
Research Challenges – Am I Doing “Real” Research?
Research Challenges – Am I Doing “Real” Research?Research Challenges – Am I Doing “Real” Research?
Research Challenges – Am I Doing “Real” Research?Dr. Mazlan Abbas
 
CS Education for All. A new wave of opportunity
CS Education for All. A new wave of opportunityCS Education for All. A new wave of opportunity
CS Education for All. A new wave of opportunityPeter Donaldson
 
Research and Commercialisation Challenges
Research and Commercialisation ChallengesResearch and Commercialisation Challenges
Research and Commercialisation ChallengesDr. Mazlan Abbas
 
5th grade-Junior Achievement -our-nation-new-curriculum
5th grade-Junior Achievement -our-nation-new-curriculum5th grade-Junior Achievement -our-nation-new-curriculum
5th grade-Junior Achievement -our-nation-new-curriculumJW William
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).pptSanjayAcharaya
 
Jordan Engbers - Making an Effective Data Scientist
Jordan Engbers - Making an Effective Data ScientistJordan Engbers - Making an Effective Data Scientist
Jordan Engbers - Making an Effective Data ScientistCybera Inc.
 
Reimagining authentic curriculum in the age of AI
Reimagining authentic curriculum in the age of AIReimagining authentic curriculum in the age of AI
Reimagining authentic curriculum in the age of AICharles Darwin University
 
How AI will change the way you help students succeed - SchooLinks
How AI will change the way you help students succeed - SchooLinksHow AI will change the way you help students succeed - SchooLinks
How AI will change the way you help students succeed - SchooLinksKatie Fang
 
Landing your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical InterviewLanding your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical InterviewAnidata
 
6_2019_10_31!10_52_47_PM.PPT
6_2019_10_31!10_52_47_PM.PPT6_2019_10_31!10_52_47_PM.PPT
6_2019_10_31!10_52_47_PM.PPTharvinderjabbal
 
The data science handbook pre release (1)
The data science handbook   pre release (1)The data science handbook   pre release (1)
The data science handbook pre release (1)Lakshmi Prasanna
 
Career introduction of Engineering Student SSVIT rizwan
Career introduction of Engineering Student SSVIT rizwanCareer introduction of Engineering Student SSVIT rizwan
Career introduction of Engineering Student SSVIT rizwanRizwan Khan
 

Similar a Mauritius Big Data and Machine Learning JEDI workshop (20)

Morgan uw mse900 2020 040-25 v2.0
Morgan uw mse900 2020 040-25 v2.0Morgan uw mse900 2020 040-25 v2.0
Morgan uw mse900 2020 040-25 v2.0
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
 
Smith "A Case Study in User Needs for Text Analysis"
Smith "A Case Study in User Needs for Text Analysis"Smith "A Case Study in User Needs for Text Analysis"
Smith "A Case Study in User Needs for Text Analysis"
 
PPT
PPTPPT
PPT
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspective
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
How to Find your Research Idea? KMCA workshop3-1-2012
How to Find your Research Idea? KMCA workshop3-1-2012How to Find your Research Idea? KMCA workshop3-1-2012
How to Find your Research Idea? KMCA workshop3-1-2012
 
Computational Thinking - a 4 step approach and a new pedagogy
Computational Thinking - a 4 step approach and a new pedagogyComputational Thinking - a 4 step approach and a new pedagogy
Computational Thinking - a 4 step approach and a new pedagogy
 
Research Challenges – Am I Doing “Real” Research?
Research Challenges – Am I Doing “Real” Research?Research Challenges – Am I Doing “Real” Research?
Research Challenges – Am I Doing “Real” Research?
 
CS Education for All. A new wave of opportunity
CS Education for All. A new wave of opportunityCS Education for All. A new wave of opportunity
CS Education for All. A new wave of opportunity
 
Research and Commercialisation Challenges
Research and Commercialisation ChallengesResearch and Commercialisation Challenges
Research and Commercialisation Challenges
 
5th grade-Junior Achievement -our-nation-new-curriculum
5th grade-Junior Achievement -our-nation-new-curriculum5th grade-Junior Achievement -our-nation-new-curriculum
5th grade-Junior Achievement -our-nation-new-curriculum
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 
Jordan Engbers - Making an Effective Data Scientist
Jordan Engbers - Making an Effective Data ScientistJordan Engbers - Making an Effective Data Scientist
Jordan Engbers - Making an Effective Data Scientist
 
Reimagining authentic curriculum in the age of AI
Reimagining authentic curriculum in the age of AIReimagining authentic curriculum in the age of AI
Reimagining authentic curriculum in the age of AI
 
How AI will change the way you help students succeed - SchooLinks
How AI will change the way you help students succeed - SchooLinksHow AI will change the way you help students succeed - SchooLinks
How AI will change the way you help students succeed - SchooLinks
 
Landing your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical InterviewLanding your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical Interview
 
6_2019_10_31!10_52_47_PM.PPT
6_2019_10_31!10_52_47_PM.PPT6_2019_10_31!10_52_47_PM.PPT
6_2019_10_31!10_52_47_PM.PPT
 
The data science handbook pre release (1)
The data science handbook   pre release (1)The data science handbook   pre release (1)
The data science handbook pre release (1)
 
Career introduction of Engineering Student SSVIT rizwan
Career introduction of Engineering Student SSVIT rizwanCareer introduction of Engineering Student SSVIT rizwan
Career introduction of Engineering Student SSVIT rizwan
 

Más de CosmoAIMS Bassett

Testing dark energy as a function of scale
Testing dark energy as a function of scaleTesting dark energy as a function of scale
Testing dark energy as a function of scaleCosmoAIMS Bassett
 
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013CosmoAIMS Bassett
 
Cosmology with the 21cm line
Cosmology with the 21cm lineCosmology with the 21cm line
Cosmology with the 21cm lineCosmoAIMS Bassett
 
Tuning your radio to the cosmic dawn
Tuning your radio to the cosmic dawnTuning your radio to the cosmic dawn
Tuning your radio to the cosmic dawnCosmoAIMS Bassett
 
A short introduction to massive gravity... or ... Can one give a mass to the ...
A short introduction to massive gravity... or ... Can one give a mass to the ...A short introduction to massive gravity... or ... Can one give a mass to the ...
A short introduction to massive gravity... or ... Can one give a mass to the ...CosmoAIMS Bassett
 
Decomposing Profiles of SDSS Galaxies
Decomposing Profiles of SDSS GalaxiesDecomposing Profiles of SDSS Galaxies
Decomposing Profiles of SDSS GalaxiesCosmoAIMS Bassett
 
Cluster abundances and clustering Can theory step up to precision cosmology?
Cluster abundances and clustering Can theory step up to precision cosmology?Cluster abundances and clustering Can theory step up to precision cosmology?
Cluster abundances and clustering Can theory step up to precision cosmology?CosmoAIMS Bassett
 
An Overview of Gravitational Lensing
An Overview of Gravitational LensingAn Overview of Gravitational Lensing
An Overview of Gravitational LensingCosmoAIMS Bassett
 
Testing cosmology with galaxy clusters, the CMB and galaxy clustering
Testing cosmology with galaxy clusters, the CMB and galaxy clusteringTesting cosmology with galaxy clusters, the CMB and galaxy clustering
Testing cosmology with galaxy clusters, the CMB and galaxy clusteringCosmoAIMS Bassett
 
Galaxy Formation: An Overview
Galaxy Formation: An OverviewGalaxy Formation: An Overview
Galaxy Formation: An OverviewCosmoAIMS Bassett
 
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio DataSpit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio DataCosmoAIMS Bassett
 
From Darkness, Light: Computing Cosmological Reionization
From Darkness, Light: Computing Cosmological ReionizationFrom Darkness, Light: Computing Cosmological Reionization
From Darkness, Light: Computing Cosmological ReionizationCosmoAIMS Bassett
 
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?CosmoAIMS Bassett
 
Binary pulsars as tools to study gravity
Binary pulsars as tools to study gravityBinary pulsars as tools to study gravity
Binary pulsars as tools to study gravityCosmoAIMS Bassett
 
Cross Matching EUCLID and SKA using the Likelihood Ratio
Cross Matching EUCLID and SKA using the Likelihood RatioCross Matching EUCLID and SKA using the Likelihood Ratio
Cross Matching EUCLID and SKA using the Likelihood RatioCosmoAIMS Bassett
 
Machine Learning Challenges in Astronomy
Machine Learning Challenges in AstronomyMachine Learning Challenges in Astronomy
Machine Learning Challenges in AstronomyCosmoAIMS Bassett
 
Cosmological Results from Planck
Cosmological Results from PlanckCosmological Results from Planck
Cosmological Results from PlanckCosmoAIMS Bassett
 

Más de CosmoAIMS Bassett (20)

Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Testing dark energy as a function of scale
Testing dark energy as a function of scaleTesting dark energy as a function of scale
Testing dark energy as a function of scale
 
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
 
Cosmology with the 21cm line
Cosmology with the 21cm lineCosmology with the 21cm line
Cosmology with the 21cm line
 
Tuning your radio to the cosmic dawn
Tuning your radio to the cosmic dawnTuning your radio to the cosmic dawn
Tuning your radio to the cosmic dawn
 
A short introduction to massive gravity... or ... Can one give a mass to the ...
A short introduction to massive gravity... or ... Can one give a mass to the ...A short introduction to massive gravity... or ... Can one give a mass to the ...
A short introduction to massive gravity... or ... Can one give a mass to the ...
 
Decomposing Profiles of SDSS Galaxies
Decomposing Profiles of SDSS GalaxiesDecomposing Profiles of SDSS Galaxies
Decomposing Profiles of SDSS Galaxies
 
Cluster abundances and clustering Can theory step up to precision cosmology?
Cluster abundances and clustering Can theory step up to precision cosmology?Cluster abundances and clustering Can theory step up to precision cosmology?
Cluster abundances and clustering Can theory step up to precision cosmology?
 
An Overview of Gravitational Lensing
An Overview of Gravitational LensingAn Overview of Gravitational Lensing
An Overview of Gravitational Lensing
 
Testing cosmology with galaxy clusters, the CMB and galaxy clustering
Testing cosmology with galaxy clusters, the CMB and galaxy clusteringTesting cosmology with galaxy clusters, the CMB and galaxy clustering
Testing cosmology with galaxy clusters, the CMB and galaxy clustering
 
Galaxy Formation: An Overview
Galaxy Formation: An OverviewGalaxy Formation: An Overview
Galaxy Formation: An Overview
 
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio DataSpit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
 
MeerKAT: an overview
MeerKAT: an overviewMeerKAT: an overview
MeerKAT: an overview
 
Casa cookbook for KAT 7
Casa cookbook for KAT 7Casa cookbook for KAT 7
Casa cookbook for KAT 7
 
From Darkness, Light: Computing Cosmological Reionization
From Darkness, Light: Computing Cosmological ReionizationFrom Darkness, Light: Computing Cosmological Reionization
From Darkness, Light: Computing Cosmological Reionization
 
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
 
Binary pulsars as tools to study gravity
Binary pulsars as tools to study gravityBinary pulsars as tools to study gravity
Binary pulsars as tools to study gravity
 
Cross Matching EUCLID and SKA using the Likelihood Ratio
Cross Matching EUCLID and SKA using the Likelihood RatioCross Matching EUCLID and SKA using the Likelihood Ratio
Cross Matching EUCLID and SKA using the Likelihood Ratio
 
Machine Learning Challenges in Astronomy
Machine Learning Challenges in AstronomyMachine Learning Challenges in Astronomy
Machine Learning Challenges in Astronomy
 
Cosmological Results from Planck
Cosmological Results from PlanckCosmological Results from Planck
Cosmological Results from Planck
 

Último

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Último (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

Mauritius Big Data and Machine Learning JEDI workshop

  • 1. Big Data and Machine Learning An introduction to Key Ideas Mauritian JEDI Bruce Bassett bruce@saao.ac.za AIMS/SAAO/UCT Jan 2015
  • 2. History of the JEDI concept We developed the format at several SA workshops (2005- 2008) NRF-Royal Society 5 year Bilateral with Portsmouth, Sussex and Oxford: train new researchers & do excellent cosmology research • JEDI 1 – Langebaan 2008 • JEDI 2 – STIAS/Avalon 2008 • We are now past JEDI X… Aim of the JEDI series: explore to find the most efficient way of teaching & learning research, building new collaborations and doing excellent research
  • 3. “Sciama” Principles • Creativity has to be nurtured creatively • Ideas are a non-linear function of interaction – want as much discussion/interaction as possible • Learning is most efficient when it is fun, informal and play. • Academia is a small-world network… • Hence personal contacts and networking are crucial for progress • Being part of the “fratelli fisici” (Coleman) is important. People need to know and trust you…
  • 4. “Google” Principles • Take good people and treat them really well. • Trust that good things will come out…things that you can’t predict before hand. • Get out of your comfort zone! “Creativity requires chaos”. Talk to people you would not normally talk to. Do things that scare you! • Attitude and atmosphere is crucial: be friendly, have fun, relax, enjoy yourself, be proactive, interact, work hard.
  • 5. How does the JEDI work? • Research is best learned by doing it with people who do it better or differently than you. • Work with a “screw-it let’s do it” attitude • Work on coming up with and evaluating new ideas • Work on real research projects in teams. • You choose the projects you are interested in and how you spend your time.
  • 6. • 1-3 years: are there any ongoing projects between people who met at the JEDI? • 10-20 years: Successful if two people can look back and say, “actually I first worked or became good friends with X at JEDI and we have since written papers together, they took my students for post-docs, they wrote a letter of reference for me, examined my student’s thesis, helped referee my grant, get me promoted etc…” Success on different timescales
  • 7. Brain Teaser • A man tosses a coin 30 times and it comes up heads 30 times in a row. • What is the probability that it comes up heads on the 31st coin toss?
  • 8. What is the scientific method?
  • 9. • What is the first thing we do when we try to understand something with physics/applied mathematics? • We build a toy model of it, a representation, that we can study. • We then study this simplified model and make predictions.
  • 10. Machine Learning • In machine learning, we do the same. We must choose a set of features that we think are the most important to achieve our goals • We then train the machine learning, and use it to make predictions.
  • 12. The Deeper Drivers Data Science is really driven by the intersection of: • Moore’s Law – cheaper, faster, smaller… • Development of powerful, fast new algorithms that take advantage of the computing power (e.g. Bayesian methods) • Turing completeness which allows near universal application of the algorithms…
  • 13. Moore’s Law applies to lots of things…
  • 14. 250,000 x more storage and about 10 x Cheaper!
  • 15. The Lean Startup Model • What we are trying is very close to running a startup in a competitive landscape • In Lean Startup, the Minimum Viable Product is central… test basic assumptions! • The same is true in data science – start with something very basic. You will learn a lot… then build a better model.
  • 16. A Very Simply & Brief Intro to Machine Learning
  • 17. Typically there are two classes of problems people want solved… • Classification – what group does this data fall into? (e.g. male vs female, big spender vs spendthrift etc…) • Regression – predict the value of this variable. (e.g. how much money will our store make next year?)
  • 18. Separate these two classes… Campbell et al, 2012
  • 19. There are two basic steps in machine learning 1. Feature extraction – what information do you pull from the data to learn from? (e.g. “you dunt neid atl the leytirs to reqd tjis”) 2. Apply the learning algorithm – feed the features to the algorithm you have chosen and get the answers. You can play with either step to get better results (and there are algorithms that do both in one step, e.g. deep learning, convnets).
  • 20. There are typically two types of ML problems… • Supervised – “here are some examples with the model answers. Learn from these and apply to new examples…” (labeled data). Just like school. Learn from Training set  Apply to Test data set • Unsupervised – ‘Here is some data. I don’t know anything, figure everything out yourself.’ (unlabeled data). This is basically clustering  Nadeem’s dataset.
  • 22. https://www.topstocks.com.au/ 1. Correlation is not causation… If you look through enough correlations (and algorithms), some of them will appear significant, just by chance… But they have no real value.
  • 23. 2. Representative training data • If the data you train on is not similar to the test data, you will usually get very bad results!
  • 24. Representative Training The Ugly Ducking lacked representative training data…
  • 25. 3. Overfitting If your friend says “I know how to get to the supermarket, follow me” and then goes to the toilet before getting in the car, you probably don’t need to follow them into the bathroom…
  • 28. Data Science: First Steps Step 1. Determine sample size, an indicator of data depth. Step 2. Know the number of numeric and character variables, an indicator of data breadth. Step 3. Calculate the percentage of missing data for each numeric variable. Step 4. Histogram, plot or otherwise map each variable Step 5. Start a search for unexpected values of each variable: Improbable values; and, undefined values due to dividing by 1/0. Step 6. Know the nature of numeric variables. I.e., declare the formats of the numerics as decimal, integer or date. If your data has some nasty peculiarities you don’t know about, it can really upset a clever algorithm.
  • 29. • Machine learning competition site (kaggle.com) • They give a training dataset and a test set for which we need to predict the answers. • We can submit up to 5 test submissions per day until the competition closes. • Final scores is based on an unknown subset of the test data.
  • 30. The Titanic Problem • Start with: https://www.kaggle.com/c/titanic- gettingStarted • Do the tutorials! • Read the forums (https://www.kaggle.com/c/titanic- gettingStarted/forums) • Download the ipython notebook: https://www.kaggle.com/c/titanic- gettingStarted/forums/t/5105/ipython-notebook- tutorial-for-titanic-machine-learning-from-disaster • This is a classification problem (0 = died, 1 = survived) • Good luck!