APM Welcome, APM North West Network Conference, Synergies Across Sectors
Mauritius Big Data and Machine Learning JEDI workshop
1. Big Data and Machine Learning
An introduction to Key Ideas
Mauritian JEDI
Bruce Bassett
bruce@saao.ac.za
AIMS/SAAO/UCT
Jan 2015
2. History of the JEDI concept
We developed the format at several SA workshops (2005-
2008)
NRF-Royal Society 5 year Bilateral with Portsmouth, Sussex
and Oxford: train new researchers & do excellent
cosmology research
• JEDI 1 – Langebaan 2008
• JEDI 2 – STIAS/Avalon 2008
• We are now past JEDI X…
Aim of the JEDI series: explore to find the most efficient way
of teaching & learning research, building new
collaborations and doing excellent research
3. “Sciama” Principles
• Creativity has to be nurtured creatively
• Ideas are a non-linear function of interaction – want as much
discussion/interaction as possible
• Learning is most efficient when it is fun, informal and play.
• Academia is a small-world network…
• Hence personal contacts and networking are crucial for progress
• Being part of the “fratelli fisici” (Coleman) is important. People
need to know and trust you…
4. “Google” Principles
• Take good people and treat them really well.
• Trust that good things will come out…things that you can’t
predict before hand.
• Get out of your comfort zone!
“Creativity requires chaos”. Talk to people you would not
normally talk to. Do things that scare you!
• Attitude and atmosphere is crucial: be friendly, have fun,
relax, enjoy yourself, be proactive, interact, work hard.
5. How does the JEDI work?
• Research is best learned by doing it with people who
do it better or differently than you.
• Work with a “screw-it let’s do it” attitude
• Work on coming up with and evaluating new ideas
• Work on real research projects in teams.
• You choose the projects you are interested in and how
you spend your time.
6. • 1-3 years: are there any ongoing projects between people
who met at the JEDI?
• 10-20 years: Successful if two people can look back and say,
“actually I first worked or became good friends with X at JEDI
and we have since written papers together, they took my
students for post-docs, they wrote a letter of reference for
me, examined my student’s thesis, helped referee my grant,
get me promoted etc…”
Success on different timescales
7. Brain Teaser
• A man tosses a coin 30 times and it comes up
heads 30 times in a row.
• What is the probability that it comes up heads
on the 31st coin toss?
9. • What is the first thing we do when we try to
understand something with physics/applied
mathematics?
• We build a toy model of it, a representation,
that we can study.
• We then study this simplified model and make
predictions.
10. Machine Learning
• In machine learning, we do the same. We
must choose a set of features that we think
are the most important to achieve our goals
• We then train the machine learning, and use it
to make predictions.
12. The Deeper Drivers
Data Science is really driven by the intersection of:
• Moore’s Law – cheaper, faster, smaller…
• Development of powerful, fast new algorithms that
take advantage of the computing power (e.g. Bayesian
methods)
• Turing completeness which allows near universal
application of the algorithms…
15. The Lean Startup Model
• What we are trying is very close to running a
startup in a competitive landscape
• In Lean Startup, the Minimum Viable Product
is central… test basic assumptions!
• The same is true in data science – start with
something very basic. You will learn a lot…
then build a better model.
17. Typically there are two classes of
problems people want solved…
• Classification – what group does this data fall
into? (e.g. male vs female, big spender vs
spendthrift etc…)
• Regression – predict the value of this variable.
(e.g. how much money will our store make
next year?)
19. There are two basic steps in machine
learning
1. Feature extraction – what information do you pull
from the data to learn from?
(e.g. “you dunt neid atl the leytirs to reqd tjis”)
2. Apply the learning algorithm – feed the features to
the algorithm you have chosen and get the answers.
You can play with either step to get better results (and
there are algorithms that do both in one step, e.g.
deep learning, convnets).
20. There are typically two types of ML
problems…
• Supervised – “here are some examples with the
model answers. Learn from these and apply to
new examples…” (labeled data). Just like school.
Learn from Training set Apply to Test data set
• Unsupervised – ‘Here is some data. I don’t know
anything, figure everything out yourself.’
(unlabeled data). This is basically clustering
Nadeem’s dataset.
22. https://www.topstocks.com.au/
1. Correlation is not causation…
If you look through enough correlations (and algorithms),
some of them will appear significant, just by chance…
But they have no real value.
23. 2. Representative training data
• If the data you train on is not similar to the
test data, you will usually get very bad results!
25. 3. Overfitting
If your friend says “I know how to get to the
supermarket, follow me” and then goes to the
toilet before getting in the car, you probably
don’t need to follow them into the
bathroom…
28. Data Science: First Steps
Step 1. Determine sample size, an indicator of data depth.
Step 2. Know the number of numeric and character variables, an indicator
of data breadth.
Step 3. Calculate the percentage of missing data for each numeric variable.
Step 4. Histogram, plot or otherwise map each variable
Step 5. Start a search for unexpected values of each variable: Improbable
values; and, undefined values due to dividing by 1/0.
Step 6. Know the nature of numeric variables. I.e., declare the formats of
the numerics as decimal, integer or date.
If your data has some nasty peculiarities you don’t know about, it can
really upset a clever algorithm.
29. • Machine learning competition site
(kaggle.com)
• They give a training dataset and a test set for
which we need to predict the answers.
• We can submit up to 5 test submissions per
day until the competition closes.
• Final scores is based on an unknown subset of
the test data.
30. The Titanic Problem
• Start with: https://www.kaggle.com/c/titanic-
gettingStarted
• Do the tutorials!
• Read the forums (https://www.kaggle.com/c/titanic-
gettingStarted/forums)
• Download the ipython notebook:
https://www.kaggle.com/c/titanic-
gettingStarted/forums/t/5105/ipython-notebook-
tutorial-for-titanic-machine-learning-from-disaster
• This is a classification problem (0 = died, 1 = survived)
• Good luck!