This document discusses machine learning and its use in business. It explains that machine learning algorithms are trained on data to make predictions, getting more accurate over time as more data is fed to the algorithms. Typical machine learning involves training algorithms on a dataset with known values, then testing the trained algorithms on "fresh" test data to measure prediction accuracy. While machine learning focuses on prediction rather than causality, data scientists aim to avoid overfitting to improve out-of-sample prediction accuracy. Competitions like Kaggle also use a standard training and test set approach to benchmark machine learning models.
1. MK99 – Big Data
1
Big data & cross-platform analytics
MOOC lectures Pr. Clement Levallois
2. MK99 – Big Data
2
A short note on machine learning for business
3. MK99 – Big Data
3
Machine Learning
• Family of techniques to formulate predictions, based on data
•Why is it called Machine learning?
–Machine: it is about algorithms running on computers, not equations solved with pen and paper
–Learning: the algorithms start with zero accuracy. Then, they get more accurate while being fed with data: the algorithm refines its parameters, it “learns”.
4. MK99 – Big Data
4
Typical set up
1.We start with a training set
Data already collected: we know the actual values to be found
Ex: a list of consumers, their characteristics and their associated credit score
2.The algorithms are trained on this set
-> A series of algorithms run on the training set. Their parameters get adjusted so that the actual values get progressively predicted the most accurately possible.
3.A test set (“fresh data”) is brought
-> List of consumer characteristics. Their credit score is known but hidden.
4.Running the trained algo on the test set
-> Predict the credit score for each consumer in the test set, using the algorithms that were trained on phase 1
5.A measure of accuracy
- Given the correct values to be predicted in the test set, how accurate were the algorithms?
-> Where the credit scores accurately predicted?
Actual values
5. MK99 – Big Data
5
Vocabulary
•Data scientists “train” their model and then test it
•They are concerned by “out-of-sample” prediction
–The fact that their model predicts accurately data points in the training set (the “sample”) is trivial
–This is the accuracy on the test set that matters!
–This is called an “out-of-sample” prediction
6. MK99 – Big Data
6
Why is machine learning (ML) so different from statistics?
•ML does not focus on causality – just prediction!
–Note: for this reason, ML cannot predict the effect of intervention - it has no causal model.
•ML has a special concern for out-of-sample prediction
–Will be especially careful about over-fitting
•ML picks its algorithms from diff academic disciplines
–Text, network relations, clustering, not just traditional statistics
•Coming from comput. sciences, ML has affinities with big data
–Procedures optimized for speed and scale
But the best data scientists often started as statisticians / econometricians:
See Hal Varian: Chief Economist at Google
7. MK99 – Big Data
7
•Kaggle is a website hosting ML competitions, anybody can join
•Goal: make the best prediction on a dataset, with cash prizes
•From predicting clicks on ads to epileptic seizures
•Always the same setup: a training set, a test set, a scoring based on accuracy.
8. MK99 – Big Data
8
This slide presentation is part of a course offered by EMLYON Business School (www.em-lyon.com)
Contact Clement Levallois (levallois [at] em-lyon.com) for more information.