SlideShare una empresa de Scribd logo
1 de 79
Descargar para leer sin conexión
Data-driven modeling
                                           APAM E4990


                                           Jake Hofman

                                          Columbia University


                                         January 23, 2012




Jake Hofman   (Columbia University)        Data-driven modeling   January 23, 2012   1 / 19
Data-dependent products



              • Effective/practical systems that learn from experience impact
                our daily lives, e.g.:
                    •   Recommendation systems
                    •   Spam detection
                    •   Optical character recognition
                    •   Face recognition
                    •   Fraud detection
                    •   Machine translation
                    •   ...




Jake Hofman    (Columbia University)       Data-driven modeling   January 23, 2012   2 / 19
Learning by example




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   3 / 19
Learning by example




              • How did you solve this problem?
              • Can you make this process explicit (e.g. write code to do so)?


Jake Hofman    (Columbia University)   Data-driven modeling        January 23, 2012   3 / 19
Learning by example




              • We learn quickly from few, relatively unstructured examples ...
                but we don’t understand how we accomplish this
              • We’d like to develop algorithms that enable machines to learn
                by example from large data sets

Jake Hofman    (Columbia University)   Data-driven modeling         January 23, 2012   4 / 19
Got data?
              • Web service APIs expose lots of data




Jake Hofman    (Columbia University)   Data-driven modeling   January 23, 2012   5 / 19
Got data?


              • Many free, public data sets available online




Jake Hofman    (Columbia University)   Data-driven modeling    January 23, 2012   6 / 19
Black-boxified?




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   7 / 19
Black-boxified?




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   7 / 19
Roadmap?




                                      Step 1: Have data
                                      Step 2: ???
                                      Step 3: Profit




Jake Hofman   (Columbia University)      Data-driven modeling   January 23, 2012   8 / 19
Roadmap, take two


              1    Get data




Jake Hofman       (Columbia University)   Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data




Jake Hofman       (Columbia University)   Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data
              5    Specify model
              6    Specify loss function




Jake Hofman       (Columbia University)    Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data
              5    Specify model
              6    Specify loss function
              7    Develop algorithm to minimize loss




Jake Hofman       (Columbia University)    Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data
              5    Specify model
              6    Specify loss function
              7    Develop algorithm to minimize loss
              8    Choose performance measure
              9    “Train” to minimize loss
          10       “Test” to evaluate generalization



Jake Hofman       (Columbia University)    Data-driven modeling   January 23, 2012   9 / 19
Topics


        • Supervised                                    • Unsupervised
            • k-nearest neighbors                           • K-means
            • Naive Bayes                                   • Mixture models
            • Linear regression                             • Principal components
            • Logistic regression                             analysis
            • Support vector machines                       • Topic models
            • Collaborative filtering
            • Matrix factorization


              • Data representation: feature space, selection, normalization
              • Model assessment: complexity control, cross-validation, ROC
                curve, Bayesian Occam’s razor
              • Large-scale learning


Jake Hofman    (Columbia University)   Data-driven modeling              January 23, 2012   10 / 19
Everything old is new again1


              • Many fields ...
                    • Statistics
                    • Pattern recognition
                    • Data mining
                    • Machine learning
              • ... similar goals
                    • Extract and recognize patterns in data
                    • Interpret or explain observations
                    • Test validity of hypotheses
                    • Efficiently search the space of hypotheses
                    • Design efficient algorithms enabling machines to learn from
                      data




              1
                  http://cbcl.mit.edu/publications/theses/thesis-rifkin.pdf
Jake Hofman       (Columbia University)   Data-driven modeling        January 23, 2012   11 / 19
Statistics vs. machine learning2


                                                  Glossary

                          Machine learning                    Statistics


                          network, graphs                     model


                          weights                             parameters


                          learning                            fitting


                          generalization                      test set performance


                          supervised learning                 regression/classification


                          unsupervised learning               density estimation, clustering


                          large grant = $1,000,000            large grant= $50,000


                          nice place to have a meeting:       nice place to have a meeting:
                          Snowbird, Utah, French Alps         Las Vegas in August




                                                          1
         2
            http:
        //anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/
Jake Hofman (Columbia University) Data-driven modeling        January 23, 2012                 12 / 19
Philosophy

              • We would like models that:
                 • Provide predictive and explanatory power
                 • Are complex enough to describe observed phenomena
                 • Are simple enough to generalize to future observations




Jake Hofman    (Columbia University)   Data-driven modeling          January 23, 2012   13 / 19
Example: Netflix Prize




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   14 / 19
Example: Netflix Prize




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   15 / 19
Shipping = Feature




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   16 / 19
References




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   17 / 19
Disclaimer


       You may be bored if you already know how to ...
              • Acquire data from APIs
              • Clean/explore/visualize data
              • Classify and cluster various types of data (e.g., images, text)
              • Code in Python, R, SciPy/NumPy, etc.
              • Scale solutions to large data sets (e.g. Hadoop, SGD)
              • Script with unix tools on the command line, e.g.

                $ sed -e ’s/<[^>]*>//g’ < page.html > page.txt




Jake Hofman    (Columbia University)   Data-driven modeling         January 23, 2012   18 / 19
Themes




                                      Data jeopardy


         Regardless of scale, it’s difficult to find the right questions to ask
                                    of the data




Jake Hofman   (Columbia University)    Data-driven modeling      January 23, 2012   19 / 19
Themes




                                      Data hacking


        Cleaning and normalizing data is a substantial amount of the work
                        (and likely impacts results)




Jake Hofman   (Columbia University)    Data-driven modeling   January 23, 2012   19 / 19
Themes




                                      Data hacking


               The ability to iterate quickly, asking and answering many
                                 questions, is crucial




Jake Hofman   (Columbia University)    Data-driven modeling      January 23, 2012   19 / 19
Themes




                                      Data hacking


                  Hacks happen: sed/awk/grep are useful, and scale




Jake Hofman   (Columbia University)    Data-driven modeling     January 23, 2012   19 / 19
Themes




                                      “Data science”


              Simple methods (e.g., linear models) work surprisingly well,
                            especially with lots of data




Jake Hofman   (Columbia University)    Data-driven modeling       January 23, 2012   19 / 19
Themes




                                      “Data science”


              It’s easy to cover your tracks—things are often much more
                             complicated than they appear




Jake Hofman   (Columbia University)    Data-driven modeling    January 23, 2012   19 / 19
Predicting consumer activity with Web search
with Sharad Goel, S´bastien Lahaie, David Pennock, Duncan Watts
                   e
                                                  "Right Round"
                                                                                        c

                                10



                                20
                         Rank




                                30


                                              Billboard
                                40            Search

                                     Mar−09   Apr−09 May−09 Jun−09             Jul−09   Aug−09
                                                          Week




Jake Hofman   (Yahoo! Research)                Learning from Online Activity                 November 30, 2011   6 / 71
Search predictions
Motivation



                                                                             "Right Round"
                                                                                                            c

                                                           10



   Does collective search activity                         20




                                                    Rank
   provide useful predictive signal
   about real-world outcomes?                              30


                                                                         Billboard
                                                           40            Search

                                                                Mar−09   Apr−09 May−09 Jun−09      Jul−09   Aug−09
                                                                                     Week




Jake Hofman   (Yahoo! Research)   Learning from Online Activity                             November 30, 2011   7 / 71
Search predictions
Motivation




        Past work mainly focuses on predicting the present1 and ignores
              baseline models trained on publicly available data

                                    8                                                                Actual
                                    7                                                                Search
              Flu Level (Percent)




                                    6                                                                Autoregressive
                                    5
                                    4
                                    3
                                    2
                                    1

                                        2004        2005      2006            2007         2008   2009           2010
                                                                       Date




          1
              Varian, 2009
Jake Hofman                     (Yahoo! Research)          Learning from Online Activity             November 30, 2011   8 / 71
Search predictions
Motivation




                            We predict future sales for movies, video games, and music

                                   "Transformers 2"                                    "Tom Clancy's HAWX"                                          "Right Round"
                                                            a                                                         b                                                              c

                                                                                                                                  10
      Search Volume




                                                                 Search Volume


                                                                                                                                  20




                                                                                                                           Rank
                                                                                                                                  30


                                                                                                                                                Billboard
                                                                                                                                  40            Search

                      −30    −20     −10   0    10     20   30                   −30   −20   −10   0     10      20   30               Mar−09   Apr−09 May−09      Jun−09   Jul−09   Aug−09
                              Time to Release (Days)                                    Time to Release (Days)                                              Week




Jake Hofman                  (Yahoo! Research)                                   Learning from Online Activity                                           November 30, 2011                    9 / 71
Search predictions
Search models




      For movies and video games, predict opening weekend box office
      and first month sales, respectively:

                           log(revenue) = β0 + β1 log(search) + 

      For music, predict following week’s Billboard Hot 100 rank:

                   billboardt+1 = β0 + β1 searcht + β2 searcht−1 + 




Jake Hofman   (Yahoo! Research)      Learning from Online Activity   November 30, 2011   10 / 71
Search predictions
Search volume




Jake Hofman   (Yahoo! Research)   Learning from Online Activity   November 30, 2011   11 / 71
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01

Más contenido relacionado

La actualidad más candente

Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AIFlorian Wilhelm
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big DataDataWorks Summit
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
前端专利那些事儿
前端专利那些事儿前端专利那些事儿
前端专利那些事儿tblanlan
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
Application of-statistics-in-CSE
Application of-statistics-in-CSEApplication of-statistics-in-CSE
Application of-statistics-in-CSEMashudRana9
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityRTTS
 
Overview of Data Cleaning.pdf
Overview of Data Cleaning.pdfOverview of Data Cleaning.pdf
Overview of Data Cleaning.pdfSheetalDandge
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionMartinHogg9
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation MaximizationAndres Mendez-Vazquez
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Data Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsData Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsNiloy Sikder
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 
Predictive analysis and modelling
Predictive analysis and modellingPredictive analysis and modelling
Predictive analysis and modellinglalit Lalitm7225
 

La actualidad más candente (20)

Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AI
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
前端专利那些事儿
前端专利那些事儿前端专利那些事儿
前端专利那些事儿
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Uncertainty in Deep Learning
Uncertainty in Deep LearningUncertainty in Deep Learning
Uncertainty in Deep Learning
 
Application of-statistics-in-CSE
Application of-statistics-in-CSEApplication of-statistics-in-CSE
Application of-statistics-in-CSE
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
 
Overview of Data Cleaning.pdf
Overview of Data Cleaning.pdfOverview of Data Cleaning.pdf
Overview of Data Cleaning.pdf
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Data Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsData Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & Systems
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Predictive analysis and modelling
Predictive analysis and modellingPredictive analysis and modelling
Predictive analysis and modelling
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Predictive Model
Predictive ModelPredictive Model
Predictive Model
 

Destacado

Paris e suas igrejas
Paris e suas igrejasParis e suas igrejas
Paris e suas igrejasfilipj2000
 
Summarizing ppoint
Summarizing ppointSummarizing ppoint
Summarizing ppointSusan Isbell
 
Lb oso polar-polar bear
Lb oso polar-polar bearLb oso polar-polar bear
Lb oso polar-polar bearfilipj2000
 
Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Terametric
 
Defrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real TimeDefrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real TimeTerametric
 
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao VietHieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao VietHo Cao Viet
 
Visualizing geolocated Internet measurements
Visualizing geolocated Internet measurementsVisualizing geolocated Internet measurements
Visualizing geolocated Internet measurementsClaudio Squarcella
 
7강 기업교육론 20110413
7강 기업교육론 201104137강 기업교육론 20110413
7강 기업교육론 20110413조현경
 
Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411Pat Maher
 
10강 기업교육론 20110504
10강 기업교육론 2011050410강 기업교육론 20110504
10강 기업교육론 20110504조현경
 
기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료조현경
 
Learning from Web Activity
Learning from Web ActivityLearning from Web Activity
Learning from Web Activityjakehofman
 
Maherprofessional c vbio216
Maherprofessional c vbio216Maherprofessional c vbio216
Maherprofessional c vbio216Pat Maher
 

Destacado (20)

Creativity
CreativityCreativity
Creativity
 
移动产品界面适配设计
移动产品界面适配设计移动产品界面适配设计
移动产品界面适配设计
 
Niver helen
Niver helenNiver helen
Niver helen
 
Maine prevention savings 11 22-10
Maine prevention savings 11 22-10Maine prevention savings 11 22-10
Maine prevention savings 11 22-10
 
Paris e suas igrejas
Paris e suas igrejasParis e suas igrejas
Paris e suas igrejas
 
Tecnicas
TecnicasTecnicas
Tecnicas
 
Baile
Baile Baile
Baile
 
Summarizing ppoint
Summarizing ppointSummarizing ppoint
Summarizing ppoint
 
Lb oso polar-polar bear
Lb oso polar-polar bearLb oso polar-polar bear
Lb oso polar-polar bear
 
Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910
 
Defrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real TimeDefrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real Time
 
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao VietHieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
 
Visualizing geolocated Internet measurements
Visualizing geolocated Internet measurementsVisualizing geolocated Internet measurements
Visualizing geolocated Internet measurements
 
7강 기업교육론 20110413
7강 기업교육론 201104137강 기업교육론 20110413
7강 기업교육론 20110413
 
Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411
 
10강 기업교육론 20110504
10강 기업교육론 2011050410강 기업교육론 20110504
10강 기업교육론 20110504
 
Spectra-Profile
Spectra-ProfileSpectra-Profile
Spectra-Profile
 
기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료
 
Learning from Web Activity
Learning from Web ActivityLearning from Web Activity
Learning from Web Activity
 
Maherprofessional c vbio216
Maherprofessional c vbio216Maherprofessional c vbio216
Maherprofessional c vbio216
 

Similar a Data-driven modeling: Lecture 01

Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02jakehofman
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09jakehofman
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Countingjakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10jakehofman
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with dataONE Talks
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine LearningONE Talks
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
MACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLEMACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLEBhimsen Joshi
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
 
Machine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic ProgrammingMachine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic Programmingrusho1234
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extractionAnmol Dwivedi
 
Algorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief IntroductionAlgorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief IntroductionAnthonyMelson
 

Similar a Data-driven modeling: Lecture 01 (20)

Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Counting
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with data
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine Learning
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
10 best practices in operational analytics
10 best practices in operational analytics 10 best practices in operational analytics
10 best practices in operational analytics
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Lecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptxLecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptx
 
MACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLEMACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLE
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Machine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic ProgrammingMachine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic Programming
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extraction
 
Algorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief IntroductionAlgorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief Introduction
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 

Más de jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 

Más de jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 

Último

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Último (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

Data-driven modeling: Lecture 01

  • 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University January 23, 2012 Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 1 / 19
  • 2. Data-dependent products • Effective/practical systems that learn from experience impact our daily lives, e.g.: • Recommendation systems • Spam detection • Optical character recognition • Face recognition • Fraud detection • Machine translation • ... Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 2 / 19
  • 3. Learning by example Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
  • 4. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
  • 5. Learning by example • We learn quickly from few, relatively unstructured examples ... but we don’t understand how we accomplish this • We’d like to develop algorithms that enable machines to learn by example from large data sets Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 4 / 19
  • 6. Got data? • Web service APIs expose lots of data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 5 / 19
  • 7. Got data? • Many free, public data sets available online Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 6 / 19
  • 8. Black-boxified? Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
  • 9. Black-boxified? Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
  • 10. Roadmap? Step 1: Have data Step 2: ??? Step 3: Profit Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 8 / 19
  • 11. Roadmap, take two 1 Get data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 12. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 13. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 14. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize loss Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 15. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize loss 8 Choose performance measure 9 “Train” to minimize loss 10 “Test” to evaluate generalization Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 16. Topics • Supervised • Unsupervised • k-nearest neighbors • K-means • Naive Bayes • Mixture models • Linear regression • Principal components • Logistic regression analysis • Support vector machines • Topic models • Collaborative filtering • Matrix factorization • Data representation: feature space, selection, normalization • Model assessment: complexity control, cross-validation, ROC curve, Bayesian Occam’s razor • Large-scale learning Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 10 / 19
  • 17. Everything old is new again1 • Many fields ... • Statistics • Pattern recognition • Data mining • Machine learning • ... similar goals • Extract and recognize patterns in data • Interpret or explain observations • Test validity of hypotheses • Efficiently search the space of hypotheses • Design efficient algorithms enabling machines to learn from data 1 http://cbcl.mit.edu/publications/theses/thesis-rifkin.pdf Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 11 / 19
  • 18. Statistics vs. machine learning2 Glossary Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation, clustering large grant = $1,000,000 large grant= $50,000 nice place to have a meeting: nice place to have a meeting: Snowbird, Utah, French Alps Las Vegas in August 1 2 http: //anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/ Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 12 / 19
  • 19. Philosophy • We would like models that: • Provide predictive and explanatory power • Are complex enough to describe observed phenomena • Are simple enough to generalize to future observations Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 13 / 19
  • 20. Example: Netflix Prize Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 14 / 19
  • 21. Example: Netflix Prize Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 15 / 19
  • 22. Shipping = Feature Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 16 / 19
  • 23. References Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 17 / 19
  • 24. Disclaimer You may be bored if you already know how to ... • Acquire data from APIs • Clean/explore/visualize data • Classify and cluster various types of data (e.g., images, text) • Code in Python, R, SciPy/NumPy, etc. • Scale solutions to large data sets (e.g. Hadoop, SGD) • Script with unix tools on the command line, e.g. $ sed -e ’s/<[^>]*>//g’ < page.html > page.txt Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 18 / 19
  • 25. Themes Data jeopardy Regardless of scale, it’s difficult to find the right questions to ask of the data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 26. Themes Data hacking Cleaning and normalizing data is a substantial amount of the work (and likely impacts results) Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 27. Themes Data hacking The ability to iterate quickly, asking and answering many questions, is crucial Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 28. Themes Data hacking Hacks happen: sed/awk/grep are useful, and scale Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 29. Themes “Data science” Simple methods (e.g., linear models) work surprisingly well, especially with lots of data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 30. Themes “Data science” It’s easy to cover your tracks—things are often much more complicated than they appear Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 31.
  • 32. Predicting consumer activity with Web search with Sharad Goel, S´bastien Lahaie, David Pennock, Duncan Watts e "Right Round" c 10 20 Rank 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Week Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 6 / 71
  • 33. Search predictions Motivation "Right Round" c 10 Does collective search activity 20 Rank provide useful predictive signal about real-world outcomes? 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Week Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 7 / 71
  • 34. Search predictions Motivation Past work mainly focuses on predicting the present1 and ignores baseline models trained on publicly available data 8 Actual 7 Search Flu Level (Percent) 6 Autoregressive 5 4 3 2 1 2004 2005 2006 2007 2008 2009 2010 Date 1 Varian, 2009 Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 8 / 71
  • 35. Search predictions Motivation We predict future sales for movies, video games, and music "Transformers 2" "Tom Clancy's HAWX" "Right Round" a b c 10 Search Volume Search Volume 20 Rank 30 Billboard 40 Search −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Time to Release (Days) Time to Release (Days) Week Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 9 / 71
  • 36. Search predictions Search models For movies and video games, predict opening weekend box office and first month sales, respectively: log(revenue) = β0 + β1 log(search) + For music, predict following week’s Billboard Hot 100 rank: billboardt+1 = β0 + β1 searcht + β2 searcht−1 + Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 10 / 71
  • 37. Search predictions Search volume Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 11 / 71