SlideShare a Scribd company logo
1 of 43
Public Data and
Data Mining
Competitions –
what are the
Lessons?
1© KDnuggets 2013
Gregory Piatetsky-Shapiro
KDnuggets
My Data
• PhD (‘84) in applying Machine Learning to databases
• Researcher at GTE Labs – started the first project on
Knowledge Discovery in Databases in 1989
• Organized first 3 Knowledge Discovery and Data Mining
(KDD) workshops (1989-93), cofounded Knowledge
Discovery and Data Mining (KDD) conferences (1995)
• Chief Scientist at 2 analytics startups 1998-2001
• Co-founder SIGKDD (1998), Chair, 2005-2009
• Analytics/Data Mining Consultant, 2001-
• Editor, KDnuggets, 1994-, full time 2001-
© KDnuggets 2013 2
Patterns – Key Part of Intelligence
• Evolution: Animals better able
to find, use patterns – more
likely to survive
• People have an ability and
desire to find patterns
• People “pattern intuition” does
not scale
• Science is what helps separate
valid from invalid patterns
(astrology, fake cures, …)
© KDnuggets 2013 3
Horoscope for August: The
Mars-Jupiter tandem in
Cancer seems to indicate a
febrile activity related to the
accommodation, houses,
premises, real estate
investments. You'll build,
redecorate, move out, change
your furniture, refurbish, set
up your yard or garden …
Outline
• What do we call it?
• Data competitions – short history
• Government and Public Data
• Big Data Hype and Reality
© KDnuggets 2013 4
What do we call it?
• Statistics
• Data mining
• Knowledge Discovery in
Data (KDD)
• Business Analytics
• Predictive Analytics
• Data Science
• Big Data
• … ?
© KDnuggets 2013 5
Same Core Idea:
Finding Useful
Patterns in Data
Different
Emphasis
20th Century
Statistics dominates
© KDnuggets 2013 6
statistics
Note: Google Ngrams are case-sensitive. Here used lower case as more
representative
Google Ngrams, smoothing=1
“Data Mining” surges in 1996,
peaks in 2004-5
© KDnuggets 2013 7
Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996, Eds:
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy
analytics
data mining
KDD-95, 1st Conference on Knowledge
Discovery and Data Mining, Montreal
Google Ngrams, smoothing=1
Analytics surges in 2006,
after Google Analytics introduced
(c) KDnuggets 2013
Slow-down in analytics
in 2012?
Google Analytics
introduced,
Dec 2005
Google Trends, Jan 2005 – July 2013
“analytics - google” is 50%
of “analytics” searches
analytics
In 2013: Big Data > Data Mining >
Business Analytics > Predictive Analytics
> Data Science
9© KDnuggets 2013
Big Data
Google Trends search, Jan 2008 - July 2013
Data mining
Big Data
slowdown?
History
• 1900 - Statistics
• 1960s Data Mining = bad activity, data “dredging”
• 1990 - “Data Mining” is good, surges in 1996
• 2003 - “Data Mining” peaks, image tarnished
(Total Information Awareness, invasion of privacy)
• 2006 - Google Analytics appears
• 2007 - Business/Data/Predictive Analytics
• 2012 - Big Data surge
• 2013 - Data Science
• 2015 - ??
10© KDnuggets 2012
Data Competitions –
Short History
(c) KDnuggets 2013 11
1st Data Mining Competition:
KDD-CUP 1997
– Organized by Ismail Parsa (then at Epsilon)
– Task: given data on past responders to fund-raising,
predict most likely responders for new campaign
– Data:
• Population of 750K prospects, 300+ variables
• 10K (1.4%) responded to a broad campaign mailing
• Competition file was a stratified sample of 10K responded,
26K non-resp. (28.7% response rate)
– Big effort on leaker detection (false predictors)
KDD Cup was almost cancelled - several times
Charles Elkan found leakers in training data
Evaluating Targeted List:
Cumulative Pct Hits (Gains)
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Model
Random
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
Cum Pct Hits (5%,model)=21%.
Pct list
Cumulative%Hits
KDD-CUP Participant Statistics
– 45 companies/institutions participated
• 23 research prototypes
• 22 commercial tools
– 16 contestants turned in their results
• 9 research prototypes
• 7 commercial tools
– Evaluation: Best Gains (CPH) at 40% and 10%
– Joint winners:
• Charles Elkan (UCSD) with BNB, Boosted Naive Bayesian Classifier
• Urban Science Applications, Inc. with commercial Gain, Direct
Marketing Selection System
• 3rd place: MineSet (SGI, Ronny Kohavi)
KDD-CUP Results Discussion
– Top finishers very close
– Naïve Bayes algorithm was used by 2 of the top 3
contestants (BNB and 3rd place MineSet)
– Naïve Bayes tools did little data preprocessing, used
small number of variables
– Urban Science implemented a tremendous amount
of automated data preprocessing and exploratory
data analysis and developed more than 50 models in
an automated fashion to get to their results
16
KDD Cup 1997: Top 3 results
Top 3 finishers
are very close
17
KDD Cup 1997 – worst results
Note that the worst
result (C6) was actually
worse than random.
Competitor names were
kept anonymous,
apart from top 3 winners
KDD Cup Lessons
• Data Preparation is key, especially eliminating
“leakers” (false predictors)
• Avoid overfitting the test data
• Simple models work well for predicting human
behavior
© KDnuggets 2013 18
Big Competition Successes
• Ansari X-Prize 2004:
Spaceship One went to
space twice in 2 weeks
• DARPA Grand
Challenge, 2005: 150 mi
Off-road robotic car
navigation
© KDnuggets 2013 19
Netflix Prize
• Started in 2006, with 100M
ratings, 500K users, 18K
movies, $1M prize
• Goal: reduce RMSE error in “star”
rating by 10% (was 0.95 for Netflix
own system Cinematch)
• Public training data, public & secret
test sets
© KDnuggets 2013 20
Predicted
Actual
Netflix Prize Milestones
• In just one week, WXYZ consulting team
beat Netflix system with RMSE 0.9430
• Progress in 2007-8 was very slow:
• In 2007 KDnuggets Poll
32% thought prize will
never be won
• Took 3 years to reach
10% improvement
© KDnuggets 2013 21
Netflix Prize Winners
• Winning team used a complex
ensemble of many algorithms
• Two teams had exactly the same RMSE
of 0.8567, but winner submitted 20
minutes earlier !
© KDnuggets 2013 22
Netflix Prize lessons, 1
• Competitions work
• Limits to predicting human behavior –
inherent randomness, noisy data
• Privacy concerns
– Researchers found a few people with matching
IMDB and Netflix ratings – potential privacy
breach
– 4 Netflix users sued
– Netflix Prize Sequel – cancelled
© KDnuggets 2013 23
Netflix Prize lessons, 2
• Winning algorithm was too complex, too
tailored to specific data set, never used 
– Netflix blog, Apr 2012
• A basic SVD algorithm, proposed by Simon
Funk (KDnuggets Interview w. Simon Funk)
got ~6% improvement
• SVD++ version by Yehuda Koren & winning
team reached ~ 8% improvement, was used
by Netflix
© KDnuggets 2013 24
Netflix Prize lessons, 3
• Wrong question was asked ! (Minimizing RMSE of
predicted vs actual ratings)
• RMSE gives big penalty for errors > 2 stars, so an
algo. that fails big a few times will be worse than
an algo. that is often worse by 1.
• Errors are not equal, but RMSE treats 2 vs 3 stars
same as 4 vs 5 or 1 vs 2.
• Also, Netflix Instant became more popular
• Better question would be “what do you like to
watch” (anything on Instant likely to rank > 3)
© KDnuggets 2013 25
Focus
on the right question ?
and the right GOAL
© KDnuggets 2013 26
Kaggle Competition Platform
• Launched by Anthony Goldbloom in 2010
• Quickly became the top platform for
competitions
– Few people know of TunedIT competition
platform launched in 2009
• Kaggle in Class – free for Universities
• Achieved 100,000 members in July 2013
© KDnuggets 2012 27
Kaggle Successes
• Allstate competition: Winner model was 270%
more accurate than baseline
• Identified sound of the endangered North
American Right whale in audio recordings
• GE FlightQuest
• Heritage Health Prize - $3M
competition, 2011-13
• But … Competitions - very time consuming
© KDnuggets 2013 28
Kaggle Business Model
• Initial business model - % of prize
• Kaggle Job Boards (currently free)
• Kaggle Connect: Offers consulting with top
0.5% of Kagglers (at $300/hr ? see post), or
$30-100K/month (IW , Mar 2013)
• Private competitions (Masters) open to top
Kagglers
– Heritage Health Prize 2
© KDnuggets 2013 29
Winning on Kaggle
• Kaggle Chief Scientist: Specialist knowledge –
useless & unhelpful (Slate, Dec 2012)
• Big-data approaches
• Use good tools: R, Random forests
• Curiosity, Creativeness, Persistence, Team, Luc
k? (also Quora answer)
• Many (most?) winners – not professional data
scientists (physicists, math profs, actuary)
(RW, Apr 2012)
© KDnuggets 2013 30
”your Ivy League diploma and IBM
resume don't matter so much
as my Kaggle score”
Almost true
31
Data:
Public, Government, Portals, Mar
ketplaces
© KDnuggets 2013 32
Public Data
www.KDnuggets.com/datasets/
• Government, Federal, State, City, Local and public data sites and portals
• Data APIs, Hubs, Marketplaces, Platforms, Portals, and Search Engines.
• Data Markets: DataMarket
• Data Platforms: Enigma, InfoChimps (acq. By CSC), Knoema, Exversion, …
• Data Search Engines: Qandl , qunb, Zanran
• Location: Factual
• People and places: Freebase
© KDnuggets 2013 33
Public and Government Data
• Datamob.org: tracks government data in
developer-friendly format
© KDnuggets 2013 34
data about U.S. state legislative
activities, including bill
summaries, votes, sponsorships, legislators
and committees.
US Project Open Data
• In May 2013, White House announced Project
Open Data
• “information is a valuable national asset whose
value is multiplied when it is made easily
accessible to the public”.
• “The Executive Order requires that, going
forward, data generated by the government be
made available in open, machine-readable
formats, while appropriately safeguarding
privacy, confidentiality, and security.”
© KDnuggets 2013 35
Using Public Data
• Google – biggest success ?
• Data Science for Social Good (Chicago) (Fast
Company, Aug 2013)
– predict when bikeshare stations run out of bikes
– forecast local crime
– warn local hospitals about impending heart
attacks
© KDnuggets 2013 36
Big Data
• 2nd Industrial Revolution
• Do old activities better
• Create new activities/businesses
37(c) KDnuggets 2013
Doing Old Things Better
Application areas
– Direct marketing/Customer modeling
– Churn prediction
– Recommendations
– Fraud detection
– Security/Intelligence
– …
• Improvement will be real, but limited because of
human randomness
• Competition will level companies
38(c) KDnuggets 2013
Big Data Enables New Things !
– Google – first big success of big data
– Social networks (Facebook, Twitter, LinkedIn, …)
success depends on network size, i.e. big data
– Location analytics
– Health-care
• Personalized medicine
– Semantics and AI ?
• Imagine IBM Watson, Google Now, Siri in 2023 ?
39(c) KDnuggets 2013
Copyright © 2003 KDnuggets
Big Data Bubble?
© 2013 KDnuggets
41
Gartner Hype Cycle
Big Data
Gartner Hype Cycle for Big Data, 2012
© KDnuggets 2013 42
Data
Scientist,
2-5 yrs
Social Network
Analysis, 5-10
Social Analytics, 2-5
Predictive Analytics, <2
MapReduce & Alternative -
Disillusionment
Questions?
KDnuggets: Analytics, Big Data, Data Mining
• News, Jobs, Software, Courses, Data, Meeting
s, Publications, Webcasts, …
www.KDnuggets.com/news
• Subscribe to KDnuggets News email at
www.KDnuggets.com/subscribe.html
• : @kdnuggets
• Email to editor1@kdnuggets.com
43© KDnuggets 2013

More Related Content

What's hot

Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
 

What's hot (20)

Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 

Similar to Public Data and Data Mining Competitions - What are Lessons?

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)Galit Shmueli
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...BigData AAI
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014MedicReS
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Data Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentData Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentKalido
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxInformation Exploration
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingMatthew Lease
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 

Similar to Public Data and Data Mining Competitions - What are Lessons? (20)

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
DBMS
DBMSDBMS
DBMS
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Data Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentData Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business Investment
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptx
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd Computing
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Public Data and Data Mining Competitions - What are Lessons?

  • 1. Public Data and Data Mining Competitions – what are the Lessons? 1© KDnuggets 2013 Gregory Piatetsky-Shapiro KDnuggets
  • 2. My Data • PhD (‘84) in applying Machine Learning to databases • Researcher at GTE Labs – started the first project on Knowledge Discovery in Databases in 1989 • Organized first 3 Knowledge Discovery and Data Mining (KDD) workshops (1989-93), cofounded Knowledge Discovery and Data Mining (KDD) conferences (1995) • Chief Scientist at 2 analytics startups 1998-2001 • Co-founder SIGKDD (1998), Chair, 2005-2009 • Analytics/Data Mining Consultant, 2001- • Editor, KDnuggets, 1994-, full time 2001- © KDnuggets 2013 2
  • 3. Patterns – Key Part of Intelligence • Evolution: Animals better able to find, use patterns – more likely to survive • People have an ability and desire to find patterns • People “pattern intuition” does not scale • Science is what helps separate valid from invalid patterns (astrology, fake cures, …) © KDnuggets 2013 3 Horoscope for August: The Mars-Jupiter tandem in Cancer seems to indicate a febrile activity related to the accommodation, houses, premises, real estate investments. You'll build, redecorate, move out, change your furniture, refurbish, set up your yard or garden …
  • 4. Outline • What do we call it? • Data competitions – short history • Government and Public Data • Big Data Hype and Reality © KDnuggets 2013 4
  • 5. What do we call it? • Statistics • Data mining • Knowledge Discovery in Data (KDD) • Business Analytics • Predictive Analytics • Data Science • Big Data • … ? © KDnuggets 2013 5 Same Core Idea: Finding Useful Patterns in Data Different Emphasis
  • 6. 20th Century Statistics dominates © KDnuggets 2013 6 statistics Note: Google Ngrams are case-sensitive. Here used lower case as more representative Google Ngrams, smoothing=1
  • 7. “Data Mining” surges in 1996, peaks in 2004-5 © KDnuggets 2013 7 Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, Eds: U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy analytics data mining KDD-95, 1st Conference on Knowledge Discovery and Data Mining, Montreal Google Ngrams, smoothing=1
  • 8. Analytics surges in 2006, after Google Analytics introduced (c) KDnuggets 2013 Slow-down in analytics in 2012? Google Analytics introduced, Dec 2005 Google Trends, Jan 2005 – July 2013 “analytics - google” is 50% of “analytics” searches analytics
  • 9. In 2013: Big Data > Data Mining > Business Analytics > Predictive Analytics > Data Science 9© KDnuggets 2013 Big Data Google Trends search, Jan 2008 - July 2013 Data mining Big Data slowdown?
  • 10. History • 1900 - Statistics • 1960s Data Mining = bad activity, data “dredging” • 1990 - “Data Mining” is good, surges in 1996 • 2003 - “Data Mining” peaks, image tarnished (Total Information Awareness, invasion of privacy) • 2006 - Google Analytics appears • 2007 - Business/Data/Predictive Analytics • 2012 - Big Data surge • 2013 - Data Science • 2015 - ?? 10© KDnuggets 2012
  • 11. Data Competitions – Short History (c) KDnuggets 2013 11
  • 12. 1st Data Mining Competition: KDD-CUP 1997 – Organized by Ismail Parsa (then at Epsilon) – Task: given data on past responders to fund-raising, predict most likely responders for new campaign – Data: • Population of 750K prospects, 300+ variables • 10K (1.4%) responded to a broad campaign mailing • Competition file was a stratified sample of 10K responded, 26K non-resp. (28.7% response rate) – Big effort on leaker detection (false predictors) KDD Cup was almost cancelled - several times Charles Elkan found leakers in training data
  • 13. Evaluating Targeted List: Cumulative Pct Hits (Gains) 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Model Random 5% of random list have 5% of targets, but 5% of model ranked list have 21% of targets Cum Pct Hits (5%,model)=21%. Pct list Cumulative%Hits
  • 14. KDD-CUP Participant Statistics – 45 companies/institutions participated • 23 research prototypes • 22 commercial tools – 16 contestants turned in their results • 9 research prototypes • 7 commercial tools – Evaluation: Best Gains (CPH) at 40% and 10% – Joint winners: • Charles Elkan (UCSD) with BNB, Boosted Naive Bayesian Classifier • Urban Science Applications, Inc. with commercial Gain, Direct Marketing Selection System • 3rd place: MineSet (SGI, Ronny Kohavi)
  • 15. KDD-CUP Results Discussion – Top finishers very close – Naïve Bayes algorithm was used by 2 of the top 3 contestants (BNB and 3rd place MineSet) – Naïve Bayes tools did little data preprocessing, used small number of variables – Urban Science implemented a tremendous amount of automated data preprocessing and exploratory data analysis and developed more than 50 models in an automated fashion to get to their results
  • 16. 16 KDD Cup 1997: Top 3 results Top 3 finishers are very close
  • 17. 17 KDD Cup 1997 – worst results Note that the worst result (C6) was actually worse than random. Competitor names were kept anonymous, apart from top 3 winners
  • 18. KDD Cup Lessons • Data Preparation is key, especially eliminating “leakers” (false predictors) • Avoid overfitting the test data • Simple models work well for predicting human behavior © KDnuggets 2013 18
  • 19. Big Competition Successes • Ansari X-Prize 2004: Spaceship One went to space twice in 2 weeks • DARPA Grand Challenge, 2005: 150 mi Off-road robotic car navigation © KDnuggets 2013 19
  • 20. Netflix Prize • Started in 2006, with 100M ratings, 500K users, 18K movies, $1M prize • Goal: reduce RMSE error in “star” rating by 10% (was 0.95 for Netflix own system Cinematch) • Public training data, public & secret test sets © KDnuggets 2013 20 Predicted Actual
  • 21. Netflix Prize Milestones • In just one week, WXYZ consulting team beat Netflix system with RMSE 0.9430 • Progress in 2007-8 was very slow: • In 2007 KDnuggets Poll 32% thought prize will never be won • Took 3 years to reach 10% improvement © KDnuggets 2013 21
  • 22. Netflix Prize Winners • Winning team used a complex ensemble of many algorithms • Two teams had exactly the same RMSE of 0.8567, but winner submitted 20 minutes earlier ! © KDnuggets 2013 22
  • 23. Netflix Prize lessons, 1 • Competitions work • Limits to predicting human behavior – inherent randomness, noisy data • Privacy concerns – Researchers found a few people with matching IMDB and Netflix ratings – potential privacy breach – 4 Netflix users sued – Netflix Prize Sequel – cancelled © KDnuggets 2013 23
  • 24. Netflix Prize lessons, 2 • Winning algorithm was too complex, too tailored to specific data set, never used  – Netflix blog, Apr 2012 • A basic SVD algorithm, proposed by Simon Funk (KDnuggets Interview w. Simon Funk) got ~6% improvement • SVD++ version by Yehuda Koren & winning team reached ~ 8% improvement, was used by Netflix © KDnuggets 2013 24
  • 25. Netflix Prize lessons, 3 • Wrong question was asked ! (Minimizing RMSE of predicted vs actual ratings) • RMSE gives big penalty for errors > 2 stars, so an algo. that fails big a few times will be worse than an algo. that is often worse by 1. • Errors are not equal, but RMSE treats 2 vs 3 stars same as 4 vs 5 or 1 vs 2. • Also, Netflix Instant became more popular • Better question would be “what do you like to watch” (anything on Instant likely to rank > 3) © KDnuggets 2013 25
  • 26. Focus on the right question ? and the right GOAL © KDnuggets 2013 26
  • 27. Kaggle Competition Platform • Launched by Anthony Goldbloom in 2010 • Quickly became the top platform for competitions – Few people know of TunedIT competition platform launched in 2009 • Kaggle in Class – free for Universities • Achieved 100,000 members in July 2013 © KDnuggets 2012 27
  • 28. Kaggle Successes • Allstate competition: Winner model was 270% more accurate than baseline • Identified sound of the endangered North American Right whale in audio recordings • GE FlightQuest • Heritage Health Prize - $3M competition, 2011-13 • But … Competitions - very time consuming © KDnuggets 2013 28
  • 29. Kaggle Business Model • Initial business model - % of prize • Kaggle Job Boards (currently free) • Kaggle Connect: Offers consulting with top 0.5% of Kagglers (at $300/hr ? see post), or $30-100K/month (IW , Mar 2013) • Private competitions (Masters) open to top Kagglers – Heritage Health Prize 2 © KDnuggets 2013 29
  • 30. Winning on Kaggle • Kaggle Chief Scientist: Specialist knowledge – useless & unhelpful (Slate, Dec 2012) • Big-data approaches • Use good tools: R, Random forests • Curiosity, Creativeness, Persistence, Team, Luc k? (also Quora answer) • Many (most?) winners – not professional data scientists (physicists, math profs, actuary) (RW, Apr 2012) © KDnuggets 2013 30
  • 31. ”your Ivy League diploma and IBM resume don't matter so much as my Kaggle score” Almost true 31
  • 32. Data: Public, Government, Portals, Mar ketplaces © KDnuggets 2013 32
  • 33. Public Data www.KDnuggets.com/datasets/ • Government, Federal, State, City, Local and public data sites and portals • Data APIs, Hubs, Marketplaces, Platforms, Portals, and Search Engines. • Data Markets: DataMarket • Data Platforms: Enigma, InfoChimps (acq. By CSC), Knoema, Exversion, … • Data Search Engines: Qandl , qunb, Zanran • Location: Factual • People and places: Freebase © KDnuggets 2013 33
  • 34. Public and Government Data • Datamob.org: tracks government data in developer-friendly format © KDnuggets 2013 34 data about U.S. state legislative activities, including bill summaries, votes, sponsorships, legislators and committees.
  • 35. US Project Open Data • In May 2013, White House announced Project Open Data • “information is a valuable national asset whose value is multiplied when it is made easily accessible to the public”. • “The Executive Order requires that, going forward, data generated by the government be made available in open, machine-readable formats, while appropriately safeguarding privacy, confidentiality, and security.” © KDnuggets 2013 35
  • 36. Using Public Data • Google – biggest success ? • Data Science for Social Good (Chicago) (Fast Company, Aug 2013) – predict when bikeshare stations run out of bikes – forecast local crime – warn local hospitals about impending heart attacks © KDnuggets 2013 36
  • 37. Big Data • 2nd Industrial Revolution • Do old activities better • Create new activities/businesses 37(c) KDnuggets 2013
  • 38. Doing Old Things Better Application areas – Direct marketing/Customer modeling – Churn prediction – Recommendations – Fraud detection – Security/Intelligence – … • Improvement will be real, but limited because of human randomness • Competition will level companies 38(c) KDnuggets 2013
  • 39. Big Data Enables New Things ! – Google – first big success of big data – Social networks (Facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data – Location analytics – Health-care • Personalized medicine – Semantics and AI ? • Imagine IBM Watson, Google Now, Siri in 2023 ? 39(c) KDnuggets 2013
  • 40. Copyright © 2003 KDnuggets
  • 41. Big Data Bubble? © 2013 KDnuggets 41 Gartner Hype Cycle Big Data
  • 42. Gartner Hype Cycle for Big Data, 2012 © KDnuggets 2013 42 Data Scientist, 2-5 yrs Social Network Analysis, 5-10 Social Analytics, 2-5 Predictive Analytics, <2 MapReduce & Alternative - Disillusionment
  • 43. Questions? KDnuggets: Analytics, Big Data, Data Mining • News, Jobs, Software, Courses, Data, Meeting s, Publications, Webcasts, … www.KDnuggets.com/news • Subscribe to KDnuggets News email at www.KDnuggets.com/subscribe.html • : @kdnuggets • Email to editor1@kdnuggets.com 43© KDnuggets 2013

Editor's Notes

  1. Future is Bright for Big Data, but need use caution when evaluating claims