6. Talk outline
• The evolution of data-driven applications
• 5 generations
• Lessons and Opportunities
• From the intersection of startups, venture capital, and
research
• Key theme: Disruption vs Optimization
• Conclusion
6
8. Follow the Data!
• Value-creation has followed the most
valuable data sources available!
• 5 overlapping generations
8
9. Data driven apps: The First Generation
• All about leveraging private, structured data
assets for competitive advantage
• E.g., Sales, inventory, payroll, …
9
11. Data-Driven Apps: The Third Generation
• Leveraging the power of “semi-public”
Social + Mobile Data
• Personal data shared in a frictionless manner with
user’s consent
11
13. Data-driven apps: The Fourth Generation
• Combining public, semi-public, and private
data
13
+
14. 4G Example: Paysa
14
• Am I being compensated fairly?
• 2012 Stanford CS grad
• Java, C++, Ruby, and Machine Learning
• Software Eng II at Google
15. 4G Example: Paysa
15
Salaries
35M+ salary
datapoints
Companies
500k+
companies
People
Professional
DNA of
15M tech
employees
Jobs
Millions of
job postings
updated daily
Local/National
Government
Databases
Partnerships
(e.g., Udacity)
Recruiters
Companies Web Crawl
Social Media
Private Public
16. The Fifth Generation: Just add AI!
16
• Companies generate massive amounts of
training data
• New class of proprietary data
21. Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. Rise of the Cyborg
5. The Data is not a Given
21
22. Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. Rise of the Cyborg
5. The Data is not a Given
22
32. Trends and Takeaways
• Infrastructure is available and solid
• Major transition from Hadoop to Spark
• Investment focus on “Vertical” analytics
plays
• e.g., Cuberon, Ayasdi
• The Age of the Intelligent App has dawned
• Major opportunities and investment dollars flowing here!
• e.g., Troo.ly, Descartes Labs, DocsApp
32
33. Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. Rise of the Cyborg
5. The Data is not a Given
33
37. Why does disruption happen?
• Data scientist as advisor not decision maker
• Domain expertise and experience often win out over data
• Data-driven approach enables a completely
different business model
• E.g., A la carte streaming vs fixed number of channels
• Cannibalization concerns
• Fear of making mistakes
• Algorithms can make mistakes
• But algorithms can learn and improve much faster with data!
37
38. Why does disruption happen?
• Classic Innovator’s Dilemma with a turbo-
boost: data network effects
• Accelerates the pace of disruption
38
39. Disruption Example: Venture Capital
• Venture Capital has been an established
industry for several decades
• Process has not changed much since early days
• VC firms expect entrepreneurs to approach them with
pitches
• Some VC firms have tried using data
• Data scientists in advisory role
• Not partners who make investment decisions
• High concentration in Silicon Valley
• And a few other places…
39
40. Sets the stage for…
40
rocketship.vc
Venture Investing through Data Science
41. More Global Startups
41
Reduced costs to launch a startup
Large consolidating markets;
smartphone ubiquity
Emerging Market Opportunities
Untapped talent pools
42. Beyond Human Scale
42
2.1 Million “Startups”
115K need funding at any time
90% outside Silicon Valley
12.8 Million Companies
45. Business Model Innovation
• Proactively identify interesting companies and
reach out to them at the appropriate moment
45
South America
9%
East
Europe
11%
China
13%
India
7%Other
East Asia
11%
Other Europe
5%
Other North
America
7%
US SF
11%
US Other
22%
Unknown
4%
46. Optimize or Disrupt?
• Key question for every entrepreneur (and
researcher too!)
• Often difference between success and failure
• Hard to answer in general, but look out for
disruption cues
• Established, fragmented industry
• Slow to adopt latest technology trend
• Asset-heavy models
• Risk/reward tradeoff
• Disruption is much riskier but the rewards compensate
46
47. Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. Rise of the Cyborg
5. The Data is not a Given
47
50. Peripheral Vision
• To make optimal decisions, humans must
provide “peripheral vision” to model
• Is this data point an outlier or does it fit the
model?
• e.g., Geo or category in VC
• Is there bias in the model?
• e.g., historical racial gap in sentencing and parole decisions
• Has the world changed in a way that
invalidates the assumption of the model?
• e.g., flash crash on Wall Street
50
51. The Problem
•Must judges, policemen,
doctors, bureaucrats
understand the nuances of
the data and the model?
•Even trickier when we
consider complex workflows
involving multiple decision
makers
• e.g., a drug trial
51
52. The Opportunity
• Systems that include humans and models
as peers
• Can also be complex workflows that involve many
humans and models
• How best to structure such systems to
produce optimal decisions?
• Model might need to be tuned to work with specific
human
• Model Invalidation
• Can models know when they are no longer valid?
52
53. Is it time to disrupt Mechanical Turk?
• The world has changed a lot
since Mechanical Turk was
introduced in 2005
• Can we move closer to true
hybrid human-machine
computing?
• Harness both human initiative and
computing power
• Harness sensors in phones
• Reimagine problems, tasks and
incentives
53
54. Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. Rise of the Cyborg
5. The Data is not a Given
54
56. The Agency Problem
•Each model is optimized
for the good of the
company that owns it
•Often our goals and the
company’s goals are in
alignment but not always!
56
57. Problems
• Privacy
• Everyone has your data and is modeling your actions
• Pricing and Discovery disadvantage
• You discover only what they choose to show you
• You are not a population
• Each service models its population of users
• And is optimizing for its own ends
• Would you rather be explored or exploited?
57
63. Cyborg Layer Services
• Privacy protection
• e.g., using Differential Privacy techniques
• Or by strategically spreading interactions across services
• e.g., watch some movies on Netflix and some on Amazon
• Discovery and Pricing
• Looks at a larger selection and picks items for you
• Acts strictly as your agent; no conflict
• Combine personal and population models
• Cyborg has complete access to all my data
• External services have population data, but only limited
window
63
65. Lessons and Opportunities
1. The Age of the App
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. The Rise of the Cyborg
5. The Data is not a Given
65
66. How to build a Model: Conventional View
• Use ground truth to build the best model
possible
• Feature engineering + model selection
• Maybe some data cleaning and integration
66
69. Can you trust the ground truth?
!
Bad users might have a good label if they haven’t
engaged in bad activity yet
Labels may be incorrect if they are coming from bad
internal models
Labels may be incorrect because of wrong attributions
in bad transactions
!
!
70. Rocketship.vc: company data
70
• How to tradeoff data sources
based on Coverage, Accuracy,
Depth, Freshness, and Cost?
• Which subset of data sources
yields the best model?
• Which subset of data sources
will identify promising
companies most quickly?
• Promising start
• Dong et al, VLDB 2012
• Rekatsinas et al, SIGMOD
2014
72. Summary
• Cannot trust the given data completely
• Ground truth is often neither true nor grounded
• Data may have bias
• Look for additional data that can improve
model
• Quality/cost tradeoff?
• Generate your own training data!
• E.g., Polarr photo-editing app
• Data Programming (Ratner et al, 2016)
72
74. Summary
• 5 generations of data-driven applications
• Lessons and Opportunities
1. The Age of the Intelligent App
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. Rise of the Cyborg
5. The Data is not a Given
74
77. Data impacts every human endeavor
77
Data
Entertainment
Transportation
Government
ManufacturingSciences
Education
Security
Commerce
78. Data + X
• Core identity of the field is to create value
from data
• Never a better time for it!
• Data is now a key part of every field of
human endeavor
• Stanford CS+X
• The value of being an outsider
78
79. Go Forth And Disrupt!
79
Entertainment
Transportation
Government
ManufacturingSciences
Education
Security
Commerce
81. IIT Madras CS Visiting Chair Program
• Focus area: data-driven
approaches to tackle important
problems
• Leading faculty/researchers
from around the world welcome!
• Flexible time commitment
• Minimum 2 weeks
• Endowed by Venky Harinarayan
and Anand Rajaraman
81
82. Confirmed Visiting Chairs so far…
82
Jeff Ullman
Professor Emeritus, CS
Stanford
Randy Katz
Distinguished Professor, EECS
UC Berkeley
Hari Balakrishnan
Professor, EECS
MIT