This presentation by Hal Varian, Professor of Berkeley School of Information, was made during the discussion on "Big Data: Bringing competition policy to the digital era" held during the 126th meeting of the OECD Competition Committee on 29 November 2016. More papers and presentations on the topic can be found out at www.oecd.org/daf/competition/big-data-bringing-competition-policy-to-the-digital-era.htm
Big data: Bringing competition policy to the digital era – VARIAN – November 2016 OECD discussion
1. Big Data, Personalization,
and Competition
Hal Varian
Nov 2016
These slides do not necessarily represent the view of the author’s employer.
2. Outline
1. Analytics and competitive advantage
2. Sources of positive feedback
3. Diminishing returns to data
4. Search advertising
5. Long tail queries
6. Online advertising
7. Competition and entry
a. Competition among incumbents
b. Startups in Europe
3. Analytics and competitive advantage
Cheap data. Collecting data is very inexpensive so
businesses automatically get a lot of it. (Point of sale
registers, web logs, sensors, etc.)
Example: 2-years old, 6 person web startup for product
recommendations: 600,000 unique visitors, 92 million web log
records, 4 gigabytes of data per month.
Data is useless by itself. It is only valuable if it can be
turned into information, knowledge and action. Data analytics
requires investment in complementary assets such as
hardware, software, and expertise. (These have also
become inexpensive but to a lesser degree.)
4. Requirements for data analytics
Hardware: Cloud computing: cheap and easy to rent from
Amazon, Google, Microsoft, IBM, etc. Services are highly
portable due to technologies developed in France (containers,
dockers).
Software: Primarily open source. LAMP: developed in Europe.
Tools: Python, R, TensorFlow.
Labor: Universities and online tutorials have done a great job
Services: Even expertise can be outsourced to companies like
Kaggle, and customer support to companies like ZenDesk.
Result: Fixed costs have been turned into variable costs. Entry is
easier than it has ever been due to scale flexibility.
6. Data scientists by country in Kaggle competitions
US 197267
IN 69660
CN 31583
GB 22533
RU 15376
CA 14501
AU 14268
DE 12986
FR 12593
JP 7521
KR 7349
ES 7168
NL 7124
TW 6803
BR 6532
SG 6325
IT 4826
PL 4408
CH 3826
IL 3303
7. Outline
1. Analytics and competitive advantage
2. Sources of positive feedback
3. Diminishing returns to data
4. Search advertising
5. Long tail queries
6. Online advertising
7. Competition and entry
a. Competition among incumbents
b. Startups in Europe
8. Sources of positive feedback
● Demand side economies of scale. The value to a user
increases as the number of users increases. (Network
effects, network externality).
○ Indirect network effects. Increased usage of
product A stimulates usage of complementary product
B which in turn increases usage of product A.
● Supply side economies of scale. The cost per unit
decreases (or quality increases) as output increases.
(Increasing returns to scale, MC < AC, decreasing AC,
often due to fixed costs.)
● Learning by doing. The cost per unit decreases (or
quality increases) as experience increases. (Learning
curve, experience curve.)
9. Classic supply side returns to scale
9
2000 Fixed costs 2015 Variable costs
Data center Cloud computing (Amazon, Google)
Custom software Open source tools (Hadoop,Python, R)
Productivity tools Low price or free (Google)
Communication Email, video conferencing (Skype)
User support Call centers (Zendesk)
Fund raising Angel funding (AngelList)
Hiring Online labor markets (LinkedIn, Kaggle)
Sales Cloud CRM (Salesforce, etc)
Fixed costs have become variable costs,
barriers to entry have fallen.
10. Data network effects?
Some argue for a “data network effect”
● Users search ⇒ Click on results ⇒ search engine
learns from data ⇒ provides better search results
This a supply side phenomenon known as “learning by
doing”.
● Present in virtually all industries
● But learning is not “free” like returns to scale
■ Requires a serious investment and commitment
■ Requires data, hardware, software, tools, expertise
■ Requires putting learning into production
● There are diminishing returns to all these factors of
production, including data
10
11. Why the distinction is important
● Share is relevant for simple network effects story
○ Two nightclubs in a small city
○ More people makes one cluc more attractive
● Size is relevant for supply side economies of scale
○ There may be a Minimum Efficient Scale
○ Can have many competing firms at or above MES
● Experience is relevant for learning by doing
○ How long they’ve been around and what they have learned
○ Companies with different cost functions can co-exist in industry
● Upsets happen
○ MySpace/Facebook, Google/Yahoo, Apple/Nokia, etc.
○ Particularly when technology changes and experience is no longer
relevant.
■ Early days of search
■ Mobile revolution
11
12. Outline
1. Analytics and competitive advantage
2. Sources of positive feedback
3. Diminishing returns to data
4. Search advertising
5. Long tail queries
6. Online advertising
7. Competition and entry
a. Competition among incumbents
b. Startups in Europe
13. Diminishing returns to data
Simple model: Y = + error
● n observations
● Estimation error for
○ Goes down as 1/n
○ Very rapid decline
● Unavoidable error
○ Asymptote
○ How to improve fit?
■ Better predictors
■ Better algorithms
■ Better hardware
14. Comparison of Machine Learning Algorithms
http://stackoverflow.com/questions/25665017/does-the-dataset-size-i
nfluence-a-machine-learning-algorithm
14
15. Algorithms, hardware, or data?
Imagenet Large Scale Visual Recognition Challenge uses a
fixed training set of 10 million labeled images. Google
recently contributed another 9 million.
16. Outline
1. Analytics and competitive advantage
2. Sources of positive feedback
3. Diminishing returns to data
4. Personalization and online advertising
5. Long tail queries
6. Online advertising
7. Competition and entry
a. Competition among incumbents
b. Startups in Europe
17. Personalization and online advertising
● Search: user sees ads related to query
○ Query is very strong signal, personalization not very
helpful. Location, recent searches/visits are helpful.
● Contextual: user sees ads related to content of web page
○ Visit a fishing site, see ads for fishing rods
● Display: user sees ads based on browsing history
○ Useful when content is not commercial
○ What ad is relevant to “earthquake in Haiti”?
○ Cost per impression is about 0.2 cent for untargeted,
0.3 cent for targeted
○ 91% of traffic to news sites see targeted display ads
since they have no other useful signals. [See Do-Not-Track
and the Economics of Third-Party Advertising, Microsoft Labs.]
18. Forms of display ad targeting
● Re-targeting
○ Cookie is set by advertiser
○ Ad server only sees User List of cookies, often does
not know where cookies came from
● Interest based
○ Interests based on prior web site visits
○ Sensitive information is not used
● Comparative value
○ Recency is important: advertisers want shoppers who
are in market; first hour is most important
○ Re-targeting is worth about twice as much as interest
based advertising
20. Value of personalization for organic search
Query is a very strong signal, personalization is not very
helpful. What is helpful: location, recent searches, re-ranking.
Re-ranking: if you search for [weather] and always click on
10-day weather, it moves up. [This can be turned off.]
References
○ Of Magic Keywords & Flavors Of Personalized Search At Google,
Searchengineland
○ Dou, Song, Wen, A large-scale evaluation and analysis of personalized
search strategies, Microsoft Labs
○ Hannak et al Measuring personalization of web search, Northeastern
University
○ Google blogpost on personalized search
21. Outline
1. Analytics and competitive advantage
2. Sources of positive feedback
3. Diminishing returns to data
4. Personalization and online advertising
5. Long tails and fat heads
6. Competition and entry
a. Competition among incumbents
b. Startups in Europe
22. Search terms: fat heads and long tails
Microsoft
● Fat head: re-ranking is good for repeated queries
○ Top 3% of queries are issued by more than 47% of users
○ 72% of repeated queries are repeated by same user
○ Re-ranking is useful for fat head queries
● Long tail: re-ranking is useless for long tail queries
○ 80% of distinct queries are issued only once in 12-day period
Google on long tail
● 15% of queries in a given day have never been seen before
● 37% of distinct queries in a given day have never been seen before
● Fraction has been essentially constant since 2009
Conclusion
● Re-ranking is useless for long tail queries
● What is useful: algorithms, machine learning, 200 search signals
● See what makes a page good in Google search quality guidelines
23. Long tail queries are not esoteric
Recover information from digital camera hard drive Hosted exchange pricing
Alpharetta air duct cleaning House tinting houston
Stone mountain carpet cleaning Disk data recovery cleveland
Plano foundation repair Personal injury attorney dallas
Cheap but good auto insurance in los angels ca Flower mound foundation repair
Boiler rental richomd va Marietta carpet cleaner
New York wedding band Criminal attorney fort meyers
Often long, misspelled and have geographic qualifiers. Here
are some queries that generated ad clicks and occurred only
once in particular day.
24. Outline
1. Analytics and competitive advantage
2. Sources of positive feedback
3. Diminishing returns to data
4. Personalization and online advertising
5. Long tails and fat heads
6. Competition and entry
a. Competition among incumbents
b. Startups in Europe
26. Competition from entrants
● Incumbent has: labor, capital, expertise, data
● Entrant: at most they have expertise
● Opportunities arise when technology disruption shakes up
conventional wisdom
● Examples
○ Uber: ride with a stranger?
○ Apple: phone with no keyboard?
○ Google: text ads?
○ Amazon: sell books online?
○ Microsoft: sell operating system with no hardware?
27. Competition from entrants
● Incumbent has: labor, capital, expertise, data
● Entrant: at most they have expertise
● Opportunities arise when technology disruption shakes up
conventional wisdom
● Examples
○ Uber: ride with a stranger?
○ Apple: phone with no keyboard?
○ Google: text ads?
○ Amazon: sell books online?
○ Microsoft: sell operating system with no hardware?
○ ...and many other “crazy” ideas
28. VC funding in Europe
Source: Sand Hill Econometrics
Since 2010, there have been 4,000 new companies founded in
Europe, which have raised $27B. Note: 2016 only shows ½ year.
29. Summary
1. Analytics and competitive advantage
2. Sources of positive feedback
3. Diminishing returns to data
4. Search advertising
5. Long tail queries
6. Online advertising
7. Competition and entry
a. Competition among incumbents
b. Startups in Europe