Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Big data in the web
1. 6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
• Big Data
• Asking the Right Questions
• Wisdom of Crowds in the Web
• The Long Tail
• Issues and Examples
• Concluding Remarks
2. 6/28/13
2
- 4 -
4
Big Data
§ Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable time
§ Large volume and growth
§ Petabytes to exabytes
§ Growth is estimated in 3 exabytes per day
§ Structured vs. non-structured data
§ Diversity
§ Types, formats, complexity, topics, etc.
§ Best Public Data Example: The Web
§ Content: text, multimedia
§ Structure: graphs
§ Usage: real time streams
- 5 -
5
Big Data
§ Focus on analytics
§ Many storage technologies:
§ DBs, DWs, distributed file systems, …
§ Many processing technologies:
§ Cloud computing, map-reduce (Hadoop), …
§ Data mining, clustering, classification, …
§ Machine learning, A/B testing, NLP, …
§ Simulation
§ Several technology providers
§ Initial best practices (see TDWI report, 2011)
§ Main challenges: scalability, online
3. 6/28/13
3
- 6 -
6
Big Data: The Five V’s
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Variety Heterogeneity,
Complexity
Adaptability,
Extensibility
Veracity Completeness, Bias,
Sparsity, Noise, Spam
Reliability,
Trust
Velocity Real time Online
Value Usefulness,
Privacy
Business
dependent
- 7 -
7
Asking the Right Questions
§ Problem Driven
§ What data we need? How much?
§ How we collect it? How we store and transfer it?
§ Understanding the Data
§ How sparse is the data? How much noise?
§ There is redundancy? There are biases?
§ There is spam? Any outliers?
§ Analyzing the Data
§ Any privacy issues? Do we need to anonymize?
§ How well our algorithms scale?
§ Can we visualize the results?
4. 6/28/13
4
- 8 -
8
Too Much Data Available
§ The Web is a database!
§ Data does not imply information
§ Many analyses for the sake of it (data driven)
§ Analyzing data is not CS per se
§ Publish in the right forum!
§ Big Data or Right Data?
- 9 -
9
The Different Facets of the Web
5. 6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + links
Y! Answers
Logs (Clicks+Queries)
Explicit Implicit
Wordnet
UGC
Private
Scale
Blogs,
Groups
Quality?
7. 6/28/13
7
- 15 -
15
Noise and Spam
§ Noise may come from many places:
§ Instruments that measure
§ How we interpret the data (example later)
§ Spam is everywhere
- 16 -
16
Web Spam
Deceiving text, links, clicks…
due to an economic incentive
Depending on the goal and the data,
spam is easier to generate
Depending on the type & target data,
spam is easier to fight
Disincentives for spammers?
• Social
• Economical
Web Spam is NOT Mail Spam
8. 6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
9. 6/28/13
9
- 19 -
Web Data Trends
• User Generated Content
– Massive (quality vs. quantity)
– Social Networks
– Real time (people + physical sensors)
• Impact
– Fragmentation of ownership
– Fragmentation of access (longer heavy tail)
– Fragmentation of right to access
• Viability
– Business model based in advertising
- 20 -
The Wisdom of Crowds
• James Surowiecki, a New Yorker columnist,
published this book in 2004
– “Under the right circumstances, groups are
remarkably intelligent”
• Importance of diversity, independence and
decentralization
“large groups of people are smarter than an elite few,
no matter how brilliant—they are better at solving
problems, fostering innovation, coming to wise
decisions, even predicting the future”.
Aggregating data
10. 6/28/13
10
- 21 -
21
Web Data Mining
• Content: text & multimedia mining
• Structure: link analysis, graph mining
• Usage: log analysis, query mining
• Relate all of the above
– Web characterization
– Particular applications
- 22 -
Flickr: Clustering Pictures
22
12. 6/28/13
12
- 27 -
“Crowd Sourcing”
Web-based “peer production” has produced a number of
successful products and communities
• Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends?
• Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT)
• Like outsourcing, but in a micro-distributed fashion
• Thousands of “turkers” working on hundreds of “HITS” (tasks)
• Rates are typically few cents per task
• Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
– Crucial for Search Ranking
– Text: Web Writers & Editors
• not only for the Web!
– Links: Web Publishers
– Tags: Web Taggers
– Queries: All Web Users!
• Queries and actions (or no action!)
The crowd implicitly
knows the experts!
13. 6/28/13
13
- 30 -
30
Scalability
§ How to scale?
§ Doubling the data in the best case will double the time
§ Time complexity vs. result quality trade-off
§ Example: entity detection in linear time at almost state
of the art quality
§ That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities
§ Distributed parallel processing
§ Map-reduce not always works
§ Parallelism is problem dependent
§ Online processing needs a different approach
- 31 -
31
Redundancy and Bias
§ There is any dependency in the data?
§ There is any duplication?
§ Lexical duplication in the Web is around 25%
§ Semantic duplication is larger
§ Are there any biases?
§ Example 1: clicks in search engines
§ Bias to the ranking and the interface
§ There is a ranking bias in the Web content
§ Example 2: tag recommendation
14. 6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a
three-month period on topics ranging from “numb
fingers” to “60 single men”.
Other queries: “landscapers in Lilburn, Ga,” several
people with the last name Arnold and “homes sold
in shadow lake subdivision gwinnett county
georgia.”
Data trail led to Thelma Arnold, a 62-year-old widow
who lives in Lilburn, Ga., frequently researches her
friends’ medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749,
By MICHAEL BARBARO and TOM ZELLER Jr,
The New York Times, Aug 9 2006
33
15. 6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public DB
(Sweeney, 2001)
K-anonymity
Suppress or generalize attributes until
each entry is identical to at least k-1
other entries
Federal Trade Commission in
US: Privacy policies should
“address the collection of data
itself and not just how the
data is used”, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007]
• Gender: 84%
• Age (±10): 79%
• Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008]
• Partial name: 8.9%
• Complete: 1.2%
More information:
• A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008]
A good anonymization is still an open problem
16. 6/28/13
16
- 36 -
36
Sparsity
§ The Long Tail is always Sparse
§ Why there is a long tail?
§ When the crowd dominates
§ Empowering the tail
§ Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
– Popularity
– Diversity
– Quality
– Coverage
Long tail
Heavy tail
17. 6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make
up a large fraction of all queries
Movies watched, blogs read, words used, …
Normal
people
Weirdos
One explanation
18. 6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, web page access, …
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction
is a
power law!
(Zipf’s principle
of minimal effort)
19. 6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
“shwarzneger” example
45
- 46 -
Empowering the Tail
The Filter “Bubble”, Eli Pariser
• Avoid the Poor get Poorer Syndrome
Solutions:
• Diversity
• Novelty
• Serendipity
46
Explore & Exploit
20. 6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of “ad-hoc” crowds?
Aggregate data in the “right way”
When data is sparse
Aggregate users around same intent, task, facet, ….
Change granularity “ad hoc”
• Middle age men
• Fans of Messi
47
- 48 -
48
Example: Mining Geo/time Data
• Optimal Touristic Paths from Flickr
• Good for tourists and locals
De Choudhury et al, HT 2010
21. 6/28/13
21
- 49 -
• The long tail is important not only for e-
commerce, but because we are all there
• Personalization vs. Contextualization
User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 -
69
Epilogue
l The Web is scientifically young
l The Web is intellectually diverse
l The technology mirrors the economic, legal and
sociological reality
l Data must be interesting! (Gerhard Weikum)
l Problem driven
l Plenty of challenges
22. 6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006