Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani
12. MapReduce is Flexible but Heavy
• Master constructs the plan and
Mapper Mapper begins spinning up workers
• Mappers read and write to
distributed storage
Master Distributed Storage
• Map => Shuffle => Reduce
Reducer
• Reducers read and write to
distributed storage
13. MapReduce is Flexible but Heavy
Stage 1 Stage 2
Mapper Mapper Mapper Mapper
Master Distributed Storage Master
Reducer Reducer
14. Dremel vs MapReduce
• MapReduce
o Flexible batch processing
o High overall throughput
o High latency
• Dremel
o Optimized for interactive SQL queries
o Very low latency
16. Simple Query
SELECT
state, COUNT(*) count_babies
FROM [publicdata:samples.natality]
WHERE
year >= 1980 AND year < 1990
GROUP BY state
ORDER BY count_babies DESC
LIMIT 10
17. LIMIT 10
ORDER BY count_babies DESC
Mixer 0
COUNT(*)
GROUP BY state
O(50 states)
O(50 states)
Mixer 1 Mixer 1 COUNT(*)
GROUP BY state
O(50 states)
COUNT(*)
Leaf Leaf Leaf Leaf
GROUP BY state
WHERE year >= 1980 and year < 1990
O(Rows ~140M)
Distributed Storage
SELECT state, year
23. Modeling Event Data: Social Music Store
logs.oct_24_2012_song_activities
USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP
Michael LISTEN Too Close Alex Clare 1351065562
Michael LISTEN Gangnam Style PSY 1351105150
Jim LISTEN Complications Deadmau5 1351075720
Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
24. Users Who Listened to More than 10 Songs/Day
SELECT
UserId, COUNT(*) as ListenActivities
FROM
[logs.oct_24_2012_song_activities]
GROUP EACH BY
UserId
HAVING
ListenActivites > 10
25. How Many Songs Listened to Total by Listeners of PSY?
SELECT
UserId, count(*) as ListenActivities
FROM
[logs.oct_24_2012_song_activities]
WHERE UserId IN (
SELECT
UserId
FROM
[logs.oct_24_2012_song_activities]
WHERE artist = 'PSY')
GROUP EACH BY UserId
HAVING
ListenActivites > 10
27. Which Users Have Listened to Beyonce?
SELECT
UserID,
COUNT(ListenActivities.artist) WITHIN RECORD
AS song_count
FROM
[logs.oct_24_2012_songactivities]
WHERE
UserID IN (SELECT UserID,
FROM [logs.oct_24_2012_songactivities]
WHERE ListenActivities.artist = 'Beyonce');
28. What Position are PSY songs in our Users' Daily Playlists?
SELECT
UserID,
POSITION(ListenActivities.artist)
FROM
[sample_music_logs.oct_24_2012_songactivities]
WHERE
ListenActivities.artist = 'PSY';
29. Average Position of Songs by PSY in All Daily Playlists?
SELECT
AVG(POSITION(ListenActivities.artist))
FROM
[sample_music_logs.oct_24_2012_songactivities],
[sample_music_logs.oct_23_2012_songactivities],
/* etc... */
WHERE
ListenActivities.artist = 'PSY';
30. Summary: Choosing a BigQuery Data Model
• "Shard" your Data Using Multiple Tables
• Source Data Files
• CSV format
• Newline-delimited JSON
• Using Nested and Repeated Records
• Simplify Some Types of Queries
• Often Matches Document Database Models
37. Libraries - Example JavaScript Query
var request = gapi.client.bigquery.jobs.query({
'projectId': project_id,
'timeoutMs': '30000',
'query': 'SELECT state, AVG(mother_age) AS theav
FROM [publicdata:samples.natality]
WHERE year=2000 AND ever_born=1
GROUP BY state
ORDER BY theav DESC;'
});
request.execute(function(response) {
console.log(response);
$.each(response.result.rows, function(i, item) {
...
42. BigQuery - Aggregate Big Data Analysis in Seconds
• Full table scans FAST
• Aggregate Queries on Massive Datasets
• Supports Flat and Nested/Repeated Data Models
• It's an API
Get started now:
http://developers.google.com/bigquery/
43. SELECT questions FROM audience
SELECT 'Thank You!'
FROM jordan
http://developers.google.com/bigquery
44.
45. Schema definition
birth_record parents
parent_id_mother id
parent_id_father race
plurality age
is_male cigarette_use
race state
weight
67. A Little Later ...
Row wp_namespace Revs
Underlying table:
1 0 53697002 • Wikipedia page revision records
2 1 6151228 • Rows: 314 million
3 3 5519859
• Byte size: 35.7 GB
4 4 4184389 Query Stats:
5 2 3108562 • Scanned 7G of data
6 10 1052044 • <5 seconds
7 6 877417
• ~ 100M rows scanned / second
8 14 838940
9 5 651749
10 11 192534
11 100 148135
68. ORDER BY Revs DESC
Mixer 0
COUNT (revision_id)
GROUP BY wp_namespace
Mixer 1 Mixer 1
COUNT (revision_id)
GROUP BY wp_namespace
Leaf Leaf Leaf Leaf COUNT (revision_id)
GROUP BY wp_namespace
WHERE timestamp > CUTOFF
10 GB / s
Distributed Storage
SELECT wp_namespace, revision_id
69. "Multi-stage" Query
SELECT
LogEdits, COUNT(contributor_id) Contributors
FROM (
SELECT
SELECT SELECT
contributor_id,
contributor_id, contributor_id,
INTEGER(LOG10(COUNT(revision_id))) LogEdits
INTEGER(LOG10(COUNT(*))) LogEdits
INTEGER(LOG10(COUNT(revision_id))) LogEdits
FROM [publicdata:samples.wikipedia]
FROM [publicdata:samples.wikipedia]
FROM [publicdata:samples.wikipedia]
GROUP EACH BY contributor_id)
GROUP EACH BY contributor_id)
GROUP BY LogEdits
ORDER BY LogEdits DESC
70. ORDER BY LogEdits DESC
Mixer 0 COUNT(contributor_id)
GROUP BY LogEdits
Mixer 1 Mixer 1
COUNT(contributor_id)
GROUP BY LogEdits
COUNT(contributor_id)
Leaf Leaf Shuffler Shuffler GROUP BY LogEdits N^2 Shuffle by
SELECT LE, Id GB/s contributor_id
COUNT(*)
GROUP BY contributor_id
Distributed Storage
SELECT contributor_id
71. When to use EACH
• Shuffle definitely adds some overhead
• Poor query performance if used incorrectly
• GROUP BY
o Groups << Rows => Unbalanced load
o Example: GROUP BY state
• GROUP EACH BY
o Groups ~ Rows
o Example: GROUP BY user_id