Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Crunching Data with BigQuery
Fast analysis of Big Data

Jordan Tigani, Software Engineer

01000001011011100111001101110111011001010111001
00010000001110100011011110010000001110100011010
00011001010010000001010101011011000111010001101
00101101101011000010111010001100101001000000101
00010111010101100101011100110111010001101001011
01111011011100010000001101111011001100010000001
00110001101001011001100110010100101100001000000
11101000110100001100101001000000101010101101110
01101001011101100110010101110010011100110110010
10010110000100000011000010110111001100100001000
00010001010111011001100101011100100111100101110
100101110011001000000011010000110010...........

Big Data at Google

72 hours

100 million gigabytes

SELECT
kick_ass_product_plan AS strategy,
AVG(kicking_factor) AS awesomeness
FROM
lots_of_data
GROUP BY
strategy

+-------------+----------------+
| strategy | awesomeness |
+-------------+----------------+
| "Forty-two" | 1000000.01 |
+-------------+----------------+
1 row in result set (10.2 s)
Scanned 100GB

Regular expressions on 13 billion rows...

13 Billion rows
1 TB of data in 4 tables
FAST!
AST

Google's Internal Technology:
Dremel

MapReduce is Flexible but Heavy

• Master constructs the plan and
Mapper Mapper begins spinning up workers

• Mappers read and write to
distributed storage
Master Distributed Storage

• Map => Shuffle => Reduce

Reducer
• Reducers read and write to
distributed storage

MapReduce is Flexible but Heavy

Stage 1 Stage 2

Mapper Mapper Mapper Mapper

Master Distributed Storage Master

Reducer Reducer

Dremel vs MapReduce

• MapReduce
o Flexible batch processing
o High overall throughput
o High latency

• Dremel
o Optimized for interactive SQL queries
o Very low latency

Mixer 0 Dremel Architecture

• Partial Reduction
Mixer 1 Mixer 1
• Diskless data flow

• Long lived shared serving tree
Leaf Leaf Leaf Leaf

• Columnar Storage

Distributed Storage

Simple Query
SELECT
state, COUNT(*) count_babies
FROM [publicdata:samples.natality]
WHERE
year >= 1980 AND year < 1990
GROUP BY state
ORDER BY count_babies DESC
LIMIT 10

LIMIT 10
ORDER BY count_babies DESC
Mixer 0
COUNT(*)
GROUP BY state

O(50 states)
O(50 states)
Mixer 1 Mixer 1 COUNT(*)
GROUP BY state

O(50 states)
COUNT(*)
Leaf Leaf Leaf Leaf
GROUP BY state
WHERE year >= 1980 and year < 1990

O(Rows ~140M)
Distributed Storage
SELECT state, year

Example: Daily Weather Station Data

weather_station_data
station lat long mean_temp humidity timestamp year month day
9384 33.57 86.75 89.3 .35 1351005129 2011 04 19
2857 36.77 119.72 78.5 .24 1351005135 2011 04 19
3475 40.77 73.98 68 .35 1351015930 2011 04 19
etc...

Example: Daily Weather Station Data

station, lat, long, mean_temp, year, mon, day
999999, 36.624, -116.023, 63.6, 2009, 10, 9
911904, 20.963, -156.675, 83.4, 2009, 10, 9
916890, -18133, 178433, 76.9, 2009, 10, 9
943320, -20678, 139488, 73.8, 2009, 10, 9

CSV

Organizing BigQuery Tables

October 22

October 23

Your Source
Data October 24

Modeling Event Data: Social Music Store

logs.oct_24_2012_song_activities
USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP
Michael LISTEN Too Close Alex Clare 1351065562
Michael LISTEN Gangnam Style PSY 1351105150
Jim LISTEN Complications Deadmau5 1351075720
Michael PURCHASE 0.99 Gangnam Style PSY 1351115962

Users Who Listened to More than 10 Songs/Day
SELECT
UserId, COUNT(*) as ListenActivities
FROM
[logs.oct_24_2012_song_activities]
GROUP EACH BY
UserId
HAVING
ListenActivites > 10

How Many Songs Listened to Total by Listeners of PSY?
SELECT
UserId, count(*) as ListenActivities
FROM
WHERE UserId IN (
SELECT
UserId
FROM
WHERE artist = 'PSY')
GROUP EACH BY UserId
HAVING
ListenActivites > 10

Modeling Event Data: Nested and Repeated Values
{"UserID" : "Michael",
"Listens": [
{"TrackId":1234,"Title":"Gangnam Style",
{"TrackId":1234,"Title":"Gangam Style",
"Artist":"PSY","Timestamp":1351075700},
{"TrackId":1234,"Title":"Alex Clare",
"Artist":"Alex Clare",'Timestamp":1351075700}
]
"Purchases": [
{"Track":2345,"Title":"Gangnam Style",
{"Track":2345,"Title":"Gangam Style",
"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}
]}

JSON

Which Users Have Listened to Beyonce?
SELECT
UserID,
COUNT(ListenActivities.artist) WITHIN RECORD
AS song_count
FROM
[logs.oct_24_2012_songactivities]
WHERE
UserID IN (SELECT UserID,
FROM [logs.oct_24_2012_songactivities]
WHERE ListenActivities.artist = 'Beyonce');

What Position are PSY songs in our Users' Daily Playlists?
SELECT
UserID,
POSITION(ListenActivities.artist)
FROM
[sample_music_logs.oct_24_2012_songactivities]
WHERE
ListenActivities.artist = 'PSY';

Average Position of Songs by PSY in All Daily Playlists?
SELECT
AVG(POSITION(ListenActivities.artist))
FROM
[sample_music_logs.oct_24_2012_songactivities],
[sample_music_logs.oct_23_2012_songactivities],
/* etc... */
WHERE
ListenActivities.artist = 'PSY';

Summary: Choosing a BigQuery Data Model
• "Shard" your Data Using Multiple Tables
• Source Data Files
• CSV format
• Newline-delimited JSON
• Using Nested and Repeated Records
• Simplify Some Types of Queries
• Often Matches Document Database Models

Upload Your Data

Google Cloud
BigQuery
Storage

Load your Data into BigQuery
"jobReference":{
"projectId":"605902584318"},
"configuration":{
"load":{
"destinationTable":{
"projectId":"605902584318",
"datasetId":"my_dataset",
"tableId":"widget_sales"},
"sourceUris":[
"gs://widget-sales-data/2012080100.csv"],
"schema":{
"fields":[{
"name":"widget",
"type":"string"},
...

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

Query Away!

"jobReference":{
"projectId":"605902584318",
"query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count
FROM widget_sales",
"maxResults":100,
"apiVersion":"v2"
}

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

Libraries

• Python • JavaScript
• Java • Go
• .NET • PHP
• Ruby • Objective-C

Libraries - Example JavaScript Query

var request = gapi.client.bigquery.jobs.query({
'projectId': project_id,
'timeoutMs': '30000',
'query': 'SELECT state, AVG(mother_age) AS theav
FROM [publicdata:samples.natality]
WHERE year=2000 AND ever_born=1
GROUP BY state
ORDER BY theav DESC;'
});

request.execute(function(response) {
console.log(response);
$.each(response.result.rows, function(i, item) {
...

Custom Code and the Google Chart Tools API

Commercial Visualization Tools

Demo: Using BigQuery on BigQuery

BigQuery - Aggregate Big Data Analysis in Seconds

• Full table scans FAST
• Aggregate Queries on Massive Datasets
• Supports Flat and Nested/Repeated Data Models
• It's an API

Get started now:
http://developers.google.com/bigquery/

SELECT questions FROM audience

SELECT 'Thank You!'
FROM jordan

http://developers.google.com/bigquery

Schema definition

birth_record parents
parent_id_mother id
parent_id_father race
plurality age
is_male cigarette_use
race state
weight

Schema definition

birth_record
mother_race
mother_age
mother_cigarette_use
mother_state
father_race
father_age
father_cigarette_use
father_state
plurality
is_male
race
weight

Tools to prepare your data

• App Engine MapReduce
• Commercial ETL tools
• Pervasive
• Informatica
• Talend
• UNIX command-line

Schema definition - sharding
birth_record_2011 birth_record_2012 birth_record_2013
mother_race mother_race birth_record_2014
mother_age mother_age
mother_cigarette_use mother_cigarette_use birth_record_2015
mother_state mother_state
father_race father_race birth_record_2016
father_age father_age
father_cigarette_use father_cigarette_use
father_state father_state
plurality plurality
is_male is_male
race race
weight weight

“ If you do a table scan over a 1TB table,
you're going to have a bad time. ”

Anonymous
16th century Italian Philosopher-Monk

Goal: Perform a 1 TB table scan in 1 second
Parallelize Parallelize Parallelize!

•
• Reading 1 TB/ second from disk:
• 10k+ disks
• Processing 1 TB / sec:
• 5k processors

Data access: Column Store

Record Oriented Storage Column Oriented Storage

BigQuery Architecture
Mixer 0

Mixer 1 Mixer 1 Mixer 1
Shard 0-8 Shard 9-16 Shard 17-24

Shard 0 Shard 10 Shard 12 Shard 20 Shard 24

Distributed Storage (e.g. GFS)

BigQuery SQL Example: Simple aggregates

SELECT COUNT(foo), MAX(foo), STDDEV(foo)
FROM ...

BigQuery SQL Example: Complex Processing

SELECT ... FROM ....
WHERE REGEXP_MATCH(url, ".com$")
AND user CONTAINS 'test'

BigQuery SQL Example: Nested SELECT

SELECT COUNT(*) FROM
(SELECT foo ..... )
GROUP BY foo

BigQuery SQL Example: Small JOIN

SELECT huge_table.foo
FROM huge_table
JOIN small_table
ON small_table.foo = huge_table.foo

BigQuery Architecture: Small Join
Mixer 0

Mixer 1 Mixer 1
Shard 0-8 Shard 17-24

Shard 0 Shard 20 Shard 24

Distributed Storage (e.g. GFS)

Batch queries!

• Don't need interactive queries for some jobs?
• priority: "BATCH"

That's it

• API
• Column-based datastore
• Full table scans FAST
• Aggregates
• Commercial tool support
• Use cases

SELECT questions FROM audience

SELECT 'Thank You!'
FROM ryan

http://developers.google.com/bigquery

@ryguyrg http://profiles.google.com/ryan.boyd

A Little Later ...
Row wp_namespace Revs
Underlying table:
1 0 53697002 • Wikipedia page revision records
2 1 6151228 • Rows: 314 million
3 3 5519859
• Byte size: 35.7 GB
4 4 4184389 Query Stats:
5 2 3108562 • Scanned 7G of data
6 10 1052044 • <5 seconds
7 6 877417
• ~ 100M rows scanned / second
8 14 838940
9 5 651749
10 11 192534
11 100 148135

ORDER BY Revs DESC
Mixer 0
COUNT (revision_id)
GROUP BY wp_namespace

Mixer 1 Mixer 1
COUNT (revision_id)

Leaf Leaf Leaf Leaf COUNT (revision_id)
WHERE timestamp > CUTOFF

10 GB / s

Distributed Storage
SELECT wp_namespace, revision_id

"Multi-stage" Query
SELECT
LogEdits, COUNT(contributor_id) Contributors
FROM (
SELECT
SELECT SELECT
contributor_id,
contributor_id, contributor_id,
INTEGER(LOG10(COUNT(revision_id))) LogEdits
INTEGER(LOG10(COUNT(*))) LogEdits
INTEGER(LOG10(COUNT(revision_id))) LogEdits
FROM [publicdata:samples.wikipedia]
GROUP EACH BY contributor_id)
GROUP EACH BY contributor_id)
GROUP BY LogEdits
ORDER BY LogEdits DESC

ORDER BY LogEdits DESC
Mixer 0 COUNT(contributor_id)
GROUP BY LogEdits

Mixer 1 Mixer 1
COUNT(contributor_id)
GROUP BY LogEdits

COUNT(contributor_id)
Leaf Leaf Shuffler Shuffler GROUP BY LogEdits N^2 Shuffle by
SELECT LE, Id GB/s contributor_id
COUNT(*)
GROUP BY contributor_id

Distributed Storage
SELECT contributor_id

When to use EACH

• Shuffle definitely adds some overhead
• Poor query performance if used incorrectly

• GROUP BY
o Groups << Rows => Unbalanced load
o Example: GROUP BY state

• GROUP EACH BY
o Groups ~ Rows
o Example: GROUP BY user_id

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Similar to Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012