Google BigQuery 101 & What’s New

Section Slide Template Option 2
Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.
Make the subtitle something clever. People will think it’s neat.
Google BigQuery 101 & What’s New
Vadim Solovey - CTO, DoIT International
Google Cloud Developer Expert | Authorized Trainer
vadim@doit-intl.com

DoIT International confidential │ Do not distribute
About me..
Vadim Solovey - CTO, DoiT International
Google Cloud Developer Expert | AWS Solutions Architect
vadim@doit-intl.com

Agenda
Google BigQuery 101
Partitioned Tables
Standard SQL & New DML Statements
1
2
3
New Formats4
Cost Optimization
6 Q & A
5

BigQuery 101
Google’s Highly Distributed Columnar Database optimized for Analytics
● Fully managed NoOps service
● Multi petabyte scale & zero sizing required
● Ingestion + Analytics + Storage + API
● No indexes, only full table scans (!)
● Pre-integrated with other Google Cloud services:
○ Dataproc (Hadoop/Spark)

BigQuery 101
Continue...
● Supports nested and repeated fields/columns
● Google’s SQL Dialect
● Query results are cached for up to 24 hours (no charge)
● Charged for storage ($10-$20 per TB/month) and for data scans ($5/TB)
○ No idle costs
○ Highly cost optimizable

Based on Dremel
Google File System (GFS)
Leaf Leaf Leaf Leaf Leaf Leaf
Mixer 1 Mixer 1
Mixer 0
BigQuery in 60 Seconds
● Long Lived Shared Tree
● Mixer = Master & Reducer
● Leaf = Mapper
● Partial Reduction
● Diskless Data flow
Columnar Storage
● Execution Independent
● Reduces Disk Time

Demo

What’s New
New features:
● Table Partitions
● Insert/Update/Delete DML
● Standard ANSI SQL 2011
● Identity and Access Management
● Stackdriver for Monitoring
● New data formats for import/export

Table Partitions
New way to shard the data to minimize amount of data being scanned by a query:
● Integrated with Streaming API for easy partition creation and update
● _PARTITIONTIME pseudo column
● Current release supports partition by DAY
Creating partitioned table (using CLI)
● bq mk --time_partitioning_type=DAY mydataset.table1
● bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 mydataset.table2
Accessing partitioned data:
● Query all partitions: SELECT * from mydataset.table
● Query specific partition: SELECT * from mydataset.table$20161109
● Query range: SELECT * FROM mydataset.table WHERE _PARTITIONTIME BETWEEN
TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Insert, Update & Delete DML
BigQuery is not append-only anymore ;-)
Data Manipulation Language (DML) supporting these statements:
● INSERT
● UPDATE
● DELETE
Every statement is implicit transactions, no multi-statement transactions yet.
Quotas:
● Maximum UPDATE/DELETE statements per day per table: 48
● Maximum UPDATE/DELETE statements per day per project: 500
● Maximum INSERT statements per day per table: 1,000
● Maximum INSERT statements per day per project: 10,000

Standard SQL
Full ANSI SQL 2011
● With extensions to support nested and repeated fields
● ‘Legacy SQL’ is still supported
Set a desired dialect using prefix, i.e.:
● #legacySQL or #standardSQL
#standardSQL
SELECT
weight_pounds, state, year, gestation_weeks
FROM
`bigquery-public-data.samples.natality`
ORDER BY weight_pounds DESC
LIMIT 10;

Import/Export Formats
Data is importable (and exportable) into/from the following formats:
● *CV files
● JSON
● AVRO
● PARQUET

Cost Optimization Tips
Some query optimization strategies:
● Use CONTAINS() instead of REGEXP_MATCH(), where possible..
● Sometimes, the sample of data is enough. Use HASH() function to sample the data.
● Use JSON_EXTRACT() if you have raw, unstructured json data in your data
● Avoid nondeterministic queries, i.e. things like NOW() etc. to improve caching
● Don’t query the table which you stream data into (cache will be immediately invalidated)
● Keep query result < 128MB, otherwise it won’t get cached as well
● Use the __TABLES__ & __DATASET__ metadata table for house-keeping goals

Are you paying too much?
BigQuery is a Columnar Datastore, and maximum performance is achieved on denormalized data sets:
● Pre-Filter with Destination Table when running many similar queries (in WHERE clause)
● Use static tables to optimize BigQuery’s cache
○ If streaming/uploading frequently, create daily/hourly ‘snapshots’ and query them instead of
primary table
● Always prefer storage over compute!
● Set TableExpiration on datasets/partitions for automatic data lifecycle management
● Fetch only required columns in your SELECT clause
● Use dryRun & EXPLAIN to find most cost efficient query
● Set Cost Controls to cap your BigQuery spending

Google BigQuery 101 & What’s New

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Google BigQuery 101 & What’s New

Similar a Google BigQuery 101 & What’s New (20)

Más de DoiT International

Más de DoiT International (15)

Último

Último (20)

Google BigQuery 101 & What’s New

Notas del editor