Vadim Solovey is a CTO of DoiT International has helped to implement Google BigQuery as a cloud data warehouse for many medium and large sized data and analytics initiatives. BigQuery’s serverless architecture had redefined what it means to be fully managed for hundreds of Israeli's startups.
Recently, Google announced an update to BigQuery that dramatically advances cloud data analytics for large-scale businesses such as BigQuery now support Standard SQL, implementing the SQL 2011 standard as well as new ODBC drivers making it possible to use BigQuery with a number of tools ranging from Microsoft Excel to traditional business intelligence systems such as Microstrategy and Qlik.
Agenda:
• Partitioned tables
• The ability to update, delete rows and columns using SQL
• Integration with IAM for fine-grained security policies
• Monitoring w/ StackDriver to track performance and usage
• Query sharing via links, to foster knowledge within orgs
• Cost optimisation strategies
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Google BigQuery 101 & What’s New
1. Section Slide Template Option 2
Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.
Make the subtitle something clever. People will think it’s neat.
Google BigQuery 101 & What’s New
Vadim Solovey - CTO, DoIT International
Google Cloud Developer Expert | Authorized Trainer
vadim@doit-intl.com
2. DoIT International confidential │ Do not distribute
About me..
Vadim Solovey - CTO, DoiT International
Google Cloud Developer Expert | AWS Solutions Architect
vadim@doit-intl.com
3. DoIT International confidential │ Do not distribute
Agenda
Google BigQuery 101
Partitioned Tables
Standard SQL & New DML Statements
1
2
3
New Formats4
Cost Optimization
6 Q & A
5
4. DoIT International confidential │ Do not distribute
BigQuery 101
Google’s Highly Distributed Columnar Database optimized for Analytics
● Fully managed NoOps service
● Multi petabyte scale & zero sizing required
● Ingestion + Analytics + Storage + API
● No indexes, only full table scans (!)
● Pre-integrated with other Google Cloud services:
○ Dataproc (Hadoop/Spark)
5. DoIT International confidential │ Do not distribute
BigQuery 101
Continue...
● Supports nested and repeated fields/columns
● Google’s SQL Dialect
● Query results are cached for up to 24 hours (no charge)
● Charged for storage ($10-$20 per TB/month) and for data scans ($5/TB)
○ No idle costs
○ Highly cost optimizable
6. DoIT International confidential │ Do not distribute
Based on Dremel
Google File System (GFS)
Leaf Leaf Leaf Leaf Leaf Leaf
Mixer 1 Mixer 1
Mixer 0
BigQuery in 60 Seconds
● Long Lived Shared Tree
● Mixer = Master & Reducer
● Leaf = Mapper
● Partial Reduction
● Diskless Data flow
Columnar Storage
● Execution Independent
● Reduces Disk Time
8. DoIT International confidential │ Do not distribute
What’s New
New features:
● Table Partitions
● Insert/Update/Delete DML
● Standard ANSI SQL 2011
● Identity and Access Management
● Stackdriver for Monitoring
● New data formats for import/export
9. DoIT International confidential │ Do not distribute
Table Partitions
New way to shard the data to minimize amount of data being scanned by a query:
● Integrated with Streaming API for easy partition creation and update
● _PARTITIONTIME pseudo column
● Current release supports partition by DAY
Creating partitioned table (using CLI)
● bq mk --time_partitioning_type=DAY mydataset.table1
● bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 mydataset.table2
Accessing partitioned data:
● Query all partitions: SELECT * from mydataset.table
● Query specific partition: SELECT * from mydataset.table$20161109
● Query range: SELECT * FROM mydataset.table WHERE _PARTITIONTIME BETWEEN
TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
10. DoIT International confidential │ Do not distribute
Insert, Update & Delete DML
BigQuery is not append-only anymore ;-)
Data Manipulation Language (DML) supporting these statements:
● INSERT
● UPDATE
● DELETE
Every statement is implicit transactions, no multi-statement transactions yet.
Quotas:
● Maximum UPDATE/DELETE statements per day per table: 48
● Maximum UPDATE/DELETE statements per day per project: 500
● Maximum INSERT statements per day per table: 1,000
● Maximum INSERT statements per day per project: 10,000
11. DoIT International confidential │ Do not distribute
Standard SQL
Full ANSI SQL 2011
● With extensions to support nested and repeated fields
● ‘Legacy SQL’ is still supported
Set a desired dialect using prefix, i.e.:
● #legacySQL or #standardSQL
#standardSQL
SELECT
weight_pounds, state, year, gestation_weeks
FROM
`bigquery-public-data.samples.natality`
ORDER BY weight_pounds DESC
LIMIT 10;
12. DoIT International confidential │ Do not distribute
Import/Export Formats
Data is importable (and exportable) into/from the following formats:
● *CV files
● JSON
● AVRO
● PARQUET
13. DoIT International confidential │ Do not distribute
Cost Optimization Tips
Some query optimization strategies:
● Use CONTAINS() instead of REGEXP_MATCH(), where possible..
● Sometimes, the sample of data is enough. Use HASH() function to sample the data.
● Use JSON_EXTRACT() if you have raw, unstructured json data in your data
● Avoid nondeterministic queries, i.e. things like NOW() etc. to improve caching
● Don’t query the table which you stream data into (cache will be immediately invalidated)
● Keep query result < 128MB, otherwise it won’t get cached as well
● Use the __TABLES__ & __DATASET__ metadata table for house-keeping goals
14. DoIT International confidential │ Do not distribute
Are you paying too much?
BigQuery is a Columnar Datastore, and maximum performance is achieved on denormalized data sets:
● Pre-Filter with Destination Table when running many similar queries (in WHERE clause)
● Use static tables to optimize BigQuery’s cache
○ If streaming/uploading frequently, create daily/hourly ‘snapshots’ and query them instead of
primary table
● Always prefer storage over compute!
● Set TableExpiration on datasets/partitions for automatic data lifecycle management
● Fetch only required columns in your SELECT clause
● Use dryRun & EXPLAIN to find most cost efficient query
● Set Cost Controls to cap your BigQuery spending
Notas del editor
Before we talk about the next generation stack, let’s look at the principles that underlie it.
The storage is very high performance, distributed, high bandwidth storage platform
Each leaf reads only few MB of data
Mixers combine data along the tree