Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Gabriel PREDA
@eRadical
(Almost) Serverless Analytics System
with BigQuery & AppEngine
Agenda
Going Serverless with
AppEngine & Tasks
Pub/Sub, DataStore
BigQuery
Load
Batch
Streaming Inserts
Query
UDF
Export
....
AeonsSome years ago...
~ 500,000 - 2,000,000 events / day
(on average)
Some time ago...
~2,000,000 - 22,000,000 events / day
Dec 2014: 57,430,000 events / day
1 day to recompute » 12 hours
NOW()
22,000,000 - 70,000,000 events / day
AVG » 40,000,000 events / day
Processing ~30GB-70GB / day
Recompute 1 day » 10-...
serverless?
Desired for: https://www.innertrends.com
other... (almost) serverless products
Cloud Functions (alpha - Node.JS)
Cloud DataFlow (Java, Python - beta)
BigQuery
https://cloud.google.com/bigquery/docs/
BigQuery - data types
● STRING - UTF-8 (2 bytes + encoded string size)
● BYTES - base64 encoded (except in Avro)
● INTEGER...
BigQuery -> loadData()
Formats: CSV, JSON (newline delimited), Avro, Parquet (experimental)
Tools: Web UI, bq, API
Source:...
BigQuery -> loadData()
bq load ...
BigQuery -> loadData()
Got some rows?
BigQuery -> SELECT … FROM surprise…
query:
SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ]
[ FROM from...
BigQuery -> SELECT … FROM surprise…
Date-Partitioned Tables [demo]
Table Decorators - See the past w/ @
Table Wildcard Fun...
bigquery.defineFunction(
'expandAssetLibrary', // Name of the function exported to SQL
['user_id', 'video_id', 'stage_sett...
BigQuery -> DML
Standard SQL only
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements ...
BigQuery -> export()
To: Google Cloud Storage
Format: CSV, JSON [.gz], Avro
…1G files
BigQuery -> some (Big)Queries
SELECT year, count(1)
FROM [bigquery-public-data:samples.natality]
WHERE father_age < 18
GRO...
BigQuery -> some (Big)Queries
SELECT REGEXP_EXTRACT(path, r'.*.(.*)$') AS file_extension,
COUNT(1) AS k
FROM [bigquery-pub...
Próxima SlideShare
Cargando en…5
×

(Almost) Serverless Analytics System with BigQuery & AppEngine

The story of a gradual moved to a(n almost) serverless analytics system and an how to start with BigQuery. A way to create a serverless analytics system in Google Cloud. Live queries on 'huge datasets' were be performed.

  • Inicia sesión para ver los comentarios

  • Sé el primero en recomendar esto

(Almost) Serverless Analytics System with BigQuery & AppEngine

  1. 1. Gabriel PREDA @eRadical (Almost) Serverless Analytics System with BigQuery & AppEngine
  2. 2. Agenda Going Serverless with AppEngine & Tasks Pub/Sub, DataStore BigQuery Load Batch Streaming Inserts Query UDF Export ...some BigQueries...
  3. 3. AeonsSome years ago... ~ 500,000 - 2,000,000 events / day (on average)
  4. 4. Some time ago... ~2,000,000 - 22,000,000 events / day Dec 2014: 57,430,000 events / day 1 day to recompute » 12 hours
  5. 5. NOW() 22,000,000 - 70,000,000 events / day AVG » 40,000,000 events / day Processing ~30GB-70GB / day Recompute 1 day » 10-20 minutes
  6. 6. serverless? Desired for: https://www.innertrends.com
  7. 7. other... (almost) serverless products Cloud Functions (alpha - Node.JS) Cloud DataFlow (Java, Python - beta)
  8. 8. BigQuery https://cloud.google.com/bigquery/docs/
  9. 9. BigQuery - data types ● STRING - UTF-8 (2 bytes + encoded string size) ● BYTES - base64 encoded (except in Avro) ● INTEGER - 64-bit signed (8 bytes) ● FLOAT (8 bytes) ● BOOLEAN - true/false, 1/0 only in CSV (1 byte) ● TIMESTAMP ex:”2014-08-19 12:41:35.220 UTC” (8 bytes) ● DATE, TIME, DATETIME - limited support in Legacy SQL ● RECORD - a collection of fields (size of fields) https://cloud.google.com/bigquery/data-types
  10. 10. BigQuery -> loadData() Formats: CSV, JSON (newline delimited), Avro, Parquet (experimental) Tools: Web UI, bq, API Source: local files, Cloud Storage, [demo] Cloud Datastore (backup files), POST requests, SQL DML* Google Sheets - Federated Data Sources - Streaming Inserts
  11. 11. BigQuery -> loadData() bq load ...
  12. 12. BigQuery -> loadData() Got some rows?
  13. 13. BigQuery -> SELECT … FROM surprise… query: SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ] [ FROM from_body [ WHERE bool_expression ] [ OMIT RECORD IF bool_expression] [ GROUP [ EACH ] BY [ ROLLUP ] { field_name_or_alias } [ , ... ] ] [ HAVING bool_expression ] [ ORDER BY field_name_or_alias [ { DESC | ASC } ] [, ... ] ] [ LIMIT n ] ]; from_body: from_item [, ...] | # Warning: Comma means UNION ALL here from_item [ join_type ] JOIN [ EACH ] from_item [ ON join_predicate ] | (FLATTEN({ table_name | (query) }, field_name_or_alias)) | table_wildcard_function from_item: { table_name | (query) } [ [ AS ] alias ] join_type: { INNER | [ FULL ] [ OUTER ] | RIGHT [ OUTER ] | LEFT [ OUTER ] | CROSS }
  14. 14. BigQuery -> SELECT … FROM surprise… Date-Partitioned Tables [demo] Table Decorators - See the past w/ @ Table Wildcard Functions - TABLE_DATE_RANGE() & TABLE_QUERY() Interesting functions - DateTime » UTC_USEC_TO_DAY/HOUR/MONTH/WEEK/YEAR() » Shifts a UNIX timestamp in microseconds to the beginning of the period it occurs in. - JSON_EXTRACT[_SCALAR]() - URL functions » HOST(), DOMAIN(), TLD() - REGEXP_MATCH(), REGEXP_EXTRACT()
  15. 15. bigquery.defineFunction( 'expandAssetLibrary', // Name of the function exported to SQL ['user_id', 'video_id', 'stage_settings'], // Names of input columns [ {'name': 'user_id', 'type': 'integer'}, // Output schema {'name': 'video_id', 'type': 'string'}, {'name': 'asset', 'type': 'string'} ], expandAssetLibrary // Reference to JavaScript UDF ); function expandAssetLibrary(row, emit) { ………………………… emit({ user_id: row.user_id, video_id: row.video_id, asset: ss.url.replace('http://', '')); } BigQuery -> User Defined Functions
  16. 16. BigQuery -> DML Standard SQL only Maximum UPDATE/DELETE statements per day per table: 48 Maximum UPDATE/DELETE statements per day per project: 500 Maximum INSERT statements per day per table: 1,000 Maximum INSERT statements per day per project: 10,000
  17. 17. BigQuery -> export() To: Google Cloud Storage Format: CSV, JSON [.gz], Avro …1G files
  18. 18. BigQuery -> some (Big)Queries SELECT year, count(1) FROM [bigquery-public-data:samples.natality] WHERE father_age < 18 GROUP BY year ORDER BY year SELECT year, count(1) FROM [bigquery-public-data:samples.natality] WHERE mother_age < 18 GROUP BY year ORDER BY year SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb FROM [bigquery-public-data:ghcn_m.__TABLES__] ORDER BY gb DESC
  19. 19. BigQuery -> some (Big)Queries SELECT REGEXP_EXTRACT(path, r'.*.(.*)$') AS file_extension, COUNT(1) AS k FROM [bigquery-public-data:github_repos.files] GROUP BY file_extension ORDER BY k DESC LIMIT 20 SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb FROM [bigquery-public-data:github_repos.__TABLES__] ORDER BY gb DESC

×