Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Complex realtime event analytics using BigQuery @Crunch Warmup

3.578 visualizaciones

Publicado el

Complex event analytics solutions require massive architecture, and Know-How to build a fast real-time computing system. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.In this presentation we will see how Bigquery solves our ultimate goal: Store everything accessible by SQL immediately at petabyte-scale. We will discuss some common use cases: funnels, user retention, affiliate metrics.

Publicado en: Software

Complex realtime event analytics using BigQuery @Crunch Warmup

  1. 1. Complex Realtime Event Analytics using BigQuery Márton Kodok Senior Software Engineer at REEA twitter: martonkodok stackoverflow: pentium10 github: pentium10 Crunch Warm Up - October 2015 - Budapest
  2. 2. Agenda 1. Big Data movement 2. Analytics Project - Background 3. Challenges - Why is it so hard? 4. Approach - Strategy - Application 5. Use Cases - Implementations 6. Exploring Big Data (GDELT, Hackernews, Reddit) Complex Realtime Event Analytics using BigQuery @martonkodok
  3. 3. Big data analyses movement Every scientist who needs big data analytics to save millions of lives should have that power. Complex Realtime Event Analytics using BigQuery @martonkodok
  4. 4. Challenging experience The simple fact is that you are brilliant but your brilliant ideas require complex big data analytics. Complex Realtime Event Analytics using BigQuery @martonkodok
  5. 5. Project: One-size-fits-all problem Need a backend to store, query, extract for deep analytics: ● Events (product, app, site email events) ● Achievements (“tag” users on the go, retention) ● Entities (split tests, user profiles, business entities) ● Metrics (app profiler data, custom) ● Email activity (click-map, engagement, ISP, Spam) ● 3rd party Analytics (good to have: Google Analytics) ● Systems generated data (log file entries, unstructured) Complex Realtime Event Analytics using BigQuery @martonkodok
  6. 6. Desired system/platform ● Terabyte scalable storage ● Real-time event ingestion ● Ask sophisticated queries (optional: without Dev) ● Query-performance ● Low-maintenance ● Cost effective ● Wire them up easily Goal: Store everything accessible by SQL immediately. Complex Realtime Event Analytics using BigQuery @martonkodok
  7. 7. Equipment strategy ● In-House ● Hosted ● Managed * people still required Services: ❏ ELK Stack (Elastic-Logstash-Kibana)... ❏ Cassandra, Hive, Hadoop... ❏ Amazon RedShift, Google BigQuery... Complex Realtime Event Analytics using BigQuery @martonkodok
  8. 8. Complex Realtime Event Analytics using BigQuery @martonkodok Google BigQuery
  9. 9. What is BigQuery? ● Analytics-as-a-Service - Data Warehouse in the Cloud ● Fully-Managed ● Scales into Petabytes ● Ridiculously fast ● Decent pricing (queries $5/TB, storage: $20/TB) ● 100.000 rows / sec Streaming API * October 2015 pricing Complex Realtime Event Analytics using BigQuery @martonkodok
  10. 10. BigQuery: Big Data Analytics in the Cloud ● Convenience of SQL ● Familiar DB Structure (table, column, views, JSON) ● Open Interfaces (REST, Web UI, ODBC) ● Fast atomic imports JSON/CSV (file size up to 5TB) ● Simple data ingest from GCS or Hadoop ● Web UI + bq CLI ● Connectors: Hadoop, Tableau, R, Talend, Logstash ● US or EU zone Complex Realtime Event Analytics using BigQuery @martonkodok
  11. 11. BigQuery: Convenience of SQL/JSON/JS ● Append-only tables ● Batch load file size limits: 5TB (CSV or JSON) ● ACL - row level locking (individual or group based) ● Columnar storage (max 10 000 columns in table) ● Rich SQL: JSON,IP,Math,RegExp,Window functions ● Datatypes: String 2MB, Record, Nested … ● UDF (User defined functions): Javascript Note: Store what you can in columns, the rest in JSON. Complex Realtime Event Analytics using BigQuery @martonkodok
  12. 12. BigQuery Costs - October 2015 * 1 Petabyte storage, 100 TB rows insert, 100 TB queries => 26,000 USD Queries Storage Ingestion ➔ 1 TB per month free ➔ 5 USD per TB ➔ only pay for the columns you use in your query ➔ 20 USD per TB ➔ Batch load free (CSV/JSON) ➔ Exporting free ➔ Table copy free ➔ 1 USD per 20TB data Estimate 1 - Storage 5 TB - Streaming Inserts 5TB - Queries 3 TB Monthly total: 110 USD Estimate 2 - Storage 20 TB - Streaming Inserts 10TB - Queries 10 TB Monthly total: 455 USD Complex Realtime Event Analytics using BigQuery @martonkodok
  13. 13. UDF - Power of Javascript ● impossible to express in SQL: Loops, complex conditionals, string parsing or transformations ● UDFs are similar to map functions in MapReduce ● inline JS or from GCS (gs://some-bucket/js/lib.js) Some UDF use cases: ● take one row and emit zero or more rows ● decoding URL-encoded strings ● text readability Complex Realtime Event Analytics using BigQuery @martonkodok
  14. 14. Append only tables - Get last value 1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table. 2. Use analytic functions FIRST_VALUE and LAST_VALUE. SELECT LAST_VALUE(email) OVER( PARTITION BY user_id ORDER BY timestamp ASC) AS email_last ... 3. Using Window Functions SELECT email, firstname, lastname FROM (SELECT email, firstname, lastname row_number() over (partition BY user_id ORDER BY timestamp DESC) seqnum FROM [profile_event] ) WHERE seqnum=1 Complex Realtime Event Analytics using BigQuery @martonkodok
  15. 15. Table wildcard functions This example assumes the following tables exist: ● mydata.people20140323 ● mydata.people20140324 ● mydata.people20140325 SELECT name FROM (TABLE_DATE_RANGE(mydata.people, DATE_ADD(CURRENT_TIMESTAMP(), -2, 'DAY'), CURRENT_TIMESTAMP())) WHERE age >= 35 #... another example with RegExp ... FROM (TABLE_QUERY(mydata, 'REGEXP_MATCH(table_id, r"^boo[d]{3,5}")')) Complex Realtime Event Analytics using BigQuery @martonkodok
  16. 16. Infrastructure Complex Realtime Event Analytics using BigQuery @martonkodok
  17. 17. Schema modelling Complex Realtime Event Analytics using BigQuery @martonkodok +--------------------------+-----------+----------+--+ | order_id | INTEGER | REQUIRED | | | ... | | | | | products | RECORD | REPEATED | | | products.product_id | INTEGER | NULLABLE | | | products.attributes | STRING | REPEATED | | | products.price | FLOAT | NULLABLE | | | products.name | STRING | NULLABLE | | | ... | | | | | common | RECORD | NULLABLE | | | common.insert_id | INTEGER | REQUIRED | | | common.tenant | INTEGER | REQUIRED | | | common.event | INTEGER | REQUIRED | | | common.user_id | INTEGER | REQUIRED | | | common.timestamp | TIMESTAMP | REQUIRED | | | .... | | | | | common.utm | RECORD | NULLABLE | | | common.utm.source | STRING | NULLABLE | | | common.utm.medium | STRING | NULLABLE | | | common.utm.campaign | STRING | NULLABLE | | | common.utm.content | STRING | NULLABLE | | | common.utm.term | STRING | NULLABLE | | | meta | STRING | NULLABLE | | +--------------------------+-----------+----------+--+
  18. 18. Streaming insert time (ms) - last 6M Complex Realtime Event Analytics using BigQuery @martonkodok
  19. 19. Achievements ● Funnel Analysis Complex Realtime Event Analytics using BigQuery @martonkodok
  20. 20. Attribute orders to first article visited Example: ● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1 ● page1 -> article2-> page3 -> orderpage2 -> ... Problem: When an order is made, attribute a credit to the first article visited by that user! Complex Realtime Event Analytics using BigQuery @martonkodok
  21. 21. Achievements ● Funnel Analysis ● Email URL click heatmap Complex Realtime Event Analytics using BigQuery @martonkodok
  22. 22. Email URL clicks map (79GB in 2.4sec) Complex Realtime Event Analytics using BigQuery @martonkodok
  23. 23. Achievements Continued ● Funnel Analysis ● Email URL click heatmap ● Email Dashboard (Trends, SPAM, ISP deferral) ● Split tests (by content, region, device, during the day) ● Ability for advanced segmentation as all raw data is stored ● Behavioral analytics (engaged users, recommendations) Complex Realtime Event Analytics using BigQuery @martonkodok
  24. 24. Our benefits ● no provisioning/deploy ● no running out of resources ● no more focus on large scale execution plan ● no need to re-implement tricky concepts (time windows / join streams) ● pay only the columns we have in your queries ● run raw ad-hoc queries (either by analysts/sales or Devs) ● no more throwing away-, expiring-, aggregating old data. Complex Realtime Event Analytics using BigQuery @martonkodok
  25. 25. BigQuery: Sample projects to try out 1. githubarchive.org: 20+ event types available since 2012 a. pull request latency b. expressions, emotions in commit messages 2. httparchive.org: Trends in web technology a. popular scripts b. website performance 3. raw Google Analytics data (*only Premium Customers) 4. GDELT - Global Database of Events, Language, and Tone GKG - Global Knowledge Graph 5. GSOD - samples of weather (rainfall, temp…) 6. 1.6 billion Reddit comments 7. Hackernews data 8. Wikipedia edits Complex Realtime Event Analytics using BigQuery @martonkodok
  26. 26. HttpArchive - .HU Javascript frameworks Complex Realtime Event Analytics using BigQuery @martonkodok
  27. 27. GDELT - News Coverage: Orbán Viktor Complex Realtime Event Analytics using BigQuery @martonkodok
  28. 28. GDELT - News Coverage: Beata Szydlo Complex Realtime Event Analytics using BigQuery @martonkodok
  29. 29. Reddit - books community talks about Complex Realtime Event Analytics using BigQuery @martonkodok
  30. 30. Questions? Thank you.

×