Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Holistics.io
Huy Nguyen
CTO, Cofounder - Holistics.io
Building Analytics Infrastructure for
Growing Tech Companies
Data Ni...
Holistics.io
● Cofounder of Holistics.io
○ Data Reporting (BI) and Infrastructure SaaS
● Previous
○ Built Data Pipeline at...
● The Data Problem
● Typical Data Pipeline (Startup)
● Choosing An Analytics DB
● Choosing A BI Tool
Agenda
Holistics.io
Background: What is Analytics/DW?
- A Typical Web Application
Data-related Business Questions:
• Daily/weekly registered users by different platforms, count...
- A Typical Web Application
Data-related Business Questions:
• Daily/weekly registered users by different platforms, count...
- A Typical Web Application
Data-related Business Questions:
• How did my marketing campaigns affect registrations?
Analyt...
- A Typical Web Application
Analytics
Database
Live
Databases
Live
Databases
Production
DBs
Android
iOS
Web
APIs
Reporting...
Holistics.io
A Typical Data Pipeline
Holistics.io
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Sc...
Holistics.io
Analytics
Database
Data Warehouse
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Der...
Holistics.io
Data Pipeline Philosophy
Centralize Your Data: join/cross-reference your data
Unix Philosophy: Each component...
Holistics.io
Choosing An Analytics DB
Holistics.io
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Sc...
Holistics.io
Transactional DBs vs. Analytics DBs
Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-...
Holistics.io
Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8)
Complex Query....
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. ...
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. ...
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. Easy to Scale...
1 Simple to Get Started
● Data requests grow gradually as your company grows
● Business users care about results (not back...
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. ...
Holistics.io
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Sc...
Holistics.io
● Managing Table Data: table partitioning
● Managing Disk Space: tablespace
● Improve Write Performance: unlo...
Holistics.io
Analytics tables hold lots of data
Managing Data Tables
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08...
Managing Data Tables: parent table
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09
…
ALTER TABLE p...
Holistics.io
Analytics DB holds lots of data; hardware spaces are limited
Data have different access
frequency
● Hot Data
...
Holistics.io
Tablespace: Define where your tables are stored on disks
Managing Disk-spaces: tablespace
CREATE TABLESPACE h...
Holistics.io
Combining TABLESPACE and PARENT TABLE
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09...
Holistics.io
● Extract / transform
● Aggregate / summarize
● Statistical analysis
2- b- Data Analysis (writing SQLs)
1. Si...
Holistics.io
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data struct...
Holistics.io
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data struct...
Holistics.io
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data struct...
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. ...
Holistics.io
● Transactional DB’s downsides:
○ Optimized for transactional applications
○ Single-core execution; row-based...
Holistics.io
Other DW Databases (Relational)
● Greenplum
● Teradata
● Infobright
● Google BigQuery
● Aster Data
● Paraccel...
Holistics.io
Compare: Popular SQL Databases
PostgreSQL MySQL Oracle SQL Server
License / Cost
Free /
Open-source
Free / Op...
vs.
is a data storage and processing framework
– HDFS: data-storage layer
– YARN: resource management
– MapReduce/Pig/Hive...
Holistics.io
Choosing A BI Tool
Holistics.io
Analytics
Database
CSVs / Excels /
Google Sheets
Operational Data Data Warehouse
Reporting /
Analysis
Data Sc...
Holistics.io
Choosing A BI Tool: Criteria
Use cases: Pretty Visualizations vs. Detailed Data Access,
Embedded, Email Sched...
Holistics.io
Other Criteria
● Process: Direct Access vs Data Models
● ETL: How is the ETL process managed?
● License Fee: ...
Holistics.io
BI Tools
BI features:
● Heavy Visualization: Tableau, Qlik
● Medium + Other Features: Holistics, Chartio, Per...
Holistics.io
Comparing BI tools
Visualization Report Creation Data Storage Pricing Notes
Tableau Strong Drag & Drop Store
...
● The Data Problem
● Typical Pipeline (Startup)
● Choosing An Analytics DB
● Choosing A BI Tool
Summary
Not Cover
● Setting Up & Performance Optimizations
● Other data types: time-series data, geo data, search data
● Big Data ...
Holistics.io
Holistics.io
Huy Nguyen
huy@holistics.io
Building Analytics Infrastructure for Growing Tech Companies
Próxima SlideShare
Cargando en…5
×
  • Sé el primero en comentar

Building Analytics Infrastructure for Growing Tech Companies

  1. 1. Holistics.io Huy Nguyen CTO, Cofounder - Holistics.io Building Analytics Infrastructure for Growing Tech Companies Data Night Singapore Aug 2016 (*) aka Data Pipelike
  2. 2. Holistics.io ● Cofounder of Holistics.io ○ Data Reporting (BI) and Infrastructure SaaS ● Previous ○ Built Data Pipeline at Viki (Singapore) ○ Growth Team at Facebook (US) About Me
  3. 3. ● The Data Problem ● Typical Data Pipeline (Startup) ● Choosing An Analytics DB ● Choosing A BI Tool Agenda
  4. 4. Holistics.io Background: What is Analytics/DW?
  5. 5. - A Typical Web Application Data-related Business Questions: • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday? Live Databases Live Databases Production DBs Android iOS Web APIs
  6. 6. - A Typical Web Application Data-related Business Questions: • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday? Analytics Database Live Databases Live Databases Production DBs Android iOS Web APIs Reporting / BI Daily Snapshot
  7. 7. - A Typical Web Application Data-related Business Questions: • How did my marketing campaigns affect registrations? Analytics Database Live Databases Live Databases Production DBs Android iOS Web APIs Reporting / BI Daily Snapshot
  8. 8. - A Typical Web Application Analytics Database Live Databases Live Databases Production DBs Android iOS Web APIs Reporting / BI Daily Snapshot GA, FB Ads, Adwords... Data-related Business Questions: • How did my marketing campaigns affect registrations?
  9. 9. Holistics.io A Typical Data Pipeline
  10. 10. Holistics.io Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Load Pre-aggregate Update / Transform / Aggregate 3rd-party Tracking: GA, FB Ads, Adwords... API Import Data Analysis
  11. 11. Holistics.io Analytics Database Data Warehouse Table Table Table Table Table Table Table Table Table Table Table Table Derived Table Transform / Aggregate Derived Table Derived Table Derived Table CSVs / Excels / Google Sheets Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Load Pre-aggregate 3rd-party Tracking: GA, FB Ads, Adwords... API Import (1) Import (2) Process (3) Present Data Warehouse
  12. 12. Holistics.io Data Pipeline Philosophy Centralize Your Data: join/cross-reference your data Unix Philosophy: Each component does one thing well Immutable Data: Don’t modify the original data
  13. 13. Holistics.io Choosing An Analytics DB
  14. 14. Holistics.io Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Load Pre-aggregate Update / Transform / Aggregate 3rd-party Tracking: GA, FB Ads, Adwords... API Import Data Analysis What database should we pick?
  15. 15. Holistics.io Transactional DBs vs. Analytics DBs Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5) Data: ● Many single-row writes ● Current, single data Queries: ● Generated by user activities; 10 to 1000 users ● < 1s response time ● Short queries Data: ● Few large batch imports ● Years of data, many sources Queries: ● Generated by large reports; 1 to 10 users ● Queries run for hours ● Long, complex queries
  16. 16. Holistics.io Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8) Complex Query...
  17. 17. Requirements 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis (SQL et al) 3. Easy to Scale Up (3) Scale(1) Start (2) Grow Data Growth
  18. 18. Requirements 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis (SQL et al) 3. Easy to Scale Up (3) Scale(1) Start (2) Grow Data Growth Personal Recommendation:
  19. 19. 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis (SQL et al) 3. Easy to Scale Up Requirements (3) Scale(1) Start (2) Grow Data Growth Personal Recommendation:
  20. 20. 1 Simple to Get Started ● Data requests grow gradually as your company grows ● Business users care about results (not backend) Postgres: ● Free (open-source) ● Easy to setup → Need something quick to start, easy to fine-tune along the way 1. Simple start 2. Rich features 3. Scale up
  21. 21. Requirements 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis (SQL et al) 3. Easy to Scale Up (3) Scale(1) Start (2) Grow Data Growth Personal Recommendation:
  22. 22. Holistics.io Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Load Pre-aggregate Update / Transform / Aggregate 3rd-party Tracking: GA, FB Ads, Adwords... API Import Data Analysis Data Pipeline (ETL) Data Analysis / Reporting
  23. 23. Holistics.io ● Managing Table Data: table partitioning ● Managing Disk Space: tablespace ● Improve Write Performance: unlogged table ● Others: foreign data wrapper, point-in-time recovery 2 a- Data Pipeline (ETL) & Performance 1. Simple start 2. Rich features 3. Scale up
  24. 24. Holistics.io Analytics tables hold lots of data Managing Data Tables pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 Solution: Split (partition) to multiple tables Problem: Difficult to query data across multiple months ⇒ Table grows big quickly, difficult to manage ! pageviews (+ 100k records a day) date_d | country | user_id | browser | page_name | views 1. Simple start 2. Rich features 3. Scale up
  25. 25. Managing Data Tables: parent table pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 … ALTER TABLE pageviews_2015_09 INHERIT pageviews_parent; ALTER TABLE pageviews_2015_09 ADD CONSTRAINT CHECK date_d >= '2015-09-01' AND date_d < '2015-10-01'; pageviews_parent (parent table) 1. Simple start 2. Rich features 3. Scale up
  26. 26. Holistics.io Analytics DB holds lots of data; hardware spaces are limited Data have different access frequency ● Hot Data ● Warm Data ● Cold Data Managing Disk-spaces 1. Simple start 2. Rich features 3. Scale up
  27. 27. Holistics.io Tablespace: Define where your tables are stored on disks Managing Disk-spaces: tablespace CREATE TABLESPACE hot_data LOCATION /disk0/ssd/ CREATE TABLESPACE warm_data LOCATION /disk1/sata2/ # beginning of the month CREATE TABLE pageviews_2016_08 TABLESPACE hot_data; ALTER TABLE pageviews_2016_07 TABLESPACE warm_data; 1. Simple start 2. Rich features 3. Scale up
  28. 28. Holistics.io Combining TABLESPACE and PARENT TABLE pageviews_2015_06 pageviews_2015_07 pageviews_2015_08 pageviews_2015_09 … pageviews_parent (parent table) 1. Simple start 2. Rich features 3. Scale up
  29. 29. Holistics.io ● Extract / transform ● Aggregate / summarize ● Statistical analysis 2- b- Data Analysis (writing SQLs) 1. Simple start 2. Rich features 3. Scale up Analytics Database Data Science / ML Reporting / BI Data Analysis
  30. 30. Holistics.io ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) 2- b - Data Analysis ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN
  31. 31. Holistics.io ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) ● PL/SQL ● Full-text search (n-gram) ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN
  32. 32. Holistics.io ● SQL features ○ WITH clause ○ Window functions ○ Aggregation functions ○ Statistical functions ● Data structures ○ JSON / JSONB ○ Arrays ○ PostGIS (geo data) ○ Geometry (point, line, etc) ○ HyperLogLog (extension) ● PL/SQL ● Full-text search ● Performance: ○ Parallel queries (pg9.6) ○ Materialized views ○ BRIN index ● Others: ○ DISTINCT ON ○ VALUES ○ generate_series() ○ Support FULL OUTER JOIN ○ Better EXPLAIN
  33. 33. Requirements 1. Simple to Get Started 2. Rich Features for Analytics – Data Pipeline (ETL) – Data Analysis (SQL et al) 3. Easy to Scale Up (3) Scale(1) Start (2) Grow Data Growth Personal Recommendation:
  34. 34. Holistics.io ● Transactional DB’s downsides: ○ Optimized for transactional applications ○ Single-core execution; row-based storage ● CitusDB Extension (Postgres) ○ Automated data sharding and parallelization ○ Columnar Storage Format (better storage and performance) ● Amazon Redshift ○ Fork of PostgreSQL 8.2 -- ParAccel DB ○ Columnar Storage & Parallel Executions ○ Pay per hour per instance types ● Google BigQuery ○ Spun out of Google’s Dremel ○ Pay per query, per data access 3- Scaling Up
  35. 35. Holistics.io Other DW Databases (Relational) ● Greenplum ● Teradata ● Infobright ● Google BigQuery ● Aster Data ● Paraccel (Postgres fork) ● Vertica (from Postgres author) ● CitusDB (Postgres extension) ● Amazon Redshift (from Paraccel) 1. Simple start 2. Rich features 3. Scale up Related to Postgres
  36. 36. Holistics.io Compare: Popular SQL Databases PostgreSQL MySQL Oracle SQL Server License / Cost Free / Open-source Free / Open-source Expensive Expensive Analytics Features Strong Weak Strong Strong
  37. 37. vs. is a data storage and processing framework – HDFS: data-storage layer – YARN: resource management – MapReduce/Pig/Hive/Spark: processing layer (MPP database, massively parallel processing) – Columnar-storage database; Meant for analytics purpose. – OLAP – Online Analytical Processing – Examples: Vertica, Amazon Redshift, Parracel
  38. 38. Holistics.io Choosing A BI Tool
  39. 39. Holistics.io Analytics Database CSVs / Excels / Google Sheets Operational Data Data Warehouse Reporting / Analysis Data Science / ML Reporting / BI Event Logs (behavioural data) Live Databases Live Databases Production DBs Daily Snapshot Load Pre-aggregate Update / Transform / Aggregate 3rd-party Tracking: GA, FB Ads, Adwords... API Import Data Analysis Which BI software?
  40. 40. Holistics.io Choosing A BI Tool: Criteria Use cases: Pretty Visualizations vs. Detailed Data Access, Embedded, Email Schedules, etc Report Creation: Technical vs. Non-technical Data Ownership: Your own database vs. BI Software’s storage
  41. 41. Holistics.io Other Criteria ● Process: Direct Access vs Data Models ● ETL: How is the ETL process managed? ● License Fee: Upfront Investment vs. Value-Based Pricing ● Implementation Fee: Self Service vs. 3rd Party ● Training Fee: Training Costs vs Specialized Skill Sets
  42. 42. Holistics.io BI Tools BI features: ● Heavy Visualization: Tableau, Qlik ● Medium + Other Features: Holistics, Chartio, Periscope Report Creation: ● SQL: Holistics, Periscope ● Drag-and-Drop/Excel: Tableau, Qlik, Sisense, PowerBI Data Ownership: ● Store your data: Periscope, Tableau, Sisense ● Doesn’t store your data: Holistics, Chartio, Looker
  43. 43. Holistics.io Comparing BI tools Visualization Report Creation Data Storage Pricing Notes Tableau Strong Drag & Drop Store Per desktop + server license Leader in Visualization Qlik Strong Drag & Drop Store Desktop + Server License ETL + Visualization Holistics Standard SQL Doesn’t store Users + Usage Strong permission management, detailed data extractions Periscope Data Standard SQL Store No. of records Auto-cache your data using Amazon Redshift Chartio Standard Drag & Drop Doesn’t store User + Feature Transform your data on Chartio server Looker Standard Data Model (LookML) based on SQL Doesn’t store User Packs Accessed through the Looker Data Model
  44. 44. ● The Data Problem ● Typical Pipeline (Startup) ● Choosing An Analytics DB ● Choosing A BI Tool Summary
  45. 45. Not Cover ● Setting Up & Performance Optimizations ● Other data types: time-series data, geo data, search data ● Big Data frameworks: Hadoop, Spark, HDFS, etc ● Real-time Data Processing, Stream Processing (Storm, Kafka, Kinesis)
  46. 46. Holistics.io
  47. 47. Holistics.io Huy Nguyen huy@holistics.io

    Sé el primero en comentar

    Inicia sesión para ver los comentarios

  • TjioeJaryll

    Nov. 5, 2017
  • OlivierWellmann

    Oct. 26, 2018

Talk presented at Data Night Singapore, Aug 2016. http://www.meetup.com/DataNight/events/233108459/

Vistas

Total de vistas

789

En Slideshare

0

De embebidos

0

Número de embebidos

1

Acciones

Descargas

31

Compartidos

0

Comentarios

0

Me gusta

2

×