Big Data On Google Cloud
Tu Pham - IO extended 2017
CTO @ Dyno
ADataasservicecompany
Technologies: Java, Python, all kind of databases and Cloud
platform from Google, Aws, Azure.
Interests: Cloud computing / architecture, technology
evolution, distributed systems.
Husband, Father, GDE, Open source contributor.
Tu Pham
foto: Lars Kruse, Aarhus Universitet
3
Giới thiệu Dyno:
- Tech marketing & digital
agency
Google CloudPlatform 5
Source: Boston Consulting Group:
The Mobile Revolution: How Mobile Technologies Drive a Trillion-DollarImpact
IDC,2015
By 2020, there will be 8 Billion connected smart phones
— 2X more than today.
And 32 Billion connected “IOT”devices
—6X more thantoday.
A common configuration: draw conclusions
CloudDatalab
Events, metrics,
etc.
Stream
Visualization and BI
Raw logs, files,
assets, Google
Analytics data etc. Co-workers
Batch
Batch
B C Applications and
A Reports
Confidential +Proprietary
A serverless big data stack that
scales automatically
Confidential & ProprietaryGoogle Cloud Platform 24
Transform Data into Actions
Exploration &
Collaboration
Databases Storage
Data
Preparation &
Processing
Analytics
Advanced
Analytics &
Intelligence
Mobile apps
Sensors and
devices
Web apps
Relational
Key-value
Document
SQL
Wide column
Object
Stream
processing
Batch
processing
Data
preparation
Federated
query
Data catalog
Data
exploration
Data
visualization
Developers
Data scientists
Business
analysts
Development
environment
for Machine
Learning
Pre-Trained
Machine
Learning
models
Data
Ingestion
Messaging
Logs
Confidential & ProprietaryGoogle Cloud Platform 25
Transform Data into Actions
Data
Preparation &
Processing
Cloud Dataflow
Cloud Dataproc
Exploration &
Collaboration
Google
BigQuery
Cloud Datalab
Google
Analytics 360
Cloud Dataproc
Mobile apps
Sensors and
devices
Web apps
Developers
Data scientists
Business
analysts
Data Ingestion
Cloud Pub/Sub
App Engine
Databases/
Storage
Cloud SQL
Cloud Bigtable
Cloud
Datastore
Cloud Storage
Analytics
Google BigQuery
Google
Analytics 360
Cloud Dataproc
Google Drive
Advanced
Analytics &
Intelligence
Cloud Machine
Learning
Translate API
Vision API
Speech API
Google Cloud Platform 3
Apache Spark and Apache Hadoop should be
fast, easy, and cost-effective.
Google Cloud Data Proc
Google Cloud Dataproc - under the hood
Applications on
the cluster
Dataproc Jobs
GCP Products
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Dataproc Cluster
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Jobs FeaturesData Outputs
Easy, fast, cost-effective
Fast
Things take seconds to minutes, not hours or weeks
Easy
Be an expert with your data, not your data infrastructure
Cost-effective
Pay for exactly what you use
Running Hadoop on Google Cloud
bdutil
Free OSS Toolkit
Dataproc
Managed Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Custom Code
Monitoring/Health
Dev Integration
Manual Scaling
Job Submission
GCP Connectivity
Deployment
Creation
On
Premise
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Google Managed
Google Cloud Platform
Customer Managed
Vendor
Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
6
Cloud Dataproc - integrated
6
Cloud Dataproc is
natively integrated with
several Google Cloud
Platform products as
part of an integrated
data platform.
Storage
Operations
Data
7
Where Cloud Dataproc fits into GCP
7
Google Bigtable
(HBase)
Google BigQuery
(Analytics, Data warehouse)
Stackdriver Logging
(Logging Ops.)
Google Cloud Dataflow
(Batch/Stream Processing)
Google Cloud Storage
(HCFS/HDFS)
Stackdriver Monitoring
(Monitoring)
Building what’s next 33
Scales automatically
No setup or administration
Stream up to 100,000 rowsp/sec
Easily integrates with third-partysoftware
Google BigQuery
makes complex data analysis simple
1000-core Hadoop Cluster
= 2.5 hours
Before
Making ad hocQueries
with BigQuery <5min
After
● 500+ Games
● Hundreds of Analysts
● Terabytes of Data Daily
Big Data Challenges At Dyno
- Multi TB data warehouse
- Raw input > 100 GB new raw data per day (Structured
& Unstructured)
- 65 online data source
- Unlimited offline data source
- Face with data quality problem everyday
- From user information & behavior to user interest &
intention
- Manage high performance / cost effective system
JOIN THE FLIGHT - WE ARE HIRING
IO Extended 2017
Twitter: @phamptu
Email: tu@dyno.vn
Frontend Developer: goo.gl/EY8RvV
Backend Developer: goo.gl/BnmmK6