Sf big analytics_2018_04_18: Evolution of the GoPro's data platform

Evolution of Data Platform at GoPro

ABOUT SPEAKERS
• Chester Chen
• Head of Data Science & Engineering (DSE) at GoPro
• Previously, Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
• David Winters
• Data Architect of Data Science & Engineering (DSE) at GoPro
• Previously worked at Splice Machine, Apple
• Hao Zou
• Senior Software Engineer of Data Science & Engineering (DSE) at GoPro
• Previously worked at Alpine Data Labs, Pivotal

AGENDA
• Business Use Cases
• Evolution of GoPro Data Platform
• Platform Architecture Transformation & Streaming to S3
• Configurable Spark Batch Framework
• Data Democratization
• Data Management & VIsualization
• Data Metrics Delivery
• Initial exploration in ML feature visualization

GROWING DATA NEED FROM GOPRO ECOSYSTEM

DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce Social Media/OTT
3rd party data
Product Insight
User segmentation
CRM/Marketing
/Personalization

EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• Web/E-Commercial Analytics
• Camera Analytics
• Mobile Analytics
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis

FIXED CLUSTER ARCHITECTURE
ETL Cluster
•Aggregations and Joins
•Hive and Spark jobs
•Map/Reduce
•Airflow
Secure Data Mart
Cluster
•End User Query
•Impala / Sentry
•Parquet
•Kerberos & LDAP
Analytics Apps
•Hue
•Tableau
•Plotly
•Python
•R
Streaming Cluster
•Log file streaming
•RESTful service
•Kafka
•Spark Streaming
•HBase
Batch Induction
Framework
•Batch files
•Scheduled downloads
•Pre-processing
•Java App
•Airflow
JSON
JSON
Parquet
DDL
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download

STREAMING ENDPOINT
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster
events
events
state

SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToETL
Cluster
/path4/…
events
state
events
events
events
state
state
state

ETL PIPELINE
HDFS
Hive Metastore
To SDM Cluster
From Streaming Cluster
Batch
Induction
Framework
state
snapshot

DATA DELIVERY!
HDFS
Hive Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio - Staging
GDA
Report
SDM Cluster
From ETL Cluster

PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters

DYNAMIC ELASTIC ARCHITECTURE
Data Files
Streaming Cluster #1
Metastore
Ephemeral
ETL
Cluster #1
Parquet
+
DDL
Aggregates
Events
+
State
Ephemeral
Analytical
Cluster #1
Streaming
State Messages
Streaming Cluster #2
Streaming Cluster #N
Dynamic
DDL
Ephemeral
ETL
Cluster #2
Ephemeral
ETL
Cluster #N
Ephemeral
Analytical
Cluster #2
Ephemeral
Analytical
Cluster #N
Centralized Data Repository
Batch
Induction
Framework
• Rest API
• FTP downloads
• S3 sync
Batch
Download
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Reduced long running clusters to maintain
Parquet
+
DDL
Notebooks

STREAMING PIPELINES
Spark Cluster
Long Running Cluster

BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf

INTERACTIVE/NOTEBOOKS
Spark Cluster
Long Running Clusters
Notebooks Scripts
(SQL, Python, Scala)
Scheduled Notebook Jobs
auto-scale
mixed on-demand &
spot Instances

TAKEAWAYS
Key Changes
•Centralized Hive meta store
•Leveraged S3 as centralized storage
•Separated compute and storage
•Provided horizontal scalability with
cluster elasticity
•Less time in managing infrastructure

TAKEAWAYS
Key Challenges
• Pushing data to S3
• Made use of parallel writes with multipart
uploads
• Moving from Hadoop YARN to Spark Standalone
• Changed from fewer large EC2 instances to many
smaller instances
• Combined Spark Streaming jobs
• Considering a move to containers for further
improved instance utilization.

TAKEAWAYS
Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year ago
• Operation
• Reduce the complexity of DevOps Support
• Analytics tools
• SQL only => Notebook with (SQL, Python, Scala)

CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK
HIVE SQL  Spark

BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job

TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs

ETL JOB CONFIGURATION
gopro.dse.config.etl {
mobile-job {
conf {}
process {}
input {}
output {}
post.process {}
}
}
include classpath("conf/production/etl_mobile_quik.conf")
include classpath("conf/production/etl_mobile_capture.conf")
include classpath("conf/production/etl_mobile_product_events.conf")
Job-level conf override JobType Conf
Job specifics includes
JobType
JobName
Input & output specification

ETL JOB CONFIGURATION
xyz {
process {}
input {
delimiter = ","
inputDirPattern = "s3a://teambucket/xyz/raw/production"
file.ext = "csv"
file.format = "csv"
date.format = "yyyy-MM-dd hh:mm:ss"
table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName"
}
output {
database = “mobile",
file.format = "parquet"
date.format = "yyyy-MM-dd hh:mm:ss"
partitions = 2
file.compression.codec.key = "spark.sql.parquet.compression.codec"
file.compression.codec.value = "gzip”
save.mode = ”append"
transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer]
}
post.process {
deleteSource = true
}
}
Save Mode
JobName
Input specification
output specification

Data Transformation
ETL With SQL & Scala

DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL

SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob

Data Democratization,
Visualization and Data
Management

DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• Other tool services

SLACK METRICS DELIVERY
xxxxxx
xxxxxxx
xxxxx xxxxxxxxxx
xxxxx
xxxxxxx xxxxxx xxxxxx
xxxxx
xxxxx
xxxx
xxxxxxxxxxxxxxxx
xxxxx xxxxx
xxxxx
xxxxx
xxxxx xxxxx

SLACK METRICS DELIVERY
• Why Slack?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Put metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.

SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations

SLACK METRICS CONCEPTS
• Slack Job 
• Channels (private channels) 
• Metrics Groups 
• Metrics1
• …
• MetricsN
• Main Query
• Compare Query (Optional)
• Chart Query (Options)
• Persistence (optional)
• Hive + S3
• Additional deliveries (Optional)
• Kafka
• Other Cache stores (Http Post)

BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks

SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics

Machine Learning Feature
Visualization with Facets + Spark

FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?

FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.

FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower

FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?

FACETS OVERVIEW + SPARK
ScalaPB

PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))

SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)

SPARK FACETS STATS GENERATOR
def protoFromDataFrames(dataFrames: List[NamedDataFrame],
features : Set[String] = Set.empty[String],
histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList

INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.

WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features  200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?

PG #
RC Playbook: Your guide to
success at GoPro
FINAL THOUGHTS

FINAL THOUGHTS
• We are still in the early stage of Data Platform Evolution.
• We will continue to share our experiences with you along the way.
• Questions?
Thank You
Data Science & Engineering
GoPro

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Sf big analytics_2018_04_18: Evolution of the GoPro's data platform

Similar a Sf big analytics_2018_04_18: Evolution of the GoPro's data platform (20)

Más de Chester Chen

Más de Chester Chen (20)

Último

Último (20)

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform

Notas del editor