SlideShare una empresa de Scribd logo
1 de 62
Evolution of Data Platform at GoPro
ABOUT SPEAKERS
• Chester Chen
• Head of Data Science & Engineering (DSE) at GoPro
• Previously, Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
• David Winters
• Data Architect of Data Science & Engineering (DSE) at GoPro
• Previously worked at Splice Machine, Apple
• Hao Zou
• Senior Software Engineer of Data Science & Engineering (DSE) at GoPro
• Previously worked at Alpine Data Labs, Pivotal
AGENDA
• Business Use Cases
• Evolution of GoPro Data Platform
• Platform Architecture Transformation & Streaming to S3
• Configurable Spark Batch Framework
• Data Democratization
• Data Management & VIsualization
• Data Metrics Delivery
• Initial exploration in ML feature visualization
GROWING DATA NEED FROM GOPRO ECOSYSTEM
DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce Social Media/OTT
3rd party data
Product Insight
User segmentation
CRM/Marketing
/Personalization
EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• Web/E-Commercial Analytics
• Camera Analytics
• Mobile Analytics
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
Evolution of Data Platform
EVOLUTION OF DATA PLATFORM
EVOLUTION OF DATA PLATFORM
FIXED CLUSTER ARCHITECTURE
ETL Cluster
•Aggregations and Joins
•Hive and Spark jobs
•Map/Reduce
•Airflow
Secure Data Mart
Cluster
•End User Query
•Impala / Sentry
•Parquet
•Kerberos & LDAP
Analytics Apps
•Hue
•Tableau
•Plotly
•Python
•R
Streaming Cluster
•Log file streaming
•RESTful service
•Kafka
•Spark Streaming
•HBase
Batch Induction
Framework
•Batch files
•Scheduled downloads
•Pre-processing
•Java App
•Airflow
JSON
JSON
Parquet
DDL
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
STREAMING ENDPOINT
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster
events
events
state
SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToETL
Cluster
/path4/…
events
state
events
events
events
state
state
state
ETL PIPELINE
HDFS
Hive Metastore
To SDM Cluster
From Streaming Cluster
Batch
Induction
Framework
state
snapshot
DATA DELIVERY!
HDFS
Hive Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio - Staging
GDA
Report
SDM Cluster
From ETL Cluster
PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters
DYNAMIC ELASTIC ARCHITECTURE
Data Files
Streaming Cluster #1
Metastore
Ephemeral
ETL
Cluster #1
Parquet
+
DDL
Aggregates
Events
+
State
Ephemeral
Analytical
Cluster #1
Streaming
State Messages
Streaming Cluster #2
Streaming Cluster #N
Dynamic
DDL
Ephemeral
ETL
Cluster #2
Ephemeral
ETL
Cluster #N
Ephemeral
Analytical
Cluster #2
Ephemeral
Analytical
Cluster #N
Centralized Data Repository
Batch
Induction
Framework
• Rest API
• FTP downloads
• S3 sync
Batch
Download
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Reduced long running clusters to maintain
Parquet
+
DDL
Notebooks
STREAMING PIPELINES
Spark Cluster
Long Running Cluster
BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf
INTERACTIVE/NOTEBOOKS
Spark Cluster
Long Running Clusters
Notebooks Scripts
(SQL, Python, Scala)
Scheduled Notebook Jobs
auto-scale
mixed on-demand &
spot Instances
TAKEAWAYS
Key Changes
•Centralized Hive meta store
•Leveraged S3 as centralized storage
•Separated compute and storage
•Provided horizontal scalability with
cluster elasticity
•Less time in managing infrastructure
TAKEAWAYS
Key Challenges
• Pushing data to S3
• Made use of parallel writes with multipart
uploads
• Moving from Hadoop YARN to Spark Standalone
• Changed from fewer large EC2 instances to many
smaller instances
• Combined Spark Streaming jobs
• Considering a move to containers for further
improved instance utilization.
TAKEAWAYS
Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year ago
• Operation
• Reduce the complexity of DevOps Support
• Analytics tools
• SQL only => Notebook with (SQL, Python, Scala)
CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK
HIVE SQL  Spark
EVOLUTION OF DATA PLATFORM
BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job
TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs
ETL JOB CONFIGURATION
gopro.dse.config.etl {
mobile-job {
conf {}
process {}
input {}
output {}
post.process {}
}
}
include classpath("conf/production/etl_mobile_quik.conf")
include classpath("conf/production/etl_mobile_capture.conf")
include classpath("conf/production/etl_mobile_product_events.conf")
Job-level conf override JobType Conf
Job specifics includes
JobType
JobName
Input & output specification
ETL JOB CONFIGURATION
xyz {
process {}
input {
delimiter = ","
inputDirPattern = "s3a://teambucket/xyz/raw/production"
file.ext = "csv"
file.format = "csv"
date.format = "yyyy-MM-dd hh:mm:ss"
table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName"
}
output {
database = “mobile",
file.format = "parquet"
date.format = "yyyy-MM-dd hh:mm:ss"
partitions = 2
file.compression.codec.key = "spark.sql.parquet.compression.codec"
file.compression.codec.value = "gzip”
save.mode = ”append"
transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer]
}
post.process {
deleteSource = true
}
}
Save Mode
JobName
Input specification
output specification
Data Transformation
ETL With SQL & Scala
DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL
SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob
Data Democratization,
Visualization and Data
Management
EVOLUTION OF DATA PLATFORM
DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• Other tool services
DSE WEB ARCHITECTURE
DEMO
dse.gopro-platform.com
EVOLUTION OF DATA PLATFORM
Delivery Metrics via Slack
SLACK METRICS DELIVERY
xxxxxx
xxxxxxx
xxxxx xxxxxxxxxx
xxxxx
xxxxxxx xxxxxx xxxxxx
xxxxx
xxxxx
xxxx
xxxxxxxxxxxxxxxx
xxxxx xxxxx
xxxxx
xxxxx
xxxxx xxxxx
SLACK METRICS DELIVERY
• Why Slack?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Put metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.
SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations
SLACK METRICS CONCEPTS
• Slack Job 
• Channels (private channels) 
• Metrics Groups 
• Metrics1
• …
• MetricsN
• Main Query
• Compare Query (Optional)
• Chart Query (Options)
• Persistence (optional)
• Hive + S3
• Additional deliveries (Optional)
• Kafka
• Other Cache stores (Http Post)
BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks
SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics
Machine Learning Feature
Visualization with Facets + Spark
EVOLUTION OF DATA PLATFORM
FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?
FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.
FACETS OVERVIEW SAMPLE
FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower
FACETS OVERVIEW
FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?
FACETS OVERVIEW + SPARK
ScalaPB
PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))
SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)
SPARK FACETS STATS GENERATOR
def protoFromDataFrames(dataFrames: List[NamedDataFrame],
features : Set[String] = Set.empty[String],
histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
FACET OVERVIEW SPARK
FACET OVERVIEW SPARK
DEMO
INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.
WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features  200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?
PG #
RC Playbook: Your guide to
success at GoPro
FINAL THOUGHTS
FINAL THOUGHTS
• We are still in the early stage of Data Platform Evolution.
• We will continue to share our experiences with you along the way.
• Questions?
Thank You
Data Science & Engineering
GoPro

Más contenido relacionado

La actualidad más candente

Using Production Profiles to Guide Optimizations
Using Production Profiles to Guide OptimizationsUsing Production Profiles to Guide Optimizations
Using Production Profiles to Guide Optimizations
Databricks
 

La actualidad más candente (20)

[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Flock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISLFlock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISL
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Using Production Profiles to Guide Optimizations
Using Production Profiles to Guide OptimizationsUsing Production Profiles to Guide Optimizations
Using Production Profiles to Guide Optimizations
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
 
MLflow at Company Scale
MLflow at Company ScaleMLflow at Company Scale
MLflow at Company Scale
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 

Similar a Sf big analytics_2018_04_18: Evolution of the GoPro's data platform

Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

Similar a Sf big analytics_2018_04_18: Evolution of the GoPro's data platform (20)

Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
Analytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationAnalytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature Visualization
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
NextGenML
NextGenML NextGenML
NextGenML
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Más de Chester Chen

zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
Chester Chen
 

Más de Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
 
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Último (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform

  • 1. Evolution of Data Platform at GoPro
  • 2. ABOUT SPEAKERS • Chester Chen • Head of Data Science & Engineering (DSE) at GoPro • Previously, Director of Engineering, Alpine Data Labs • Founder and Organizer of SF Big Analytics meetup • David Winters • Data Architect of Data Science & Engineering (DSE) at GoPro • Previously worked at Splice Machine, Apple • Hao Zou • Senior Software Engineer of Data Science & Engineering (DSE) at GoPro • Previously worked at Alpine Data Labs, Pivotal
  • 3. AGENDA • Business Use Cases • Evolution of GoPro Data Platform • Platform Architecture Transformation & Streaming to S3 • Configurable Spark Batch Framework • Data Democratization • Data Management & VIsualization • Data Metrics Delivery • Initial exploration in ML feature visualization
  • 4. GROWING DATA NEED FROM GOPRO ECOSYSTEM
  • 5. DATA Analytics Platform Consumer Devices GoPro Apps E-Commerce Social Media/OTT 3rd party data Product Insight User segmentation CRM/Marketing /Personalization
  • 6. EXAMPLES OF ANALYTICS USE CASES • Product Analytics • Web/E-Commercial Analytics • Camera Analytics • Mobile Analytics • GoPro Plus Analytics • CRM Analytics • Digital Marketing Analytics • Social Media Analytics • Cloud Media Analysis
  • 7. Evolution of Data Platform
  • 8. EVOLUTION OF DATA PLATFORM
  • 9. EVOLUTION OF DATA PLATFORM
  • 10. FIXED CLUSTER ARCHITECTURE ETL Cluster •Aggregations and Joins •Hive and Spark jobs •Map/Reduce •Airflow Secure Data Mart Cluster •End User Query •Impala / Sentry •Parquet •Kerberos & LDAP Analytics Apps •Hue •Tableau •Plotly •Python •R Streaming Cluster •Log file streaming •RESTful service •Kafka •Spark Streaming •HBase Batch Induction Framework •Batch files •Scheduled downloads •Pre-processing •Java App •Airflow JSON JSON Parquet DDL • Rest API • FTP downloads • S3 sync Streaming Batch Download
  • 11. STREAMING ENDPOINT ELBHTTP Pipeline for processing of streaming logs To ETL Cluster events events state
  • 13. ETL PIPELINE HDFS Hive Metastore To SDM Cluster From Streaming Cluster Batch Induction Framework state snapshot
  • 15. PROS AND CONS OF OLD SYSTEM • Isolation of workloads • Fast ingest • Secure • Fast delivery/queries • Loosely coupled clusters • Multiple copies of data • Tightly coupled storage and compute • Lack of elasticity • Operational overhead of multiple clusters
  • 16. DYNAMIC ELASTIC ARCHITECTURE Data Files Streaming Cluster #1 Metastore Ephemeral ETL Cluster #1 Parquet + DDL Aggregates Events + State Ephemeral Analytical Cluster #1 Streaming State Messages Streaming Cluster #2 Streaming Cluster #N Dynamic DDL Ephemeral ETL Cluster #2 Ephemeral ETL Cluster #N Ephemeral Analytical Cluster #2 Ephemeral Analytical Cluster #N Centralized Data Repository Batch Induction Framework • Rest API • FTP downloads • S3 sync Batch Download Improvements Single copy of data Separate storage from compute Elastic clusters Reduced long running clusters to maintain Parquet + DDL Notebooks
  • 18. BATCH JOBS Job Gateway Spark ClusterScheduled Jobs New cluster per Job Dev Machines Spark ClusterDev Jobs New or existing cluster Production Job.conf Dev Job.conf
  • 19. INTERACTIVE/NOTEBOOKS Spark Cluster Long Running Clusters Notebooks Scripts (SQL, Python, Scala) Scheduled Notebook Jobs auto-scale mixed on-demand & spot Instances
  • 20. TAKEAWAYS Key Changes •Centralized Hive meta store •Leveraged S3 as centralized storage •Separated compute and storage •Provided horizontal scalability with cluster elasticity •Less time in managing infrastructure
  • 21. TAKEAWAYS Key Challenges • Pushing data to S3 • Made use of parallel writes with multipart uploads • Moving from Hadoop YARN to Spark Standalone • Changed from fewer large EC2 instances to many smaller instances • Combined Spark Streaming jobs • Considering a move to containers for further improved instance utilization.
  • 22. TAKEAWAYS Key Benefits • Cost • Reduce redundant storage, compute cost. • Use the smaller instance types • 60% AWS cost saving comparing to 1 year ago • Operation • Reduce the complexity of DevOps Support • Analytics tools • SQL only => Notebook with (SQL, Python, Scala)
  • 23. CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK HIVE SQL  Spark
  • 24. EVOLUTION OF DATA PLATFORM
  • 25. BATCH INGESTION GoPro Product data 3rd Parties Data 3rd Parties Data 3rd Parties Data Rest APIs sftp s3 sync s3 sync Batch Data Downloads Input File Formats: CSV, JSON Spark Cluster New cluster per Job
  • 26. TABLE WRITER JOBS SparkJob HiveTableWriter JDBCToHiveTableWriter AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter CSVTableWriter JSONTableWriter FileToHiveTableWriter HBaseToHiveTableWriter TableToHiveTableWriter HBaseSnapshotJob TableSnapshotJob CoreTableWriter Customized Json JobCustomized CSV Job mixin All jobs has the same way of configuration loading, Job State and error reports All table writers will have the Dynamic DDL capabilities, as long as they becomes DataFrames, they will be behave the same CSV and JSON have different loader Need different Loader to load HBase Record to DataFrame Aggregate Jobs
  • 27. ETL JOB CONFIGURATION gopro.dse.config.etl { mobile-job { conf {} process {} input {} output {} post.process {} } } include classpath("conf/production/etl_mobile_quik.conf") include classpath("conf/production/etl_mobile_capture.conf") include classpath("conf/production/etl_mobile_product_events.conf") Job-level conf override JobType Conf Job specifics includes JobType JobName Input & output specification
  • 28. ETL JOB CONFIGURATION xyz { process {} input { delimiter = "," inputDirPattern = "s3a://teambucket/xyz/raw/production" file.ext = "csv" file.format = "csv" date.format = "yyyy-MM-dd hh:mm:ss" table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName" } output { database = “mobile", file.format = "parquet" date.format = "yyyy-MM-dd hh:mm:ss" partitions = 2 file.compression.codec.key = "spark.sql.parquet.compression.codec" file.compression.codec.value = "gzip” save.mode = ”append" transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer] } post.process { deleteSource = true } } Save Mode JobName Input specification output specification
  • 30. DATA TRANSFORMATION • HSQL over JDBC via beeline • Suitable for non-java/scala/python-programmers • Spark Job • Requires Spark and Scala knowledge, need to setup job, configurations etc. • Dynamic Scala Scripts • Scala as script, compile Scala at Runtime, mixed with Spark SQL
  • 31. SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE class CameraAggCaptureMainJob extends SparkJob { def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = { val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc) val cameraCleanDataSchema = … //define DataFrame Schema val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema) .json("s3a://databucket/camera/work/production/clean-events/final/*") cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data") sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict set hive.enforce.bucketing=false set hive.auto.convert.join=false set hive.merge.mapredfiles=true""") sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” ) //rest of code } new CameraAggCaptureMainJob
  • 33. EVOLUTION OF DATA PLATFORM
  • 34. DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS • BedRock: Self-Service & Data Management (ongoing project) • Pipeline Monitoring • Product Analytics Visualization • Self-service Ingestion • Other tool services
  • 37. EVOLUTION OF DATA PLATFORM
  • 39. SLACK METRICS DELIVERY xxxxxx xxxxxxx xxxxx xxxxxxxxxx xxxxx xxxxxxx xxxxxx xxxxxx xxxxx xxxxx xxxx xxxxxxxxxxxxxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx
  • 40. SLACK METRICS DELIVERY • Why Slack? • Push vs. Pull -- Easy Access • Avoid another Login when view metrics • When Slack Connected, you are already login • Put metrics generation into software engineering process • SQL code is under software control • publishing job is scheduled and performance is monitored • Discussion/Question/Comments on the specific metrics can be done directly at the channel with people involved.
  • 41. SLACK DELIVERY FRAMEWORK • Slack Metrics Delivery Framework • Configuration Driven • Multiple private Channels : Mobile/Cloud/Subscription/Web etc. • Daily/Weekly/Monthly Delivery and comparison • New metrics can be added easily with new SQL and configurations
  • 42. SLACK METRICS CONCEPTS • Slack Job  • Channels (private channels)  • Metrics Groups  • Metrics1 • … • MetricsN • Main Query • Compare Query (Optional) • Chart Query (Options) • Persistence (optional) • Hive + S3 • Additional deliveries (Optional) • Kafka • Other Cache stores (Http Post)
  • 43. BLACK KPI DELIVERY ARCHITECTURE Slack message json HTTP POST Rest API Server Rest API Server generate graphMetrics Json Return Image HTTP POST Save/Get Image Plot.ly json Save Metrics to Hive Table Slack Spark Job Get Image URL Webhooks
  • 44. SLACK DELIVERY BENEFITS • Pros: • Quick and easy access via Slack • Can quickly deliver to engineering manager, executives, business owner and product manager • 100+ members subscribed different channels, since we launch the service • Cons • Limited by Slack UI Real-States, can only display key metrics in two-column formats, only suitable for hive-level summary metrics
  • 46. EVOLUTION OF DATA PLATFORM
  • 47. FEATURE VISUALIZATION • Explore Feature Visualization via Google Facets • Part 1 : Overview • Part 2: Dive • What is Facets Overview ?
  • 48. FACETS OVERVIEW INTRODUCTION • From Facets Home Page • https://pair-code.github.io/facets/ • "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis. • Overview can help uncover issues with datasets, including the following: • Unexpected feature values • Missing feature values for a large number of examples • Training/serving skew • Training/test/validation set skew • Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. • Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. • Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets.
  • 50. FACETS OVERVIEW IMPLEMENTATIONS • The Facets-overview implementation is consists of • Feature Statistics Protocol Buffer definition • Feature Statistics Generation • Visualization • Visualization • The visualizations are implemented as Polymer web components, backed by Typescript code • It can be embedded into Jupyter notebooks or webpages. • Feature Statistics Generation • There are two implementations for the stats generation: Python and Javascripts • Python : using numpy, pandas to generate stats • JavaScripts: using javascripts to generate stats • Both implementations are running stats generation in brower
  • 52. FEATURE OVERVIEW SPARK • Initial exploration attempt • Is it possible to generate larger datasets with small stats size ? • can we generate stats leveraging distributed computing capability of spark instead just using one node ? • Can we generate the stats in Spark, and then used by Python and/or Javascripts ?
  • 53. FACETS OVERVIEW + SPARK ScalaPB
  • 54. PREPARE SPARK DATA FRAME case class NamedDataFrame(name:String, data: DataFrame) val features = Array("Age", "Workclass", ….) val trainData: DataFrame = loadCSVFile(”./adult.data.csv") val testData = loadCSVFile("./adult.test.txt") val train = trainData.toDF(features: _*) val test = testData.toDF(features: _*) val dataframes = List(NamedDataFrame(name = "train", train), NamedDataFrame(name = "test", test))
  • 55. SPARK FACETS STATS GENERATOR val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList()) val proto = generator.protoFromDataFrames(dataframes) persistProto(proto)
  • 56. SPARK FACETS STATS GENERATOR def protoFromDataFrames(dataFrames: List[NamedDataFrame], features : Set[String] = Set.empty[String], histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
  • 59. INITIAL FINDINGS • Implementation • 1st Pass implementation is not efficient • We have to go through each feature multiple paths, with increase number of features, the performance suffers, this limits number of features to be used • The size of dataset used for generate stats also determines the size of the generated protobuf file • I haven’t dive deeper into this as to what’s contributing the change of the size • The combination of data size and feature size can produce a large file, which won’t fit in browser • With Spark DataFrame, we can’t support Tensorflow Records • The Base64-encoded protobuf String can be loaded by Python or Javascripts • Protobuf binary file can also be loaded by Python • But it somehow not be able to loaded by Javascripts.
  • 60. WHAT’S NEXT? • Improve implementation performance • When we have a lot of data and features, what’s the proper size that generate proper stats size that can be loaded into browser or notebook ? • For example, One experiments: 300 Features  200MB size • How do we efficiently partition the features so that can be viewable ? • Data is changing : how can we incremental update the stats on the regular basis ? • How to integrate this into production?
  • 61. PG # RC Playbook: Your guide to success at GoPro FINAL THOUGHTS
  • 62. FINAL THOUGHTS • We are still in the early stage of Data Platform Evolution. • We will continue to share our experiences with you along the way. • Questions? Thank You Data Science & Engineering GoPro

Notas del editor

  1. High Level Architecture of Data Platform Isolation of workloads  3 clusters (ingest, ETL, delivery) Lamdba architecture Input and output data formats Cadence of clusters A word about Data Sources: IoT data Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc. Some Raw and Gzip, Some Binary and JSON Some streaming and some batch Batch data Web marketing, campaigns Social media ERP CRM Lambda architecture Both batch and stream processing Basic needs/workloads in a Data Platform High throughput ingestion Transformations: joins, aggregations, etc. Fast queries Today, we have 3 clusters to isolate these workloads We started with one cluster, ETL Everything ran there Ingest (Flume) Batch (Framework) ETL (Hive) Analytical (Impala) Lots of resource contention (I/O, memory, cores) To alleviate the resource contention, we opted for 3 clusters to isolate the workloads. Ingest cluster for near real-time streaming Kafka, Spark Streaming (Cloudera Parcels) Input: Logs, Output: JSON Minutes cadence Moving towards more real-time in seconds Induction framework for scheduled batch ingestion ETL cluster for heavy duty aggregation Input: JSON flat files, Output: Aggregated Parquet files Hive (Map/Reduce) Hourly cadence Secure Data Mart Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers) Input: Compressed Parquet files Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio) With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future. Kudu is one possible new technology that could help us to consolidate some of the clusters.
  2. Let’s take a deeper dive into our streaming ingestion… Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS Custom servlet pushes logs into Kafka topics by environment A series of Spark streaming jobs process the logs from Kafka Landing place in ingestion cluster is HDFS with JSON flat files Rationalization of tech stacks… Why Kafka? Unrivaled write throughput for a queue Traditional queue throughput: 100K writes/sec on the biggest box you can buy Kafka throughput: 1M writes/sec on 3-4 commodity servers Strong ordering policy of messages Distributed Fault-tolerant through replication Support synchronous and asynchronous writes Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks) Why Spark Streaming? Strong transactional semantics - "exactly once" processing Leverage Spark technology for both data ingest and analytics Horizontally scalable - High throughput for micro-batching Large open source community
  3. Keyword: Impedance mismatch As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events Vary significantly in size from < 1 KB to > 1 MB Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes There are generic jobs/services and specialized jobs/services Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB) Specialized services contain business logic Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data) Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
  4. A few things… Two flows of data: streaming and batch Join data sources Aggregate data sources Convert to compressed columnar format (gzipped Parquet fies) On the ETL cluster… Here’s where we do our heavy lifting. Almost entirely all Hive Map Reduce jobs Some Impala to make the really big narly aggregations more performant Previously, had a custom Java Map Reduce job for sessionization of events This has been replaced with a Spark Streaming job on the ingestion cluster In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.) The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore. The Parquet files are then copied via distcp to the Secure Data Mart.
  5. Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart. The Secure Data Mart is protected with Apache Sentry. Kerberos is used for authentication.  Corporate Standard Active Directory stores the groups.  Corporate Standard Access control is role based and the roles are assigned with Sentry. Hue has a Sentry UI app to manage authorization.
  6. Store data in one place  Data (S3) + Structure (Hive Metastore) Separate compute nodes from storage nodes Elasticity  size of clusters and number of clusters Lower operational overhead of maintaining HDFS storage nodes Promote Kafka to be a centralized service (data hub)