Building a big data intelligent application on top of xPatterns using tools that leverage Spark, Shark, Mesos, Tachyon and Cassandra. Jaws, open sourcing our own spark sql restful service and our own contributions to the Spark and Mesos projects, lessons learned
5. 5
• Hadoop MR -> Spark
o core + graphx
• Hive -> Shark -> Spark SQL
o Cli + SharkServer2 + ... Jaws!
• NO resource manager - > Mesos
o Spark Job Servers, Jaws, SharkServer2, Hadoop, Aurora
• No Cache -> Tachyon
o sharing data between contexts, satellite cluster file system, faster
for long running queries … GC friendlier, survives JVM crashes
• Hadoop distro dashboards-> Ganglia (+Nagios & Graphite)
• Databricks certified!
Hadoop to BDAS
6. 6
• Jaws, xPatterns http spark sql server, open sourcing today!
http://github.com/Atigeo/http-spark-sql-server
Backward compatible with Shark and Spark 0.x stack
• Spark Job Server
multiple Spark contexts in same JVM, job submission in Java + Scala
• Mesos framework starvation bug
submitted patch… detailed Tech Blog link soon at http://xpatterns.com
• *SchedulerBackend update, job cancellation in Mesos fine-
grained mode, 0.9.0 patches (shuffle spill, Mesos fine-grained)
BDAS to BDAS++
7. 7
• 0.8.0 - first POC … lots of OOM
• 0.8.1 – first production deployment, still lots of OOM
20 billion healthcare records, 200 TB of compressed hdfs data
Hadoop MR: 100 m1.xlarge (4c x 15GB)
BDAS: 20 cc2.8xlarge (32c x 60.8 GB), still lots of OOM map & reducer side
Perf gains of 4x to 40x, required individual dataset and query fine-tuning
Mixed Hive & Shark workloads where it made sense
Daily processing reduced from 14 hours to 1.5 hours!
• 0.9.0 - fixed many of the problems, but still requires patches! (spill & mesos
fine-grained), spilling on the reducer side fixed (less OOM)
• 0.9.1 – in production today
• 1.0 upgrade in progress, Jaws being migrated to Spark SQL
Spark … 0.8.0 to 1.0
8. 8
• Highly available, scalable and resilient distributed download tool exposed through Restful API
& GUI
• Supports encryption/decryption, compression/decompression, automatic backup & restore
(aws S3) and geo-failover (hdfs and S3 in both us-east and us-west ec2 regions)
• Support multiple input sources: sftp, S3 and 450+ sources through Talend Integration
• Configurable throughput (number of parallel Spark processors, in both fine-grained and
coarse-grained Mesos modes)
• File Transfer log and file transition state history for auditing purposes (pluggable persistence
model, Cassandra/hdfs), configurable alerts, reports
• Ingest + Backup: download + decompression + hdfs persistence + encryption + S3 upload
• Restore: S3 download + decryption + decompress + hdfs persistence
• Geo-failover: backup on S3 us-east + restore from S3 us-east into west-coast hdfs + backup on
S3 us-west
• Ingestion jobs can be resumed from any stage after failure (# of Spark task retries exhausted)
• Logs, job status and progress pushed asynchronously to GUI though web sockets
• Http streaming API exposed for high-throughput push model ingestion (ingestion into Kafka
pub-sub, batch Spark job for transfer into hdfs)
Distributed Data Ingestion API & GUI
10. 10
T-Component API & GUI
• Data Transformation component for building a data pipeline with monitoring and quality gates
• Exposes all of Oozie’s action types and adds Spark (Java & Scala) and Shark (QL) stages
• Uses our own Spark JobServer (multiple Spark contexts in same JVM!)
• Spark stage required to run code that accepts an xPatterns-managed Spark context (coarse-grained or
fine-grained) as parameter
• DAG and job execution info persistence in Hive Metastore
• Exposes full API for job, stages, resources management and scheduled pipeline execution
• Logs, job status and progress pushed asynchronously to GUI though web sockets
• T-component DAG executed by Oozie
• Shark stage executed through shark CLI for now (SharkServer2 in the future)
• Support for pySpark stage coming soon
11. 11
• Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session
that can concurrently and asynchronously submit Shark queries, return persisted results
(automatically limited in size or paged), execution logs and job information (Cassandra or hdfs
persisted).
• Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI
that is integrated in the xPatterns Management Console (Warehouse Explorer)
• Jaws exposes configuration options for fine-tuning Spark & Shark performance and running
against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed
file system on top of HDFS, and with or without Mesos as resource manager
• Shark editor provides analysts, data scientists with a view into the warehouse through a
metadata explorer, provides a query editor with intelligent features like auto-complete, a
results viewer, logs viewer and historical queries for asynchronously retrieving persisted
results, logs and query information for both running and historical queries
• web-style pagination and query cancellation, spray io http layer (REST on Akka)
• Open Sourced at the Summit! http://github.com/Atigeo/http-shark-server
• Jaws will be upgraded to Spark SQL soon!
Jaws REST SharkServer & GUI
14. 14
Export to NoSql API
• Datasets in the warehouse need to be exposed to high-throughput low-latency real-time
APIs. Each application requires extra processing performed on top of the core datasets,
hence additional transformations are executed for building data marts inside the
warehouse
• Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive
table to a Cassandra Column Family, through a custom Spark job with configurable
throughput (configurable Spark processors against a Cassandra ring) (instrumentation
dashboard embedded, logs, progress and instrumentation events pushed though SSE)
• Data Modeling is driven by the read access patterns provided by an application engineer
building dashboards and visualizations: lookup key, columns (record fields to read), paging,
sorting, filtering
• The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo-
replicated) that uses the underlying generated Cassandra data model and fuels the data in
the dashboards
• Configuration API provided for creating export jobs and executing them (ad-hoc or
scheduled).
• Logs, job status and progress pushed asynchronously to GUI though web sockets
Demo: building a simple recommender system using a memory-based method … vector similarity
the logical architecture diagram with the 3 logical layers of xPatterns: Infrastructure, Analytics and Visualization and the roles: ELT Engineer, Data Scientist, Application Engineer.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application.”
The physical architecture diagram for our largest customer deployment, demonstrating the enterprise-grade attributes of the platform: scalability, high availability, performance, resilience, manageability while providing means for geo-failover (warehouse), geo-replication (real-time DB), data and system monitoring, instrumentation, backup & restore.
Cassandra rings are DC-replicated across EC2 east and west coast regions, data between geo-replicas synchronized in real time through an ipsec tunnel (VPC-to-VPC).
Geo-replicated apis behind an AWS Route 53 DNS service (latency based resource records sets) and ELBs ensures users requests are served from the closest geographical location. Failure to an entire region (happened to us during a big conference!) does not affect our availability and SLAs.
User facing dashboards are served from Cassandra (real-time store), with data being exported from a data warehouse (Shark/Hive) build on top a Mesos-managed Spark/Hadoop cluster.
Export jobs are instrumented and provide a throttling mechanism to control throughput.
Export jobs run on the east-coast only, data is synchronized in real time with the west coast ring. Generated apis are automatically instrumented (Graphite) and monitored (Nagios).
This is by no means a complete architecture … Missing here: Aurora, Sharkserver2, iPython Notebook nodes, model publishing services, solrCloud(lucene) cluster, RabbitMQ
Hadoop -> Spark: faster distributed computing engine leveraging in-memory computation at a much lower operational cost, machine learning primitives, simpler programming model (Scala, Python, Java), faster job submission, shell for quick prototyping and testing, ideal for our iterative algorithms
Hive -> Shark: interactive queries on large datasets have become reasonable requests (in-memory caching yields 4-20x performance improvement, ELT script base migration required minimal effort (same familiar HiveQL, with a few exceptions)
NO resource manager - > Mesos: multiple workloads from multiple frameworks can co-exist and fairly consume the cluster resources (). More mature than YARN, allows us to separate production from experimentation workloads, co-locates legacy Hadoop MR jobs, multiple Shark servers (Jaws), multiple Spark Job servers, mixed Hive and Shark queries (ELT), and establish priority queues: no more unmanageable contention and delayed execution while maximizing cluster utilization (dynamic scheduling)
No Cache -> Tachyon: in-memory distributed file system, with HDFS backup, resilience through lineage rather than replication, our out-of-process cache that survives Spark JVM restarts, allows for fine tuning performance and experimenting against cached warehouse tables without reload. Faster than in process cache due to delayed GC. Provides data sharing between multiple Spark/Shark jobs, efficient in-memory columnar storage with compression support for minimal footprint
Hadoop distro dashboards-> Ganglia: distributed monitoring system for dashboards with historical metrics data (CPU, RAM, disk I/O, network I/O) and Spark/Hadoop metrics. This is a nice addition to our Nagios (monitoring and alerts) and Graphite (instrumentation dashboards)
Thank you team Alpha for your work on Jaws!
Different dataset require different cluster configurations, biggest problem of Spark 0.8.1 is that intermediate output and reducer input is not spilled to disk, causing OOM frequently (0.9.0 solves the reducer part).
Processing pipeline, a mixture of custom MR and mostly Hive scripts, converted to Spark and Shark, with performance gains of 3-4x (for disk intensive operations) to 20-40x for queries on cached tables (Spark cache or Tachyon which is slightly faster with added resilience benefits)
Shark 0.8.1 does not support: map join auto-conversion, automatic calculation of number of reducers, reducer or map out phase disk spills, skew joins etc … we have to either manually fine tune the cluster and the query based on the specific dataset, or we are better off with Hive under these circumstances … so we use Mesos to manage Hadoop and Spark under the same cluster, mixing Hive and Shark workloads
Tested against multiple cluster configurations of the same cost, using 3 types of instances: m1.xlarge (4c x 15GB), m2.4xlarge (8c x 68.4GB) and cc2.8xlarge (32c x 60.8GB).
set mapreduce.job.reduces=…, set shark.column.compress=true, spark.default.parallelism=…, spark.storage.memoryFraction=0.3, spark.shuffle.memoryFraction=0.6, spark.shuffle.consolidateFiles=true, spark.shuffle.spill=false|true
Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session that can concurrently and asynchronously submit Shark queries, return persisted results (automatically limited in size), execution logs and job information (Cassandra or hdfs persisted).
Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI called Shark Editor that is integrated in the xPatterns Management Console
Jaws exposes configuration options for fine-tuning Spark & Shark performance and running against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed file system on top of HDFS, and with or without Mesos as resource manager
Provides different deployment recipes for all combinations of Spark, Mesos and Tachyon
Shark editor provides analysts, data scientists with a view into the warehouse through a metadata explorer, provides a query editor with intelligent features like auto-complete, a results viewer, logs viewer and historical queries for asynchronously retrieving persisted results, logs and query information for both running and historical queries (DEMO)
Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse
Pre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requests
Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring)
Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering.
The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboards
Execution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through websockets (RabbitMQ)
Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse
Pre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requests
Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring)
Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering.
The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboards
Execution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through SSE (Zookeeper watchers used for synchronization)
Mesos/Spark context (CoarseGrainedMode) with a fixed 120 cores spread out across 4 nodes for the export job
Instrumentation dashboard showcasing the write latency measured during the export to noSql job (7ms max). Writes are performed against the east-coast DC … they are propagated to the west coast, however the JMX metric exposed (Write.Latency.OneMinuteRate) does not reflect it … need to build a new dashboard with different metrics!
Nagios monitoring for the geo-replicated, instrumented generated apis. The APIs (readers) and the Spark executors (writers) have a retry mechanism (AOP aspects) that implement throttling when Cassandra is under siege …