Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Michael Fagan & Prashant Khanolkar, Comcast
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
3. 3
INTRODUCTION
WE ARE THE DATA EXPERIENCE
TEAM
We empower the enterprise’s data
needs:
• Metadata and Discovery
• Data Governance
• Data Security
• Tools for Analytics and Reporting
4. 4
BENEFITS OF HYBRID CLOUD
• Leverage current investment in the on-prem ecosystem
• Continual build vs buy cost modeling
• On-demand provisioning vs procurement cycles
• Take advantage of public cloud’s on-demand scaling
• Take advantage of public cloud’s reliability
• Not all workloads are work in a public cloud
5. 5
DATA ACCESS CHALLENGES IN HYBRID CLOUD
• Network latency for data access
• Normalized security models and data access
• Managing data dependent workloads
• Cloud Export Tax
• Storage protocol mismatch
• Hiding the storage environment
6. 6
OUR DATA PROCESSING ECOSYSTEM Hybrid
Presto
Direct Connect
Hive
Components
•Multiple Storage Sources:
A mix of Hadoop/HDFS, AWS/S3 and MinIO/S3 as the
filesystem and/or object storage.
•Alluxio:
- Universal data plane across a variety of storage
systems and clouds so Presto can query data stored
anywhere
- Future proof Comcast in the sense that we started
our journey with AWS S3 as the cloud object store and
later deployed Minio as the on-premise object store,
we were able to accommodate that as well quite
seamlessly using the current Alluxio infrastructure.
•Presto Query Fabric:
- Coupled with Alluxio enables true separation of
storage and compute for data locality and provides
memory speed response times from the respective
storage subsystems.
- Data Egress is provided by a Query Fabric that
leverages Presto as a distributed SQL query engine.
•Privacera:
Fine-Grained Data Access governance and policies are
provided by Privacera that leverages Apache Ranger
architecture.
7. 7
ACCESSING S3 USING PRESTO ORIGINAL DESIGN
ACCESSING S3 USING PRESTO WITHOUT A CACHE
• Presto is an open-source, distributed, SQL query engine that
is optimized for running interactive queries on large data sets.
• Presto relies on a metadata catalog typically a Hive Metastore for HDFS
in case of AWS, Glue Data Catalog provides this essential service.
• Presto can execute queries quite efficiently because it completely runs in
memory, but it lacks any kind of persistent or transient cache.
• Subsequent or concurrent Presto queries run completely independent of
each other and do not benefit from any form of caching / data sharing.
• Without any caching solution at the intersection of on-premise and cloud
we must incur data egress charges even if the same data is egressed
multiple times by same/similar queries.
• These costs which we refer to as Cloud export tax can and do add up
over time.
8. 8
ACCESSING S3 DATA USING PRESTO AND ALLUXIO
ACCESSING S3 USING PRESTO AND ALLUXIO CACHING
• Since Presto lacks a cache, we chose to leverage Alluxio as a caching layer
for Presto.
• Specific S3 buckets depending on whether they require a caching solution for
performance or to reduce AWS Egress costs are mounted to Alluxio.
• External tables are created on top of the Alluxio path which points to the
respective under store in this case S3 bucket.
• Alluxio allows Presto to access data regardless of data source and
transparently caches frequently accessed data.
• We can either let the cache warm up naturally as a result of running queries
against it or preload all files in an Alluxio directory in advance or on-demand
using Alluxio interface.
9. 9
ON-PREM SPARK DATA PIPELINES
ACCESSING S3 USING SPARK AND ALLUXIO
• We have several Spark jobs that run from our on-premise Spark cluster that
need to access data from AWS S3 and join it against other on-prem data
sources.
• Alluxio provides fast storage access and sharing of data across Spark jobs.
• Alluxio ensures that the real data path in persistent under storage is hidden
from Spark and this provides a single data access plane.
• Spark jobs can consume AWS s3 data either directly via the Query
Fabric/Presto or natively via the Alluxio Filesystem path.
10. 10
ASSESSING SPARK ETL PERFORMANCE
ALLUXIO FOR ENHANCING SPARK ETL
• In the past we had to copy data from AWS S3 to HDFS for Spark jobs
that ran on-premise but needed to access data from the cloud.
• We realized an immediate improvement in Spark ETL job performance
after introducing Alluxio in our ecosystem and operations have become
a lot simpler because we no longer need to manage data copies.
• Alluxio provides compute framework like Spark data locality.
• Alluxio improves completion times and reduces performance variability
for Spark pipelines to the cloud.
• The attached chart illustrates performance gains that we realized.
• There are cloud egress cost savings as well which are hard to quantify
because the more jobs that run against cached copies the more the
savings.
s3 access
via Data
Copy to
HDFS
s3 access
via QF /
Presto
S3 access
via Alluxio
Count
Query
30.583sec 4.681sec 2.781sec
Limit Query 6.531sec 3.388sec 2.664sec
Complex
Query
37.129sec 12.701sec 5.810sec
11. 11
ADDRESSING THE CHALLENGES
• Network latency for data access Alluxio
• Normalized security models and data access Alluxio, Custom Broker,
Privacera
• Managing data dependent workloads Alluxio, Atlas
• Cloud Export Tax Alluxio
• Storage protocol mismatch Alluxio
• Hiding the storage environment Alluxio, Presto