Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Accelerate Analytics and ML in the Hybrid Cloud Era

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 29 Anuncio

Accelerate Analytics and ML in the Hybrid Cloud Era

Descargar para leer sin conexión

Alluxio Webinar
September 22, 2020

For more Alluxio events: https://www.alluxio.io/events/

Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio

Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.

In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.

In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results

Alluxio Webinar
September 22, 2020

For more Alluxio events: https://www.alluxio.io/events/

Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio

Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.

In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.

In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Accelerate Analytics and ML in the Hybrid Cloud Era (20)

Anuncio

Más de Alluxio, Inc. (20)

Más reciente (20)

Anuncio

Accelerate Analytics and ML in the Hybrid Cloud Era

  1. 1. Data Orchestration for the Hybrid Cloud Era Peter Behrakis and Alex Ma - Alluxio
  2. 2. Agenda • Market • Alluxio Vision • What is Data Orchestration • How can Alluxio help you?
  3. 3. Enterprises have organically created a legacy of data silos through short term focused projects, mergers & acquisitions! Data Lakes and Silos Abound ▪ Data lakes and critical data is often in a silo and challenging to access ▪ Consolidation of data lakes and silos are expensive and slow to complete ▪ Compute is everywhere Teradata POSIX DB Intern apps Public Clouds S3 Object HDFS 1 HDFS 2
  4. 4. 4 Big Trends Driving the Need for a New Architecture Separation of Compute & Storage Hybrid – Multi cloud environments Self-service data across the enterprise Rise of the object store
  5. 5. ▪ Data volume, velocity and variety are avalanching - data doubles every two years* ▪ The business knows data analytics/ML models allow them to compete effectively* ▪ The Hadoop investment is being replaced by object (on prem and cloud) ▪ The enterprise is a multi cloud world and will remain so for some time ▪ Technical leadership wants the agility to run applications anywhere to sustain operations offering users a transparent self service experience ▪ Technical organizations struggle to keep up with data ingest and business demands ▪ Data is still not fully optimized yet there are many copies costing $$$$ * “The Fourth Industrial Revolution”, by Klaus Schwab Market Summary
  6. 6. Alluxio’s Vision Accelerate analytics and machine learning to enable companies to grow and remain relevant regardless of where their data and compute are located. What can 2X – 5X analytics acceleration do for - ● Fraud protection ● Research for treatments for diseases like COVID-19 ● Uptime for all industrial and digital technologies we depend on
  7. 7. What is Data Orchestration? A platform that brings your data closer to compute across clusters, regions, clouds, and countries to accelerate results
  8. 8. Companies Using Alluxio Consumer Travel & TransportationTelco & Media Learn more TechnologyFinancial Services Retail & Entertainment Data & Analytics Services 8
  9. 9. Companies use Alluxio to … • Gain faster results that matter to the business – advanced caching technology • Dramatically lower OpEx by eliminating data management and cloud egress costs – unified namespace and API translations • Drop into existing on prem and clouds with zero programming
  10. 10. Data Accessibility Translate access to optimal storage APIs over a slow network Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver 10
  11. 11. Hybrid Data Lake with Alluxio A Data Orchestration Approach 11
  12. 12. Approaches to Hybrid Cloud ▪ Simple tools available like distCP ▪ Works for workloads with easily identifiable datasets Issues ▪ Datasets for many workloads cannot always be identified easily ▪ Significantly more data transfer than workload requirements ▪ Additional copies are very hard to sync back with master data Performance can be dramatically impacted due to cloud storage limitations Lift and Shift Data copy by workload Compute-driven Data Caching ▪ Migration may seem easier as no application re-architecture needed Issues ▪ If workloads are not made cloud- native and elastic, infrastructure cost can skyrocket ▪ If on-prem data copy needs to be maintained, syncing cloud and on- prem data can be hard ▪ Data pulled into cloud based on compute requests ▪ Data is cached locally to reduce I/O on remote clusters and is automatically synced Issues ▪ Less helpful for workloads that don’t read data set more than once 12
  13. 13. Problem: HDFS cluster is compute- bound & complex to maintain Google Cloud Platform Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity• Offload on-prem cluster (both compute & I/O) • Manage working set, not FULL set of data • Local performance • Automatic synchronization with on-prem changes Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days GCS Our Solution: “Zero-Copy Burst” 13
  14. 14. Case Studies 14
  15. 15. Alluxio at Walmart 15 Architectural Components • Alluxio is co-located with Presto For Data Locality • Automatic Metadata Synchronization To create Hive tables with Alluxio mount points • Auto-scaling To maintain a min number of Alluxio workers • Pin frequently used data To avoid cache evictions
  16. 16. 2x Performance For range queries High Concurrency With Alluxio Cost Reduction With Half the compute costs or 2x compute capacity for the same environment Alluxio at Walmart Takeaways 16
  17. 17. Alluxio at Adobe Primary DC with large Hadoop Cluster out of space, ad hoc SQL workloads exponentially growing as analyst headcount as reached 1800 ppl PROBLEM ● 80% less network usage ● More stable infrastructure ● Lower costs ● Results come in faster ● Easier to scale ● Ability handle new analysts with no impact and increase response times ● Self-service for end-users Leverage compute resources outside of primary on- prem DC for multiple analytical frameworks. SOLUTION REMOTE DATA RESULTS 17 Cross Data Center Access
  18. 18. Alluxio at Electronic Arts (EA) Single Cloud with AWS Learn More Upto 6x Performance When handling a large number of small files Elastic Compute To Reduce Infrastructure Costs Reduce S3 Costs By eliminating S3 access operations
  19. 19. Core Features Enable a Hybrid Data Lake 19
  20. 20. Data Locality with Intelligent Multi-tiering Local Performance from remote data using multi-tier storage Hot War m Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL On-premisesPublic Cloud 20
  21. 21. Metadata Locality with “Active Sync” Detect on-prem changes and synchronize metadata Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion, TTL HDFS iNotify Based Metadata Synchronization Mutation On-premisesPublic Cloud 21
  22. 22. Policy Driven Data Migration Migrate Data to Cloud Storage based on Access Policies hdfs://host:port/directory/ Reports Sales • Single Alluxio path backed by multiple storage systems • Example policy: Migrate data older than 7 days from HDFS to S3 22
  23. 23. Reference Architecture Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD … … Under Store 1 Under Store 2 23 Control Path Data Path
  24. 24. Alluxio Catalog Service Hive Metastore Hive Under Database Functionality Manages metadata for structured data Abstracts other database catalogs as Under Database (UDB) Benefits Schema-aware optimizations Simple deployment 24 Alluxio Catalog Service
  25. 25. Transform data to be compute-optimized independent of the storage format Coalesce Format Conversion parquetcsv 25 Transformation Service
  26. 26. Attached existing Hive database into Alluxio Catalog Alluxio Catalog served table metadata for Presto Transformed store_sales by coalescing and converting CSV to Parquet Presto Without Alluxio 20s Alluxio Transformations 7s Alluxio Transformations With Caching 3s 26 Example Results
  27. 27. Questions? 27
  28. 28. How can Alluxio help you? • Did you learn what Alluxio Data Orchestration is? • Do you have a use case Alluxio can accelerate? For follow up questions and to discuss your situation, please contact Peter at peter@alluxio.com
  29. 29. I. Burst data lake processing to Dataproc using on-prem Hadoop data https://cloud.google.com/blog/products/data-analytics/burst-data-lake-processing-dataproc-using-prem-hadoop-data II. Tutorial: Hybrid Cloud Bursting with GCP and Alluxio https://docs.alluxio.io/ee/user/stable/en/tutorials/GCP-Tutorial.html III. “Zero-Copy” Hybrid Cloud for Data Analytics https://www.alluxio.io/resources/whitepapers/zero-copy-hybrid-cloud-for-data-analytics-strategy-architecture-and- benchmark-report/ IV. Getting Started with Dataproc and Alluxio https://docs.alluxio.io/ee/user/stable/en/cloud/Google-Dataproc.html V. Using Transparent URI https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html Additional Resources 29

×