Alluxio Global Online Meetup
November 10, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Leonardo Militano, ZHAW
In most of the distributed storage systems, the data nodes are decoupled from compute nodes. This is motivated by an improved cost efficiency, storage utilization and a mutually independent scalability of computation and storage. While this consideration is indisputable, several situations exist where moving computation close to the data brings important benefits. Whenever the stored data is to be processed for analytics purposes, all the data needs to be repeatedly moved from the storage to the compute cluster, which leads to reduced performance.
In this talk, we will present how using Alluxio computation and storage ecosystems can better interact benefiting the "bringing the data close to the code" approach. Moving away from the complete disaggregation of computation and storage, data locality can enhance the computation performance. During this talk, we will present our observations and testing results that will show important enhancements in accelerating Spark Data Analytics on Ceph Objects Storage using Alluxio.
2. Agenda
● Introduction to Cloud Storage
● Solutions for data analytics based on data locality
● Alluxio based solution for data analytics
● Performance evaluation
● Conclusions
3. Service Engineering group
● The SE group at InIT, Zurich University of Applied Sciences
(ZHAW), Switzerland
● Core expertise: IaaS, PaaS, SaaS, virtualization
● Focus is on scalable and reliable implementation of
IT-based services
● Research Initiatives:
○ Cloud (infrastructure, platform, CI/CD, DevOps, CNA)
○ Robotics (cloud robotics, ROS)
● Blog: https://blog.zhaw.ch/icclab/
5. ● The global storage market has an annual growth of
25.8% and it is predicted to reach $74.94 billion of
value in 2021
● Increasing demand for data storage:
○ IDC expects data to grow 61% to 163 ZB by 2025
○ By 2025, 49 percent of data will be stored in public cloud
environments
● At the same there is a paradigm shift with more data created,
stored and processed at the edge
● Data is the new oil!
Storage in the Cloud
6. Data analytics
● If data is the new oil, it needs to be processed into higher-order
products to benefit from its value
● Disaggregation of storage and compute for cost efficiency and
manageability is the common approach
○ Data is remote to the compute nodes
● Bringing the code to the data (e.g., computational storage) or
bringing the data close to the code (e.g., in-memory
computation)?
○ Data locality for bandwidth, power consumption, cost, latency, and security
7. Ceph storage
● Ceph is a unified, distributed storage system
with self-management and healing features
for: Object Storage, Block Storage and File
Storage
● We performed some Experiments on Ceph
Object Classes for Active Storage showing
great time savings using object classes
8. Alluxio for Memory Speed Computation
● Alluxio on the compute nodes allows for in-memory computation and fast data
analysis
Source: alluxio.io
9. The framework used for testing
● Ceph (version mimic) storage cluster
○ 6 OpenStack VMs: 1 Ceph monitor, 3 OSDs,
1 RGW, 1 Admin node
● Total storage size of 420GiB over 7
OSD volumes
● Alluxio cluster (v2.3 and v2.4)
● Spark (v3.0.0)
● Scala application on Spark
● Find more details on our blog post
10. Two compute cluster configurations
● Single-node:
○ One VM (16vCPUs) for Alluxio and Spark with 40GB of
memory for the worker node
● Cluster-mode:
○ Two Spark/Alluxio worker nodes (16vCPUs, 40GB memory)
● Scala application over Spark
○ repeated access to a text file
○ count operation over the lines in the file
● A comparison was performed in terms of overall
execution time for different file sizes:
o Alluxio-based vs. direct Ceph access
13. Summary of results
● Single-node setup:
○ The second time the file is accessed directly on Ceph it takes 75
times more for the 1GB file, 111 and 107 times more for the 5GB
and 10GB file w.r.t. the access over Alluxio
● Cluster-mode setup:
○ The second time the file is accessed directly on Ceph it takes 35
times more for the 1GB file, 57 and 65 times more for the 5GB and
10GB file w.r.t. the access over Alluxio
● NB! Results were obtained using Java version 8 (prerequisite of
Alluxio v2.3)
o Direct Ceph file access with Spark using Java 11 performs much better when
compared to using Java 8!
15. Testing Alluxio 2.4
● The benefits are downscaled by the general reduced execution time using Java 11
● Anyhow still a 6 times better performance is obtained for a 10GB file at the second
access compared to direct Ceph access
● So Alluxio 2.4 resolves an important limitation of previous versions
16. Conclusions
● Alluxio enables memory-speed data access by eliminating
remote data reads for repeated accesses
● Our results show how both single-node and cluster-mode
setups lead to several orders of improvement
● Alluxio 2.3 had Java version 8 as a prerequisite (default
Java version is Java 11), which was a limiting factor
● Alluxio 2.4 supporting Java 11 is fundamental to keep the
performance improvements w.r.t. direct backend storage
access