Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Alluxio (formerly Tachyon)
Open Source Memory Speed
Virtual Distributed Storage System
Gene Pang @ Alluxio, Inc.
July 9, 2...
About Me
•  Software Engineer @ Alluxio, Inc.
•  One of the core maintainers of Alluxio Open Source Project
•  Ph.D. @ AMP...
About Alluxio, Inc.
•  Founded by creators and top committers of Alluxio open
source project (formerly named Tachyon)
•  S...
What I’ll be Covering
•  Brief overview of Alluxio
•  Motivation for Alluxio
•  Alluxio Use Cases
4
5
What Is Alluxio?
6
Alluxio
Open Source
Memory Speed
Virtual
Distributed Storage System
•  Open Source. One of the fastest growing project
communities
•  Memory Speed. Memory-centric architecture designed for
m...
8
Alluxio Ecosystem
•  Flexibility. Unified namespace enable new workloads
across storage systems
•  Agility. Quickly adapt to frameworks and ...
10
Alluxio is Open Source
•  Started at UC Berkeley AMPLab, Summer 2012
–  The same lab that produced Apache Mesos and Apache
Spark
•  Open sourced ...
•  Over 250 Contributors
•  3x growth over the last year
12
Contributor Growth
Alluxio Open Source Community
13
Over 3x increase
from 1 year ago!
Contributors and Users
14
15
Alluxio is Memory Speed
16
Why Use Memory for Storage?
•  RAM throughput increasing exponentially
•  Disk throughput increasing slowly
•  Memory-locality key to interactive resp...
•  DRAM is becoming inexpensive (source: jcmit.com)
18
Why Memory? Price Trend
19
What if memory capacity is
still not enough?
Alluxio Manages Tiered Storage
20
MEM
SSD
HDD
Faster
Higher Capacity
Configurable Storage Tiers
21
MEM only
MEM + HDD
SSD only
Pluggable Tier Management Policies
22
Evict stale data to
slower tier
Promote hot data
to faster tier
23
Alluxio is a 
Virtual Distributed Storage
System
24
The Big Data Ecosystem Today
25
This is Problematic
•  Costly Ecosystem Integrations
•  Costly ETL and Data Duplication
•  Data Silos
•  Long Cycle from Data to Value
26
What...
27
Alluxio Unifies Access to Data
28
How to use Alluxio?
•  Accelerate access to remote storage
•  Share data across jobs/applications at memory speed
•  Transparently manage data...
30
Accelerating Access to
Remote Storage
31
Remote I/O to Data
Spark
Amazon S3
every data operation
requires data transfer,
sometimes over the
WAN
high latency, ne...
32
Local I/O with Alluxio
Spark
Amazon S3
Alluxio
low latency, memory
throughput
high latency, network
throughput
Keeping ...
33
Sharing Data at
Memory Speed
34
Sharing Data Slowly
Spark
Amazon S3
MapReduce
 Flink
Network I/O
Disk I/O
I/O slows
down sharing
35
Sharing Data Memory Speed with Alluxio
Spark
Amazon S3
MapReduce
 Flink
Alluxio
Share data via
memory
36
Managing Data Across
Different Storage Systems
37
Simple World
Application 1
HDFS
38
Adding a Storage System
Application 1
HDFS
 Amazon S3
39
Adding a Storage System
Application 1
Google GCS
 HDFS
 Amazon S3
40
Adding an Application
Application 1
Google GCS
 HDFS
 Amazon S3
Application 2
41
Adding an Application
Application 1
Google GCS
 HDFS
 Amazon S3
Application 2
Application 3
complex,
inflexible
42
With Alluxio
Application 1
HDFS
Alluxio
43
New Storage Systems and Applications
Application 1
Google GCS
 HDFS
 Amazon S3
Application 2
Application 3
Alluxio
Flex...
44
Alluxio in the Wild!
45
Use Case
•  Framework: Spark SQL
•  Under Storage: Baidu’s File System
•  Storage Media: MEM + HDD
•  200+ nodes deployment
•  2PB+...
47
Use Case
•  Framework: Spark
•  Storage Media: MEM
•  Improvement from Hours to Seconds
48
at
49
Use Case
•  Framework: Spark Streaming + Flink Streaming + Spark +
Flink
•  Under Storage: Multiple HDFS clusters
•  Storage Media:...
51
•  Alluxio Project: www.alluxio.org
•  Alluxio, Inc: www.alluxio.com
•  Development: www.github.com/Alluxio/alluxio
•  Mee...
Próxima SlideShare
Cargando en…5
×

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon): An Open Source Memory Speed Virtual Distributed Storage - Gene Pang, Software Engineer, Alluxio

658 visualizaciones

Publicado el

Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system. The Alluxio open source community is one of the fastest growing open source communities in big data history with more than 300 developers from over 100 organizations around the world. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features including tiered storage, transparent naming, and unified namespace. Alluxio now supports a wide range of under storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift. This year, our goal is to make Alluxio accessible to an even wider set of users, through our focus on security, new language bindings, and further increased stability.

Publicado en: Tecnología
  • Sé el primero en comentar

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon): An Open Source Memory Speed Virtual Distributed Storage - Gene Pang, Software Engineer, Alluxio

  1. 1. Alluxio (formerly Tachyon) Open Source Memory Speed Virtual Distributed Storage System Gene Pang @ Alluxio, Inc. July 9, 2016 @ Big Data Day LA
  2. 2. About Me •  Software Engineer @ Alluxio, Inc. •  One of the core maintainers of Alluxio Open Source Project •  Ph.D. @ AMPLab, UC Berkeley •  Worked at Google before UC Berkeley •  Twitter: @unityxx 2
  3. 3. About Alluxio, Inc. •  Founded by creators and top committers of Alluxio open source project (formerly named Tachyon) •  Series A by Andreessen Horowitz •  http://www.alluxio.com •  We are hiring! 3
  4. 4. What I’ll be Covering •  Brief overview of Alluxio •  Motivation for Alluxio •  Alluxio Use Cases 4
  5. 5. 5 What Is Alluxio?
  6. 6. 6 Alluxio Open Source Memory Speed Virtual Distributed Storage System
  7. 7. •  Open Source. One of the fastest growing project communities •  Memory Speed. Memory-centric architecture designed for memory I/O •  Virtual. Unified Namespace abstracts storage from applications •  Distributed. Designed to scale out with commodity hardware 7 What Does That Mean?
  8. 8. 8 Alluxio Ecosystem
  9. 9. •  Flexibility. Unified namespace enable new workloads across storage systems •  Agility. Quickly adapt to frameworks and storage systems of your choice •  Performance. Architecture supports fast, memory-speed access to data •  Cost. Grow storage and compute resources independently 9 Alluxio Benefits Any application can access any data from any storage at memory speed
  10. 10. 10 Alluxio is Open Source
  11. 11. •  Started at UC Berkeley AMPLab, Summer 2012 –  The same lab that produced Apache Mesos and Apache Spark •  Open sourced as Tachyon, April 2013 –  Apache License 2.0 –  Renamed to Alluxio in February 2016 –  Latest Release: Version 1.1.1 (July 2016) 11 The Beginnings
  12. 12. •  Over 250 Contributors •  3x growth over the last year 12 Contributor Growth
  13. 13. Alluxio Open Source Community 13 Over 3x increase from 1 year ago!
  14. 14. Contributors and Users 14
  15. 15. 15 Alluxio is Memory Speed
  16. 16. 16 Why Use Memory for Storage?
  17. 17. •  RAM throughput increasing exponentially •  Disk throughput increasing slowly •  Memory-locality key to interactive response times 17 Why Memory? Performance Trend
  18. 18. •  DRAM is becoming inexpensive (source: jcmit.com) 18 Why Memory? Price Trend
  19. 19. 19 What if memory capacity is still not enough?
  20. 20. Alluxio Manages Tiered Storage 20 MEM SSD HDD Faster Higher Capacity
  21. 21. Configurable Storage Tiers 21 MEM only MEM + HDD SSD only
  22. 22. Pluggable Tier Management Policies 22 Evict stale data to slower tier Promote hot data to faster tier
  23. 23. 23 Alluxio is a Virtual Distributed Storage System
  24. 24. 24 The Big Data Ecosystem Today
  25. 25. 25 This is Problematic
  26. 26. •  Costly Ecosystem Integrations •  Costly ETL and Data Duplication •  Data Silos •  Long Cycle from Data to Value 26 What are the Problems?
  27. 27. 27 Alluxio Unifies Access to Data
  28. 28. 28 How to use Alluxio?
  29. 29. •  Accelerate access to remote storage •  Share data across jobs/applications at memory speed •  Transparently manage data across different storage systems 29 Alluxio Common Use Cases
  30. 30. 30 Accelerating Access to Remote Storage
  31. 31. 31 Remote I/O to Data Spark Amazon S3 every data operation requires data transfer, sometimes over the WAN high latency, network throughput
  32. 32. 32 Local I/O with Alluxio Spark Amazon S3 Alluxio low latency, memory throughput high latency, network throughput Keeping data in Alluxio accelerates data access
  33. 33. 33 Sharing Data at Memory Speed
  34. 34. 34 Sharing Data Slowly Spark Amazon S3 MapReduce Flink Network I/O Disk I/O I/O slows down sharing
  35. 35. 35 Sharing Data Memory Speed with Alluxio Spark Amazon S3 MapReduce Flink Alluxio Share data via memory
  36. 36. 36 Managing Data Across Different Storage Systems
  37. 37. 37 Simple World Application 1 HDFS
  38. 38. 38 Adding a Storage System Application 1 HDFS Amazon S3
  39. 39. 39 Adding a Storage System Application 1 Google GCS HDFS Amazon S3
  40. 40. 40 Adding an Application Application 1 Google GCS HDFS Amazon S3 Application 2
  41. 41. 41 Adding an Application Application 1 Google GCS HDFS Amazon S3 Application 2 Application 3 complex, inflexible
  42. 42. 42 With Alluxio Application 1 HDFS Alluxio
  43. 43. 43 New Storage Systems and Applications Application 1 Google GCS HDFS Amazon S3 Application 2 Application 3 Alluxio Flexible, simple no application changes, new mount point
  44. 44. 44 Alluxio in the Wild!
  45. 45. 45 Use Case
  46. 46. •  Framework: Spark SQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  200+ nodes deployment •  2PB+ managed space 46 at
  47. 47. 47 Use Case
  48. 48. •  Framework: Spark •  Storage Media: MEM •  Improvement from Hours to Seconds 48 at
  49. 49. 49 Use Case
  50. 50. •  Framework: Spark Streaming + Flink Streaming + Spark + Flink •  Under Storage: Multiple HDFS clusters •  Storage Media: MEM + HDD •  200+ nodes deployment •  Alluxio enables previously impossible jobs to finish •  300x Performance Improvement during peak load 50 at
  51. 51. 51
  52. 52. •  Alluxio Project: www.alluxio.org •  Alluxio, Inc: www.alluxio.com •  Development: www.github.com/Alluxio/alluxio •  Meet Friends: www.meetup.com/Alluxio •  Email: gene@alluxio.com 52 To Get More Information

×