Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.


348 visualizaciones

Publicado el

RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.

Publicado en: Ingeniería
  • Sé el primero en comentar

  • Sé el primero en recomendar esto


  1. 1. RubiX A caching framework for big data engines in the cloud Strata + Hadoop World March 2017 Shubham Tagra (
  2. 2. Agenda ● Intro ● Why Caching? ● Path to Rubix ● Rubix Architecture ● Future of Rubix ● QnA
  3. 3. Built for Anyone who Uses Data Analysts l Data Scientists l Data Engineers l Data Admins Optimize performance, cost, and scale through automation, control and orchestration of big data workloads. A Single Platform for Any Use Case ETL & Reporting l Ad Hoc Queries l Machine Learning l Streaming l Vertical Apps Open Source Engines, Optimized for the Cloud Native Integration with multiple cloud providers
  4. 4. Qubole operates at Cloud Scale 500 PB Data Processed in the Cloud Monthly 6 PB 80 PB 150 PB 500 PB 500 Nodes Largest Spark Cluster in the Cloud 2000 Clusters Started per month
  5. 5. Why Caching ● Popularity of Cloud Stores like S3 + Near-infinite capacity + Inexpensive + Ease of use - Network Latencies - Back-offs
  6. 6. Rubix ancestors ● File cache
  7. 7. Rubix ancestors ● File cache ○ Benefits: as much as 10x performance improvement ○ Problems ■ Huge warm-ups ■ Cache size ■ Tied to Presto ■ Required Presto scheduler changes
  8. 8. ● Improve performance ● Abstracted from user ○ Easy of use ● Support Columnar formats ○ Improves speed ● Work well with autoscaling ○ Saves cost ● Ease of extension to clouds and engines Requirements for new cache
  9. 9. Alternatives Considered: FUSE FileSystem ● Mount S3 paths on ec2 ● OS for page caching, read ahead, etc ● Problems ○ Exclusive control over bucket ○ Data corruptions in external updates ○ Not production ready
  10. 10. Alternatives Considered: HTTP Caching
  11. 11. Alternatives Considered: HTTP Caching ● Worked fine with TXT data ● Problems ○ Columnar formats and Byte-Range based Varnish Keys ■ Poor hit ratio ■ Redundant copies
  12. 12. Tachyon/Alluxio ● More than just a caching system ● We required light weight system ● SQL first
  13. 13. Rubix ● Extendible to many engines ● Columnar format friendly ● Works well with autoscaling ● Share-able across engines/instances
  14. 14. Architecture ● Split ownership assignment system ● Data Caching System ● Plugins
  15. 15. Architecture ● Split ownership assignment system ○ Used in master node during split computation ○ Calculates which node owns particular split of file ○ Uses Consistent Hashing to work well with Autoscaling
  16. 16. ● Data Caching System ○ Used in worker nodes when data is read ○ Read from disk or remote as per the metadata ○ Metadata stored in units of block (1MB each) ○ BookKeeper provides metadata for the block ○ Metadata too Checkpointed to local disk Architecture
  17. 17. ● Plugin ○ Provides two types of information ■ How to get the list of nodes in the system ■ FileSystem for remote reads ○ E.g. presto plugin, hadoop1 plugin, hadoop2 plugin Architecture
  18. 18. Plugins: Presto ● Presto provides tight control over scheduling local splits ● This ensured that splits will be always scheduled locally ● Worked well for our customers
  19. 19. Plugins: Hadoop ● Strict local scheduling was not possible with hadoop ● This meant lot of warm-ups and redundant copies of data ● Options: ○ Read directly from remote for non-local read ○ Figure out the correct owner and read from it ○ Implement Non-Local reads for Hadoop support ○ Learnings ■ 100% strict location based scheduling not possible in H2
  20. 20. Using Rubix with Presto ● Configure disk mount point ○ Assumes disks mounted on /media/ephemeral0, /media/ephemeral1, etc by default ● Start BookKeeper ● Place rubix jars in hive-hadoop2 plugin of Presto ● Configure Presto to use Rubix FileSystem for the cloud store
  21. 21. Using Rubix with Presto in Qubole
  22. 22. Using Rubix with Hadoop ● Configure disk mount point ○ Assumes disks mounted on /media/ephemeral0, /media/ephemeral1, etc by default ● Start BookKeeper ● Place rubix jars with hadoop libraries ● Configure Hadoop to use Rubix FileSystem for the cloud store
  23. 23. Extending to other Engines and Clouds
  24. 24. Performance gains
  25. 25. Future Work ● Extend to other clouds and engines ● Table aware objects in Rubix ● Caching policies for Hive Partitions ● Subquery caching
  26. 26. Questions?