This document proposes an in-memory file system to provide a unified abstraction for distributed computing across different devices and networks. It aims to simplify application integration and allow data to be directly placed on devices like GPUs. A prototype would use distributed mmap over Ethernet or InfiniBand to test the approach. The goal is to apply this to build a new distributed version of Caffe that could better leverage devices and frameworks like H2O through simplified pipelines rather than separate integrations.
2. Context - me
● Distributed systems - trading, air control, neural nets
● Multi-GPU Caffe
● Caffe over InfiniBand in Spark
Now at UCB
● Caffe: python, help merge forks
● Project: how to generalize work above?
○ Help leverage devices, e.g. in H2O
○ New distributed Caffe, meta graph
6. A single abstraction?
● Intra (device bus) vs inter-machine (networks)
○ E.g. CUDA copy and sockets
○ RDMA blurs local and remote devices
● Communication vs persistence
○ Sockets vs files is orthogonal to location
○ NVMe allows storage on remote disks
● Ephemeral vs durable
○ 3D XPoint & ReRAM are in-between RAM and SSD
○ Intel’s pmem exposes device directly as memory
8. Example - GPU kernel on data in storage
Today
BFS
● Client reads HDFS path
● HDFS client resolves worker
● Establishes connection
● Server accepts connection
● Authentication, authorization
● File system operation
● Network transfer
● CUDA transfer
data = mmap("/path")
gpu_kernel(data)
9. Example - Compute graph in hardware
/app/jpgs/*
/layers/*
/vars/* // Access DB
/redis db = redis.open("./redis")
● Everything is a file
○ Using mmap, named pipes, unix sockets
○ E.g. inputs jpgs, weights, activations, counters
● All state and coordination in fs
○ Minimal code, e.g. persistent GPU kernels
○ Location independent → dynamic placement
○ Arbitrary graph splitting, e.g. data & model parallel ML
10. Example - Caffe & H2O
● H2O can write to Caffe input layers
○ Data directly placed GPUs
○ RDMA atomic ops to count dependencies
● Can form pipelines
○ No need for pair wise integrations
○ Uniform monitoring, logging etc.
○ Leverage best device for each step
11. Benefits
● Performance
○ mmap lowest possible overhead
○ Leverages hardware, e.g. GPUDirect, RDMA, NVMe, atomic ops
● Complexity
○ Unified naming, permissioning, distributed state management
○ Hierarchical naming & location transparency → HA, placement
● Security
○ File permissions familiar & kernel level, other networking disabled
○ Mounting folder gives access to well defined resources / capabilities
12. Prototype
● Single master with meta data
● Distributed mmap (CPU)
● Embedded platform (X1)
● Ethernet, InfiniBand
13. Summary
● Caffe progress - multi-GPU in python, merge NV work
● Working on new programming model
○ “Unix philosophy for modern apps”
○ Helps leverage devices, e.g. in H2O
○ Simplifies apps integration & pipelines
○ Distributed version of Caffe first use case