Expectations for optical network from the viewpoint of system software research
1. Ryousei Takano
National Institute of Advanced Industrial Science and Technology
(AIST)
Special session on challenges and opportunities of integrated
photonics in future datacenters
ACSI 2015@Tsukuba, 27 Jan. 2015
Expectations for optical network
from the viewpoint of system
software research
2. Outline
• Trends in datacenter research and
development
• AIST IMPULSE Project
• Workload analysis
• Proposed architecture:
Dataflow-‐‑‒centric computing
2
3. Introduction
• BigData is a killer app in datacenters, it
requires a clean slate architecture like
“disaggregation”, “datacenter in a box”.
• Optical network is key to making them.
• Optical path network (all optical path
between end-‐‑‒to-‐‑‒end) in a datacenter
– Pros: huge bandwidth, energy efficiency
– Cons: path switching latency, utilization
• To take advantage of optical path network,
a new datacenter OS is essential.
– Key idea: control/data plane separation
3
4. Optical Network in DCs
• Similar concept (“disaggregation” or “datacenter
in a box”) projects are launched recently.
– Open Compute Project (Facebook)
– Rack Scale Computing (Intel)
– Extremely Shrinking Computing (IBM)
– The Machine (HP)
– FireBox (UCB)
– CTR Consortium (MIT)
• Optical network, including photonic-‐‑‒electronic
convergence and short (<1km) reach inter-‐‑‒
connection, is key to drive innovation in future
datacenters.
4
5. 5
Architecture
Service Cluster Back-End Cluster
Front-End Cluster
Web
250 racks
Ads 30 racks
Cache (~144TB)
Search Photos Msg Others UDB ADS-DB TaoLeader
Multifeed 9 racks
Other small services
“Flash at Facebook”,
Flash Summit 2013
Standard
Systems
I
Web
III
Database
IV
Hadoop
V
Photos
VI
Feed
CPU
High
2&x&E5*2670
High
2&x&E5*2660
High
2&x&E5*2660
Low
High
2&x&E5*2660
Memory Low
High
144GB
Medium
64GB
Low
High
144GB
Disk Low
High&IOPS
3.2&TB&Flash
High
15&x&4TB&SATA
High
15&x&4TB&SATA
Medium
Services Web,&Chat Database
Hadoop
(big&data)
Photos,&Video
MulPfeed,
Search,&Ads
Five Standard Servers
6. Open Compute Project
6
• OCP was founded to openly share designs of
datacenter products by Facebook in April 2011.
• Shift from commodity products to user-‐‑‒driven
design to improve the energy efficiency of large
scale datacenters
– Industry Standard: 1.9 PUE
– Open Compute Project: 1.07 PUE
• Specifications: server, storage,
rack, network switch, etc.
• Products: Quanta Rackgo X,
GIGABYTE DataCenter Solution
Open Compute Rack v2
8. HP “The Machine”
8
The
Machine
could
be
six
%mes
more
powerful
than
an
equivalent
conven2onal
design,
while
using
just
1.25
percent
of
the
energy
and
being
around
1/100
the
size.
h:p://www.hpl.hp.com/research/systems-‐research/themachine/
9. Datacenter in a Box
• The machine is six times faster with 1.25 percent
of the energy comparing with K.
9
HPC Challengeʼ’s RandomAccess benchmark
10. The Machine: Architecture
10
Photonic
Interconnect
Compute Elements
Memory Elements
NV Memory Elements
Storage Elements
Architecture evolution/revolution
“Computing Ensemble”: bigger than a
server, smaller than a datacenter,
built-in system software
– Disaggregated pools of uncommitted
compute, memory, and storage
elements
– Optical interconnects enable dynamic,
on-demand composition
– Ensemble OS software using
virtualization for composition and
management
– Management and programming
virtual appliances add value for IT
and application developers
On-demand composition
Ensemble OS Management
Ensemble Programming
11. Machine OS
• Linux++: Linux-‐‑‒based OS for The Machine
– A new concept of memory management
– An emulator to make a conventional computer
behave like The Machine
– A developerʼ’s preview released in June 2015?
• Carbon
– HP will replace Linux++ with Carbon.
11
12. UC Berkeley
1 Terabit/sec optical fibers
FireBox Overview!
High Radix
Switches
SoC
SoC
SoC
SoC
SoC
SoC
SoCSoC
SoC
SoC
SoC
SoC
SoC
SoC
SoC
SoC
Up to 1000 SoCs +
High-BW Mem
(100,000 core total)
NVM
NVM
NVM
NVM
NVM
NVM
NVM
NVMNVM
NVM
NVM
NVM
NVM
NVM
NVM
NVM
Up to 1000 NonVolatile
Memory Modules (100PB total)
InterXBox&
Network&
Many&Short&Paths&
Thru&HighXRadix&Switches&
FireBox Overview
12
A similar concept of
The Machine
14. IMPULSE: Initiative for Most Power-efficient
Ultra-Large-Scale data Exploration
2014
2020
2030
・・・
・・・
Op2cal
Network
3D stacked package
2.5D stacked package
Separated packages
Future data center
Logic
I/O
NVRAM
Logic
NVRAM
I/O
I/O
Logic
NVRAM
High-Performance Logic Architecture
Non-Volatile Memory Optical Network
- Voltage-controlled, magnetic RAM
mainly for cache and work memories
- 3D build-up integration of the front-end
circuits including high-mobility Ge-on-
insulator FinFETs. / AIST-original TCAD
- Silicon photonics cluster SW
- Optical interconnect technologies
- Future data center architecture
design / Dataflow-centric warehouse-
scale computing
15. 15
HPC
Big data
Architecture for concentrated data processing
3D installation
Non-‐‑‒volatile
memory
Energy-‐‑‒
saving
logic
Optical path
between chips
Non-‐‑‒volatile
memory Energy-‐‑‒
saving
logic
Storage class memory(non-‐‑‒volatile memory)
HDD storage
High performance
server module
Energy-‐‑‒saving
high-‐‑‒speed network
Energy-‐‑‒saving
large-‐‑‒capacity
storage
Creating a rich and
eco-‐‑‒friendly society
Initiative for Most Power-‐‑‒efficient Ultra-‐‑‒Large-‐‑‒Scale data Exploration
IMPULSE STrategic AIST integrated R&D (STAR) program
*STAR program is AIST research that will produce a large outcome in the future.
AIST’s IMPULSE Program
16. Voltage-controlled Nonvolatile Magnetic RAM
Nonvolatile CPU
Nonvolatile Cash
Nonvolatile Display
Power saved
storage
NAND Flash
Voltage Controlled
Spin RAM
• voltage-induced magnetic
anisotropy change
• Less than 1/100 rewriting power
• Resistance change by the Ge
displacement
• Loss by entropy: < 1/100
Voltage Controlled
Topological RAM
Memory keeping
w/o power
Insulation
Layer
Thin film
Ferro-
magnetics
17. Low Power High-performance Logic
Wiring
layer
Front
-end
Front-end 3D integration
ソース ドレイン ソース ドレイン
nMOS pMOS
絶縁膜
Ge Ge Ge
Ge Fin CMOS Tech.
• Low-power/high-speed by Ge
• Toward 0.4V - Ge Fin CMOS
S
D
S
D
Insulation layer
● Dense integration w/o miniaturization
● Reduction of the wiring length for power saving
● Introduction of Ge and III-V channels by simple stacking process
● Innovative circuit by using Z direction
19. DEMUX
Wavelength
bank
(Optical
comb)
・
・
・
MOD. MUX
Fiber
Silicon
Photonics
Integration
Datacenter
server
racks
Silicon photonics
cluster switches
DWDM, multi-level
modulation optical
interconnects
DSP
Tx RxComb
source
Memory
cube
CPU
/GPU
2.5D-CPU Card
No of λs
Order of mod.
Bit rate
1
1
20 Gbps
4
8
640 Gbps
32
8
5.12 Tbps
●Large-scale silicon photonics based cluster switches
●DWDM, multi-level modulation, highly integrated “elastic” optical interconnects
●Ultra-low energy consumption network by making use of optical switches
Ø Ultra-compact switches
based on silicon photonics
Ø 3D integration by
amorphous silicon
Ø A new server architecture
Current state-of-the-art Tx
100Gbps → ~ 5.12Tbps
Current electrical switches:
~130Tbps → ~500Pbps
Optical Network Technology for Future Datacenters
20. Architecture for Big Data and Extreme-scale Computing
Real-time
Big data
Optimal arrangement of the data flow
Resource management / Monitoring
Storage
Server Module
Data center OS
Input
Output
Conv.
Ana.
Data flow
Data
flow
centric
warehouse
scale
compu%ng
1 - Single OS controls
entire data center
2 - Split a data center OS into the
data plane and the control plane to
guarantee real-time data processing
Connect to universal processor /
hardware and storage by using
optical network
21. Performance Estimation
• Estimate the performance of typical both HPC and
BigData workloads on a future datacenter system
• SimGrid simulator
– Simulator of large-‐‑‒scale distributed Systems, such as
Grids, Clouds, HPC, and P2P.
21
http://simgrid.gforge.inria.fr
mGrid Overview
MSG
Simple application-
level simulator
SimDag
Framework for
DAGs of parallel tasks
applications on top of
a virtual environment
Library to run MPI
SMPI
virtual platform simulator
SURF
Contrib
Grounding features (logging, etc.), data structures (lists, etc.) and portability
XBT
TRACETracingsimulation
User Code
Grid user APIs
If your application is a DAG of (parallel) tasks ; use SimDag
To study an existing MPI code ; use SMPI
SimGrid is not a Simulator
logs
stats
visu
Availibility
Changes
Platform
Topology
Application
Deployment
Simulation Kernel
Application
Simulator
OutcomesScenario
Applicative
Workload
Parameters
Input
That’s a Generic Simulation Framework
Da SimGrid Team SimGrid User 101 Introduction Installing MSG Java lua Ruby Trace Config xbt Performance CC 23/28
22. Workload 1: Simple Message Passing
• Iteration of neighbor communication (bottom left)
• Big impact of increasing link bandwidth if an
application is network intensive.
22
Compute power
(FLO)
Relative
execution time
Data size (byte)
#node: 10000
Link bandwidth: 0.1, 1, 10Tbps
Link latency 100ns
CPU power: 10TFLOPS
Data size:1012〜1024B
…
…
1/100
23. Workload 2: HPC Application
• NAS Parallel Benchmark (256 procs, class C)
– Low latency is more important than huge bandwidth.
– The problem size is too small to utilize huge bandwidth.
23
CPU power:1TFLOPS
Link bandwidth:1Tbps
CPU power:1TFLOPS
Link latency:0.1us
Effect of reducing the link latency Effect of increasing the link bandwidth
Relative execution time
Relative execution time
24. Workload 3: MapReduce
24
• KDD Cup 2012, Track 2: predict the click-‐‑‒through
rate of ads (using Hadoop and Hivemall)
• Machine learning is CPU intensive.
• The effect of huge bandwidth is limited, because...
– The concurrency of the used model is not enough.
– Hadoop is optimized to make jobs run faster on the
current I/O devices.
Disk I/O bandwidth (Mbps)
Execution time (second)
Execution time (second)
Relative CPU power (base 172GFLOPS)
CPU power: 17.2TFLOPS
Disk bandwidth: 200Mbps
Network bandwidth: 10Gbps
28. In-‐‑‒storage, network processing
28
Mappers
Shuffle and Reduce
Hierarchical and partial
reduce processing in
each network node to
avoid network
congestion and
serialized reduce.
Compute modules are
attached in storage to
maximize the read
throughput from
storage.
30. 30
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
Direct optical I/O connection
to non-‐‑‒volatile memory
modules distributed on a chip
Communication
over DWDM
In-‐‑‒storage, network processing:
hardware design
31. Direct Memory Copy over DWDM
• Assume processor-‐‑‒
memory embedded
package with WDM
interconnect.
• To fully utilize the huge
I/O bandwidth realized
by DWDM.
• Multiple memory blocks
can be sent/received
simultaneously using
multiple wavelengths.
• Memory-‐‑‒centric network
is a similar idea [PACT13]
31
Processor
Memory block
Processor
cores
Memory Bank
WDM
Interconnect
Memory block
Single package compute node
Cache
/
MMU
From
Wavelength bank
32. 32
2014 2020 2030
・・・
・・・
Optical
Network
3D stacked package2.5D stacked packageSeparated packages
Future data center
Logic I/O
NVRAM
Logic
NVRAM
I/O
I/O
Logic
NVRAM
Goal: 100x energy efficiency of data
processing
Our Vision of Future Datacenter
33. DPF
DPF
DPF
DPF
Data Flow
Application
Datacenter OS
Data flow
planning
Slice
DPCs
Dataflow Processing System
33
DPF: Data Processing
Function
DPC: Data Processing
Component
Optical Network
Storage
Server modules
Co-‐‑‒allocate DPCs
and network path
between them.
Resource
monitoring
34. IMPULSE Datacenter OS
• A single OS for datacenter-‐‑‒wide optimization of
the energy efficiency and the performance
• Separation of data plane and control plane:
- Data plane is an application specific library OS.
- Control plane manages resources (server, network, etc).
Datacenter OS
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
…
deploy/launch/destroy/
monitor
Control-‐‑‒plane Control-‐‑‒plane
34
35. IMPULSE Datacenter OS
Data plane
• Application specific library OS
- E.g., machine learning, data store, etc.
• Mitigate the OS overhead to fully
utilize high performance devices.
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
Control-‐‑‒
plane
CPU CPU GPU
Mem I/O Mem I/O Mem I/O
Control plane
• Resource management
• Logical and secure resource
partitioning for data planes
• Running on the firmware
35
36. Related Work
• Datacenter-‐‑‒wide resource management
– OpenStack, Apache CloudStack, Kubernetes
– Hadoop YARN, Apache Mesos
• Dataflow processing engine
– Google Cloud Dataflow
– Lambda architecture
• Control and data planes separated design in
OS
– Arrakis (U. Washington)
– IX (Stanford)
36
37. Summary
• New visions of future datacenters:
“disaggregation” and “datacenter in a box”
• Optical network is key to making them.
• Hardware and software co-‐‑‒design is critical.
• Optical path network encourages C/D
separation in a datacenter OS.
– Control plane manages resources and establishes
a path between data processing components.
– Data plane fully utilizes the huge bandwidth.
37
39. Reference
• Rack scale architecture for Cloud, IDC2013
– https://intel.activeevents.com/sf13/connect/fileDownload/session/
6DE5FDFBF0D0854E73D2A3908D58E1E2/SF13_̲CLDS001_̲100.pdf
• Intel rack scale architecture overview, Interop2013
– http://presentations.interop.com/events/las-‐‑‒vegas/2013/free-‐‑‒
sessions-‐‑‒-‐‑‒-‐‑‒keynote-‐‑‒presentations/download/463
• New technologies that disrupt our complete ecosystem and
their limits in the race to Zettascale, HPC2014
– http://www.hpcc.unical.it/hpc2014/pdfs/demichel.pdf
• HPが「Tech Power Club」で⾒見見せた“未来のサーバー技術”,
ASCII.jp
– http://ascii.jp/elem/000/000/915/915508/
39
40. 40
Optical Interconnect:
As the bandwidth demand for traditionally electrical wireline interconnects has accelerated, optics has become an increa
alternative for interconnects within computing systems. Optical communication offers clear benefits for high-speed an
interconnects. Relative to electrical interconnects, optics provides lower channel loss. Circuit design and packaging techn
traditionally been used for electrical wireline are being adapted to enable integrated optical with extremely low power
resulted in rapid progress in optical ICs for Ethernet, backplane and chip-to-chip optical communication. ISSCC 201
dimensional (12×5) optical array achieving an aggregate data-rate of 600Gb/s [8.2]. Pre-emphasis using group-delay
the useful date rate of a 25Gb/s VCSEL to 40Gb/s [8.9]. Additional examples of low-power-linear and non-linear e
electronic dispersion compensation in multi-mode and long-haul cables [8.1, 8.3].
Concluding Remarks:
Continuing to aggressively scale I/O bandwidth is both essential for the industry and extremely challenging. Innovatio
higher performance and lower power will continue to be made in order to sustain this trend. Advances in circuit architectu
topologies, and transistor scaling are together changing how I/O will be done over the next decade. The most exciting and
of these emerging technologies for wireline I/O will be highlighted at ISSCC 2014.
Per-pin data-rate vs. year for a variety of common I/O standards.
2000 2002 2004 2006 2008 2010 2012 2014 2016
1
10
50
Year
Data Rate [Gbps]
HyperTranport
QPI
PCIe
S‐ATA
SAS
OIF/CEI
PON
Fibre Channel
DDR
GDDR
DRAM Data bandwidth trends.
Non-Volatile Memories (NVMs):
the past decade, significant investment has been put into emerging memories to find an alternative to floating-gate based non-volatile
emory. The emerging NVMs, such as phase-change memory (PRAM), ferroelectric RAM (FeRAM), magnetic spin-torque-transfer (STT-
RAM), and Resistive memory (ReRAM), are showing potential to achieve high cycling capability and lower power per bit in read/write
perations. Some commercial applications, such as cellular phones, have recently started to use PRAM, demonstrating that reliability
nd cost competitiveness in emerging memories is becoming a reality. Fast write speed and low read-access time are the potential
enefits of these emerging memories. At ISSCC 2014, a high-density ReRAM with a buried WL access device is introduced to improve
e write performance and area. The next Figure highlights how MLC NAND Flash write throughput continues to improve. However,
hile the Figure following shows no increase in NAND Flash density over the past year, recent devices are built with finer dimensions
more sophisticated 3-dimensional vertical bit cells.
Per-‐‑‒pin data rate of common I/O standards
High Bandwidth Memory
ISSCC2014 Trends
ISSCC2014 Trends
Data source:
http://cpudb.stanford.edu/
2x/1.5yrs
Processor scaling trends