SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Library Operating
System with Mainline
Linux Network Stack
!
Hajime Tazaki, Ryo Nakamura, Yuji Sekiya	
netdev0.1, Feb. 2015
Motivation
Why kernel space ?	
Packets were expensive in 1970’	
Why not userspace ?	
well grown in decades, costs degrades	
obtain network stack personalization	
controllable by userspace utilities
2
Userspace network stacks
A lot of userspace network stack	
full scratch: mTCP, Mirage, lwIP	
Porting: OSv, Sandstorm, libuinet (FreeBSD),
Arrakis (lwIP), OpenOnload (lwIP?) 	
Motivated by their own problems (specialized NIC,
cloud, high-speed Apps)	
Writing a network stack is 1-week DIY,	
but writing opera-table network stack is decades
DIY (which is not DIY)
3
Questions
How to benefit matured network stack
in userspace ?	
How to trivially introduce your idea
on network stack ?	
xxTCP, IPvX, etc..	
How to flexibly test your code with a
complex scenario ?
4
The answers
Using Linux network stack as-is	
!
as a userspace Library (library
operating system)
5
This talk is about
an introduction of a library
operating system for Linux	
and its implementation	
with a couple of useful use cases
6
Outlook (design)
hardware-independent arch (arch/lib)	
3 components	
Host backend layer	
Kernel layer	
POSIX layer
7
https://github.com/libos-nuse/net-next-nuse
Outlook (cont’d)
8
ARP
Qdisc
TCP UDP DCCP SCTP
ICMP IPv4IPv6
Netlink
BridgingNetfilter
IPSec Tunneling
Kernel layer
Host backend layer
bottom halves/
rcu/timer/
interrupt
struct
net_device
scheduler
netdev
clock
source
POSIX glue layer
Application
1) Build Linux srctree
w/ glues as a library
2) put backend!
(vNIC, clock source,!
scheduler) and bind
3) add POSIX glue code
4) applications
magically runs
Kernel glue code
9
https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.c
void schedule(void)!
{!
! lib_task_wait();!
}!
signed long schedule_timeout(signed long timeout)!
{!
! u64 ns;!
! struct SimTask *self;!
!
! if (timeout == MAX_SCHEDULE_TIMEOUT) {!
! ! lib_task_wait();!
! ! return MAX_SCHEDULE_TIMEOUT;!
! }!
! lib_assert(timeout >= 0);!
! ns = ((__u64)timeout) * (1000000000 / HZ);!
! self = lib_task_current();!
! lib_event_schedule_ns(ns, &trampoline, self);!
! lib_task_wait();!
! /* we know that we are always perfectly on time. */!
! return 0;!
}
POSIX glue code
10
https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.c
int nuse_socket(int domain, int type, int protocol)!
{!
! lib_update_jiffies();!
! struct socket *kernel_socket = malloc(sizeof(struct socket));!
! int ret, real_fd;!
!
! memset(kernel_socket, 0, sizeof(struct socket));!
! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);!
! if (ret < 0)!
! ! errno = -ret;!
(snip)!
! lib_softirq_wakeup();!
! return real_fd;!
}!
weak_alias(nuse_socket, socket);
Implementations
(Instances)
Direct Code Execution (DCE)	
network simulator integration (ns-3)	
for more testing	
Network Stack in Userspace (NUSE)	
gives new platform of Linux network stack	
for ad-hoc network stack
11
Direct Code Execution
ns-3 integration	
deterministic scheduler	
single-process model virtualization	
dlmopen(3)-like virtualization	
full control over multiple network stacks
12
Execution (DCE)
main() => dlmopen(ping,liblinux.so)

=> main()=>socket(2)=>dce_socket()

=> (do whatever)
13
14
15
Network Stack in
Userspace
Userspace network stack running on
Linux (POSIX) platform	
Network stack personalization	
Full features by design (full stack)	
ARP/ND, UDP/TCP (all cc algorithm), SCTP,
DCCP, QDISC, XFRM, netfilter, etc.
16
17
Application
ARP
Qdisc
TCP UDP DCCP SCTP
ICMP IPv4IPv6
Netlink
BridgingNetfilter
IPSec Tunneling
Kernel layer
Host backend layer (NUSE)
POSIX glue layer
bottom halves/
rcu/timer/
interrupt
struct
net_device
RAW DPDK netmap ...
NIC
scheduler
netdev
clock
source
system call hijack
Application
master process slave processes
rump
syscall
proxy
rump
server
Execution (NUSE)
LD_PRELOAD=libnuse-linux.so 

ping www.google.com	
ping(8) => socket(2) => nuse_socket()
=> raw(7) => (network)
18
When it’s useful?
ad-hoc network stack (network stack
personalization)	
LD_PRELOAD=liblinux-mptcp.so firefox	
Bundle with kernel bypasses	
Intel DPDK / netmap / PF_RING / etc.	
debugging/testing with ns-3
19
Testing workflow
1.Write/modify code (patches)	
2.Write a test code (incl. packet
exchanges)	
3.if PASS; accept pull-request

else; rejects
20
continuous integration
(CI)
21
http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/
T1) write a patch
22
Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")!
Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>!
---!
net/ipv6/xfrm6_policy.c | 1 +!
1 file changed, 1 insertion(+)!
!
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c!
index 48bf5a0..8d2d01b 100644!
--- a/net/ipv6/xfrm6_policy.c!
+++ b/net/ipv6/xfrm6_policy.c!
@@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int
reverse)!
!
#if IS_ENABLED(CONFIG_IPV6_MIP6)!
! ! case IPPROTO_MH:!
+! ! ! offset += ipv6_optlen(exthdr);!
! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {!
! ! ! ! struct ip6_mh *mh;!
http://patchwork.ozlabs.org/patch/436351/
T2) write a test
As ns-3 scenario	
C++ or python	
create a topology	
config nodes	
run/check results
(e.g., ping6)
23
+-----------+!
| HA |!
+-----------+!
|sim0!
+----------+------------+!
|sim0 |sim0!
sim2+----+---+ +----+---+!
- - -| AR1 | | AR2 |!
+---+----+ +----+---+!
|sim1 |sim1!
| |!
sim0 sim0!
+----+------+ (Movement) +----+-----+!
| MR | <=====> | MR |!
+-----------+ +----------+!
|sim1 |sim1!
+---------+ +---------+!
| MNN | | MNN |!
+---------+ +---------+!
http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
24
#!/usr/bin/python!
!
from ns.dce import *!
from ns.core import *!
!
nodes = NodeContainer()!
nodes.Create (100)!
dce = DceManagerHelper()!
dce.SetNetworkStack ("liblinux.so")!
dce.Install (nodes)!
!
app = DceApplicationHelper()!
app.SetBinary ("ping6")!
app.Install (nodes)!
(snip)!
!
NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname!
<< " did not return successfully: " << g_testError)!
!
Simulator.Stop (Seconds(1000.0))!
Simulator.Run ()
Performance of NUSE
10G Ethernet back-to-back	
transmission	
IP forwarding	
native Linux, raw socket, tap, dpdk,
netmap
25
Performance: setup
26
10G10G
NUSE node Tx/Rx nodes
CPU
Xeon E5-2650v2 @ 2.60GHz 	
(16 core)
Xeon L3426 @ 1.87GHz 	
(8 core)
Memory 32GB 4GB
NIC Intel X520 Intel X520
OS
host:3.13.0-32	
nuse: 3.17.0-rc1
host:3.13.0-32
ping!
flowgen
vnstat!
(packet count)
Tx NUSE Rx
ping!
flowgen
Host Tx
27
RxNUSE
ping (RTT)
throughput	
(1024byte,UDP)
0
1000
2000
3000
4000
5000
6000
dpdk native netmap raw tap
Throughput(Mbps)
0
0.2
0.4
0.6
0.8
1
dpdk native netmap raw tap
RTT(ms)
native: ping A.B.C.D!
others: ./nuse ping A.B.C.D
L3 Routing	
Sender->NUSE->Receiver
28
Tx RxNUSE
ping (RTT)
throughput	
(1024byte,UDP)
0
1000
2000
3000
4000
5000
6000
dpdk native netmap raw tap
Throughput(Mbps)
0
0.2
0.4
0.6
0.8
1
dpdk native netmap raw tap
RTT(ms)
Alternatives
UML/LKL (1proc/1vm, no POSIX i/f)	
Containers (can’t change kernel)	
scratch-based (mTCP,Mirage)	
rumpkernel (in NetBSD)
29
Limitations
ad-hoc kernel glues required	
when we changed a member of a struct,
LibOS needs to follow it	
Performance drawbacks on NUSE	
adapt known techniques (mTCP)
30
(not) Conclusions
An abstraction for multiple benefits	
Conservative	
Use past decades effort as much	
with a small amount of effort	
Planing to RFC for upstreaming
31
github: https://github.com/libos-nuse/net-
next-nuse	
DCE: http://bit.ly/ns-3-dce	
twitter: @thehajime
32
Backups
Bug reproducibility
34
Wi-Fi Wi-Fi
Home Agent
AP1 AP2
handoff
ping6
mobile node
correspondent
node
(gdb) b mip6_mh_filter if dce_debug_nodeid()==0

Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.	

<continue>	

(gdb) bt 4	

#0  mip6_mh_filter	

(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)	

at net/ipv6/mip6.c:109 	

#1  0x00007ffff2831418 in ipv6_raw_deliver	

(skb=0x7ffff7cde8b0, nexthdr=135) 

at net/ipv6/raw.c:199 	

#2  0x00007ffff2831697 in raw6_local_deliver	

(skb=0x7ffff7cde8b0, nexthdr=135) 

at net/ipv6/raw.c:232 	

#3  0x00007ffff27e6068 in ip6_input_finish	

(skb=0x7ffff7cde8b0) 	

at net/ipv6/ip6_input.c:197
Debugging
Memory error detection	
among distributed nodes	
in a single process	
using Valgrind	
!
!
35
==5864== Memcheck, a memory error detector	

==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.	

==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in
==5864== Command: ../build/bin/ns3test-dce-vdl --verbose	

==5864== 	

==5864== Conditional jump or move depends on uninitialised value(s)
==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)
==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)	

==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)	

==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)	

==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)	

==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)	

==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)	

==5864== by 0x7D2313F: process_backlog (dev.c:3368)	

==5864== by 0x7D23455: net_rx_action (dev.c:3526)	

==5864== by 0x7CF2477: do_softirq (softirq.c:65)	

==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)	

==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage
==5864== Uninitialised value was created by a stack allocation
==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)	

==5864==
Fine-grained parameter coverage
36
Code coverage measurement with DCE	
With fine-grained network, node, protocol parameters
1) kernel build
build kernel source tree w/ the patch	
make menuconfig ARCH=sim	
make library ARCH=sim	
➔ libnuse-linux-3.17-rc1.so
37
Example: How timer
works
38
add_timer()
TIMER_SOFTIRQ
timer_list
run_timer_softirq ()
timer handler
timer thread
(timer_create (2))
Tx callgraph
39
sendmsg () (socket API)
lib_sock_sendmsg () (NUSE)
sock_sendmsg ()
ip_send_skb ()
ip_finish_output2 ()
dst_neigh_output () (existing
neigh_resolve_output () -kernel)
arp_solicit ()
dev_queue_xmit ()
lib_dev_xmit () (NUSE)
nuse_vif_raw_write ()
start_thread () (pthread)
nuse_netdev_rx_trampoline ()
nuse_vif_raw_read () (NUSE)
lib_dev_rx ()
netif_rx () (ex-kernel)
Rx callgraph
40
start_thread () (pthread)
do_softirq () (NUSE)
net_rx_action ()
process_backlog () (ex-kernel)
__netif_receive_skb_core ()
ip_rcv ()
vNIC!
rx
softirq!
rx

Más contenido relacionado

La actualidad más candente

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
Kernel TLV
 

La actualidad más candente (20)

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
コンテナ時代のOpenStack
コンテナ時代のOpenStackコンテナ時代のOpenStack
コンテナ時代のOpenStack
 
TripleOの光と闇
TripleOの光と闇TripleOの光と闇
TripleOの光と闇
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
Faster packet processing in Linux: XDP
Faster packet processing in Linux: XDPFaster packet processing in Linux: XDP
Faster packet processing in Linux: XDP
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 

Similar a Library Operating System for Linux #netdev01

May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
Jeff Larkin
 
Picobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertisingPicobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertising
Claudio Mignanti
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
Jian-Hong Pan
 

Similar a Library Operating System for Linux #netdev01 (20)

LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
 
Picobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertisingPicobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertising
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
The true story_of_hello_world
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_world
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
 
The Fuzzing Project - 32C3
The Fuzzing Project - 32C3The Fuzzing Project - 32C3
The Fuzzing Project - 32C3
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
BUD17-300: Journey of a packet
BUD17-300: Journey of a packetBUD17-300: Journey of a packet
BUD17-300: Journey of a packet
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 

Más de Hajime Tazaki (8)

Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
 
Network stack personality in Android phone - netdev 2.2
Network stack personality in Android phone - netdev 2.2Network stack personality in Android phone - netdev 2.2
Network stack personality in Android phone - netdev 2.2
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
Linux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic KernelLinux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic Kernel
 
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 

Library Operating System for Linux #netdev01

  • 1. Library Operating System with Mainline Linux Network Stack ! Hajime Tazaki, Ryo Nakamura, Yuji Sekiya netdev0.1, Feb. 2015
  • 2. Motivation Why kernel space ? Packets were expensive in 1970’ Why not userspace ? well grown in decades, costs degrades obtain network stack personalization controllable by userspace utilities 2
  • 3. Userspace network stacks A lot of userspace network stack full scratch: mTCP, Mirage, lwIP Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?) Motivated by their own problems (specialized NIC, cloud, high-speed Apps) Writing a network stack is 1-week DIY, but writing opera-table network stack is decades DIY (which is not DIY) 3
  • 4. Questions How to benefit matured network stack in userspace ? How to trivially introduce your idea on network stack ? xxTCP, IPvX, etc.. How to flexibly test your code with a complex scenario ? 4
  • 5. The answers Using Linux network stack as-is ! as a userspace Library (library operating system) 5
  • 6. This talk is about an introduction of a library operating system for Linux and its implementation with a couple of useful use cases 6
  • 7. Outlook (design) hardware-independent arch (arch/lib) 3 components Host backend layer Kernel layer POSIX layer 7 https://github.com/libos-nuse/net-next-nuse
  • 8. Outlook (cont’d) 8 ARP Qdisc TCP UDP DCCP SCTP ICMP IPv4IPv6 Netlink BridgingNetfilter IPSec Tunneling Kernel layer Host backend layer bottom halves/ rcu/timer/ interrupt struct net_device scheduler netdev clock source POSIX glue layer Application 1) Build Linux srctree w/ glues as a library 2) put backend! (vNIC, clock source,! scheduler) and bind 3) add POSIX glue code 4) applications magically runs
  • 9. Kernel glue code 9 https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.c void schedule(void)! {! ! lib_task_wait();! }! signed long schedule_timeout(signed long timeout)! {! ! u64 ns;! ! struct SimTask *self;! ! ! if (timeout == MAX_SCHEDULE_TIMEOUT) {! ! ! lib_task_wait();! ! ! return MAX_SCHEDULE_TIMEOUT;! ! }! ! lib_assert(timeout >= 0);! ! ns = ((__u64)timeout) * (1000000000 / HZ);! ! self = lib_task_current();! ! lib_event_schedule_ns(ns, &trampoline, self);! ! lib_task_wait();! ! /* we know that we are always perfectly on time. */! ! return 0;! }
  • 10. POSIX glue code 10 https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.c int nuse_socket(int domain, int type, int protocol)! {! ! lib_update_jiffies();! ! struct socket *kernel_socket = malloc(sizeof(struct socket));! ! int ret, real_fd;! ! ! memset(kernel_socket, 0, sizeof(struct socket));! ! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);! ! if (ret < 0)! ! ! errno = -ret;! (snip)! ! lib_softirq_wakeup();! ! return real_fd;! }! weak_alias(nuse_socket, socket);
  • 11. Implementations (Instances) Direct Code Execution (DCE) network simulator integration (ns-3) for more testing Network Stack in Userspace (NUSE) gives new platform of Linux network stack for ad-hoc network stack 11
  • 12. Direct Code Execution ns-3 integration deterministic scheduler single-process model virtualization dlmopen(3)-like virtualization full control over multiple network stacks 12
  • 13. Execution (DCE) main() => dlmopen(ping,liblinux.so)
 => main()=>socket(2)=>dce_socket()
 => (do whatever) 13
  • 14. 14
  • 15. 15
  • 16. Network Stack in Userspace Userspace network stack running on Linux (POSIX) platform Network stack personalization Full features by design (full stack) ARP/ND, UDP/TCP (all cc algorithm), SCTP, DCCP, QDISC, XFRM, netfilter, etc. 16
  • 17. 17 Application ARP Qdisc TCP UDP DCCP SCTP ICMP IPv4IPv6 Netlink BridgingNetfilter IPSec Tunneling Kernel layer Host backend layer (NUSE) POSIX glue layer bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC scheduler netdev clock source system call hijack Application master process slave processes rump syscall proxy rump server
  • 18. Execution (NUSE) LD_PRELOAD=libnuse-linux.so 
 ping www.google.com ping(8) => socket(2) => nuse_socket() => raw(7) => (network) 18
  • 19. When it’s useful? ad-hoc network stack (network stack personalization) LD_PRELOAD=liblinux-mptcp.so firefox Bundle with kernel bypasses Intel DPDK / netmap / PF_RING / etc. debugging/testing with ns-3 19
  • 20. Testing workflow 1.Write/modify code (patches) 2.Write a test code (incl. packet exchanges) 3.if PASS; accept pull-request
 else; rejects 20
  • 22. T1) write a patch 22 Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")! Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>! ---! net/ipv6/xfrm6_policy.c | 1 +! 1 file changed, 1 insertion(+)! ! diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c! index 48bf5a0..8d2d01b 100644! --- a/net/ipv6/xfrm6_policy.c! +++ b/net/ipv6/xfrm6_policy.c! @@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse)! ! #if IS_ENABLED(CONFIG_IPV6_MIP6)! ! ! case IPPROTO_MH:! +! ! ! offset += ipv6_optlen(exthdr);! ! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {! ! ! ! ! struct ip6_mh *mh;! http://patchwork.ozlabs.org/patch/436351/
  • 23. T2) write a test As ns-3 scenario C++ or python create a topology config nodes run/check results (e.g., ping6) 23 +-----------+! | HA |! +-----------+! |sim0! +----------+------------+! |sim0 |sim0! sim2+----+---+ +----+---+! - - -| AR1 | | AR2 |! +---+----+ +----+---+! |sim1 |sim1! | |! sim0 sim0! +----+------+ (Movement) +----+-----+! | MR | <=====> | MR |! +-----------+ +----------+! |sim1 |sim1! +---------+ +---------+! | MNN | | MNN |! +---------+ +---------+! http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
  • 24. 24 #!/usr/bin/python! ! from ns.dce import *! from ns.core import *! ! nodes = NodeContainer()! nodes.Create (100)! dce = DceManagerHelper()! dce.SetNetworkStack ("liblinux.so")! dce.Install (nodes)! ! app = DceApplicationHelper()! app.SetBinary ("ping6")! app.Install (nodes)! (snip)! ! NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname! << " did not return successfully: " << g_testError)! ! Simulator.Stop (Seconds(1000.0))! Simulator.Run ()
  • 25. Performance of NUSE 10G Ethernet back-to-back transmission IP forwarding native Linux, raw socket, tap, dpdk, netmap 25
  • 26. Performance: setup 26 10G10G NUSE node Tx/Rx nodes CPU Xeon E5-2650v2 @ 2.60GHz (16 core) Xeon L3426 @ 1.87GHz (8 core) Memory 32GB 4GB NIC Intel X520 Intel X520 OS host:3.13.0-32 nuse: 3.17.0-rc1 host:3.13.0-32 ping! flowgen vnstat! (packet count) Tx NUSE Rx ping! flowgen
  • 27. Host Tx 27 RxNUSE ping (RTT) throughput (1024byte,UDP) 0 1000 2000 3000 4000 5000 6000 dpdk native netmap raw tap Throughput(Mbps) 0 0.2 0.4 0.6 0.8 1 dpdk native netmap raw tap RTT(ms) native: ping A.B.C.D! others: ./nuse ping A.B.C.D
  • 28. L3 Routing Sender->NUSE->Receiver 28 Tx RxNUSE ping (RTT) throughput (1024byte,UDP) 0 1000 2000 3000 4000 5000 6000 dpdk native netmap raw tap Throughput(Mbps) 0 0.2 0.4 0.6 0.8 1 dpdk native netmap raw tap RTT(ms)
  • 29. Alternatives UML/LKL (1proc/1vm, no POSIX i/f) Containers (can’t change kernel) scratch-based (mTCP,Mirage) rumpkernel (in NetBSD) 29
  • 30. Limitations ad-hoc kernel glues required when we changed a member of a struct, LibOS needs to follow it Performance drawbacks on NUSE adapt known techniques (mTCP) 30
  • 31. (not) Conclusions An abstraction for multiple benefits Conservative Use past decades effort as much with a small amount of effort Planing to RFC for upstreaming 31
  • 34. Bug reproducibility 34 Wi-Fi Wi-Fi Home Agent AP1 AP2 handoff ping6 mobile node correspondent node (gdb) b mip6_mh_filter if dce_debug_nodeid()==0
 Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88. <continue> (gdb) bt 4 #0  mip6_mh_filter (sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0) at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver (skb=0x7ffff7cde8b0, nexthdr=135) 
 at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver (skb=0x7ffff7cde8b0, nexthdr=135) 
 at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish (skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
  • 35. Debugging Memory error detection among distributed nodes in a single process using Valgrind ! ! 35 ==5864== Memcheck, a memory error detector ==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in ==5864== Command: ../build/bin/ns3test-dce-vdl --verbose ==5864== ==5864== Conditional jump or move depends on uninitialised value(s) ==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782) ==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532) ==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496) ==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576) ==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696) ==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226) ==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318) ==5864== by 0x7D2313F: process_backlog (dev.c:3368) ==5864== by 0x7D23455: net_rx_action (dev.c:3526) ==5864== by 0x7CF2477: do_softirq (softirq.c:65) ==5864== by 0x7CF2544: softirq_task_function (softirq.c:21) ==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage ==5864== Uninitialised value was created by a stack allocation ==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522) ==5864==
  • 36. Fine-grained parameter coverage 36 Code coverage measurement with DCE With fine-grained network, node, protocol parameters
  • 37. 1) kernel build build kernel source tree w/ the patch make menuconfig ARCH=sim make library ARCH=sim ➔ libnuse-linux-3.17-rc1.so 37
  • 39. Tx callgraph 39 sendmsg () (socket API) lib_sock_sendmsg () (NUSE) sock_sendmsg () ip_send_skb () ip_finish_output2 () dst_neigh_output () (existing neigh_resolve_output () -kernel) arp_solicit () dev_queue_xmit () lib_dev_xmit () (NUSE) nuse_vif_raw_write ()
  • 40. start_thread () (pthread) nuse_netdev_rx_trampoline () nuse_vif_raw_read () (NUSE) lib_dev_rx () netif_rx () (ex-kernel) Rx callgraph 40 start_thread () (pthread) do_softirq () (NUSE) net_rx_action () process_backlog () (ex-kernel) __netif_receive_skb_core () ip_rcv () vNIC! rx softirq! rx