How to Troubleshoot Apps for the Modern Connected Worker
PFQ@ 9th Italian Networking Workshop (Courmayeur)
1. PFQ: a Novel Architecture for Packet
Capture on Parallel Commodity
Hardware
Nicola Bonelli, Andrea Di Pietro,
Stefano Giordano, Gregorio Procissi
CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
2. Outline
• Introduction and motivation
• Multi-core programming guidelines
• PFQ architecture
• Performance evaluation
• Conclusion and future work
3. Introduction and Motivations
• Monitoring applications for fast links on commodity hardware is a very challenging
task
– The hardware has evolved: 10Gbits links, multi-core architectures and multi-queue network
devices…
• The present software for packet capturing, including some parts of the Linux
kernel, is not suitable for the new hardware.
– (+) kernel support for multi-queue network adapters is now implemented
– (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap)
• Linux Networking Subsystem is slow and pointless for monitoring applications
– (-) PF_RING is designed for single-processor systems
• Traffic monitoring is not limited to packet capturing…
– Exploits the current hardware, scaling possibly linearly with the number of cores
– Decouple the hardware parallelism from software parallelism
– Divide and conquer approach to steer packets to applications
4. Multi-thread on Multi-core (1)
• What’s wrong with the current software?
– Previous multi-threading paradigms used for single-processor systems
are still valid, but prevent the software from scaling with the number
of cores.
• For a software on multi-core system to be effective…
– Semaphores, mutexes, R/W mutexes and spinlocks are out of
question!
– Atomic operations are required, but must be used with moderation
• software design determines the use of atomic operations
– Sharing (writes to shared data) must be used with moderation too
– False-sharing must and can always be avoided
• wait-free algorithms are as well as cache-oblivious algorithms are
our friends
5. PFQ preamble
• PFQ is a novel capture system natively supporting 64bit multi-core architectures
written on top of all the previously exposed guidelines to provide the best possible
performance
• PFQ does not memory map packet descriptors of the device driver to user-space
(like most commercial vendor products do)
• PFQ is not a custom driver (such as NetMap or PF_RING DNA), it’s an architecture
running on top of standard Ethernet drivers, as well as slightly modified ones “PFQ
aware drivers” (PF_RING driver aware inheritance)
• PFQ enables packet capturing, filtering, hw queues and devices aggregation,
packet classifications, packet steering and so forth…
• PFQ pre-processing is ideal for bidirectional connection balancing , VoIP, different
kinds of tunnels, tasks otherwise left to the user-space applications.
6. PFQ architecture
Built on the top of the following components…
• DB-MPSC queue: multiple-producer, double buffered queue (for the
communication to user-space):
– allows concurrent NAPI contexts to enqueue packets
– Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts
– enables user-space copies from the queue to a private buffer in a batch fashion
• De-multiplexing Matrix:
– perfect concurrently accessible data structure (benign race conditions)
– no serialization is required to steer/copy packets
• SPSC queue:
– enables batching for sk_buff, increase locality for fast packet handlers
• Driver aware:
– an effective idea inherited from PF_RING
8. Prefetching queue
• Memory allocation in kernels prior to 2.6.39 had a spinlock
on fast path that serialized threads of executions
• Allocation/deallocation of sk_buff were not completely
parallelized even if running on different physical cores
• Batch processing is a well-known and efficient technique:
– Optimizes cache effectiveness through temporal reference
locality
– Reduce the probability of contention on the alloc/dealloc
structures
9. Packet steering
• Per socket filtering is a common paradigm in capture
engines
– Linearly scan the socket list to check which one may be
interested for each packet is O(n)!!!
• In a multi-core environment we need a new paradigm:
packet steering
• Completely concurrent block (wait-free):
– Shared state is mostly read only
– Bitmap based that can be updated through atomics (support up
to 64 sockets)
– Socket section is ~ O(1)
10. Packet steering
• Given a packet and a set of sockets, which socket needs to receive it?
– Filtering (possibly no socket needs to receive the packet)
– Load balancing (balance across multiple sockets based on a hash function)
• Load balancing groups:
– A socket can subscribe to a load balancing group
– It will receive a fraction of the overall traffic
• Simple subscription:
– A socket can subscribe to all of the traffic coming from one or more hardware
queues
• Both modes can be supported concurrently:
– Copy and balancing are handled by PFQ
11. Socket queue: DB-MPSC
• This is an unavoidable contention point:
– Load balancing shuffles packets across sockets
• How handle contention without impacting performance?
– Use a wait-free algorithm: DB-MPSC queues (double buffer multi-producer
single-consumer)
– Support copies/balancing
– Reduce traffic coherence among cores, a single (per-packet) atomic operation
that will be amortized in the future implementations
12. Testbed: Mascara & Monsters
Mascara Monsters
10 Gb link
Dual Xeon 6-core L5640, @2.27 GHz,
24GBytes RAM
New socket PF_DIRECT for generation
Intel 82599 multi-queue 10G ethernet
adapter.
By deploying 3-4 cores, it is possible to
generate up to 13 Mpps of 64 bytes.
Xeon 6-core X5650 @2.57GHz, 12
GBytes RAM
Intel 82599 multi-queue 10G ethernet
adapter
PFQ on board for traffic capture
15. Load balancing across user space
sockets
• Keep the number of capturing NAPI context fixed (12 with the Intel
hyper-threading)
• Change the number of user space threads
All of the traffic
with just 3
threads!
16. Packet copy
• Copying the same traffic to a variable number of user space threads
• Still 12 NAPI contexts within the kernel
17. Future directions
• Work on a new packet steering framework:
– How can we distribute packets according to an application-
specific semantic?
• Implement balancing groups
• Each group is associated with an “application specific hash function”
• Bind a set of sockets to each group
• Use case: VoIP analysis
– Steer control traffic to a specific core
– Load balance candidate RTP flows across a variable number of
sockets
• Easy (but inaccurate): stateless heuristic
• Hard: implement a distributed stateful heuristic, where each core
works on a private state that is then synchronized with those of other
cores periodically…
18. Conclusions
• Modern commodity architectures are increasingly parallel
• Huge potential for software based network devices
• Need to strictly fulfill coding and design rules
• PFQ
– A novel packet capturing engine
– Better scalability with respect to competitors
– Flexible packet steering
– Decouples kernel space and user space parallelism
• PFQ webpage and download:
– netgroup.iet.unipi.it/software/pfq