SlideShare a Scribd company logo
1 of 70
Download to read offline
IRQs: the Hard, the Soft, the Threaded and the
Preemptible
Alison Chaiken
Latest version of these slides
alison@she-devel.com
Embedded Linux Conference Europe
Oct 11, 2016
Example code
Version 2, actually presented live
2
Thursday October 13, 2016 15:30:
Debugging Methodologies for Realtime Issues
Joel Fernandes, Google
this same room
Knocking at Your Back Door (or How Dealing with
Modern Interrupt Architectures can Affect Your Sanity)
Marc Zyngier, ARM Ltd
Hall Berlin A
3
Agenda
โ— Why do IRQs exist?
โ— About kinds of hard-IRQ handlers
โ— About softirqs and tasklets
โ— Differences in IRQ handling between RT and non-RT kernels
โ— Studying IRQ behavior via kprobes, event tracing, mpstat and
eBPF
โ— Detailed example: when does NAPI take over for eth IRQs?
โ€œKunst nicht lehrbar ist. Sie mรผssen wieder in der
Werkstatt aufgehen.โ€ -- Walter Gropius
4
Sample questions to be answered
โ— What's all stuff in /proc/interrupts anyway?
โ— What are IPIs and NMIs?
โ— Why are atomic operations expensive for ARM?
โ— Why are differences between mainline and RT for softirqs?
โ— What is 'current' task while in softirq?
โ— What function is running inside the threaded IRQs?
โ— When do we switch from individual hard IRQ processing to
NAPI?
5
Interrupt handling: a brief pictorial summary
DennisJarvis,http://tinyurl.com/jmkw23h
onefulllife,http://tinyurl.com/j25lal5
Top half: the hard IRQ Bottom half: the soft IRQ
6
Why do we need interrupts at all?
โ— IRQs allow devices to notify the kernel that they require
maintenance.
โ— Alternatives include
โ€“ polling (servicing devices at a pre-configured
interval);
โ€“ traditional IPC to user-space drivers.
โ— Even a single-threaded RTOS or a bootloader needs a
system timer.
7
Interrupts in Das U-boot
โ— For ARM, minimal IRQ support:
โ€“ clear exceptions and reset timer (e.g., arch/arm/lib/interrupts_64.c
or arch/arm/cpu/armv8/exceptions.S)
โ— For x86, interrupts are serviced via a stack-push followed by a
jump (arch/x86/cpu/interrupts.c)
โ€“ PCI has full-service interrupt handling (arch/x86/cpu/irq.c)
8
Interrupts in RTOS: Xenomai/ADEOS IPIPE
From Adeos website, covered by GFDL
9
Zoology of IRQs
โ— Hard versus soft
โ— Level- vs. edge-triggered, simple, fast EOI
or per-CPU
โ— Local vs. global; System vs. device
โ— Maskable vs. non-maskable
โ— Shared or not; chained or not
โ— Multiple interrupt controllers per SOC
'cat /proc/interrupts' or 'mpstat -A'
ByBirdBeaksA.svg:L.Shyamalderivativework:Leptictidium(talk)-BirdBeaksA.svg,CCBY-SA2.5,https://commons.wikimedia.org/w/index.php?curid=6626434
10
ARM IPIs, from arch/arm/kernel/smp.c
$ # cat /proc/interrupts, look at bottom
void handle_IPI(int ipinr, struct pt_regs *regs)
switch (ipinr) {
case IPI_TIMER:
tick_receive_broadcast();
case IPI_RESCHEDULE:
scheduler_ipi();
case IPI_CALL_FUNC:
generic_smp_call_function_interrupt();
case IPI_CPU_STOP:
ipi_cpu_stop(cpu);
case IPI_IRQ_WORK:
irq_work_run();
case IPI_COMPLETION:
ipi_complete(cpu);
}
Handlers are in
kernel/sched/core.c
11
What is an NMI?
โ— A 'non-maskable' interrupt is related to:
โ€“ HW problem: parity error, bus error, watchdog timer expiration . . .
โ€“ also used by perf
/* non-maskable interrupt control */
#define NMICR_NMIF 0x0001 /* NMI pin interrupt flag */
#define NMICR_WDIF 0x0002 /* watchdog timer overflow */
#define NMICR_ABUSERR 0x0008 /* async bus error flag */
From arch/arm/mn10300/include/asm/intctl-regs.h
ByJohnJewell-Fenix,CCBY2.0,https://commons.wikimedia.org/w/index.php?curid=49332041
SKIP
12
How IRQ masking works
arch/arm/include/asm/irqflags.h:
#define arch_local_irq_enable arch_local_irq_enable
static inline void arch_local_irq_enable(void)
{ asm volatile(
"cpsie i @ arch_local_irq_enable"
::: "memory", "cc"); }
arch/arm64/include/asm/irqflags.h:
static inline void arch_local_irq_enable(void)
{ asm volatile(
"msr daifclr, #2 // arch_local_irq_enable"
::: "memory"); }
arch/x86/include/asm/irqflags.h:
static inline notrace void arch_local_irq_enable(void)
{ native_irq_enable(); }
static inline void native_irq_enable(void)
{ asm volatile("sti": : :"memory"); }
โ€œchange processor stateโ€
only current core
SKIP
13
x86's Infamous System Management Interrupt
โ— SMI jumps out of kernel into System Management Mode
โ€“ controlled by System Management Engine (Skochinsky)
โ— Identified as security vulnerability by Invisible Things Lab
โ— Not directly visible to Linux
โ— Traceable via hw_lat detector (sort of)
[RFC][PATCH 1/3] tracing: Added hardware latency tracer, Aug 4
From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
The hardware latency tracer has been in the PREEMPT_RT patch for some
time. It is used to detect possible SMIs or any other hardware interruptions that
the kernel is unaware of. Note, NMIs may also be detected, but that may be
good to note as well.
14
ARM's Fast Interrupt reQuest
โ— An NMI with optimized handling due to dedicated registers.
โ— Underutilized by Linux drivers.
โ— Serves as the basis for Android's fiq_debugger.
15
IRQ 'Domains' Correspond to Different INTC's
CONFIG_IRQ_DOMAIN_DEBUG:
This option will show the mapping relationship between hardware irq
numbers and Linux irq numbers. The mapping is exposed via debugfs
in the file "irq_domain_mapping".
Note:
โ— There are a lot more IRQs than in /proc/interrupts.
โ— There are more IRQs in /proc/interrupts than in 'ps axl | grep irq'.
โ— Some IRQs are not used.
โ— Some are processor-reserved and not kernel-managed.
SKIP
Example: i.MX6 General Power Controller
Unmasked IRQs can wakeup sleeping power domains.
Threaded IRQs in RT kernel
ps axl | grep irq
with both RT and non-RT kernels.
Handling IRQs as kernel threads allows priority and
CPU affinity to be managed individually.
IRQ handlers running in threads can themselves be
interrupted.
18
Quiz:ย Whatย weย willย seeย 
withย 'psย axlย |ย grepย irq'ย 
forย nonยญRTย kernels?
Why?
?
?
?
?
?? ??
?
What function do threaded IRQs run?
/* request_threaded_irq - allocate an interrupt line
* @handler: Function to be called when the IRQ occurs.
* Primary handler for threaded interrupts
* If NULL and thread_fn != NULL the default
* primary handler is installed
*
* @thread_fn: Function called from the irq handler thread
* If NULL, no irq thread is created
*/
Even in mainline, request_irq() = requested_threaded_irq()
with NULL thread_fn.
EXAMPLE
20
Result:
-- irq_default_primary_handler() runs in interrupt context.
-- All it does is wake up the thread.
-- Then handler runs in irq/<name> thread.
Result:
-- handler runs in interrupt context.
-- thread_fn runs in irq/<name> thread.
request_irq(handler) request_threaded_irq(handler, NULL)
direct invocation of request_threaded_irq()CASE 1
irq_setup_forced_threading()
CASE 0 indirect invocation of request_threaded_irq()
21
Threaded IRQs in RT, mainline and mainline with
โ€œthreadirqsโ€ boot param
โ— RT: all hard-IRQ handlers that don't set IRQF_NOTHREAD run
in threads.
โ— Mainline: only those hard-IRQ handlers whose registration
requests explicitly call request_threaded_irq() run in threads.
โ— Mainline with threadirqs kernel cmdline: like RT, but CPU affinity
of IRQ threads cannot be set.
genirq: Force interrupt thread on RT
genirq: Do not invoke the affinity callback via a workqueue on RT
22
Shared interrupts: mmc driver
โ— Check 'ps axl | grep irq | grep mmc':
1 0 122 2 -51 0 - S ? 0:00 [irq/16-mmc0]
1 0 123 2 -50 0 - S ? 0:00 [irq/16-s-mmc0]
โ— 'cat /proc/interrupts': mmc and ehci-hcd share an IRQ line
16: 204 IR-IO-APIC 16-fasteoi mmc0,ehci_hcd:usb3
โ— drivers/mmc/host/sdhci.c:
ret = request_threaded_irq(host->irq, sdhci_irq, sdhci_thread_irq,
IRQF_SHARED,mmc_hostname(mmc), host);
handler thread_fn
Why are atomic operations more expensive (ARM)?
arch/arm/include/asm/atomic.h:
static inline void atomic_##op(int i, atomic_t *v) 
{ raw_local_irq_save(flags); 
v->counter c_op i; 
raw_local_irq_restore(flags); }
include/linux/irqflags.h:
#define raw_local_irq_save(flags) 
do { flags = arch_local_irq_save(); } while (0)
arch/arm/include/asm/atomic.h:
/* Save the current interrupt enable state & disable IRQs */
static inline unsigned long arch_local_irq_save(void) { . . . }
24
Introduction to softirqs
In kernel/softirq.c:
const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
};
Tasklet interface Raised by devices Kernel housekeeping
In ksoftirqd, softirqs are serviced in the listed order.
IRQ_POLL since 4.4
Gone since 4.1
25
What are tasklets?
โ— Tasklets perform deferred work not handled by other softirqs.
โ— Examples: crypto, USB, DMA, keyboard . . .
โ— More latency-sensitive drivers (sound, PCI) are part of
tasklet_hi_vec.
โ— Any driver can create a tasklet.
โ— tasklet_hi_schedule() or tasklet_schedule() are called directly by
ISR.
const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
};
26
[alison@sid ~]$ sudo mpstat -I SCPU
Linux 4.1.0-rt17+ (sid) 05/29/2016 _x86_64_(4 CPU)
CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s
0 0.03 249.84 0.00 0.11 19.96 0.43 238.75 0.68 0.00
1 0.01 249.81 0.38 1.00 38.25 1.98 236.69 0.53 0.00
2 0.02 249.72 0.19 0.11 53.34 3.83 233.94 1.44 0.00
3 0.59 249.72 0.01 2.05 19.34 2.63 234.04 1.72 0.00
Linux 4.6.0+ (sid) 05/29/2016 _x86_64_(4 CPU)
CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s
0 0.26 16.13 0.20 0.33 40.90 0.73 9.18 0.00 19.04
1 0.00 9.45 0.00 1.31 14.38 0.61 7.85 0.00 17.88
2 0.01 15.38 0.00 0.20 0.08 0.29 13.21 0.00 16.24
3 0.00 9.77 0.00 0.05 0.15 0.00 8.50 0.00 15.32
Linux 4.1.18-rt17-00028-g8da2a20 (vpc23) 06/04/16 _armv7l_ (2 CPU)
CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s
0 0.00 999.72 0.18 9.54 0.00 89.29 191.69 261.06 0.00
1 0.00 999.35 0.00 16.81 0.00 15.13 126.75 260.89 0.00
Linux 4.7.0 (nitrogen6x) 07/31/16 _armv7l_ (4 CPU)
CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s
0 0.00 2.84 0.50 40.69 0.00 0.38 2.78 0.00 3.03
1 0.00 89.00 0.00 0.00 0.00 0.00 0.64 0.00 46.22
2 0.00 16.59 0.00 0.00 0.00 0.00 0.23 0.00 3.05
3 0.00 10.22 0.00 0.00 0.00 0.00 0.25 0.00 1.45
SKIP
27
Two paths by which softirqs run
Related demo and sample code
system
management thread
run_ksoftirqd()
Hard-IRQ handler system
management thread
exhausts timeslice?
local_bh_enable()
raises softirqraises softirq
__do_softirq()do_current_softirqs()
(RT) or
__do_softirq()
CASE 0
(left)
CASE 1
(right)
28
Case 0: Run softirqs at exit of a hard-IRQ handler
while (current->softirqs_raised) {
i = __ffs(current->softirqs_raised);
do_single_softirq(i);
}
RT (4.6.2-rt5) non-RT (4.6.2)
local_bh_enable(); local_bh_enable();
__local_bh_enable(); do_softirq();
do_current_softirqs(); __do_softirq();
Run softirqs raised
in the current context.
Run all pending softirqs up to
MAX_IRQ_RESTART.
handle_pending_softirqs();
handle_softirq(); while ((softirq_bit = ffs(pending)))
handle_softirq();
EXAMPLE
29
Case 1: Scheduler runs the rest from ksoftirqd
RT (4.6.2-rt5) non-RT (4.6.2)
do_softirq();
__do_softirq();
h = softirq_vec;
while ((softirq_bit = ffs(pending)))
{
h += softirq_bit - 1;
h->action(h);
}
run_ksoftirqd(); run_ksoftirqd();
do_current_softirqs()
[ where current == ksoftirqd ]
30
4.6.2-rt5:
[ 6937.393805] e1000e_poll+0x126/0xa70 [e1000e]
[ 6937.393808] check_preemption_disabled+0xab/0x240
[ 6937.393815] net_rx_action+0x53e/0xc90
[ 6937.393824] do_current_softirqs+0x488/0xc30
[ 6937.393831] do_current_softirqs+0x5/0xc30
[ 6937.393836] __local_bh_enable+0xf2/0x1a0
[ 6937.393840] irq_forced_thread_fn+0x91/0x140
[ 6937.393845] irq_thread+0x170/0x310
[ 6937.393848] irq_finalize_oneshot.part.6+0x4f0/0x4f0
[ 6937.393853] irq_forced_thread_fn+0x140/0x140
[ 6937.393857] irq_thread_check_affinity+0xa0/0xa0
[ 6937.393862] kthread+0x12b/0x1b0
} hard-IRQ handler
kick-off softIRQ
}
4.7 mainline:
[11661.191187] e1000e_poll+0x126/0xa70 [e1000e]
[11661.191197] net_rx_action+0x52e/0xcd0
[11661.191206] __do_softirq+0x15c/0x5ce
[11661.191215] irq_exit+0xa3/0xd0
[11661.191222] do_IRQ+0x62/0x110
[11661.191230] common_interrupt+0x82/0x82
hard-IRQ handler
}
kick off soft IRQ
RT vs Mainline: entering softirq handler SKIP
31
Summary of softirq execution paths
Case 0: Behavior of local_bh_enable() differs
significantly between RT and mainline kernel.
Case 1: Behavior of ksoftirqd itself is mostly the
same (note discussion of ktimersoftd below).
32
What is 'current'?
include/asm-generic/current.h:
#define get_current() (current_thread_info()->task)
#define current get_current()
arch/arm/include/asm/thread_info.h:
static inline struct thread_info *current_thread_info(void)
{ return (struct thread_info *) (current_stack_pointer &
~(THREAD_SIZE - 1));
}
arch/x86/include/asm/thread_info.h:
static inline struct thread_info *current_thread_info(void)
{ return (struct thread_info *)(current_top_of_stack() -
THREAD_SIZE);}
In do_current_softirqs(), current is the threaded IRQ task.
33
What is 'current'? part 2
arch/arm/include/asm/thread_info.h:
/*
* how to get the current stack pointer in C
*/
register unsigned long current_stack_pointer asm ("sp");
arch/x86/include/asm/thread_info.h:
static inline unsigned long current_stack_pointer(void)
{
unsigned long sp;
#ifdef CONFIG_X86_64
asm("mov %%rsp,%0" : "=g" (sp));
#else
asm("mov %%esp,%0" : "=g" (sp));
#endif
return sp;
}
SKIP
34
Q.:ย Whenย doย 
systemยญmanagementย 
softirqsย getย toย run?
?
?
? ?
?
? ??
?
35
Introducing systemd-irqd!!โ€ 
โ€ 
As suggested by Dave Anders
36
Do timers, scheduler, RCU ever run as part of
do_current_softirqs?
Examples:
-- every jiffy,
raise_softirq_irqoff(HRTIMER_SOFTIRQ);
-- scheduler_ipi() for NOHZ calls
raise_softirq_irqoff(SCHED_SOFTIRQ);
-- rcu_bh_qs() calls
raise_softirq(RCU_SOFTIRQ);
These run when ksoftirqd is current.
37
Demo: kprobe on do_current_softirqs() for RT kernel
โ— At Github
โ— Counts calls to do_current_softirqs() from ksoftirqd and from a
hard-IRQ hander.
โ— Tested on 4.4.4-rt11 with Boundary Devices' Nitrogen i.MX6.
Output showing what task of 'current_thread' is:
[ 52.841425] task->comm is ksoftirqd/1
[ 70.051424] task->comm is ksoftirqd/1
[ 70.171421] task->comm is ksoftirqd/1
[ 105.981424] task->comm is ksoftirqd/1
[ 165.260476] task->comm is irq/43-2188000.
[ 165.261406] task->comm is ksoftirqd/1
[ 225.321529] task->comm is irq/43-2188000.
explanation
38
struct task_struct {
#ifdef CONFIG_PREEMPT_RT_BASE
struct rcu_head put_rcu;
int softirq_nestcnt;
unsigned int softirqs_raised;
#endif
};
Softirqs can be pre-empted with PREEMPT_RT
include/linux/sched.h:
39
RT-Linux headache: 'softirq starvation'
โ— ksoftirqd scarcely gets to run.
โ— Events that are triggered by timer interrupt won't happen.
โ— Example: main event loop in userspace did not run due to
missed timer ticks.
Reference: โ€œUnderstanding a Real-Time Systemโ€ by Rostedt,
slides and video
โ€œsched: RT throttling activatedโ€ or
โ€œINFO: rcu_sched detected stalls on CPUsโ€
40
(partial) RT solution: ktimersoftd
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed Jan 20 2016 +0100
softirq: split timer softirqs out of ksoftirqd
With enough networking load it is possible that the system
never goes idle and schedules ksoftirqd and everything else
with a higher priority. One of the tasks left behind is one of
RCU's threads and so we see stalls and eventually run out of
memory. This patch moves the TIMER and HRTIMER
softirqs out of the `ksoftirqd` thread into its own `ktimersoftd`.
The former can now run SCHED_OTHER (same as mainline)
and the latter at SCHED_FIFO due to the wakeups. [ . . . ]
41
42
ftrace produces a copious amount of output
43
Investigating IRQs with eBPF: bcc
โ— BCC - Tools for BPF-based Linux analysis
โ— tools/ and examples/ illustrate interfaces to kprobes and
uprobes.
โ— BCC tools are:
โ€“ a convenient way to study arbitrary infrequent events
dynamically;
โ€“ based on dynamic code insertion using Clang Rewriter JIT;
โ€“ lightweight due to in-kernel data storage.
44
eBPF, IOvisor and IRQs: limitations
โ— JIT compiler is currently available for the x86-64, arm64, and
s390 architectures.
โ— No stack traces unless CONFIG_FRAME_POINTER=y
โ— Requires recent kernel, LLVM and Clang
โ— bcc/src/cc/export/helpers.h:
#ifdef __powerpc__
[ . . . ]
#elif defined(__x86_64__)
[ . . . ]
#else
#error "bcc does not support this platform yet"
#endif
45
bcc tips
โ— Kernel source must be present on the host where the probe
runs.
โ— /lib/modules/$(uname -r)/build/include/generated must exist.
โ— To switch between kernel branches and continue quickly using
bcc:
โ€“ run 'mrproper; make config; make'
โ€“ 'make' need only to populate include/generated in kernel source
before bcc again becomes available.
โ€“ 'make headers_install' as non-root user
SKIP
46
Get latest version of clang by compiling from source
(or from Debian Sid)
$ git clone http://llvm.org/git/llvm.git
$ cd llvm/tools
$ git clone --depth 1 http://llvm.org/git/clang.git
$ cd ..; mkdir build; cd build
$ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86"
$ make -j $(getconf _NPROCESSORS_ONLN)
SKIP
from samples/bpf/README.rst
47
Example: NAPI: changing the bottom half
DiO.Quincel-Operapropria,CCBY-SA4.0
ByMcSmit-Ownwork,CCBY-SA3.0
48
Quick NAPI refresher
The problem:
โ€œHigh-speed networking can create thousands of interrupts per
second, all of which tell the system something it already knew: it has
lots of packets to process.โ€
The solution:
โ€œInterrupt mitigation . . . NAPI allows drivers to run with (some)
interrupts disabled during times of high traffic, with a corresponding
decrease in system load.โ€
The implementation:
Poll the driver and drop packets without processing in the NIC if the
polling frequency necessitates.
net/core/dev.c in RT
49
Example: i.MX6 FEC RGMII NAPI turn-on
static irqreturn_t fec_enet_interrupt(int irq, void *dev_id)
[ . . . ]
if ((fep->work_tx || fep->work_rx) && fep->link) {
if (napi_schedule_prep(&fep->napi)) {
/* Disable the NAPI interrupts */
writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
__napi_schedule(&fep->napi);
}
}
== irq_forced_thread_fn() for irq/43
Back to threaded IRQs
50
Example: i.MX6 FEC RGMII NAPI turn-off
static int fec_enet_rx_napi(struct napi_struct *napi, int budget){
[ . . . ]
pkts = fec_enet_rx(ndev, budget);
if (pkts < budget) {
napi_complete(napi);
writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
}
}
netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi,
NAPI_POLL_WEIGHT);
Interrupts are re-enabled when budget is not consumed.
Using existing tracepoints
โ— function_graph tracing causes a lot of overhead.
โ— How about napi_poll tracer in /sys/kernel/debug/events/napi?
โ€“ Fires constantly with any network traffic.
โ€“ Displays no obvious change in behavior when eth IRQ is
disabled and polling starts.
52
The Much Easier Way:
BCC on x86_64 with
4.6.2-rt5 and Clang-3.8
53
Handlind Eth IRQs in ksoftirqd on x86_64, but NAPI?
root $ ./stackcount.py e1000_receive_skb
Tracing 1 functions for "e1000_receive_skb"
^C
e1000_receive_skb
e1000e_poll
net_rx_action
do_current_softirqs
run_ksoftirqd
smpboot_thread_fn
kthread
ret_from_fork
1
e1000_receive_skb
e1000e_poll
net_rx_action
do_current_softirqs
__local_bh_enable
irq_forced_thread_fn
irq_thread
kthread
ret_from_fork
26469
running from
ksoftirqd, not from
hard IRQ handler.
Normal behavior:
packet handler runs
immediately after eth
IRQ, in its context.
COUNTS
4.6.2-rt5
54
Switch to NAPI on x86_64
[alison@sid]$ sudo modprobe kp_ksoft eth_irq_procid=1
[ ] __raise_softirq_irqoff_ksoft: 582 hits
[ ] kprobe at ffffffff81100920 unregistered
[alison@sid]$ sudo ./stacksnoop.py __raise_softirq_irqoff_ksoft
144.803096056 __raise_softirq_irqoff_ksoft
ffffffff81100921 __raise_softirq_irqoff_ksoft
ffffffff810feda9 do_current_softirqs
ffffffff810ffeae run_ksoftirqd
ffffffff8114d255 smpboot_thread_fn
ffffffff81144a99 kthread
ffffffff8205ed82 ret_from_fork
55
Same Experiment, but non-RT 4.6.2
Most frequent:
e1000_receive_skb
e1000e_poll
net_rx_action
__softirqentry_text_start
irq_exit
do_IRQ
ret_from_intr
cpuidle_enter
call_cpuidle
cpu_startup_entry
start_secondary
1016045
Run in ksoftirqd:
e1000_receive_skb
e1000e_poll
net_rx_action
__softirqentry_text_start
run_ksoftirqd
smpboot_thread_fn
kthread
ret_from_fork
1162
At least 70 other call stacks observed in a few seconds.
SKIP
56
Due to handle_pending_softirqs(), any hard IRQ can run before a
given softirq (non-RT 4.6.2)
e1000_receive_skb
e1000e_poll
net_rx_action
__softirqentry_text_start
irq_exit
do_IRQ
ret_from_intr
pipe_write
__vfs_write
vfs_write
sys_write
entry_SYSCALL_64_fastpath
357
e1000_receive_skb
e1000e_poll
net_rx_action
__softirqentry_text_start
irq_exit
do_IRQ
ret_from_intr
__alloc_pages_nodemask
alloc_pages_vma
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
366
57
Same Experiment, but 4.6.2 with 'threadirqs' boot param
e1000_receive_skb
e1000e_poll
net_rx_action
__softirqentry_text_start
do_softirq_own_stack
do_softirq.part.16
__local_bh_enable_ip
irq_forced_thread_fn
irq_thread
kthread
ret_from_fork
569174
With 'threadirqs'
cmdline parameter at
boot.
Note:
no do_current_softirqs()
58
Investigation on ARM:
kprobe with 4.6.2-rt5
59
Documentation/kprobes.txt
โ€œIn general, you can install a probe
anywhere in the kernel.
In particular, you can probe interrupt handlers.โ€
Takeaway: not limited to existing tracepoints!
60
root@nitrogen6x:~# insmod 4.6.2/kp_raise_softirq_irqoff.ko
[ 1749.935955] Planted kprobe at 8012c1b4
[ 1749.936088] Internal error: Oops - undefined instruction: 0 [#1]
PREEMPT SMP ARM
[ 1749.936109] Modules linked in: kp_raise_softirq_irqoff(+)
[ 1749.936116] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.2
[ 1749.936119] Hardware name: Freescale i.MX6 Quad/DualLite
[ 1749.936131] PC is at __raise_softirq_irqoff+0x0/0xf0
[ 1749.936144] LR is at __napi_schedule+0x5c/0x7c
[ 1749.936766] Kernel panic - not syncing: Fatal exception in
interrupt
Not quite anywhere
Mainline stable 4.6.2
61
Adapt samples/kprobes/kprobe_example.c
/* For each probe you need to allocate a kprobe structure */
static struct kprobe kp = {
.symbol_name= "__raise_softirq_irqoff_ksoft",
};
/* kprobe post_handler: called after the probed instruction is executed */
static void handler_post(struct kprobe *p, struct pt_regs *regs,unsigned
long flags)
{
unsigned id = smp_processor_id();
/* change id to that where the eth IRQ is pinned */
if (id == 0) { pr_info("Switched to ethernet NAPI.n");
pr_info("post_handler: p->addr = 0x%p, pc = 0x%lx,"
" lr = 0x%lx, cpsr = 0x%lxn",
p->addr, regs->ARM_pc, regs->ARM_lr, regs->ARM_cpsr); }
}
code at Github
in net/core/dev.c
62
Watching net_rx_action() switch to NAPI
alison@laptop:~# make ARCH=arm CROSS_COMPILE=arm-linux-
gnueabi- samples/kprobes/ modules
root@nitrogen6x:~# modprobe kp_ksoft.ko eth_proc_id=1
root@nitrogen6x:~# dmesg | tail
[ 6548.644584] Planted kprobe at 8003344
root@nitrogen6x:~# dmesg | grep post_handler
root@nitrogen6x:~#
. . . . . Start DOS attack . . . Wait 15 seconds . . . .
root@nitrogen6x:~# dmesg | tail
[ 6548.644584] Planted kprobe at 80033440
[ 6617.858101] pre_handler: p->addr = 0x80033440, pc = 0x80033444,
lr = 0x80605ff0, cpsr = 0x20070193
[ 6617.858104] Switched to ethernet NAPI.
63
Another example of output
Insert/remove two probes during packet storm:
root@nitrogen6x:~# modprobe -r kp_ksoft
[ 232.471922] __raise_softirq_irqoff_ksoft: 14 hits
[ 232.471922] kprobe at 80033440 unregistered
root@nitrogen6x:~# modprobe -r kp_napi_complete
[ 287.225318] napi_complete_done: 1893005 hits
[ 287.262011] kprobe at 80605cc0 unregistered
64
Counting activation of two softirq execution paths
show you the codez
static struct kprobe kp = {
.symbol_name= "do_current_softirqs",
};
if (raised == NET_RX_SOFTIRQ) {
ti = current_thread_info();
task = ti->task;
if (chatty)
pr_debug("task->comm is %sn", task->comm);
if (strstr(task->comm, "ksoftirq"))
p->ksoftirqd_count++;
if (strstr(task->comm, "irq/"))
p->local_bh_enable_count++;
}
previously included results
modprobe kp_do_current_softirqs chatty=1
store counters in
struct kprobe{}
65
Summary
โ— IRQ handling involves a 'hard', fast part or 'top half' and a 'soft',
slower part or 'bottom half.'
โ— Hard IRQs include arch-dependent system features plus
software-generated IPIs.
โ— Soft IRQs may run directly after the hard IRQ that raises them,
or at a later time in ksoftirqd.
โ— Threaded, preemptible IRQs are a salient feature of RT Linux.
โ— The management of IRQs, as illustrated by NAPI's response to
DOS, remains challenging.
โ— If you can use bcc and eBPF, you should be!
66
Acknowledgements
Thanks to Sebastian Siewor, Brenden Blanco, Brendan Gregg,
Steven Rostedt and Dave Anders for advice and inspiration.
Special thanks to Joel Fernandes and Sarah Newman for detailed
feedback on an earlier version.
67
Useful Resources
โ— NAPI docs
โ— Documentation/kernel-per-CPU-kthreads
โ— Documentation/DocBook/genericirq.pdf
โ— Brendan Gregg's blog
โ— Tasklets and softirqs discussion at KLDP wiki
โ— #iovisor at OFTC IRC
โ— Alexei Starovoitov's 2015 LLVM Microconf slides
68
ARMv7 Core Registers
69
Softirqs that don't run in context of hard-IRQ handlers
run โ€œon behalf of ksoftirqdโ€
static inline void ksoftirqd_set_sched_params(unsigned int cpu)
{
/* Take over all but timer pending softirqs when starting */
local_irq_disable();
current->softirqs_raised = local_softirq_pending() & ~TIMER_SOFTIRQS;
local_irq_enable();
}
static struct smp_hotplug_thread softirq_threads = {
.store = &ksoftirqd,
.setup = ksoftirqd_set_sched_params,
.thread_should_run = ksoftirqd_should_run,
.thread_fn = run_ksoftirqd,
.thread_comm = "ksoftirqd/%u",
};
70
Compare output to source with GDB
[alison@hildesheim linux-4.4.4 (trace_napi)]$ arm-linux-gnueabihf-gdb vmlinux
(gdb) p *(__raise_softirq_irqoff_ksoft)
$1 = {void (unsigned int)} 0x80033440 <__raise_softirq_irqoff_ksoft>
(gdb) l *(0x80605ff0)
0x80605ff0 is in net_rx_action (net/core/dev.c:4968).
4963 list_splice_tail(&repoll, &list);
4964 list_splice(&list, &sd->poll_list);
4965 if (!list_empty(&sd->poll_list))
4966 __raise_softirq_irqoff_ksoft(NET_RX_SOFTIRQ);
4967
4968 net_rps_action_and_irq_enable(sd);
4969 }

More Related Content

What's hot

Porting a new architecture (NDS32) to open wrt project
Porting a new architecture (NDS32) to open wrt projectPorting a new architecture (NDS32) to open wrt project
Porting a new architecture (NDS32) to open wrt project
Macpaul Lin
ย 

What's hot (20)

qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
ย 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
ย 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
ย 
Intel TSX ใซใคใ„ใฆ x86opti
Intel TSX ใซใคใ„ใฆ x86optiIntel TSX ใซใคใ„ใฆ x86opti
Intel TSX ใซใคใ„ใฆ x86opti
ย 
Porting a new architecture (NDS32) to open wrt project
Porting a new architecture (NDS32) to open wrt projectPorting a new architecture (NDS32) to open wrt project
Porting a new architecture (NDS32) to open wrt project
ย 
Linux power management: are you doing it right?
Linux power management: are you doing it right?Linux power management: are you doing it right?
Linux power management: are you doing it right?
ย 
Linux SD/MMC device driver
Linux SD/MMC device driverLinux SD/MMC device driver
Linux SD/MMC device driver
ย 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
ย 
Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014
ย 
Linux Memory
Linux MemoryLinux Memory
Linux Memory
ย 
Logging kernel oops and panic
Logging kernel oops and panicLogging kernel oops and panic
Logging kernel oops and panic
ย 
Block Drivers
Block DriversBlock Drivers
Block Drivers
ย 
Video Drivers
Video DriversVideo Drivers
Video Drivers
ย 
Fast Boot Times with InsydeH2O
Fast Boot Times with InsydeH2OFast Boot Times with InsydeH2O
Fast Boot Times with InsydeH2O
ย 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+
ย 
Kdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysisKdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysis
ย 
KASan in a Bare-Metal Hypervisor
 KASan in a Bare-Metal Hypervisor  KASan in a Bare-Metal Hypervisor
KASan in a Bare-Metal Hypervisor
ย 
BeagleBone Black Booting Process
BeagleBone Black Booting ProcessBeagleBone Black Booting Process
BeagleBone Black Booting Process
ย 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
ย 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals
ย 

Viewers also liked

Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
Baruch Osoveskiy
ย 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
Fabio Fumarola
ย 
Linux architecture
Linux architectureLinux architecture
Linux architecture
mcganesh
ย 

Viewers also liked (20)

Tuning systemd for embedded
Tuning systemd for embeddedTuning systemd for embedded
Tuning systemd for embedded
ย 
LISA15: systemd, the Next-Generation Linux System Manager
LISA15: systemd, the Next-Generation Linux System Manager LISA15: systemd, the Next-Generation Linux System Manager
LISA15: systemd, the Next-Generation Linux System Manager
ย 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
ย 
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
ย 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
ย 
Docker, LinuX Container
Docker, LinuX ContainerDocker, LinuX Container
Docker, LinuX Container
ย 
Boost UDP Transaction Performance
Boost UDP Transaction PerformanceBoost UDP Transaction Performance
Boost UDP Transaction Performance
ย 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
ย 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
ย 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
ย 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
ย 
Docker in the Oracle Universe / WebLogic 12c / OFM 12c
Docker in the Oracle Universe / WebLogic 12c / OFM 12cDocker in the Oracle Universe / WebLogic 12c / OFM 12c
Docker in the Oracle Universe / WebLogic 12c / OFM 12c
ย 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
ย 
Linux architecture
Linux architectureLinux architecture
Linux architecture
ย 
SR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and ImprovementSR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and Improvement
ย 
WebLogic im Docker Container
WebLogic im Docker ContainerWebLogic im Docker Container
WebLogic im Docker Container
ย 
Container Landscape in 2017
Container Landscape in 2017Container Landscape in 2017
Container Landscape in 2017
ย 
Advanced troubleshooting linux performance
Advanced troubleshooting linux performanceAdvanced troubleshooting linux performance
Advanced troubleshooting linux performance
ย 
Feature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with EncryptionFeature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with Encryption
ย 
Container Storage Best Practices in 2017
Container Storage Best Practices in 2017Container Storage Best Practices in 2017
Container Storage Best Practices in 2017
ย 

Similar to IRQs: the Hard, the Soft, the Threaded and the Preemptible

Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
SnehaLatha68
ย 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
KandavelEee
ย 
Building a QT based solution on a i.MX7 processor running Linux and FreeRTOS
Building a QT based solution on a i.MX7 processor running Linux and FreeRTOSBuilding a QT based solution on a i.MX7 processor running Linux and FreeRTOS
Building a QT based solution on a i.MX7 processor running Linux and FreeRTOS
Fernando Luiz Cola
ย 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
ย 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overview
Linaro
ย 

Similar to IRQs: the Hard, the Soft, the Threaded and the Preemptible (20)

Beneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBeneath the Linux Interrupt handling
Beneath the Linux Interrupt handling
ย 
Introduction to FreeRTOS
Introduction to FreeRTOSIntroduction to FreeRTOS
Introduction to FreeRTOS
ย 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
ย 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
ย 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
ย 
Processor types
Processor typesProcessor types
Processor types
ย 
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
ย 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
ย 
Building a QT based solution on a i.MX7 processor running Linux and FreeRTOS
Building a QT based solution on a i.MX7 processor running Linux and FreeRTOSBuilding a QT based solution on a i.MX7 processor running Linux and FreeRTOS
Building a QT based solution on a i.MX7 processor running Linux and FreeRTOS
ย 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
ย 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
ย 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System Workshop
ย 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overview
ย 
Bottom halves on Linux
Bottom halves on LinuxBottom halves on Linux
Bottom halves on Linux
ย 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
ย 
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMAn Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
ย 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
ย 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
ย 
AMP Kynetics - ELC 2018 Portland
AMP  Kynetics - ELC 2018 PortlandAMP  Kynetics - ELC 2018 Portland
AMP Kynetics - ELC 2018 Portland
ย 
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
ย 

More from Alison Chaiken

Not breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABINot breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABI
Alison Chaiken
ย 
Developing automotive Linux
Developing automotive LinuxDeveloping automotive Linux
Developing automotive Linux
Alison Chaiken
ย 
Automotive Free Software 2013: "Right to Repair" and Privacy
Automotive Free Software 2013: "Right to Repair" and PrivacyAutomotive Free Software 2013: "Right to Repair" and Privacy
Automotive Free Software 2013: "Right to Repair" and Privacy
Alison Chaiken
ย 

More from Alison Chaiken (20)

Not breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABINot breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABI
ย 
Supporting SW Update via u-boot and GPT/EFI
Supporting SW Update via u-boot and GPT/EFISupporting SW Update via u-boot and GPT/EFI
Supporting SW Update via u-boot and GPT/EFI
ย 
Two C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp InsightsTwo C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp Insights
ย 
V2X Communications: Getting our Cars Talking
V2X Communications: Getting our Cars TalkingV2X Communications: Getting our Cars Talking
V2X Communications: Getting our Cars Talking
ย 
Practical Challenges to Deploying Highly Automated Vehicles
Practical Challenges to Deploying Highly Automated VehiclesPractical Challenges to Deploying Highly Automated Vehicles
Practical Challenges to Deploying Highly Automated Vehicles
ย 
Linux: the first second
Linux: the first secondLinux: the first second
Linux: the first second
ย 
Functional AI and Pervasive Networking in Automotive
 Functional AI and Pervasive Networking in Automotive Functional AI and Pervasive Networking in Automotive
Functional AI and Pervasive Networking in Automotive
ย 
Flash in Vehicles: an End-User's Perspective
Flash in Vehicles: an End-User's PerspectiveFlash in Vehicles: an End-User's Perspective
Flash in Vehicles: an End-User's Perspective
ย 
Linux: the first second
Linux: the first secondLinux: the first second
Linux: the first second
ย 
Automotive Linux, Cybersecurity and Transparency
Automotive Linux, Cybersecurity and TransparencyAutomotive Linux, Cybersecurity and Transparency
Automotive Linux, Cybersecurity and Transparency
ย 
Automotive Grade Linux and systemd
Automotive Grade Linux and systemdAutomotive Grade Linux and systemd
Automotive Grade Linux and systemd
ย 
Systemd for developers
Systemd for developersSystemd for developers
Systemd for developers
ย 
Developing Automotive Linux
Developing Automotive LinuxDeveloping Automotive Linux
Developing Automotive Linux
ย 
Systemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to loveSystemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to love
ย 
Technology, Business and Regulation of the Connected Car
Technology, Business and Regulation of the Connected CarTechnology, Business and Regulation of the Connected Car
Technology, Business and Regulation of the Connected Car
ย 
Best practices for long-term support and security of the device-tree
Best practices for long-term support and security of the device-treeBest practices for long-term support and security of the device-tree
Best practices for long-term support and security of the device-tree
ย 
The โ€œTelematics Horizonโ€ V2V and V2I Networking
The โ€œTelematics Horizonโ€ V2V and V2I NetworkingThe โ€œTelematics Horizonโ€ V2V and V2I Networking
The โ€œTelematics Horizonโ€ V2V and V2I Networking
ย 
Developing automotive Linux
Developing automotive LinuxDeveloping automotive Linux
Developing automotive Linux
ย 
Automotive Free Software 2013: "Right to Repair" and Privacy
Automotive Free Software 2013: "Right to Repair" and PrivacyAutomotive Free Software 2013: "Right to Repair" and Privacy
Automotive Free Software 2013: "Right to Repair" and Privacy
ย 
Addressing the hard problems of automotive Linux: networking and IPC
Addressing the hard problems of automotive Linux: networking and IPCAddressing the hard problems of automotive Linux: networking and IPC
Addressing the hard problems of automotive Linux: networking and IPC
ย 

Recently uploaded

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
SUHANI PANDEY
ย 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
ย 
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort ServiceCall Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
ย 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
sivaprakash250
ย 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
KreezheaRecto
ย 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ย 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
ย 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ย 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
ย 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
ย 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
ย 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
ย 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
ย 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
ย 
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort ServiceCall Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
ย 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
ย 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
ย 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
ย 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
ย 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
ย 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
ย 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
ย 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
ย 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
ย 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
ย 

IRQs: the Hard, the Soft, the Threaded and the Preemptible

  • 1. IRQs: the Hard, the Soft, the Threaded and the Preemptible Alison Chaiken Latest version of these slides alison@she-devel.com Embedded Linux Conference Europe Oct 11, 2016 Example code Version 2, actually presented live
  • 2. 2 Thursday October 13, 2016 15:30: Debugging Methodologies for Realtime Issues Joel Fernandes, Google this same room Knocking at Your Back Door (or How Dealing with Modern Interrupt Architectures can Affect Your Sanity) Marc Zyngier, ARM Ltd Hall Berlin A
  • 3. 3 Agenda โ— Why do IRQs exist? โ— About kinds of hard-IRQ handlers โ— About softirqs and tasklets โ— Differences in IRQ handling between RT and non-RT kernels โ— Studying IRQ behavior via kprobes, event tracing, mpstat and eBPF โ— Detailed example: when does NAPI take over for eth IRQs? โ€œKunst nicht lehrbar ist. Sie mรผssen wieder in der Werkstatt aufgehen.โ€ -- Walter Gropius
  • 4. 4 Sample questions to be answered โ— What's all stuff in /proc/interrupts anyway? โ— What are IPIs and NMIs? โ— Why are atomic operations expensive for ARM? โ— Why are differences between mainline and RT for softirqs? โ— What is 'current' task while in softirq? โ— What function is running inside the threaded IRQs? โ— When do we switch from individual hard IRQ processing to NAPI?
  • 5. 5 Interrupt handling: a brief pictorial summary DennisJarvis,http://tinyurl.com/jmkw23h onefulllife,http://tinyurl.com/j25lal5 Top half: the hard IRQ Bottom half: the soft IRQ
  • 6. 6 Why do we need interrupts at all? โ— IRQs allow devices to notify the kernel that they require maintenance. โ— Alternatives include โ€“ polling (servicing devices at a pre-configured interval); โ€“ traditional IPC to user-space drivers. โ— Even a single-threaded RTOS or a bootloader needs a system timer.
  • 7. 7 Interrupts in Das U-boot โ— For ARM, minimal IRQ support: โ€“ clear exceptions and reset timer (e.g., arch/arm/lib/interrupts_64.c or arch/arm/cpu/armv8/exceptions.S) โ— For x86, interrupts are serviced via a stack-push followed by a jump (arch/x86/cpu/interrupts.c) โ€“ PCI has full-service interrupt handling (arch/x86/cpu/irq.c)
  • 8. 8 Interrupts in RTOS: Xenomai/ADEOS IPIPE From Adeos website, covered by GFDL
  • 9. 9 Zoology of IRQs โ— Hard versus soft โ— Level- vs. edge-triggered, simple, fast EOI or per-CPU โ— Local vs. global; System vs. device โ— Maskable vs. non-maskable โ— Shared or not; chained or not โ— Multiple interrupt controllers per SOC 'cat /proc/interrupts' or 'mpstat -A' ByBirdBeaksA.svg:L.Shyamalderivativework:Leptictidium(talk)-BirdBeaksA.svg,CCBY-SA2.5,https://commons.wikimedia.org/w/index.php?curid=6626434
  • 10. 10 ARM IPIs, from arch/arm/kernel/smp.c $ # cat /proc/interrupts, look at bottom void handle_IPI(int ipinr, struct pt_regs *regs) switch (ipinr) { case IPI_TIMER: tick_receive_broadcast(); case IPI_RESCHEDULE: scheduler_ipi(); case IPI_CALL_FUNC: generic_smp_call_function_interrupt(); case IPI_CPU_STOP: ipi_cpu_stop(cpu); case IPI_IRQ_WORK: irq_work_run(); case IPI_COMPLETION: ipi_complete(cpu); } Handlers are in kernel/sched/core.c
  • 11. 11 What is an NMI? โ— A 'non-maskable' interrupt is related to: โ€“ HW problem: parity error, bus error, watchdog timer expiration . . . โ€“ also used by perf /* non-maskable interrupt control */ #define NMICR_NMIF 0x0001 /* NMI pin interrupt flag */ #define NMICR_WDIF 0x0002 /* watchdog timer overflow */ #define NMICR_ABUSERR 0x0008 /* async bus error flag */ From arch/arm/mn10300/include/asm/intctl-regs.h ByJohnJewell-Fenix,CCBY2.0,https://commons.wikimedia.org/w/index.php?curid=49332041 SKIP
  • 12. 12 How IRQ masking works arch/arm/include/asm/irqflags.h: #define arch_local_irq_enable arch_local_irq_enable static inline void arch_local_irq_enable(void) { asm volatile( "cpsie i @ arch_local_irq_enable" ::: "memory", "cc"); } arch/arm64/include/asm/irqflags.h: static inline void arch_local_irq_enable(void) { asm volatile( "msr daifclr, #2 // arch_local_irq_enable" ::: "memory"); } arch/x86/include/asm/irqflags.h: static inline notrace void arch_local_irq_enable(void) { native_irq_enable(); } static inline void native_irq_enable(void) { asm volatile("sti": : :"memory"); } โ€œchange processor stateโ€ only current core SKIP
  • 13. 13 x86's Infamous System Management Interrupt โ— SMI jumps out of kernel into System Management Mode โ€“ controlled by System Management Engine (Skochinsky) โ— Identified as security vulnerability by Invisible Things Lab โ— Not directly visible to Linux โ— Traceable via hw_lat detector (sort of) [RFC][PATCH 1/3] tracing: Added hardware latency tracer, Aug 4 From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org> The hardware latency tracer has been in the PREEMPT_RT patch for some time. It is used to detect possible SMIs or any other hardware interruptions that the kernel is unaware of. Note, NMIs may also be detected, but that may be good to note as well.
  • 14. 14 ARM's Fast Interrupt reQuest โ— An NMI with optimized handling due to dedicated registers. โ— Underutilized by Linux drivers. โ— Serves as the basis for Android's fiq_debugger.
  • 15. 15 IRQ 'Domains' Correspond to Different INTC's CONFIG_IRQ_DOMAIN_DEBUG: This option will show the mapping relationship between hardware irq numbers and Linux irq numbers. The mapping is exposed via debugfs in the file "irq_domain_mapping". Note: โ— There are a lot more IRQs than in /proc/interrupts. โ— There are more IRQs in /proc/interrupts than in 'ps axl | grep irq'. โ— Some IRQs are not used. โ— Some are processor-reserved and not kernel-managed. SKIP
  • 16. Example: i.MX6 General Power Controller Unmasked IRQs can wakeup sleeping power domains.
  • 17. Threaded IRQs in RT kernel ps axl | grep irq with both RT and non-RT kernels. Handling IRQs as kernel threads allows priority and CPU affinity to be managed individually. IRQ handlers running in threads can themselves be interrupted.
  • 19. What function do threaded IRQs run? /* request_threaded_irq - allocate an interrupt line * @handler: Function to be called when the IRQ occurs. * Primary handler for threaded interrupts * If NULL and thread_fn != NULL the default * primary handler is installed * * @thread_fn: Function called from the irq handler thread * If NULL, no irq thread is created */ Even in mainline, request_irq() = requested_threaded_irq() with NULL thread_fn. EXAMPLE
  • 20. 20 Result: -- irq_default_primary_handler() runs in interrupt context. -- All it does is wake up the thread. -- Then handler runs in irq/<name> thread. Result: -- handler runs in interrupt context. -- thread_fn runs in irq/<name> thread. request_irq(handler) request_threaded_irq(handler, NULL) direct invocation of request_threaded_irq()CASE 1 irq_setup_forced_threading() CASE 0 indirect invocation of request_threaded_irq()
  • 21. 21 Threaded IRQs in RT, mainline and mainline with โ€œthreadirqsโ€ boot param โ— RT: all hard-IRQ handlers that don't set IRQF_NOTHREAD run in threads. โ— Mainline: only those hard-IRQ handlers whose registration requests explicitly call request_threaded_irq() run in threads. โ— Mainline with threadirqs kernel cmdline: like RT, but CPU affinity of IRQ threads cannot be set. genirq: Force interrupt thread on RT genirq: Do not invoke the affinity callback via a workqueue on RT
  • 22. 22 Shared interrupts: mmc driver โ— Check 'ps axl | grep irq | grep mmc': 1 0 122 2 -51 0 - S ? 0:00 [irq/16-mmc0] 1 0 123 2 -50 0 - S ? 0:00 [irq/16-s-mmc0] โ— 'cat /proc/interrupts': mmc and ehci-hcd share an IRQ line 16: 204 IR-IO-APIC 16-fasteoi mmc0,ehci_hcd:usb3 โ— drivers/mmc/host/sdhci.c: ret = request_threaded_irq(host->irq, sdhci_irq, sdhci_thread_irq, IRQF_SHARED,mmc_hostname(mmc), host); handler thread_fn
  • 23. Why are atomic operations more expensive (ARM)? arch/arm/include/asm/atomic.h: static inline void atomic_##op(int i, atomic_t *v) { raw_local_irq_save(flags); v->counter c_op i; raw_local_irq_restore(flags); } include/linux/irqflags.h: #define raw_local_irq_save(flags) do { flags = arch_local_irq_save(); } while (0) arch/arm/include/asm/atomic.h: /* Save the current interrupt enable state & disable IRQs */ static inline unsigned long arch_local_irq_save(void) { . . . }
  • 24. 24 Introduction to softirqs In kernel/softirq.c: const char * const softirq_to_name[NR_SOFTIRQS] = { "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL", "TASKLET", "SCHED", "HRTIMER", "RCU" }; Tasklet interface Raised by devices Kernel housekeeping In ksoftirqd, softirqs are serviced in the listed order. IRQ_POLL since 4.4 Gone since 4.1
  • 25. 25 What are tasklets? โ— Tasklets perform deferred work not handled by other softirqs. โ— Examples: crypto, USB, DMA, keyboard . . . โ— More latency-sensitive drivers (sound, PCI) are part of tasklet_hi_vec. โ— Any driver can create a tasklet. โ— tasklet_hi_schedule() or tasklet_schedule() are called directly by ISR. const char * const softirq_to_name[NR_SOFTIRQS] = { "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL", "TASKLET", "SCHED", "HRTIMER", "RCU" };
  • 26. 26 [alison@sid ~]$ sudo mpstat -I SCPU Linux 4.1.0-rt17+ (sid) 05/29/2016 _x86_64_(4 CPU) CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s 0 0.03 249.84 0.00 0.11 19.96 0.43 238.75 0.68 0.00 1 0.01 249.81 0.38 1.00 38.25 1.98 236.69 0.53 0.00 2 0.02 249.72 0.19 0.11 53.34 3.83 233.94 1.44 0.00 3 0.59 249.72 0.01 2.05 19.34 2.63 234.04 1.72 0.00 Linux 4.6.0+ (sid) 05/29/2016 _x86_64_(4 CPU) CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s 0 0.26 16.13 0.20 0.33 40.90 0.73 9.18 0.00 19.04 1 0.00 9.45 0.00 1.31 14.38 0.61 7.85 0.00 17.88 2 0.01 15.38 0.00 0.20 0.08 0.29 13.21 0.00 16.24 3 0.00 9.77 0.00 0.05 0.15 0.00 8.50 0.00 15.32 Linux 4.1.18-rt17-00028-g8da2a20 (vpc23) 06/04/16 _armv7l_ (2 CPU) CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s 0 0.00 999.72 0.18 9.54 0.00 89.29 191.69 261.06 0.00 1 0.00 999.35 0.00 16.81 0.00 15.13 126.75 260.89 0.00 Linux 4.7.0 (nitrogen6x) 07/31/16 _armv7l_ (4 CPU) CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s TASKLET/s SCHED/s HRTIMER/s RCU/s 0 0.00 2.84 0.50 40.69 0.00 0.38 2.78 0.00 3.03 1 0.00 89.00 0.00 0.00 0.00 0.00 0.64 0.00 46.22 2 0.00 16.59 0.00 0.00 0.00 0.00 0.23 0.00 3.05 3 0.00 10.22 0.00 0.00 0.00 0.00 0.25 0.00 1.45 SKIP
  • 27. 27 Two paths by which softirqs run Related demo and sample code system management thread run_ksoftirqd() Hard-IRQ handler system management thread exhausts timeslice? local_bh_enable() raises softirqraises softirq __do_softirq()do_current_softirqs() (RT) or __do_softirq() CASE 0 (left) CASE 1 (right)
  • 28. 28 Case 0: Run softirqs at exit of a hard-IRQ handler while (current->softirqs_raised) { i = __ffs(current->softirqs_raised); do_single_softirq(i); } RT (4.6.2-rt5) non-RT (4.6.2) local_bh_enable(); local_bh_enable(); __local_bh_enable(); do_softirq(); do_current_softirqs(); __do_softirq(); Run softirqs raised in the current context. Run all pending softirqs up to MAX_IRQ_RESTART. handle_pending_softirqs(); handle_softirq(); while ((softirq_bit = ffs(pending))) handle_softirq(); EXAMPLE
  • 29. 29 Case 1: Scheduler runs the rest from ksoftirqd RT (4.6.2-rt5) non-RT (4.6.2) do_softirq(); __do_softirq(); h = softirq_vec; while ((softirq_bit = ffs(pending))) { h += softirq_bit - 1; h->action(h); } run_ksoftirqd(); run_ksoftirqd(); do_current_softirqs() [ where current == ksoftirqd ]
  • 30. 30 4.6.2-rt5: [ 6937.393805] e1000e_poll+0x126/0xa70 [e1000e] [ 6937.393808] check_preemption_disabled+0xab/0x240 [ 6937.393815] net_rx_action+0x53e/0xc90 [ 6937.393824] do_current_softirqs+0x488/0xc30 [ 6937.393831] do_current_softirqs+0x5/0xc30 [ 6937.393836] __local_bh_enable+0xf2/0x1a0 [ 6937.393840] irq_forced_thread_fn+0x91/0x140 [ 6937.393845] irq_thread+0x170/0x310 [ 6937.393848] irq_finalize_oneshot.part.6+0x4f0/0x4f0 [ 6937.393853] irq_forced_thread_fn+0x140/0x140 [ 6937.393857] irq_thread_check_affinity+0xa0/0xa0 [ 6937.393862] kthread+0x12b/0x1b0 } hard-IRQ handler kick-off softIRQ } 4.7 mainline: [11661.191187] e1000e_poll+0x126/0xa70 [e1000e] [11661.191197] net_rx_action+0x52e/0xcd0 [11661.191206] __do_softirq+0x15c/0x5ce [11661.191215] irq_exit+0xa3/0xd0 [11661.191222] do_IRQ+0x62/0x110 [11661.191230] common_interrupt+0x82/0x82 hard-IRQ handler } kick off soft IRQ RT vs Mainline: entering softirq handler SKIP
  • 31. 31 Summary of softirq execution paths Case 0: Behavior of local_bh_enable() differs significantly between RT and mainline kernel. Case 1: Behavior of ksoftirqd itself is mostly the same (note discussion of ktimersoftd below).
  • 32. 32 What is 'current'? include/asm-generic/current.h: #define get_current() (current_thread_info()->task) #define current get_current() arch/arm/include/asm/thread_info.h: static inline struct thread_info *current_thread_info(void) { return (struct thread_info *) (current_stack_pointer & ~(THREAD_SIZE - 1)); } arch/x86/include/asm/thread_info.h: static inline struct thread_info *current_thread_info(void) { return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);} In do_current_softirqs(), current is the threaded IRQ task.
  • 33. 33 What is 'current'? part 2 arch/arm/include/asm/thread_info.h: /* * how to get the current stack pointer in C */ register unsigned long current_stack_pointer asm ("sp"); arch/x86/include/asm/thread_info.h: static inline unsigned long current_stack_pointer(void) { unsigned long sp; #ifdef CONFIG_X86_64 asm("mov %%rsp,%0" : "=g" (sp)); #else asm("mov %%esp,%0" : "=g" (sp)); #endif return sp; } SKIP
  • 36. 36 Do timers, scheduler, RCU ever run as part of do_current_softirqs? Examples: -- every jiffy, raise_softirq_irqoff(HRTIMER_SOFTIRQ); -- scheduler_ipi() for NOHZ calls raise_softirq_irqoff(SCHED_SOFTIRQ); -- rcu_bh_qs() calls raise_softirq(RCU_SOFTIRQ); These run when ksoftirqd is current.
  • 37. 37 Demo: kprobe on do_current_softirqs() for RT kernel โ— At Github โ— Counts calls to do_current_softirqs() from ksoftirqd and from a hard-IRQ hander. โ— Tested on 4.4.4-rt11 with Boundary Devices' Nitrogen i.MX6. Output showing what task of 'current_thread' is: [ 52.841425] task->comm is ksoftirqd/1 [ 70.051424] task->comm is ksoftirqd/1 [ 70.171421] task->comm is ksoftirqd/1 [ 105.981424] task->comm is ksoftirqd/1 [ 165.260476] task->comm is irq/43-2188000. [ 165.261406] task->comm is ksoftirqd/1 [ 225.321529] task->comm is irq/43-2188000. explanation
  • 38. 38 struct task_struct { #ifdef CONFIG_PREEMPT_RT_BASE struct rcu_head put_rcu; int softirq_nestcnt; unsigned int softirqs_raised; #endif }; Softirqs can be pre-empted with PREEMPT_RT include/linux/sched.h:
  • 39. 39 RT-Linux headache: 'softirq starvation' โ— ksoftirqd scarcely gets to run. โ— Events that are triggered by timer interrupt won't happen. โ— Example: main event loop in userspace did not run due to missed timer ticks. Reference: โ€œUnderstanding a Real-Time Systemโ€ by Rostedt, slides and video โ€œsched: RT throttling activatedโ€ or โ€œINFO: rcu_sched detected stalls on CPUsโ€
  • 40. 40 (partial) RT solution: ktimersoftd Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed Jan 20 2016 +0100 softirq: split timer softirqs out of ksoftirqd With enough networking load it is possible that the system never goes idle and schedules ksoftirqd and everything else with a higher priority. One of the tasks left behind is one of RCU's threads and so we see stalls and eventually run out of memory. This patch moves the TIMER and HRTIMER softirqs out of the `ksoftirqd` thread into its own `ktimersoftd`. The former can now run SCHED_OTHER (same as mainline) and the latter at SCHED_FIFO due to the wakeups. [ . . . ]
  • 41. 41
  • 42. 42 ftrace produces a copious amount of output
  • 43. 43 Investigating IRQs with eBPF: bcc โ— BCC - Tools for BPF-based Linux analysis โ— tools/ and examples/ illustrate interfaces to kprobes and uprobes. โ— BCC tools are: โ€“ a convenient way to study arbitrary infrequent events dynamically; โ€“ based on dynamic code insertion using Clang Rewriter JIT; โ€“ lightweight due to in-kernel data storage.
  • 44. 44 eBPF, IOvisor and IRQs: limitations โ— JIT compiler is currently available for the x86-64, arm64, and s390 architectures. โ— No stack traces unless CONFIG_FRAME_POINTER=y โ— Requires recent kernel, LLVM and Clang โ— bcc/src/cc/export/helpers.h: #ifdef __powerpc__ [ . . . ] #elif defined(__x86_64__) [ . . . ] #else #error "bcc does not support this platform yet" #endif
  • 45. 45 bcc tips โ— Kernel source must be present on the host where the probe runs. โ— /lib/modules/$(uname -r)/build/include/generated must exist. โ— To switch between kernel branches and continue quickly using bcc: โ€“ run 'mrproper; make config; make' โ€“ 'make' need only to populate include/generated in kernel source before bcc again becomes available. โ€“ 'make headers_install' as non-root user SKIP
  • 46. 46 Get latest version of clang by compiling from source (or from Debian Sid) $ git clone http://llvm.org/git/llvm.git $ cd llvm/tools $ git clone --depth 1 http://llvm.org/git/clang.git $ cd ..; mkdir build; cd build $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" $ make -j $(getconf _NPROCESSORS_ONLN) SKIP from samples/bpf/README.rst
  • 47. 47 Example: NAPI: changing the bottom half DiO.Quincel-Operapropria,CCBY-SA4.0 ByMcSmit-Ownwork,CCBY-SA3.0
  • 48. 48 Quick NAPI refresher The problem: โ€œHigh-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process.โ€ The solution: โ€œInterrupt mitigation . . . NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.โ€ The implementation: Poll the driver and drop packets without processing in the NIC if the polling frequency necessitates. net/core/dev.c in RT
  • 49. 49 Example: i.MX6 FEC RGMII NAPI turn-on static irqreturn_t fec_enet_interrupt(int irq, void *dev_id) [ . . . ] if ((fep->work_tx || fep->work_rx) && fep->link) { if (napi_schedule_prep(&fep->napi)) { /* Disable the NAPI interrupts */ writel(FEC_ENET_MII, fep->hwp + FEC_IMASK); __napi_schedule(&fep->napi); } } == irq_forced_thread_fn() for irq/43 Back to threaded IRQs
  • 50. 50 Example: i.MX6 FEC RGMII NAPI turn-off static int fec_enet_rx_napi(struct napi_struct *napi, int budget){ [ . . . ] pkts = fec_enet_rx(ndev, budget); if (pkts < budget) { napi_complete(napi); writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK); } } netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT); Interrupts are re-enabled when budget is not consumed.
  • 51. Using existing tracepoints โ— function_graph tracing causes a lot of overhead. โ— How about napi_poll tracer in /sys/kernel/debug/events/napi? โ€“ Fires constantly with any network traffic. โ€“ Displays no obvious change in behavior when eth IRQ is disabled and polling starts.
  • 52. 52 The Much Easier Way: BCC on x86_64 with 4.6.2-rt5 and Clang-3.8
  • 53. 53 Handlind Eth IRQs in ksoftirqd on x86_64, but NAPI? root $ ./stackcount.py e1000_receive_skb Tracing 1 functions for "e1000_receive_skb" ^C e1000_receive_skb e1000e_poll net_rx_action do_current_softirqs run_ksoftirqd smpboot_thread_fn kthread ret_from_fork 1 e1000_receive_skb e1000e_poll net_rx_action do_current_softirqs __local_bh_enable irq_forced_thread_fn irq_thread kthread ret_from_fork 26469 running from ksoftirqd, not from hard IRQ handler. Normal behavior: packet handler runs immediately after eth IRQ, in its context. COUNTS 4.6.2-rt5
  • 54. 54 Switch to NAPI on x86_64 [alison@sid]$ sudo modprobe kp_ksoft eth_irq_procid=1 [ ] __raise_softirq_irqoff_ksoft: 582 hits [ ] kprobe at ffffffff81100920 unregistered [alison@sid]$ sudo ./stacksnoop.py __raise_softirq_irqoff_ksoft 144.803096056 __raise_softirq_irqoff_ksoft ffffffff81100921 __raise_softirq_irqoff_ksoft ffffffff810feda9 do_current_softirqs ffffffff810ffeae run_ksoftirqd ffffffff8114d255 smpboot_thread_fn ffffffff81144a99 kthread ffffffff8205ed82 ret_from_fork
  • 55. 55 Same Experiment, but non-RT 4.6.2 Most frequent: e1000_receive_skb e1000e_poll net_rx_action __softirqentry_text_start irq_exit do_IRQ ret_from_intr cpuidle_enter call_cpuidle cpu_startup_entry start_secondary 1016045 Run in ksoftirqd: e1000_receive_skb e1000e_poll net_rx_action __softirqentry_text_start run_ksoftirqd smpboot_thread_fn kthread ret_from_fork 1162 At least 70 other call stacks observed in a few seconds. SKIP
  • 56. 56 Due to handle_pending_softirqs(), any hard IRQ can run before a given softirq (non-RT 4.6.2) e1000_receive_skb e1000e_poll net_rx_action __softirqentry_text_start irq_exit do_IRQ ret_from_intr pipe_write __vfs_write vfs_write sys_write entry_SYSCALL_64_fastpath 357 e1000_receive_skb e1000e_poll net_rx_action __softirqentry_text_start irq_exit do_IRQ ret_from_intr __alloc_pages_nodemask alloc_pages_vma handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault 366
  • 57. 57 Same Experiment, but 4.6.2 with 'threadirqs' boot param e1000_receive_skb e1000e_poll net_rx_action __softirqentry_text_start do_softirq_own_stack do_softirq.part.16 __local_bh_enable_ip irq_forced_thread_fn irq_thread kthread ret_from_fork 569174 With 'threadirqs' cmdline parameter at boot. Note: no do_current_softirqs()
  • 59. 59 Documentation/kprobes.txt โ€œIn general, you can install a probe anywhere in the kernel. In particular, you can probe interrupt handlers.โ€ Takeaway: not limited to existing tracepoints!
  • 60. 60 root@nitrogen6x:~# insmod 4.6.2/kp_raise_softirq_irqoff.ko [ 1749.935955] Planted kprobe at 8012c1b4 [ 1749.936088] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP ARM [ 1749.936109] Modules linked in: kp_raise_softirq_irqoff(+) [ 1749.936116] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.2 [ 1749.936119] Hardware name: Freescale i.MX6 Quad/DualLite [ 1749.936131] PC is at __raise_softirq_irqoff+0x0/0xf0 [ 1749.936144] LR is at __napi_schedule+0x5c/0x7c [ 1749.936766] Kernel panic - not syncing: Fatal exception in interrupt Not quite anywhere Mainline stable 4.6.2
  • 61. 61 Adapt samples/kprobes/kprobe_example.c /* For each probe you need to allocate a kprobe structure */ static struct kprobe kp = { .symbol_name= "__raise_softirq_irqoff_ksoft", }; /* kprobe post_handler: called after the probed instruction is executed */ static void handler_post(struct kprobe *p, struct pt_regs *regs,unsigned long flags) { unsigned id = smp_processor_id(); /* change id to that where the eth IRQ is pinned */ if (id == 0) { pr_info("Switched to ethernet NAPI.n"); pr_info("post_handler: p->addr = 0x%p, pc = 0x%lx," " lr = 0x%lx, cpsr = 0x%lxn", p->addr, regs->ARM_pc, regs->ARM_lr, regs->ARM_cpsr); } } code at Github in net/core/dev.c
  • 62. 62 Watching net_rx_action() switch to NAPI alison@laptop:~# make ARCH=arm CROSS_COMPILE=arm-linux- gnueabi- samples/kprobes/ modules root@nitrogen6x:~# modprobe kp_ksoft.ko eth_proc_id=1 root@nitrogen6x:~# dmesg | tail [ 6548.644584] Planted kprobe at 8003344 root@nitrogen6x:~# dmesg | grep post_handler root@nitrogen6x:~# . . . . . Start DOS attack . . . Wait 15 seconds . . . . root@nitrogen6x:~# dmesg | tail [ 6548.644584] Planted kprobe at 80033440 [ 6617.858101] pre_handler: p->addr = 0x80033440, pc = 0x80033444, lr = 0x80605ff0, cpsr = 0x20070193 [ 6617.858104] Switched to ethernet NAPI.
  • 63. 63 Another example of output Insert/remove two probes during packet storm: root@nitrogen6x:~# modprobe -r kp_ksoft [ 232.471922] __raise_softirq_irqoff_ksoft: 14 hits [ 232.471922] kprobe at 80033440 unregistered root@nitrogen6x:~# modprobe -r kp_napi_complete [ 287.225318] napi_complete_done: 1893005 hits [ 287.262011] kprobe at 80605cc0 unregistered
  • 64. 64 Counting activation of two softirq execution paths show you the codez static struct kprobe kp = { .symbol_name= "do_current_softirqs", }; if (raised == NET_RX_SOFTIRQ) { ti = current_thread_info(); task = ti->task; if (chatty) pr_debug("task->comm is %sn", task->comm); if (strstr(task->comm, "ksoftirq")) p->ksoftirqd_count++; if (strstr(task->comm, "irq/")) p->local_bh_enable_count++; } previously included results modprobe kp_do_current_softirqs chatty=1 store counters in struct kprobe{}
  • 65. 65 Summary โ— IRQ handling involves a 'hard', fast part or 'top half' and a 'soft', slower part or 'bottom half.' โ— Hard IRQs include arch-dependent system features plus software-generated IPIs. โ— Soft IRQs may run directly after the hard IRQ that raises them, or at a later time in ksoftirqd. โ— Threaded, preemptible IRQs are a salient feature of RT Linux. โ— The management of IRQs, as illustrated by NAPI's response to DOS, remains challenging. โ— If you can use bcc and eBPF, you should be!
  • 66. 66 Acknowledgements Thanks to Sebastian Siewor, Brenden Blanco, Brendan Gregg, Steven Rostedt and Dave Anders for advice and inspiration. Special thanks to Joel Fernandes and Sarah Newman for detailed feedback on an earlier version.
  • 67. 67 Useful Resources โ— NAPI docs โ— Documentation/kernel-per-CPU-kthreads โ— Documentation/DocBook/genericirq.pdf โ— Brendan Gregg's blog โ— Tasklets and softirqs discussion at KLDP wiki โ— #iovisor at OFTC IRC โ— Alexei Starovoitov's 2015 LLVM Microconf slides
  • 69. 69 Softirqs that don't run in context of hard-IRQ handlers run โ€œon behalf of ksoftirqdโ€ static inline void ksoftirqd_set_sched_params(unsigned int cpu) { /* Take over all but timer pending softirqs when starting */ local_irq_disable(); current->softirqs_raised = local_softirq_pending() & ~TIMER_SOFTIRQS; local_irq_enable(); } static struct smp_hotplug_thread softirq_threads = { .store = &ksoftirqd, .setup = ksoftirqd_set_sched_params, .thread_should_run = ksoftirqd_should_run, .thread_fn = run_ksoftirqd, .thread_comm = "ksoftirqd/%u", };
  • 70. 70 Compare output to source with GDB [alison@hildesheim linux-4.4.4 (trace_napi)]$ arm-linux-gnueabihf-gdb vmlinux (gdb) p *(__raise_softirq_irqoff_ksoft) $1 = {void (unsigned int)} 0x80033440 <__raise_softirq_irqoff_ksoft> (gdb) l *(0x80605ff0) 0x80605ff0 is in net_rx_action (net/core/dev.c:4968). 4963 list_splice_tail(&repoll, &list); 4964 list_splice(&list, &sd->poll_list); 4965 if (!list_empty(&sd->poll_list)) 4966 __raise_softirq_irqoff_ksoft(NET_RX_SOFTIRQ); 4967 4968 net_rps_action_and_irq_enable(sd); 4969 }