Insufficient memory is a regular problem for systems, and finding what is using up our memory can be tricky. In this session we look at the linux kernel memory system : where memory is consumed, why, and what to do about it. We'll explore memory metrics through utilities like top, ps, vmstat, pmap, and slabinfo. We'll start with the basics of memory in the Linux kernel - overviewing the relevant fields in top, looking at per process statistics in ps, but then quickly work up to more complex matters. Topics will include paging, swapping, caches, buffers, the Linux VFS, and shared memory. Throughout the presentation we'll look at sample cases which highlight particular components, the circumstances in which that component might become involved in using a significant portion of a system's memory, and discuss how and whether tunables should be used to manage how the kernel manages its resources.
12. Overcommit
top - 14:57:44 up 137 days, 7:02, 6 users, load average: 0.03, 0.02, 0.00
Tasks: 141 total,
1 running, 140 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:
3858692k total, 3075728k used,
782964k free,
283648k buffers
Swap:
0k total,
0k used,
0k free, 1073320k cached
PID USER
22385 jmiller
PR
20
NI VIRT
0 18.6g
RES
572
SHR S %CPU %MEM
308 S 0.0 0.0
TIME+ COMMAND
0:00.00 bloat
4G of physical memory and no swap , so how can “bloat” have 18.6g virtual?
13. Overcommit
top - 14:57:44 up 137 days, 7:02, 6 users, load average: 0.03, 0.02, 0.00
Tasks: 141 total,
1 running, 140 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:
3858692k total, 3075728k used,
782964k free,
283648k buffers
Swap:
0k total,
0k used,
0k free, 1073320k cached
PID USER
22385 jmiller
PR
20
NI VIRT
0 18.6g
RES
572
SHR S %CPU %MEM
308 S 0.0 0.0
TIME+ COMMAND
0:00.00 bloat
4G of physical memory and no swap , so how can “bloat” have 18.6g virtual?
●
●
Virtual memory is not “physical memory plus swap”
A process can request huge amounts of memory, but it
isn't mapped to “real memory” until actually referenced
14. Linux filesystem caching
Free memory is used to cache filesystem contents.
Over time systems can appear to be out of memory
because all of the free memory is used for cache.
15. top is awesome
top - 15:57:33 up 131 days, 8:02, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 129 total,
1 running, 128 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.2%us, 0.3%sy, 0.3%ni, 99.0%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:
3858692k total, 3149296k used,
709396k free,
261556k buffers
Swap:
0k total,
0k used,
0k free, 1081832k cached
PID
8131
8153
8154
7767
7511
3379
7026
USER
root
root
root
root
root
root
root
PR
30
30
30
30
30
20
20
NI VIRT RES SHR
10 243m 50m 3748
10 238m 19m 7840
10 208m 15m 14m
10 50704 8748 1328
10 140m 7344 580
0 192m 4116 652
0 113m 3992 3032
S %CPU %MEM
S 0.0 1.3
S 0.0 0.5
S 0.0 0.4
S 1.0 0.2
S 0.0 0.2
S 0.0 0.1
S 0.0 0.1
TIME+
0:51.97
1:35.48
0:08.03
1559:39
13:06.29
48:20.28
0:00.02
COMMAND
chef-client
sssd_be
sssd_nss
munin-asyncd
munin-node
snmpd
sshd
About 25% of this system's memory is from page cache
16. Linux filesystem caching
Additions and removals from the cache are transparent to
applications
Tunable through swappiness
Can be dropped - echo 1 > /proc/sys/vm/drop_caches
Under memory pressure, memory is freed automatically*
*usually
21. Where'd my memory go?
top - 16:40:53 up 137 days, 8:45, 5 users, load average: 0.88, 0.82, 0.46
Tasks: 138 total,
1 running, 137 sleeping,
0 stopped,
0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:
3858692k total, 1549480k used, 2309212k free,
25804k buffers
Swap:
0k total,
0k used,
0k free,
344280k cached
PID
28285
7767
7511
3379
USER
root
root
root
root
PR
30
30
30
20
NI VIRT RES SHR S %CPU %MEM
10 238m 17m 6128 S 0.0 0.5
10 50704 8732 1312 S 0.0 0.2
10 140m 7344 580 S 0.0 0.2
0 192m 4116 652 S 0.0 0.1
...
TIME+
1:39.42
1659:37
13:56.68
50:31.44
COMMAND
sssd_be
munin-asyncd
munin-node
snmpd
1.5G used - 106MB RSS - 345MB cache - 25MB buffer = ~1GB mystery
What is consuming a GB of memory?
22. kernel slab cache
●
The kernel uses free memory for its own caches.
●
Some include:
–
–
–
dentries (directory cache)
inodes
buffers
23. kernel slab cache
[jmiller@mem-mystery ~]$ slabtop -o
Active / Total Objects (% used)
Active / Total Slabs (% used)
Active / Total Caches (% used)
Active / Total Size (% used)
Minimum / Average / Maximum Object
OBJS
624114
631680
649826
494816
186
4206
6707
2296
-s c
: 2461101 / 2468646 (99.7%)
: 259584 / 259586 (100.0%)
: 104 / 187 (55.6%)
: 835570.40K / 836494.74K (99.9%)
: 0.02K / 0.34K / 4096.00K
ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
624112 99%
1.02K 208038
3
832152K nfs_inode_cache
631656 99%
0.19K 31584
20
126336K dentry
649744 99%
0.06K 11014
59
44056K size-64
494803 99%
0.03K
4418
112
17672K size-32
186 100%
32.12K
186
1
11904K kmem_cache
4193 99%
0.58K
701
6
2804K inode_cache
6163 91%
0.20K
353
19
1412K vm_area_struct
2290 99%
0.55K
328
7
1312K radix_tree_node
24. kernel slab cache
[jmiller@mem-mystery ~]$ slabtop -o
Active / Total Objects (% used)
Active / Total Slabs (% used)
Active / Total Caches (% used)
Active / Total Size (% used)
Minimum / Average / Maximum Object
OBJS
624114
631680
649826
494816
186
4206
6707
2296
-s c
: 2461101 / 2468646 (99.7%)
: 259584 / 259586 (100.0%)
: 104 / 187 (55.6%)
: 835570.40K / 836494.74K (99.9%)
: 0.02K / 0.34K / 4096.00K
ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
624112 99%
1.02K 208038
3
832152K nfs_inode_cache
631656 99%
0.19K 31584
20
126336K dentry
649744 99%
0.06K 11014
59
44056K size-64
494803 99%
0.03K
4418
112
17672K size-32
186 100%
32.12K
186
1
11904K kmem_cache
4193 99%
0.58K
701
6
2804K inode_cache
6163 91%
0.20K
353
19
1412K vm_area_struct
2290 99%
0.55K
328
7
1312K radix_tree_node
1057MB of kernel slab cache
25. Where'd my memory go?
top - 16:40:53 up 137 days, 8:45, 5 users, load average: 0.88, 0.82, 0.46
Tasks: 138 total,
1 running, 137 sleeping,
0 stopped,
0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:
3858692k total, 1549480k used, 2309212k free,
25804k buffers
Swap:
0k total,
0k used,
0k free,
344280k cached
PID
28285
7767
7511
3379
USER
root
root
root
root
PR
30
30
30
20
NI VIRT RES SHR S %CPU %MEM
10 238m 17m 6128 S 0.0 0.5
10 50704 8732 1312 S 0.0 0.2
10 140m 7344 580 S 0.0 0.2
0 192m 4116 652 S 0.0 0.1
...
TIME+
1:39.42
1659:37
13:56.68
50:31.44
COMMAND
sssd_be
munin-asyncd
munin-node
snmpd
1.5G used - 106MB RSS - 345MB cache - 25MB buffer = ~1GB mystery
What is consuming a GB of memory?
26. Where'd my memory go?
top - 16:40:53 up 137 days, 8:45, 5 users, load average: 0.88, 0.82, 0.46
Tasks: 138 total,
1 running, 137 sleeping,
0 stopped,
0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:
3858692k total, 1549480k used, 2309212k free,
25804k buffers
Swap:
0k total,
0k used,
0k free,
344280k cached
PID
28285
7767
7511
3379
USER
root
root
root
root
PR
30
30
30
20
NI VIRT RES SHR S %CPU %MEM
10 238m 17m 6128 S 0.0 0.5
10 50704 8732 1312 S 0.0 0.2
10 140m 7344 580 S 0.0 0.2
0 192m 4116 652 S 0.0 0.1
...
TIME+
1:39.42
1659:37
13:56.68
50:31.44
COMMAND
sssd_be
munin-asyncd
munin-node
snmpd
1.5G used - 106MB RSS - 345MB cache - 25MB buffer = ~1GB mystery
What is consuming a GB of memory?
Answer: kernel slab cache
→
1057MB
27. kernel slab cache
Additions and removals from the cache are
transparent to applications
Tunable through procs vfs_cache_pressure
Under memory pressure, memory is freed
automatically*
*usually
28. kernel slab cache
network buffers example
[jmiller@mem-mystery2 ~]$ slabtop -s c -o
Active / Total Objects (% used)
: 2953761 / 2971022 (99.4%)
Active / Total Slabs (% used)
: 413496 / 413496 (100.0%)
Active / Total Caches (% used)
: 106 / 188 (56.4%)
Active / Total Size (% used)
: 1633033.85K / 1635633.87K (99.8%)
Minimum / Average / Maximum Object : 0.02K / 0.55K / 4096.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1270200 1270170 99%
1.00K 317550
4
1270200K size-1024
1269480 1269406 99%
0.25K 84632
15
338528K skbuff_head_cache
325857 325746 99%
0.06K
5523
59
22092K size-64
29. kernel slab cache
network buffers example
[jmiller@mem-mystery2 ~]$ slabtop -s c -o
Active / Total Objects (% used)
: 2953761 / 2971022 (99.4%)
Active / Total Slabs (% used)
: 413496 / 413496 (100.0%)
Active / Total Caches (% used)
: 106 / 188 (56.4%)
Active / Total Size (% used)
: 1633033.85K / 1635633.87K (99.8%)
Minimum / Average / Maximum Object : 0.02K / 0.55K / 4096.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1270200 1270170 99%
1.00K 317550
4
1270200K size-1024
1269480 1269406 99%
0.25K 84632
15
338528K skbuff_head_cache
325857 325746 99%
0.06K
5523
59
22092K size-64
~1.5G used , this time for in-use network buffers (SO_RCVBUF)
31. Unreclaimable slab
[jmiller@mem-mystery2 ~]$ grep -A 2 ^Slab /proc/meminfo
Slab:
1663820 kB
SReclaimable:
9900 kB
SUnreclaim:
1653920 kB
Some slab objects can't be reclaimed, and memory pressure won't
automatically free the resources
32. Nitpick Accounting
Now we can account for all memory utilization:
[jmiller@postgres ~]$ ./memory_explain.sh
"free" buffers (MB) : 277
"free" caches (MB) : 4650
"slabtop" memory (MB) : 109.699
"ps" resident process memory (MB) : 366.508
"free" used memory (MB) : 5291
buffers+caches+slab+rss (MB) :
difference (MB) : -112.207
5403.207
33. Nitpick Accounting
Now we can account for all memory utilization:
[jmiller@postgres ~]$ ./memory_explain.sh
"free" buffers (MB) : 277
"free" caches (MB) : 4650
"slabtop" memory (MB) : 109.699
"ps" resident process memory (MB) : 366.508
"free" used memory (MB) : 5291
buffers+caches+slab+rss (MB) :
difference (MB) : -112.207
5403.207
But sometimes we're using more memory than we're using?!
34. And a cache complication...
top - 12:37:01 up 66 days, 23:38, 3 users, load average: 0.08, 0.02, 0.01
Tasks: 188 total,
1 running, 187 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.3%us, 0.6%sy, 0.0%ni, 98.9%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Mem:
7673860k total, 6895008k used,
778852k free,
300388k buffers
Swap:
0k total,
0k used,
0k free, 6179780k cached
PID USER
2189 postgres
PR
20
NI VIRT RES SHR S %CPU %MEM
0 5313m 2.8g 2.8g S 0.0 38.5
TIME+ COMMAND
7:09.20 postgres
35. And a cache complication...
top - 12:37:01 up 66 days, 23:38, 3 users, load average: 0.08, 0.02, 0.01
Tasks: 188 total,
1 running, 187 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.3%us, 0.6%sy, 0.0%ni, 98.9%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Mem:
7673860k total, 6895008k used,
778852k free,
300388k buffers
Swap:
0k total,
0k used,
0k free, 6179780k cached
PID USER
2189 postgres
PR
20
~7G used
NI VIRT RES SHR S %CPU %MEM
0 5313m 2.8g 2.8g S 0.0 38.5
TIME+ COMMAND
7:09.20 postgres
36. And a cache complication...
top - 12:37:01 up 66 days, 23:38, 3 users, load average: 0.08, 0.02, 0.01
Tasks: 188 total,
1 running, 187 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.3%us, 0.6%sy, 0.0%ni, 98.9%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Mem:
7673860k total, 6895008k used,
778852k free,
300388k buffers
Swap:
0k total,
0k used,
0k free, 6179780k cached
PID USER
2189 postgres
PR
20
~7G used ,
NI VIRT RES SHR S %CPU %MEM
0 5313m 2.8g 2.8g S 0.0 38.5
~6G cached ,
TIME+ COMMAND
7:09.20 postgres
37. And a cache complication...
top - 12:37:01 up 66 days, 23:38, 3 users, load average: 0.08, 0.02, 0.01
Tasks: 188 total,
1 running, 187 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.3%us, 0.6%sy, 0.0%ni, 98.9%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Mem:
7673860k total, 6895008k used,
778852k free,
300388k buffers
Swap:
0k total,
0k used,
0k free, 6179780k cached
PID USER
2189 postgres
PR
20
~7G used ,
NI VIRT RES SHR S %CPU %MEM
0 5313m 2.8g 2.8g S 0.0 38.5
~6G cached ,
TIME+ COMMAND
7:09.20 postgres
so how can postgres have 2.8G resident?
38. Shared memory
●
Pages that multiple processes can access
●
Resident, shared, and in the page cache
●
Not subject to cache flush
●
shmget()
●
mmap()
50. Locked shared memory
●
Memory from shmget() must be explicitly
released by a shmctl(..., IPC_RMID, …) call
●
Process termination doesn't free the memory
●
Not the case for mmap()
57. Accounting for shared memory
is difficult
●
●
●
●
top reports memory that can be shared – but
might not be
ps doesn't account for shared
pmap splits mapped vs shared, reports
allocated vs used
mmap'd files are shared, until modified → at
which point they're private
59. Linux filesystem cache
We know shared memory is in the page cache,
which we can largely understand through proc
From /proc/meminfo:
Cached:
...
Mapped:
Shmem:
367924 kB
31752 kB
196 kB
60. Linux filesystem cache
We know shared memory is in the page cache,
which we can largely understand through proc
From /proc/meminfo:
Cached:
...
Mapped:
Shmem:
367924 kB
31752 kB
196 kB
But what about the rest of what's in the cache?
61. Linux filesystem cache
Bad news:
We can't just ask “What's in the cache?”
Good news:
We can ask “Is this file in the cache?”
67. system tap – cache hits
https://sourceware.org/systemtap/wiki/WSCacheHitRate
[jmiller@stap ~]$ sudo stap /tmp/cachehit.stap
Cache Reads (KB)
508236
0
0
686012
468788
17000
0
0
Disk Reads (KB)
24056
43600
59512
30624
0
63256
67232
19992
Miss Rate
4.51%
100.00%
100.00%
4.27%
0.00%
78.81%
100.00%
100.00%
Hit Rate
95.48%
0.00%
0.00%
95.72%
100.00%
21.18%
0.00%
0.00%
Track reads against VFS, reads against disk, then infer cache hits
68. system tap – cache hits
[jmiller@stap ~]$ sudo stap /tmp/cachehit.stap
Cache Reads (KB)
508236
0
0
686012
468788
17000
0
0
Disk Reads (KB)
24056
43600
59512
30624
0
63256
67232
19992
Miss Rate
4.51%
100.00%
100.00%
4.27%
0.00%
78.81%
100.00%
100.00%
Hit Rate
95.48%
0.00%
0.00%
95.72%
100.00%
21.18%
0.00%
0.00%
But – have to account for LVM, device mapper, remote disk
devices (NFS, iSCSI ), ...
69. Easy mode - drop_caches
echo 1 | sudo tee /proc/sys/vm/drop_caches
●
●
●
frees clean cache pages immediately
frequently accessed files should be re-cached
quickly
performance impact while caches repopulated
70. Filesystem cache contents
●
No ability to easily see full contents of cache
●
mincore() - but have to check every file
●
Hard - system tap / dtrace inference
●
Easy – drop_caches and observe impact