Transparent Huge Pages (THP) - A Rough Guide to Memory Management

[Official kernel docs](https://docs.kernel.org/admin-guide/mm/transhuge.html) ## What is THP? THP is a feature that automatically adjusts page tables to make use of *huge folios* (order > 0) which are directly mappable by hardware. We say we **collapse** these ranges. These are ranges larger than the base page size (typically 4 KiB) which can be mapped by hardware such that a larger range occupies a single slot in the TLB cache. By doing so, we reduce TLB contention which can have a significant performance impact. What makes THP special (and different from hugetlb) is that it handles this behind the scenes as well as converting these ranges back to 'ordinary' mappings (i.e. at base page size) when necessary (we say we **split** these ranges). THPs are typically PMD sized (2 MiB for 4 KiB base page size). THPs of PUD size (1 GiB for 4 KiB base page size) are permitted for DAX and VFIO mappings (in theory other `VM_PFNMAP`/`VM_MIXEDMAP` mappings, but no others implement `vm_op->huge_fault`. We (horribly, shamefully) term these **special huge pages**. These are only established upon page fault. For architectures that support it (really mostly arm64) **mTHP** (multi-size THP) allows huge pages at sizes less than PMD (note that **order-1**, typically 8 KiB page size, is **not** supported by anonymous mTHP but **is** by shmem mTHP). ## Which kinds of mappings support THP? * Anonymous mappings. We even try to get THP on fault for these if possible, see [[#How do I get THP?]] bleow. * shmem mappings (e.g. tmpfs, `MAP_SHARED | MAP_ANON`, memfd mappings, etc.). * This behaves differently depending on whether it's tmpfs or an 'anonymous' shmem mapping (defined as being one which had its inode unlinked at time of creation, e.g. `memfd` or `MAP_ANON | MAP_SHARED` mapping etc. - see `vma_is_anon_shmem()`). See [[#How to configure THP?]] for details. * Device memory mappings * DAX - If it happens to have PMD or PUD alignment (DAX specifies alignment), **only** on **fault**. See `dax_get_unmapped_area()` and `dev_dax_huge_fault()`. * VFIO - If the mapping is explicitly PMD or PUD aligned (via hint) or you are lucky enough to get it, **only** on **fault**. See `vfio_pci_mmap_huge_fault()`. * Mappings of files on file systems which explicitly support it (e.g. implement `vm_ops->huge_fault`). At the time of writing that is - fuse dax, ext2, ext4, xfs, erofs. * Note that these **only** give you THP on **fault**, and do not support khugepaged or `MADV_COLLAPSE` by default. * To get khugepaged, `MADV_COLLAPSE` support you have to utilise `CONFIG_READ_ONLY_THP_FOR_FS`. * Mappings of files which are not open for R/W anywhere if `CONFIG_READ_ONLY_THP_FOR_FS` is specified. This is only for khugepaged, `MADV_COLLAPSE`. ## How do I get THP? There are three means of getting huge pages from THP. You can observe how we determine what orders are supported in `thp_vma_allowable_orders()` and `__thp_vma_allowable_orders()`. 1. **Page fault** - supports **PMD, PUD, mTHP** - Upon fault, if the faulting address aligned to the huge folio order spans an empty range and THP is permitted, we allocate a THP folio and map it as THP. Considering different types of mapping: * **Anonymous** * Upon `mmap()`, if the mapping: * Is anonymous, * Has no address hint specified, * And whose **size** is aligned to PMD (see 'famous' commit d4148aeab4 for details as to why we have this requirement). * It will be mapped PMD-aligned, meaning it will, upon fault, automatically be mapped as THP. See `__get_unmapped_area()`, `__handle_mm_fault()`. * If **mTHP** is available (as determined by sysfs settings): * Upon fault we check to determine the largest mTHP size we can map, and do so depending on whether the mTHP-aligned address is currently unmapped. See `alloc_anon_folio()`. * **shmem** * We align shmem mappings for THP if viable (see `shmem_get_unmapped_area()`). * We treat mappings differently depending on whether they are a tmpfs mount or an 'anonymous' shmem mapping (e.g. `memfd` or `mmap()` with `MAP_ANON | MAP_SHARED`). * tmpfs behaviour is controlled via global mount controls, e.g. mount options `huge=xxx`. * Anonymous shmem is controlled via sysfs settings. * supports both PMD size and **mTHP** if available. See `shmem_fault()`, `shmem_get_folio_gfp()`, `shmem_allowable_huge_orders()`, `shmem_suitable_orders(). * See [[#shmem THP configuration]] for details on how shmem THP is configured. * **File system** * Upon `mmap()`, if the mapping is of a file on: * ext2 * ext4 * erofs * xfs (since each of these specify `thp_get_unmapped_area()`) * It will be mapped PMD-aligned, meaning it will, upon fault, automatically be mapped as THP (each of these file systems support `vm_op->huge_fault`). See `thp_get_unmapped_area()`, `__handle_mm_fault()`. * Note that `CONFIG_READ_ONLY_THP_FOR_FS` does **not** result in THP on page fault. * btrfs is a 'special snowflake' that uses `thp_get_unmapped_area()` but doesn't implement `vm_op->huge_fault`, so gets mapped THP-aligned, but relies on `CONFIG_READ_ONLY_THP_FOR_FS` being set and khugepaged or `MADV_COLLAPSE` doing the actual work. * **Device** * Upon `mmap()` if the mapping is **DAX**, it will be mapped according to DAX alignment, which may align to PMD or PUD. See `dax_get_unmapped_area()`. * Upon fault, since it implements `vm_op->huge_fault`, it will be automatically be mapped as THP if possible. See `__handle_mm_fault()` and `dev_dax_huge_fault()`. * If the mapping is of a PCI **VFIO** area, there is no special mapping logic applied, so to get THP you either have to get lucky with alignment or use a hint. * Upon fault, since it implements `vm_op->huge_fault`, it will be automatically be mapped as THP if possible. See `__handle_mm_fault()` and `vfio_pci_mmap_huge_fault()`. 2. **khugepaged** - supports **PMD** - Kernel process that works in the background collapsing base pages (typically 4 KiB) into huge pages. *(mTHP support is coming soon)*. * Kernel thread that runs in the background. * We "**scan**" memory for existing base page mappings we can collapse into huge pages. This is determined by sysfs settings, see [[#How to configure THP?]] below for details. * We support **anonymous**, **shmem**, and **file-backed** (only if `CONFIG_READ_ONLY_THP_FOR_FS` is set and the file-backed mapping is not mapped writable anywhere) mappings in khugepaged. * Anonymous/file-backed and shmem are configured differently. See [[#shmem THP configuration]] for details. * khugepage **only** collapses to **PMD** size mappings (though mTHP support is coming soon). 3. **madvise(.., MADV_COLLAPSE)** - supports **PMD** - Means by which ranges can be unconditionally manually collapsed into huge pages. * **Ignores** sysfs settings. * We support **anonymous**, **shmem**, and **file-backed** (only if `CONFIG_READ_ONLY_THP_FOR_FS` is set and the file-backed mapping is not mapped writable anywhere) mappings for collapse. * **mTHP** is unsupported for anything on collapse. See `madvise_collapse()`, we invoke `thp-vma_allowable_order()` against `PMD_ORDER`. * At least one page must be faulted in in the range for collapse to succeed. * The specified address and size must be PMD-aligned. ## How to configure THP? **IMPORTANT:** Any VMA that has `VM_NOHUGEPAGE` set, which can be achieved via `madvise(..., MADV_NOHUGEPAGE)`, will have THP **disabled** in all cases. **sysfs** can only configure khugepaged and page faults, and is ignored by `madvise(..., MADV_COLLAPSE)`. The primary means of controlling **khugepaged** is via `sysfs` tunables in `/sys/kernel/mm/transparent_hugepage`: ```shell [~]$ ls /sys/kernel/mm/transparent_hugepage defrag hugepages-128kB hugepages-32kB khugepaged enabled hugepages-16kB hugepages-512kB shmem_enabled hpage_pmd_size hugepages-2048kB hugepages-64kB shrink_underused hugepages-1024kB hugepages-256kB hugepages-8kB use_zero_page ``` These settings impact **page fault** and **khugepaged** collapse of memory ranges, but do **not** impact `madvise(..., MADV_COLLAPSE)` of memory ranges. ### anon/file-backed THP configuration The so called *global* controls are: ```shell [~]$ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never ``` **IMPORTANT:** These are misleading - setting `never` in `/sys/kernel/mm/transparent_hugepage/enabled` does **not** disable THP globally for anon/file-backed mappings. It simply disables THP for khugepaged/page faults for those THP sizes set to **inherit**. E.g.: ```shell $ ls /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ enabled shmem_enabled stats $ cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled always [inherit] madvise never ``` Any specific THP size can be configured differently from the 'global' setting. The settings for anonymous/file-backed memory, as specified in the `enabled` file are: * **always** - Enable regardless of whether `madvise(..., MADV_HUGEPAGE)` is specified on a memory range. * **inherit** (obviously not present in the global setting) - Inherit from global setting (default). * **madvise** - Enable only if `madvise(..., MADV_HUGEPAGE)` is specified on a memory range. * **never** - Do not allow **khugepaged** or **page faults** on memory ranges of this size. ### shmem THP configuration For shmem, things differ depending on whether the mapping is a tmpfs mount or not. shmem mappings are either tmpfs mounted or 'anonymous shmem', that is memfd, `mmap()` with `MAP_ANON | MAP_SHARED`, etc. For tmpfs, we determine how to proceed based on the mount option `huge=xxx`. We **ignore sysfs** for tmpfs, with only a couple exceptions: **IMPORTANT:** The `/sys/kernel/mm/transparent_hugepage/shmem_enabled` tunable is only applicable for **anonymous shmem mappings** **except** if `deny` or `force` are specified. All specific page size settings in `/sys/kernel/mm/transparent_hugepage/hugepages-xxxkB/shmem_enabled` are **ignored** for tmpfs. See `shmem_allowable_huge_orders()` and `shmem_huge_global_enabled()` for details. We observe **anonymous shmem** global settings in `/sys/kernel/mm/transparent_hugepage/shmem_enabled`: ```shell $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled always within_size advise [never] deny force ``` And individual **anonymous shmem** THP size shmem settings in `/sys/kernel/mm/transparent_hugepage/hugepages-xxxkB`: ```shell $ cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/shmem_enabled always [inherit] within_size advise never ``` For **tmpfs** available `huge=xxx` options are: * **always** - Allow collapse of PMD and all **mTHP** THP size on page fault (PMD for `MADV_COLLAPSE` or khugepaged). * **within_size** - Allow collapse only at sizes contained entirely within the size of the inode. * **advise** - Same as `always` if `VM_HUGEPAGE` set (e.g. via `madvise(..., MADV_HUGEPAGE)`), otherwise disallow. * **never** - Do not collapse THP. And it deals with `/sys/kernel/mm/transparent_hugepage/shmem_enabled` options: * **force** - If the pagecache supports PMD-sized folios, then force enabled, otherwise don't (also same behaviour for `MADV_COLLAPSE`). * **deny** - Simply disable. For **anonymous shmem**, global options are in `/sys/kernel/mm/transparent_hugepage/shmem_enabled`: * **always** - Enables all `inherit` sizes. * **within_size** - Enables all `inherit` sizes strictly contained within the size of the inode. * **advise** - Enables all `inherit` sizes if `VM_HUGEPAGE` is set on the VMA (the region had `MADV_HUGEPAGE`) applied. * **deny** (only present at global level, for testing purposes only) - Disable globally regardless of per-size settings. **Also applicable to tmpfs**. * **force** (only present at global level, for testing purposes only) - Force-enable all inherit values. You cannot enable this if `sys/kernel/mm/transparent_hugepage/hugepages-2048kB/shmem_enabled` is not set to inherit, so this will give a max size of PMD. Note that the implementation of this is truly horrible, see `shmem_huge_global_enabled()` which doubles up as checking tmpfs settings as well as, for anonymous shmem, checking inherit state by using `SHMEM_SB(shm_mnt->mnt_sb)` to store global state state and `inode->i_sb` to store individual tmpfs mount state for `huge=xxx` settings. The `shmem_allowable_huge_orders()` function calls it for both purposes. ## How to disable THP?? It seems as if: ```shell $ echo never > /sys/kernel/mm/transparent_hugepage/enabled $ echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled ``` Should disable THP globally right? Wrong. This only disables THP for khugepaged and page faults for those THP sizes set to `inherit`. It also doesn't impact tmpfs mounts. How about: ```shell $ echo never | tee /sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled $ echo never | tee /sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled ``` Again, tmpfs will ignore this, and `madvise(..., MADV_COLLAPSE)` will ignore everything. OK so: ```shell $ echo never | tee /sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled $ echo deny > /sys/kernel/mm/transparent_hugepage/shmem_enabled ``` Will handle khugepaged and page faults *including* tmpfs. But what about `madvise(..., MADV_COLLAPSE)`? In this case, we can only disable per-process. To do so, we use a `prctl()` option (`prctl()` is a graveyard of stuff we couldn't figure out where to put sensibly): ```c prctl(PRCTL_SET_THP_DISABLE, 1, 0, 0); ``` This disables THP **for this process and all forked children of the process**. We now have further control over how this behaves, with an additional flags parameter: ```c prctl(PRCTL_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0); ``` Which, as the flag name suggests, means we are able to disable all THP except in instances where VMAs are explicitly `madvise(..., MADV_HUGEPAGE)`'d.