Dispelling the Kernel Compiling Myth

By Jean Francois Martinez

"Thou hast to recompile thee kernel".

This antique curse has been thrown on every Linux newcomer since the birth of Linux. Unfortunately as long as kernel recompiling is deemed a necessary part of a Linux installation it will be impossible to spread Linux between non-nerds. In this article we will make a detailed analysis of the performance increases one can expect of kernel compiling.

Memory savings

"Thanks to kernel recompiling you can free your installation kernel of much unneeded bloat. You also should compile permanently used modules in the kernel for additional savings. A leaner kernel will make your computer faster thanks to reducing paging".
Let's quantify this.

To begin with we will see module compiling. Compiling a module in the kernel will save a little more than 2K per module: 2K due to page alignment and a small bit of code for the loading, unloading of the module. Now, despite being a module fanatic I never managed to be in a situation with more than ten modules loaded, but let's imagine you have 20 modules loaded and all of them are needed permanently so you recompile them in the kernel. You would save 40K of memory, that is 0.5% of the memory of an 8 Meg computer.

Now we will look at benefits of a lean kernel. When Matt Welsh wrote his books kernel recompiling was undoubtedly necessary. It was not uncommon to be able to save above 1.5 Megs of memory and your average computer had 8 Megs of RAM. Thus recompiling would increase memory available from 5.5 to 7 Megs that is a 27% increase.
But people failed to notice that Linux has gone modular and computers got more memory. Today most distributions ship modular kernels so recompiling will get benefits much smaller than in 1995. As an example I tested recompiling the kernel shipped in RedHat 5.2 with everything unneeded thrown out and modularizing everything else when it was possible. The boot messages (that is before loading of any module) showed I had saved a mere 400K. In addition today even low end computers have 32 Megs of RAM that means that recompiling your kernel will increase your available memory of only 1.25%

It is possible to write a specially designed program who will not do a single page fault with N Megs of memory and thrash horribly if you reduce it by a single page. However in normal situations a 1.25% increase in memory available will make little difference. There ARE still a couple distributions who ship kernels good for little else outside installation: huge kernels lacking essential features so recompiling is not a performance issue but a requirement. Now consider what happens if a small company without a full-time guru needs a firewall. Its expert is good for little else short of starting Word. If he stumbles upon a distribution with one of those broken kernels he will fail and will end recommending NT.

Most modern distribs (Caldera, Suse, RedHat and their clones) ship fully-featured kernels and in addition kernel recompiling will produce no appreciable speed increase due to memory savings: they are good enough out of the box. Only a couple of "hackeristic" distribs will force you to recompile the kernel. But for the good of Linux you should ask the maintainers to fix them instead of supplying for their deficiencies. YOU can recompile but your neighbour cannot and he will choose NT.

Evaluating CPU speedups due to recompiling

"Recompiling will allow you to build a faster kernel because you will be able to compile for the right CPU".

Again let's quantify this. Linux performs a number of optimizations for CPU type but most of them are performed at execution time and don't depend on compiling options. For one part we will quantify the influence due to alternative portions of code being compiled and we will also take a look at the influence of compilation options in the code generated by GCC.

Effect of the ifdefs

If you take a look at the source code of the 2.0 kernel you will notice only two portions of code whose inclusion depends on CPU type. The first one is related to selective invalidation of TLB entries and the second one is related to the way used for swapping bytes. In both cases the choice is 386 versus everything else. There was a third portion of code who depended on CPU time: the way blocks of memory were copied: the fastest way for 386 and PPros, Pentim IIs is slightly sub-optimal on 486s and much slower on plain Pentiums. However this optimization has been disabled and now whatever CPU you have blocks of memory are copied the 386-PPro-PII way.

Effect of byte swapping

Byte swapping takes place in two cases: header info when trading packets through a network with a different endian machine and addressing info for SCSI peripherals. In both cases the content (eg what you write to an SCSI disk) is not changed. The only effect is on headers/control info and that is only a minimal part of the CPU time spent for networking/SCSI activity so it has no noticeable effect on performance.

Effects of selective invalidation of TLB

We will explain some basics about VM and address translation. When given an address the CPU will first look into a page directory, and later into a page table in order to translate the virtual address into a real address before being able to access the data. That means a threefold slowdown because there are three accesses to memory instead of one. In fact it could be much more than that in case the page table entries are in slow regular RAM while the real data is in the much faster cache. To avoid this the CPU keeps a list of the last accessed pages and of their translations into an internal ultra-fast memory called the TLB (translation lookaside buffer). Now suppose the kernel wants to unmap a page belonging to a process, it will modify the page tables but the problem is they are no longer in sync with the TLB so if the CPU finds the adress in TLB it will not look at the page tables and will use the wrong data. Therefore the kernel needs to tell the CPU to avoid using the TLB entry, but 386s don't support selective invalidation of TLB entries so the kernel invalidates the whole TLB. Now the kernel you get with your distribution has to be able to work with 386s as well as newer processors so they are compiled to use total TLB invalidation and that means if you are using a newer processor you lose the benefits of selective invalidation.

Let's look now at the circumsatnces where selective TLB invalidation has a significant effect and let's quantify the slow down.
First of all if the kernel unmaps a page and then handles control to another process it will reload CR3 and that will cause a total TLB invalidation (different processes have entirely different mappings) so you get any benefit only if control is handled back to the same process either immediately or after some time in kernel mode. Also consider that time wasted due to entire TLB invalidation is some microseconds while disk IO takes 10 milliseconds in best case that is one thousand times more. That means in case there is disk IO following this unmapping (due to swap out) benefits would be unsignificant.

In fact about the only case where selective TLB will be meaningful would be in the following scenario: process frees memory so the kernel will invalidate TLB, it handles control to the same process and then the process scans a large array doing only a single access for every entry, then just when the TLB is fully reloaded, it unmaps memory again, new TLB invalidation, kernel gives back control again and then the process scans the same array entries. Highly theorical and don't forget that during the second pass page entries will be in cache so address translation will be much faster and this will reduce benefits got due to selective TLB invalidation.

Let's evaluate what happens in a normal process. We will arbitrarily assume this process runs for one tick (10 ms) after the unmapping. For everything else we will take the worst case. The slower the memory the more costly is translation so we will assume this computer uses 60 ms DRAM instead of SDRAM. The larger the TLB the bigger the benefits of selective invalidation so we will choose a CPU with a big TLB in our case it will be an AMD K6 model 7: it has a 64 entry TLB for code pages and a 128 entry TLB for data pages. We will also assume that we never find nor page table entries nor page directory entries in cache (the later is very irrealistic because a single directory entry is used every 4 Megs of address space) so every translation will need 2x60=120 ns so the complete refilling of the TLB needs 120 ns * 192 TMB entries = 23 microseconds. Because we assumed the process would be running for a whole tick that means the slowdown due to address translation is only 0.2 per cent.

Effects of tuning GCC options

Precise measuring of kernel timing is quite difficult, in addition the kernel is a mix of C and assembler. What will we do will be to recompile the Byte benchmark using GCC 2.7.2.3 with the same flags used in 2.0 kernels both for 386s (the one used for native kernels in distributions) and for Pentiums and above (486 is an intermediary case). However those benchmarks will give us a good idea, with perhaps a bias towards overestimation because the Byte benchmarks are pure C so the compiler gains will be felt in full while the kernel is a mix of C and assembler the later being unaffected by compiler optimizations.
The benchmarks were run in two computers: a Pentium 75 and an AMD K6-300. The Pentium tuned test was effectively faster than the 386 tuned test ... by a mere 1.8% on the P75, about the same in the AMD. The conclusions to be drawn is that GCC 2.7 for the x86 family has little model-dependent optimizations nor are the alignment optimizations particularly effective. Those paltry TWO percent (rounded UP) is all you get when you listen to the words of wisdom dispensated in magazines.

If you are an expert and have a spare machine for experimenting then you could try recompilings using more agressive optimizations than the standard -O2 or using a better compiler than gcc 2.7 like egcs or pgcc. However be warned that all 2.0 kernels until 2.0.35 and possibly 2.0.36 have some bugs who will break the kernel with any other compiler than gcc 2.7 (they work due to gcc 2.7 bugs). Also be wary about some optimizations like loop-unrolling who according to egcs or pgcc doc were never thorougly tested be in gcc, egcs or pgcc and that egcs and pgcc are not as well tested as gcc (egcs 1.0 was notorious for its FP bugs). Given these warnings there is a 7% speed difference between the Byte benchmarks compiled with -O6 and loop-unrolling against plain -O2. So playing with compiler and compiler flags is an interesting possibility if you are an expert: it could help the kernel developpers to determine what are the more agresive optimizations who don't break the kernel. If you are not an expert then don't lose sleep about this. The problem is that only a small part of the time spent by your program will be spent executing those parts of kernel code affected

If your program spends 90% of its CPU time in user mode then kernel optimizations will be hardly felt.
Compiler optimizations will have no effect whenever the kernel runs parts written in assembler.
Many kernel-intensive processes are in fact IO-bound: the CPU waits for the peripheral. That means that if there is only one active process the kernel will end its job earlier and will wait a bit longer until the disk is ready. In that case you will get any benefit only if you have two active processes: the speed increase in the kernel will allow running the other process until it gets the answer of the peripheral.
Consider also that there are some peripherals (notoriously some broken IDE disks) who force the kernel to enter active loops until it gets the answer of the peripheral. That means that recompiling your kernel will only affect the number of times the kernel executes the loop.
Two cases were the kernel spends time doing pure CPU are pipe data transfers and disk reading when data is found in cache. This should benefit from tuning the compiler flags were it not that data transfer is done in assembler and will not be affected by compiler magic.

Now remember that if your process spends only 10% of its time in kernel parts written in C then recompiling the kernel with a compiler generating 30% faster code will only provide a 3% speed increase in the overall performance.

Kernel recompiling for your specific processor gives only a minimal CPU boost when the kernel version is 2.0 and the processor is a 1998 or earlier model of the i386 architecture. This could change in future versions of Linux or when using newer processors.

Advice and conclusions

Kernel compiling is not presently an effective way to optimize a Linux box. Don't do it if it frightens you. At most, because it is easy and relatively safe, prepare a rescue floppy, ensure you can boot from it and then recompile changing only two things: processor type and disable FPU emulation if you have one (do a cat /proc/cpuinfo if you don't know). With most distributions you will get exactly the same drivers your distribution kernel was compiled (keep a backup of the original modules just in case).

Kernel compiling has been seen as the panacea for Linux optimization. Unfortunately this doesn't resist serious analysis. It also has two serious drawbacks. First it is poor public relations for spreading Linux between normal people. Second this has sterilized investigation of more effective optimizations.

Some broken IDE disks absorb 90% of CPU time when data tranfer is taking place, tuning them with hdparms can reduce this to 20%. But tuning hdparms is very dangerous and everyone who has used has suffered massive data corruption at least once. Never use it unless you can backup your disks or perform your tests having a single partition mounted and that one being expendable. But if half the energy who has been spent in kernel compiling had been spent on hdparms we would have a data base specifying what settings can be safely used according to disk and chipset model.
Little has been written about to the placement of swap partitions, however smart placement of them can shorten the moves of the disk arm. In addition if you have two or more disks you can play with swap partition priorities in order to get your pages being spread evenly between two disks thus doubling transfer rate. You can also try placing your partition in a different disk than Linux itself.
Your kernel can be tuned by writing in files under /proc/sys. Problem is we have had little experimentation for finding the right values. In fact few people know about this. Again emphasis on kernel compiling has precluded serious investigation about it.

The people advocating other solutions will use kernel compiling as an argument against Linux. Let's kill this myth.

"Linux Gazette...making Linux just a little more fun!"