The least common words in the linux kernel

2017-??-?? ??:??:??

I was recently shown a blog post about the oldest temporary hack in the Linux kernel , which lead to an interesting question:

How easy is it to grep the whole kernel?

Firstly I needed to get hold of a copy of the Linux kernel from kernel.org. I chose 4.10 because it was the latest stable kernel. Now the only thing left to do was to find a command capable of recursively searching a directory for a given string.

After some initial stumbling around with the find command I happened upon the command I was obviously looking for

 $ grep -r string dir
which looks to be the simplest way. Having found this command the first thing I did was to find the other uses of "temporary hack"
$ grep -r "temporary hack" linux-4.10
linux-4.10/drivers/uwb/beacon.c:
	/* temporary hack until we do something with this message... */
linux-4.10/drivers/ide/ide-xfer-mode.c:
	* TODO: temporary hack for some legacy host drivers that didn't
linux-4.10/drivers/net/ethernet/atheros/atlx/atlx.c:
	/* Including this file like a header is a temporary hack, I promise. -- CHS */
linux-4.10/sound/usb/format.c:
	fp->formats = SNDRV_PCM_FMTBIT_U8; /* temporary hack to receive byte streams */
linux-4.10/sound/usb/format.c:
	/* FIXME: temporary hack for extigy/audigy 2 nx/zs */
linux-4.10/sound/hda/hdac_device.c:
	/* temporary hack: we have still no proper support
linux-4.10/include/net/irda/irda_device.h:
	* The default_qdisc_pad field is a temporary hack.
linux-4.10/arch/parisc/kernel/vmlinux.lds.S:
	/* temporary hack until binutils is fixed to not emit these
linux-4.10/arch/m68k/kernel/ptrace.c:
	int regno = addr >> 2; /* temporary hack. */
linux-4.10/arch/m68k/atari/config.c:
	/* This is a temporary hack: If there is Falcon video
linux-4.10/kernel/exit.c:
	* FIXME: this is the temporary hack, we should teach
At this point I was pleasantly surprised by how fast grep is. It took only around 2-3 seconds to search the entire Linux kernel source, which is much less time than I expected.

The second thing that I noticed was the word frequency is fairly unusual, for example "temporary" has less hits than "hack" (which is used an unsettling amount of times).

 $ grep -r "hack" linux-4.10 | wc -l 
1476

$ grep -r "temporary" linux-4.10 | wc -l
1270

Unusual words

It was at this point that I started wondering which, if any, words are used only once in the kernel source.

The first thing I needed to find this out is a list of candidate words. Fortunately most Linux distros have a package that installs a dictionary:

#Ubuntu
$ sudo apt-get install wbritish
$ cat /usr/share/dict/british-english | tail -n 2
tude
tudes


#Antergos
$ sudo pacman -S words
$ cat /usr/share/dict/british | tail -n 2
Zyuganov's
Zzz
As you can see they actually contain different words. I ended up using the Antergos one in the end, so that is what the results are based on.

All that needs to be done now is to iterate through all the words in this list, and get the amount of occurrences of each.

Grep is too slow

Here I hit a snag. Remember I mentioned that it *only* took 2-3 seconds to grep the kernel? well a quick check verifies that we have a significant problem.

Firstly time to find out how long it actually takes. Turns out that I am terrible at estimating time.

$ time grep -r test linux-4.10 | wc -l
49588

real	0m0.865s
user	0m0.644s
sys	0m0.220s
Then do some maths.
$ cat /usr/share/dict/british-english | wc -l
99156
$ echo '99156 * 0.865 / 60 / 60' | bc
23
While 23 hours is feasible, I'd rather not have to wait that long. Time to find a faster way to search.

First things first, I switched to my laptop, which has better cpu power and disk speeds than my desktop. Unfortunately this also means switching to antergos, but as seen above the only difference is a minor change to the dictionary file location.

Unfortunately

time grep -r test linux-4.10 | wc -l
49605

real	0m0.811s
user	0m0.640s
sys	0m0.183s

$ echo '99156 * 0.811 / 60 / 60' | bc
22
Still nowhere near fast enough. Time to replace grep.

A better alternative

So firstly I though I'd try those more complicated alternatives using find that I mentioned at the beginning.

$ time find linux-4.10 -type f -exec grep 'test' {} + | wc -l
49605

real	0m0.838s
user	0m0.680s
sys	0m0.143s

$ echo '99156 * 0.838 / 60 / 60' | bc
23
Even worse.

Maybe Parallelism can help?

Since everything up to this point has been running on only one core, a safe bet it that using more cores would help here.

A quick google found this post, whose top answer uses the parallel program.

After fiddling with some flags to remove the citation warning that's throwing off my line counts I got to a workable version.

$ time find linux-4.10 -type f | parallel --will-cite -k -j150% -n 1000 -m grep -H -n STRING {} | wc -l
3664

real	0m1.465s
user	0m1.947s
sys	0m0.320s

$ echo '99156 * 1.465 / 60 / 60' | bc
40
Just awful. Also from a glance at my CPU usage it looked like this was still only using one core. My guess is that this was caused by some combination of misconfiguration and the program being agressively terrible, but whatever the reason I decided to just give up on it.

Further down I found a recommendation for ripgrep. After a quick glance at the github it looked like I'd found what I'm looking for, and the tests came back with the evidence.

$ time rg -j 8 -r -B 0 -A 0 test linux-4.10 | wc -l
49588

real	0m0.182s
user	0m0.917s
sys	0m0.350s

$ echo '99156 * 0.182 / 60 / 60' | bc
5
That page is pretty long, and contains myriad ways of (possibly) reducing the runtime of grep. I suspect that it might be able to improve on ripgrep but I decided that enough was enough and to just leave it.

The last bits

Now all I needed to do was to sort out some formatting for the output and other niceties. I ended up writing a quick bash script to iterate through all the words and sanitise the output.

#!/bin/bash
cat /usr/share/dict/british | while read line
do
	num=$(rg -r -j 8 " $line " $1 | wc -l)
	if [ $num -eq 1 ]
	then
		echo $line : $num
		rg -n -B 1 -A 1 -r " $line " $1
		echo ""
	fi
done
There's a couple of things to note here. Firstly I require spaces before and after each word to try and filter out variables.

Secondly I'm not using ripgrep for the second grep. This is because I'm an idiot. I forgot to change it and only noticed when it was halfway through, but it didn't matter too much because that line isn't run that often.

The final product

After several hours of my laptop's fan whining at maximum volume I had the list that I originally wanted.

It's available on gist because it's pretty huge. Turns out there's ~4000 words that are only used once.

It looks like most of these are in humorous comments, and the list is quite good fun to read. Unfortunately some are missing context due to the small amount of lines I took.

Appendix A: A note on inconsistency

So you may have noticed that the numbers for how many of each word we've been finding have been varying wildly over the course of this blog.

This it due to a combination of switching OSes and word lists halfway through, and also because I think some of the grep alternatives were finding filenames, although as far as I can tell this hasn't affected the final result, just some of the timing data.