Some notes on recent random numbers
By now people may have seen complaints of boot slowdown on newer kernels. I want to explain a little more about what’s going on and why Fedora seems to be particularly hard hit.
The Linux kernel has a random number generator in drivers/char/random.c
. This
provides several interfaces for random numbers to the system. There are two
main interfaces for random numbers: /dev/random
and /dev/urandom
.
/dev/random
is designed to be “secure”, meaning it is sufficiently random
that it can be used for things like cryptography keys. /dev/urandom
is
“random” in the sense that most humans won’t detect a pattern but sufficient
mathematical analysis might find a weakness.
Random number generators rely on entropy to work properly. You can’t just
make up entropy, the system has to get it from
somewhere. At boot the kernel assumes it has no entropy and relies on various
parts of the system (interrupts, timer ticks etc.) to give it entropy.
/dev/random
is supposed to block if there is not sufficient entropy in the
system (entropy is a finite resource and it can be drained). Google Project 0
recently discovered several flaws
in the Linux RNG, among them that the RNG was marked as being available for
cyptographically secure generation earlier than it should have. They provided
patches to fix this which were applied by the RNG maintainer.
And then people started seeing issues,
mostly a lot of messages about crng_init
. It turns out, there were a lot of
places in the kernel that were trying to get random numbers early in the kernel
boot process that weren’t as random as they might expect. Fedora had a
particularly nasty problem where the compose machines were getting stuck.
Trying to get more logs from the systemd journal didn’t help. Eventually after
some debugging with the infrasturcture team (and the help of sendkey alt-sysrq-
t
in the qemu monitor window), we were able to see that init was blocked
on the getrandom
systemcall for secure entropy. Interestingly enough, systemd
only made non-blocking (insecure) random calls in its code.
I was lucky I could re-build kernels to reproduce the issue, so I decided to
experiment a bit and return something unexpected from getrandom (-ENOMEM
).
This gave me an error message that (luckily) uniquely mapped to gcrypt.
systemd links against gcrypt for some features, such as calculating an HMAC
for the journal entries. None of that involved random numbers at bootup though
so it didn’t explain why things were getting stuck. After some more back and
forth, Patrick Uiterwijk found a patch
that gcrypt was carrying. If FIPS
mode is enabled, the cryptographic system is initalied at constructor time
(i.e. when it gets loaded by systemd). It turns out, the default images
ship with dracut-fips
which will put gcrypt into FIPS mode. So the very first
time systemd went to open the journal to write something, it would load gcrypt
which would attempt to initialize the random number system. (Fun fact, it
also looks like the default in systemd is to do a write to a journal before
the commandline options are parsed. So even adding an option to not write to
the journal didn’t help this case. I might be wrong here though?)
Despite the fact that these patches have some side-effects, they do fix
a real issue and can’t exactly just permanently reverted. So what to do? One
easy answer is to give the system more entropy. On virtualized systems, this
can be provided by the CONFIG_HW_RANDOM_VIRTIO
option. Part of the fix also
involves making sure userspace isn’t actually trying to rely on secure random
number generation too early since randomness is hard to come by early in
boot. At least for now in Fedora, we’ve temporarily reverted the random series
on stable (F27/F28) releases. The plan is to continue working with upstream
and userspace developers to find a workable solution and bring back the
patches when things are fixed.