"Illegal Instruction" the desktop hardware upgrade that took almost a week
On my birthday list I added "new motherboard, CPU + RAM" for my computer, which was getting into the "struggling with running 3 browsers and 300 tabs" range (yes, I know...). I picked them out after discussions with various friends who pay more attention to this sort of thing than I do, Mini-ITX, (ASRock B550M AM4), AMD-GPU (Ryzen 5 5600G), 32GB RAM (Kingston, 3600).
The old hardware was similar, just almost 10 years old, an AMD A10 GPU, ASRock ITX FM2+, 16GB RAM. Seems like it ought to Just Work (tm)?
"Computer says 'No'" was what I got, or rather combinations of "Illegal instruction" and "traps: mount[11719] trap invalid opcode ip:7f5e6de26036 sp:7ffdca259d88 error in libmount.so.1.1.0" did I forget to mention I run Gentoo Linux on my system?
Searching the internet for either of these provided some only semi-helpful results, frustrating as I'm usually good at getting google to answer the question I meant to ask. Coupled with our insistance that this couldn't possibly be an issue with the change of CPU (it's an upgrade AMDGPU to another AMDGPU, so surely that couldn't be it), it took rather a while to realise that yes, in fact, it was.
It helps to realise that a Gentoo install, installed using the default instructions, not only builds all its software from source, but also builds it using the GCC flag march=native. For the uninitiated, this means "build it for the exact instruction set of the CPU you are running on", light dawned yet? Yes, somehow in their infinite wisdom, the makers of my new CPU decided to retire an instruction code (don't ask me which, still haven't checked, but it must be an important one), making all the existing compiled software useless, yes my entire operating system!
Having understood the problem, now what? Go backwards, or forwards? We had already at this point fetched a copy of the Gentoo stage 3 tarball and verified that things improve ever so slightly if we replaced my on-disc "ld-linux-x86-64.so" file with the one from the tarball. Next followed another frustrating search, had anyone else come across this problem / written down a solution? Again the answer seemed to be "no".
The closest I could find, via some forum posts, was the "Fix my Gentoo" wiki page, which starts off by claiming that no Gentoo system could easily get broken enough to require its application. Little do they know. The essence of this suggestion is, fetch the stage 3 tarball (which is a minimum bundled, built with march=generic install), chroot to that, mount your own gentoo repository and config directories into it, and start building "bin" (pre-built) packages against your current system hardware.
The alternative to this was, put the old hardware back, rebuild everything (or a base install?) using march=generic, switch hardware again, rebuild it another time for native ..
Did I mention this rescuing was all going on using a Debian bootable USB stick? We happen to have several lying around, as you do.
I didn't want to go backwards, just forwards, so cue 2 days! of recompiling, careful unpacking (the pre-built packages also contain the default /etc files, which I didn't want to overwrite) of the tarballs over top of the broken ones on disc. This went with various attempts to chroot into the system being fixed, to see if it was working yet, plus occasionally rebooting it and attempting to login for the same reason.
By the end of the first day, it was annoying enough to order yet more hardware, since the new motherboard supports Nvme SSD drives (aka teeny even faster flash based storage), and those were no longer super expensive, I ordered one. 1TB, twice as big as the SSD I was running. I paid extra for next-day delivery (but not pre-midday which would have been an extra fiver). Typically the delivery was attempted in the one hour or so we left the house that day, for fresh air and groceries! Didn't turn up until late afternoon on day 2.
When the new drive appeared on our doorstep, I was (alllmosst) done fixing the existing one. I could at that point compile packages in the chroot of the real system, instead of the stage3 tarball (you need to build the build system before you can build other stuff, turtles all the way down). At the end of the evening that day, I could log into the rebuilt system (by way of editing grub cos all those drives were in different orders), and.. not-quite start X. Unsurprisingly the X issue turned out to be a library still needing to be rebuilt, though that was tricky to figure out (X was fine, the openbox window manager wasn't).
At this point I could get back to some paid work. Alongside that I ran a script that James came up with: Compare all the binary/ELF files on disc to see if they were older than a file I fixed yesterday, pipe the list into gentoo's equery to see which packages were involved, rebuild those, rinse, repeat.
LOCALE=C find /{usr,}/{s,}{lib,bin}{64,} -maxdepth 1 -type f ! -cnewer /usr/bin/ncdu | xargs ls --sort=time -r | xargs file | tee /dev/stderr | grep ELF | cut -f1 -d: | head -n200 | tee /dev/stderr | xargs equery --quiet belongs --name-only | sort | uniq | ack -v 'tasksh|mpack|libfm|pcmanfm|mono|/python$|XML-Twig|wings|dev-cpp/.*mm' | tee /dev/stderr | xargs sudo emerge -j -v --oneshot --keep-going --fail-clean=y
Note the "ack" (fancy grep) for items that simply refused to build, or no longer exist in the current gentoo repositories.
After this I *still* needed to manually cope with some packages which I have compiled for multiple slots (eg I have python2.7, python3.8, python3.9, python3.10 all installed), as well as sort out some issues, eg python3.7 has been dropped completely, nothing depends on it, so uninstall that.
In comparison to that saga, putting in the new Nvme drive, rsyncing the (now working!) system onto it, and getting it to boot was.. quite uneventful.
Remind me not to do that again.