bank18_my_old_nemesis.png
Notices where this attachment appears
-
:hacker_f::hacker_s::hacker_e: One damn thing or another edition!
VM keeps crashing. This is way, way better than the box crashing, because I can just restart the VM running on the box.
The box has plenty of RAM, we could lose a stick without noticing. The problem is figuring out which bank is bank 18.
DISREGARD THAT, I lshw(1) COCKS: It's "B7".
[Elided: a paragraph about me guessing at numbering schemes. Linux calls that bank "Bank 18" but the manual doesn't, the labels don't: they have A1 through A12, and so on with B, C, and D. So how does that map? Does B1 follow A1 or A12? Then I decided to check if the banks were 0-indexed or 1-indexed, which would give me a maximum of 4 sticks to pop if I went full-retard on the RAM, so I fired up lshw and it turns out that they are 0-indexed and also that the stupid $letter$number scheme can be queried somehow and lshw is clever enough to do that and also apparently now that I check dmidecoe, it's in there, too. I think it works in pairs so popping bank 18 probably makes it ignore another bank.]
It's also entirely possible that RAM has nothing to do with this. memtest86+ ran for almost an entire day without finding anything.
From the host machine's perspective, qemu is one process, then one process gets the bus error, so the entire VM tanks, and that's all of FSE. If BEAM or nginx or whatever were running on the host machine, it'd be one process killed, and as long as that process wasn't responsible for supervising other processes, it could be bounced. So maybe it'd be better to proceed with the previous plan of moving that shit off the VM. More resilient overall.
According to /usr/src/linux-`uname -r`/Documentation/admin-guide/mm/memory-hotplug.rst , it looks like the memory hotplug system can be abused for this, so rather than driving down there to pop the stick, I can just pop it from here. It'll require a reboot (it'd require a reboot to physically remove the stick anyway), so maybe this weekend.
That might fix The Last Problem.
bank18_my_old_nemesis.png