So I ran memtest86+ on my desktop.
It didn't error!
or succeed. It just... hung, 25 minutes in.
that's probably not a good sign
So I ran memtest86+ on my desktop.
It didn't error!
or succeed. It just... hung, 25 minutes in.
that's probably not a good sign
I'm gonna have to start pulling parts out and get it down to a minimal hardware setup and see if it still crashes.
like, I probably could disconnect the floppy drives. those shouldn't be needed just to run memtest
even if they are clearly VITAL for day-to-day work
yanked the odd hardware like GPU and hard drives, as well as the common stuff like DVI video capture device, both floppy drives, optical drive, and serial port
I had to go find another keyboard though. my spare keyboard on hand is an AT keyboard, and while of course I have a (hand made!) AT-to-PS/2 adapter, this computer for some reason only has USB ports.
I can't understand these companies just expecting us to have obscure hardware on hand to diagnose their failures.
I normally have it plugged into a KVM but I'm bypassing that on the extreme off-chance that there's some weird interaction between the KVM and the PC
@mmu_man I've had that before, but it was mainly because I was needing to run it on 386s and the like. This is a modern machine so there should be no problems.
@foone might want to try another (an older?) version, I recall seeing various hangs on various machines…
@jonoabroad look 640kb ought to be enough for anyone
@foone because you have SOOOOOOOOO much memory it didn't finish?
WE'VE GOT ERRORS! hitting a bunch of RAM problems around 31-39gb.
nastily that's divided between two banks. So... it's probably not just one pair of DIMMS going bad.
(and of course it's also possible it's some other problem that's causing RAM issues)
yanked half the RAM to see if it still errors
with half ram, I did two successful passes of memtest86+, no errors. time to check THE OTHER HALF (to confirm if it was just a needs-reseating/too much ram for the mobo problem)
and it passes too.
So, either my motherboard has stopped being able to support 64gb, or it was just a reseating problem. Let's stick it back in and see.
that's... probably not a good sign
I really don't like that it corrupted outside of the text window.
That shouldn't be happening if this is just a RAM problem. That's like... a GPU corruption problem.
Which is very bad because I'm currently using the CPU's on-die GPU
@manawyrm No, I think it was interleaved. I thought I had corrected this when I upgraded it. Thanks for catching that!
@foone it's a bit hard to tell from the DMI info, but is the RAM installed the right way round (each vendor together per memory channel)?
I would've expected Slots 0,1 to be A, Slots 2,3 to be B, not interleaved like this.
@foone ... actually: with this behaviour: Clean the contacts on the RAM with IPA and the ones on the CPU as well.
@foone that's a fail :P
@xabean good point
@foone but remember your RAM is your VRAM right
@manawyrm pointed out that my ram sticks weren't properly interleaved. So I'm trying memtest again with that fixed
NOPE! it ran for 3 hours, threw a ton of memory errors, then crashed in the same matrixy way.
So, back to just one pair of sticks, and I'm gonna let memtest run for a while and see if that seems stable.
@fuchsiii @foone did #memtest86 crash or did it just freeze?
OK so ran the half-ram test for nearly 8 hours with zero issues.
Now I've reattached the GPU, usb ports, serial port, floppy drives, and rust drives and optical drives (but not the m.2 ssds yet because they're boot drives) and I'm running it again, to confirm it's not a PSU problem
@lewiscowles1986 I've been getting crashes in memtest and the BIOS, so definitely not a bad driver for me!
@foone
I Had errors like this on an AMD Radeon card after a year, and it turned out to be shitty software. It cost me a heap (new RAM, power) to find out that a stupid driver update was the cause and by editing some files I could have the kernel overcome that.
If the RAM tests say it's fine, it might be crap software.
No issues after running for 4 hours!
I'll leave it on overnight but this seems to be functional
The only thing that's not back in is the two m.2 drives, which I doubt will make much difference.
We'll see!
It turns out the GPU wasn't in use so I can't be sure it's not still a power problem.
So I fixed that, put in the m.2, and started again
Memtest didn't complete, but for a very silly reason: I have my computer up on the desk and plugged into a different outlet for this testing, which means there's a power cable crossing the doorway.
When I left the room, moving the cable so I could go under it slid it out of the power supply and it shut off
memtest completed and now I'm going full on stress testing: playing fullscreen video, 3D games, processing a bunch of shit in the background, some VMs. you know, the usual stuff I do on an average day
I probably shouldn't try to run the VM that has 16gb of RAM allocated to it, though
Well that lasted all of 30 minutes before the system hung.
Fuck. It's not just the RAM.
So, it's probably one of motherboard, cpu, or PSU. At a stretch, it could be the GPU.
I have another spare GPU I could swap in. I have a near-identical CPU that I could swap in (it's in use, but I can temporarily borrow it).
PSU and mobo are trickier.
So, I'll have to try the easy ones first. Swap the GPU and see if windows still hard crashes like that, then the cpu, then start working on the others.
If you'd like to help me get back online (and gay cats, of course), donations would help. I'm kinda broke and not having a working computer is not going to help.
@Pibble it's a core i7-8700k
@foone intel or amd? what gen? this looks legit like a cache instability issue (not cache per se, but ringbus/infinity fabric instability)
@foone PSU tester? They're fairly cheap.
@baishen I've got one somewhere but this seems too subtle a problem for it to detect. It's not failing to boot, it's just falling over after running for 20-30 minutes
@foone I was hoping it would show if one of the rails was marginal.
Does it start right back up or need a cool off?
@baishen starts up just fine
@Pibble that could definitely be it. this motherboard is about a year old
@foone default uncore/ring ratio I assume. What is vccsa and vccio read (voltages, system agent and IO)? also how much memory, dual rank sticks, or single rank? 2 dimm or 4 dimm, what mobo?
@foone the only reason I ask is because 8th gen Intel memory controller likes to fall over when dealing with 4 sticks of dual rank memory aside from micron e die, and z1/4xx boards (6th-9th) boards were notorious for shoving a ton of sa and io voltage when all dimms were populated or xmp was enabled, and by a ton. I mean voltage that will degrade or kill the imc fairly quickly, like 6 months to a year.
okay today's first test: I yanked out my GPU and I'm running on just the internal GPU. I'm gonna load up some videos, VMs, 3D games, and a bunch of browser tabs. See if this falls over too
I am not getting a large number of frames, and I only have one game running at the moment.
somehow I got my youtube video playing over my actual speakers but one of the games playing out the HDMI and the little speakers on the monitor. that's weird.
I AM STRESS TESTING THIS MACHINE AT NOT A LOT OF FRAMES A SECOND
@ChlorideCull I think my minecraft skin actually predates my floppy disk hyperfixation, if that's even believable
@foone Don't know why I expected your Minecraft skin to be a floppy.
okay I've made it an hour running sans-GPU. That doesn't mean it's the GPU though. This machine is using way less power without the GPU... so it could still be a PSU related problem.
so after running fine for about 3 hours with no GPU, I've gone out and bought a new... power supply.
yeah I don't think it's the GPU. And a flakey PSU could easily fail with the GPU and not without, since the power usage is way lower without a GPU in there
okay new PSU is in. That took way longer than it should.
Apparently between the RM650x and the RM850x, Corsair redesigned their modular cables, so I couldn't just swap the PSU and reuse the cables. So now I have a cable management nightmare, but it's running. Let's put the stress-testing pants on
@foone if it's not DNS it's PSU
that could have been a disaster
the worst part is that I forgot to double-check that the new PSU would come with the right cables to let me hook up my floppy drives. Thankfully, it did.
@foone for a moment I thought about how I need to make sure I still have adapters in my collection to power 3.5" drives with the 0.1" center 4 pin JST header off of a full size power connector. aaahhhh!
well my 3.5" floppy drive is working. That's a good test
changing my PSU seems to have confused Satisfactory into thinking I'm a different person and now I'm sitting on the floor of my own base. WHO ARE YOU?
CRASHHANG.
sticking in a different GPU.
Swapping out my Asus Gegorce RTX3070 for an EVGA GeForce GT 1030
ran an hour and 30 minutes on the other GPU with no crashes.
god damn it, IS it my GPU?
@scottmichaud that was using the onboard GPU, not the suspect one!
@foone The visual artifacts in Memtest strongly suggest that something's wrong with the GPU.
Could be multiple failures, too.
taking a work meeting from the system under test
this is known as "living dangerously"
@cw I don't have the temps on the crashy setup, but I've not seen any temps outside of reasonable ranges. No cooling problems. RAM was matched pair, nothing should be overclocked or undervolted. I've not tried a new PCIe slot yet, that's definitely something I should try out. I think it'd cause my GPU to run at lower speed but at least I can confirm it works/doesn't work there
@foone @foone what are your core temps and clocks like in hwinfo? Heatsinks all snug? All internals unobstructed? Checked the board for any spicy caps? When you tested RAM, was it a matched pair? Anything overclocked or undervolted? Tried another PCIe slot for the GPU?
Worst case the motherboard is just dying, I had an X99 Asus board which was notorious for eating CPUs (which mine did, while simultaneously FUBARing itself). That's probably a niche edge case though 😅
okay. on the new PSU, with old GPU back in, but in a different PCIe slot. Let's see how this goes
no crash in an hour with it in a different slot? weird.
19 hours in the different slot, no crashes. Very strange.
So, theories:
1. that slot was just bad/dirty. Possible, I guess? The other GPU worked fine in that slot, though.
2. The GPU might be running at 8x PCIe instead of 16x PCIe. Maybe that pushes it over some timing/temperature threshold and makes it not crash?
okay, GPU-z says it is indeed running at x8 3.0 speed, when it's capable of x16 4.0.
So, how much you wanna bet that if I fix that, the system will start crashing again?
@tallawk not that I've seen but I'm going to try and check
@foone Are you by chance collecting a snowstorm of pcie errors in your system logs?
So it turns out I can't get it to do 16x in the other slot. My motherboard has 4 16x slots, but it does them in sequential order: if the first one is full, it gets 16x. if the first and second are full, they get 16x. if the second one is full and the first isn't, the second gets 8x.
yeah.
So I swapped the card back to slot 1.
Interestingly, GPU-z says it's at 2.0 now instead of 3.0. Not sure why that is.
wait no it doesn't. I can't read suddenly
I'm gonna run my stress test with some performance logging on to see if it's overheating.
I did realize my card has a physical switch for "high fan" vs "quiet fan" and switched it to "high fan".
I can't imagine that'd be why it was crashing but maybe.
also while looking around my BIOS, I realized I could clock my ram faster. It's running at 2133mhz and could go up to 3200mhz, supposedly.
I didn't test that out for obvious reasons.
over an hour with the GPU back in the Crashy Slot and no crashes.
huh. Maybe it was temperature based?
My GPU isn't getting THAT hot, my fans aren't even maxing out.
GPU temp hit a max of 65C with a hot spot of 76C.
Those aren't out of range for a GPU under load, and they're not trending upward at all, it's stable.
or it's still just a memory corruption problem and it's just VERY random and I need to test for longer
it's now been nearly 5 hours with no crashes.
what the fuck?
(temps are about the same)
@foone that certainly sounds like computers
@OtterMatic it's possible, but I've pulled and re-inserted it like 5-6 times through this whole saga, and it was still crashing up until this last time
@foone could be re-seating the card made a difference as well.
@StompyRobot BUT HOW
@foone you fixed it!
So it's now ran for 23 hours... no system-crash.
minecraft did crash, but it's modded minecraft. it might have just done that on its own
I also stuck the "bad" set of ram sticks into a spare computer and ran it on memtest.
16 hours, 9 full passes of memtest86+, zero errors.
starting to think this is a motherboard problem
so the remaining Questionable Hardware is:
1. Motherboard
2. CPU
3. GPU
4. Half the RAM
So that's, like, 1200$ if I wanted to replace it all. That's definitely not going to happen. So for now I think I'm just gonna have to wait to see if more shit breaks, but ordering more RAM is next on the list, since I'd like to be back up to full RAM and it'd be useful to know if adding RAM back in causes it to crash again
actually, I can use the RAM in one of my other machines if it turns out to not be useful to fix my main PC. So I hit order on that
@Pibble there is a BIOS update, it looks like. I'll have to try installing
@foone Yeah, like I was saying, a lot of the boards up until 10th gen were very not memory focused, so they didnt have great shielding or trace layout, you often even got really bad t topology layouts or extremely long daisy chain that would fall over at anything above 2133 with 4 dimms.
Have you looked at the bios on the board, there may be a few updates that include "improved memory stability" and usually those are extremely helpful for running dimms made well after the board.
New RAM installed. I had a fun moment where the system seemed to be completely dead... Turns out one of the modular cables slipped out while I was installing the RAM.
But no crashes so far!
I also updated the BIOS, to a version that has "dram stability improvements" listed on on the changelog.
@foone so you changed more than one thing at once?..
@dougbarry no, but I summarized for mastodon rather than specify each change separately in multiple posts
076萌SNS is a social network, courtesy of 076. It runs on GNU social, version 2.0.2-beta0, available under the GNU Affero General Public License.
All 076萌SNS content and data are available under the Creative Commons Attribution 3.0 license.