@mint I take it you had another freeze/crash even with the latest changes in the branch? Well, let's see how long you can go on Oban 2.18. Nothing in that changelog looks relevant but hey, stranger things have happened 🤪
@feld Not yet, but I guess it's better to be prepared that to simply wait. Also updating containers to fresh Alpine version (with OTP 26 and Elixir 1.16).
Older Oban used the db triggers. It has very slight overhead, so they changed to the new method for more performance for the most demanding use cases.
If Postgres can't keep up with work and queries start timing out, Ecto/Postgrex (db driver and connection pooler) crashes and restarts. This would cascade up to Oban. And I think in some edge case it can cause Oban to come back online but not properly start the queue processing.
Now you stopped processing jobs. Super weird.
I have a lower resource test server I'm running now and following some relays -- feld@friedcheese.us. Feel free to flood me with follow requests from a giant bot network, I'll need more followers to stress this further 🤭
@i@mint It's still a passion project and I really love elixir so it's not going away. Lain wants to start cutting out complexity and unused functionality. There's low hanging fruit for performance improvements. e.g., I have plans to completely refactor and simplify the caching too with a new better approach (Nebulex). I hope to be running a Pleroma instance across multiple redundant tiny computers at home soon as proof of concept that we can scale horizontally and scale down just fine (database and media on another server, but good enough. Serious people can cluster / load balance that too)
It's also possible to run Pleroma with no frontend webserver. I've done this in another project to experiment. Works great! It can get its own certificate with LetsEncrypt and bind on real 80/443.
@i@mint It's true, but the dude made like the world's best modern job queue software so I can't be too mad at him for wanting to run a business and make a living.
Back in ~2018 I almost recruited him to work on Pleroma. We had funding for him, but he was too involved in contract work so it didn't go anywhere beyond initial discussion. I was ready to drive down to Chicago and bring him into our office too 😢
@i@mint Lifeline will recover stuck / orphaned jobs but will not retry them if it was their "last attempt". Oban Pro has a better Lifeline plugin that can rescue these too based on custom rules.
I just forked theirs to detect if it was the last job and to reset it so it will be tried again.
@feld@mint good on you for keeping it going, definitely an appreciated effort, even if i'm not looking forward to rebasing all my changes on the recent pleroma develop changes any time soon
and lain is right, a leaner pleroma is a better pleroma for everyone involved
running a 15 node nebulex project right now, playing around with AP vaporware from time to time, it's still magical that libcluster can just gossips up a distributed KV cache with so little effort
@i@mint Mind you we had full time devs on it from late 2017-2020, but the focus was almost exclusively on the backend. There were two frontend projects exploring the space as well. If I told you how much money was spent exploring the viability of the fediverse you'd never believe me. It has a lot of zeroes. Unfortunately the timing was just off and something else filled the void for my employer
@hj@i@mint We cannot rely on PleromaFE to be the one true FE for Pleroma forever. Soapbox isn't bad, and someone could take it over now that Gleason has mostly abandoned it for the fediverse. Phanpy is super cool. Something else could come out tomorrow, you could get hit by a bus, Lain could make an entirely new UI with his AI experiments, etc. Just can't predict the future.
I really wish I had time to do more work in this area but when I have time I gravitate towards problems I already know I can solve, and that's mostly backend work 😓
>...someone could take it over now that Gleason has mostly abandoned it for the fediverse.
Has that been confirmed? I know he's working more on nostr, but not sure if it's a temporary job or not.
I think there's a lot of neat things in AdminFe and soapbox. Hell if confirmed abandonware I wouldn't mind tweaking it in my spare time. Though never worked much in elixir.
@i@mint Biggest problem with AdminFE is the difficulty of parsing the config because it's not already JSON or something more approachable. It's got some pretty gross hardcoded stuff in it. I personally want to replace it with one written in LiveView where it can directly work with the native config data structures no problem. If we could just get an initial PoC up that even controls one settings group the rest of the work is easy if not just a little tedious, but it could happen fast
@mint@NonPlayableClown@i@hj I just ran into some very serious usability bugs on mobile and when I asked him about it he was like "sorry, working on the Nostr stuff for now" so I just stopped using it
@i@feld@NonPlayableClown@hj Readme still mentions how it's effective enough to run on a Raspberry Pi, meanwhile I'm struggling to run my instance on a somewhat decently specced machine from ~2013 with a SATA SSD. Regarding the jsonb schema, it sure feels wasteful; something like a separate table with likes/repeats/reacts that only references the post, type and user would've probably done the job just as well if not better.
@feld@NonPlayableClown@mint@hj just ignore it eventually exploding into a death spiral because your vps/sbc isn't running on a modern local nvme raid and a postgresql auto vacuum having to rewrite all the deep indexes from the constant delete spam cripples the instance for a few minutes then hours then days
it's fine, but just letting it crash loop doesn't leave a nice impression
lain is also right that people should know better and buy better specced servers, but we also lied to them for years about that not being the case
@i@NonPlayableClown@mint@hj I don't get why everyone's so mad about Pleroma'a schema. Where's the core issue? Are people just upset that there's jsonb instead of a billion columns? Is there a pervasive impression that those jsonb columns are slow?
@feld@NonPlayableClown@mint@hj glussy's schema is currently running into more issues than even pleroma's buckets of slop, if he doesn't figure it out in time, the project is basically dead on arrival
@i@NonPlayableClown@mint@hj I think we could have a dedicated tombstones table so we can check for deleted there first, would probably help. Needs deeper investigation.
Some of the delete work cascading is annoying and I think we could do better there. I've already helped slow the pain by making deletes lower priority and with a narrower queue.
We have some triggers I am suspicious of and need to investigate deeper, but my memory is that these are totally Postgres FTS related. On that note I think even if you choose to use another search backend we still waste energy on the GIN/RUM indexing...
I think also there is an expensive operation for things like number of likes on a post. I wear I saw this count being embedded directly in the Object JSON and that's crazy to me if that's how it was implemented. So deletes require rewriting that row? Holy shit that's gonna bring pain and make your table a Swiss cheese mess over time.
Those post stats (likes, replies, quotes) should probably be in their own table, use FKs, etc. Counts could be cached, and a new activity or a delete should trigger a job to update the count async instead of blocking the commit
Stuff like that could help lower resource instances tremendously
Note I just fixed a bug where we appeared to have been updating the "unreachable_since" field for every instance on every successful activity published due to the test not matching reality. Imagine how many wasted queries that is😭😭😭
@feld@mint I think that this theory is the very likely cause. I've updated FluffyTail to the same Oban version bump commit (dbf29cbae4) on Sunday and it finally broke under it's own weight. I've seen an increase in CPU iowait and disk backlog alerts that must be associated with the Oban v12 jobs migration, because nothing else has changed. I've also noticed a big increase in DB timeouts in the logs. Granted it's running on a garbage tier Frantech VPS with the DB being hosted on a slab with dmcache on the OS drive.
I'll leave the instance in its current failed state if you want more information.
My test instance (@phnt@oban.borked.technology) running on the cheapest OVH Cloud VPS is still working perfectly fine after a week and a half of 4x the traffic FluffyTail receives.
@phnt@feld Looks like the queue clogged up on salon again, this time managed to catch it just five minutes in. The theory might be true; worth noting that postgrex crashes used to cascade into crashing the whole pleroma (#2469), which stopped at around the same time freezes began. Some (but not all) of my instances were affected, had to make the script that curls API every 30 seconds and restarts it when receiving no response.
@phnt@feld Honestly crashes are preferable to that, since it can be mitigated with a trivial script. With freezes, you'd have to check the date of the last post in TWKN, and there's no guarantee that it was caused by the bug and not some external factors (e.g. broken VPN tunnel like in my setup).
@mint@phnt okay, my feld/debugging branch has a clean rollback of Oban to the 2.13.6 version. There is a migration that needs to run. The Oban Live Dashboard is not compatible if you were using it, so sorry about that. But let's see if this gives you stability again
@feld@phnt Seems a bit radical for the solution, but okay. I do wonder, though, if moving Oban queue into separate database might help with I/O issues. Maybe even keep it in ramdisk.
@mint@phnt the I/O should really be minimal which is what's baffling
it is possible to make Oban specifically use a dedicated SQLite database in newer releases (a dedicated postgres is possible too, of course)
but let's roll back to the Oban version you never had issues with and work from there. If it still has problems we know it's something else (Postgrex?)
> The theory might be true; worth noting that postgrex crashes used to cascade into crashing the whole pleroma Similar thing happened to me, stalled federation and an hour later all connections on localhost were refused (Pleroma did not listen on it's port). When I used IEx to get access to it, IEx had no idea what Oban and Ecto were.
the I/O should really be minimal which is what's baffling
I've looked through the 2.17.0 release notes and sorentwo mentioned disabling the insert trigger functionality completely in config if "sub-second job execution isn't important." That should disable the insert trigger functionality that I suspected of the increased I/O and revert to polling only. After changing that with config :pleroma, Oban, insert_trigger: false the I/O did not change in any way. Still the same behavior. At this point I'm kinda lost at what the issue might be.
@feld@phnt@mint Updated instance to the newest commit on feld/debugging and downgraded Oban. There's no point in me trying to find it on my own as I have no other clues.
@feld@mint@phnt Pleroma crashed again ~1 minute after I made a post. federator_incoming queue had 0 available jobs, and few retryable. federator_outgoing had 7 failed jobs and zero available/executing.
Same thing just like last time. Out of nowhere a jump in disk backlog for a minute, disk busytime and Pleroma DB locks. Had almost zero DB timeouts before that.
Before the crash a lot of (DBConnection.ConnectionError) connection not available and request was dropped from queue after <some number>ms. This means requests are coming in and your connection pool cannot serve them fast enough. showed up in logs. Pleroma used at maximum 12 DB connections. Number of connections or pool size are from the default config, only :pleroma :connections_pool, connect_timeout was increased to 10s from default 5s. :pleroma, Pleroma.Repo, timeout was also increased to 30s.
The Netdata screenshots are from the same time. Ignore the time difference. Server is UTC-4 (US ET) and Netdata is UTC+2 (CEST).
@phnt@phnt@mint I'm a little confused about why there are duplicate/retry jobs firing so quickly as there are some things in the logs that shouldn't be happening successively like that. They may not be the root cause but they're definitely adding to the pressure and we can address it