Conversation
Notices
-
@phnt @NonPlayableClown @Owl @dj @ins0mniak @transgrammaractivist
Breaking follows is basic userland, ask me how I know he doesn't test shit before pushing updates.
-
@cvnt @phnt @NonPlayableClown @Owl @dj @ins0mniak @transgrammaractivist
> ask me how I know he doesn't test shit before pushing updates.
"there isn't a Pleroma instance that exists which cannot handle the load on available hardware" still irks me because it means that the person that wrote it was ignoring the performance issues that had been reported in the bug tracker as well as in messages on fedi. (I don't think most people could have convinced Pleroma to stay up when subjected to FSE's load.) I guess nothing has changed since FSE's last merge with upstream.
-
@p @phnt @cvnt @ins0mniak @transgrammaractivist @Owl @dj @NonPlayableClown In his defense, he's been a big help with identifying the root cause of performance issue (a bug in Oban that made it crash its queue processing tasks) that until that point only affected my instances and one other guy that made a bug report.
-
@p @NonPlayableClown @Owl @dj @ins0mniak @phnt @transgrammaractivist
Imagine thinking an rpi4 was underpowered for your website, and the issue isn't your fucking website.
-
@cvnt @NonPlayableClown @Owl @dj @ins0mniak @phnt
> Imagine thinking an rpi4 was underpowered for your website, and the issue isn't your fucking website.
In this case, he was talking about maximum-size Frantech instances at the time: https://freespeechextremist.com/objects/508dc765-46de-4e28-adde-a8a6fd1d8ee0 . Granted, it's gotten better sometimes (worse sometimes) but breaking follows is a thing that shouldn't happen at any point on a release version.
The bae.st media import is actually running on a CM-4 and the actual bottleneck is the disk and it's chugging but I just sort of took it as a given that this is something I have to fix. There was an instance running on a Switch for a while, and I'm sure you're aware of mint's antics. lain cares about that kind of platform, so it's nice that lain is around.
-
@mint @NonPlayableClown @Owl @cvnt @dj @ins0mniak @phnt Yeah? That's good to hear, maybe he cares about craftsmanship.
-
@p @phnt @cvnt @ins0mniak @Owl @dj @NonPlayableClown it would also be nice if lain was around to actually make a 2.7.1 release like they said they should half a month ago
-
@feld @NonPlayableClown @Owl @cvnt @ins0mniak @phnt @transgrammaractivist
> Nobody proved there was an *Oban* bottleneck and still haven't.
Well, this was a remark years back. (It does still irk me.) Everything I know about the current Oban bug is second-hand, I am running what might be the only live Pleroma instance with no Gleason commits (happy coincidence; I was actually dodging another extremely expensive migration and then kicked off the other project, which meant I don't want to have to hit a moving target if I can avoid it, so I stopped pulling); at present, I backport a security fix (or just blacklist an endpoint) once in a while.
Unless you mean the following thing, but I haven't run 2.7.0. I don't know what that bug is.
> If I could reproduce reported issues it would be much easier to solve them but things generally just work for me.
I mean, like I mentioned, the Prometheus endpoints were public at the time. You could see my bottlenecks. (I think that would be cool to reenable by default; they'd just need to stop having 1MB of data in them if people are gonna fetch them every 30s, because enough people doing that can saturate your pipe.)
> A ton of work has been put into correctness (hundreds of Dialyzer fixes) and tracking down elusive bugs and looking for optimizations like reducing JSON encode/decode work when we don't need to, avoiding excess queries, etc.
I'm not sure what the Dialyzer is (old codebase), but improvements are good to hear about. That kind of thing gets you a 5%, 10% bump to a single endpoint, though. The main bottleneck is the DB; some cleverness around refetching/expiration would get you some much larger performance gains, I think; using an entire URL for an index is costing a lot in disk I/O. There's a lot of stuff to do, just not much of it is low-hanging, I think.
> It's actually been going really great :shrug:
:bigbosssalute: That is awesome to hear.
-
@p
> I mean, like I mentioned, the Prometheus endpoints were public at the time.
Problem is that this data is useful for monitoring overall health of an instance but doesn't give enough granular information to track down a lot of issues. With the metrics/telemetry work I have in progress we'll be able to export more granular Pleroma-specific metrics that will help a lot.
> The main bottleneck is the DB
So often it's just badly configured Postgres. If your server has 4 cores and 4 GB of RAM you can't go use pgtune and tell it you want to run Postgres with 4 cores and 4GB. There's nothing leftover for the BEAM. You want at least 500MB-1GB dedicated to BEAM, more if your server has a lot of local users so it can handle memory allocation spikes.
And then what else is running on your OS? That needs resources too. There isn't a good way to predict the right values for everyone. 😭 Like I said, it's running *great* on my little shitty thin client PC with old slow Intel J5005 cores and 4GB RAM. But I have an SSD for the storage and almost nothing else runs on the OS (FreeBSD). I'm counting a total of 65 processes before Pleroma, Postgres, and Nginx are running. Most Linux servers have way more services running by default. That really sucks when trying to make things run well on lower specced hardware.
You also have to remember that BEAM is greedy and will intentionally hold the CPU longer than it needs because it wants to produce soft-realtime performance results. This needs to be tuned down on lower resource servers because BEAM itself will be preventing Postgres from doing productive work. It's just punching itself in the face then. Set these vm.args on any server that isn't massively overpowered:
+sbwt none
+sbwtdcpu none
+sbwtdio none
> using an entire URL for an index is costing a lot in disk I/O
For the new Rich Media cache (link previews stored in the db so they're not constantly refetched) I hashed the URLs for the index for that same reason. Research showed a hash and the chosen index type were super optimal.
Another thing I did was I noticed we were storing *way* too much data in Oban jobs. Like when you federated an activity we were taking the entire activity's JSON and storing it in the jobs. Imagine making a post with 100KB of content that needs to go to 1000 servers? Each delivery job in the table was HUGE. Now it's just the ID of the post and we do the JSON serialization at delivery time. Much better, lower resource usage overall, lower IO.
Even better would be if we could serialize the JSON *once* for all deliveries but it's tricky because we gotta change the addressing for each delivery. Jason library has some features we might be able to leverage for this but it doesn't seem important to chase yet. Even easier might be to put placeholders in the JSON text, store it in memory, and then just use regex or cheaper string replacement to fill those fields at delivery time. Saves all that repeat JSON serialization work.
Other things I've been doing:
- making sure Oban jobs that have an error we should really treat as permanent are caught and don't allow the job to repeat. It's wasteful for us, rude to remote servers when we're fetching things
- finding every possible blocker for rendering activities/timelines and making those things asynchronous. One of the most recent ones I found was with polls. They could stall rendering a page of the timeline if the poll wasn't refreshed in the last 5 mins or whatever. (and also... I'm pretty sure polls were still being refreshed AFTER the poll was closed 🤬)
I want Pleroma to be the most polite Fedi server on the network. There are still some situations where it's far too chatty and sends requests to other servers that could be avoided, so I'm trying to plug them all. Each of these improvements lowers the resource usage on each server. Just gotta keep striving to make Pleroma do *less* work.
I do have my own complaints about the whole Pleroma releases situation. I wish we were cutting releases like ... every couple weeks if not every month. But I don't make that call.
-
@p @cvnt @phnt @NonPlayableClown @Owl @dj @ins0mniak @transgrammaractivist Nobody proved there was an *Oban* bottleneck and still haven't.
I'm always running my changes live on my instances. They were massively overpowered. Now I have a severely underpowered server and it's still fine.
If I could reproduce reported issues it would be much easier to solve them but things generally just work for me.
A ton of work has been put into correctness (hundreds of Dialyzer fixes) and tracking down elusive bugs and looking for optimizations like reducing JSON encode/decode work when we don't need to, avoiding excess queries, etc.
I'm halfway done with an entire logging rewrite and telemetry integration which will make it even easier to identify bottlenecks.
It's actually been going really great :shrug: