@feld Don't know if it's because of what you did to oban, but lately (after updating from 03024318 to 9953b0da) some shit's been happening when the instance essentially gets constipated and stops properly receiving remote activities (and likely also delivering local ones) until you restart it, then immediately after that they begin to come through from the queue somewhere in DB. Any idea what could be happening?
@mint@feld@mint@p Its because of a few posts in the federation tab (you need to delete some of them) to get it to restart (i have had this issuse two time)
@dcc@feld@p@mint I don't see how this is related. If there was some post preventing the rest of the posts from coming through, I imagine it'll be in the queue itself instead of the table with already processed posts.
@jesu As lain said, posts from your subscriptions are pushed by remote instances to yours, not fetched from them. Local accounts take as much resources as remote ones, that being a single row in users table, so if an account lays dormant, it doesn't take up any resources aside from whatever his subscriptions push in.
@jesu What you've been proposing might be doable, possibly even with an MRF. Activities that are shared to followers have actor's /followers address either in `to` (unlisted, lockposts) or `cc` (public). When receiving such an activity, one could check the list of local followers this actor has, then check each follower's last active time and if none were recently active, reject it. The problem? This would break thread contexts, since when fetching a post someone replied to, Pleroma would generate the Create activity with the same structure, and then promptly reject it since it fits the same criteria. Essentially turning your instance to a whitelist-federated one with extra steps.
@mint I don't know why you aren't on pleroma dev team with how much you know but something tells me it has to do with time and/or money and/or don't wanna deal with that
@jesu This also seems to be the case where non-shared user-specific inboxes would've been something more than gnusoc-era rudiment (since you could filter by recipient more easily), but all modern fedi software I've seen sends stuff to the shared inbox exclusively aside from a few edge cases like DMs (and even then it's inconsistent, Pleroma sends a first DM to user inbox but if you reply to your own DM, it gets sent to a shared one, not like it matters since scopes/recipients have always been defined in an activity itself).
@jesu >how to stop those inactive accounts from, not only receiving the activities, but also how to stop the sending servers from sending them Nothing much can be done aside from deleting accounts then, which should send out unfollow activities to each followed user, but that often doesn't work as intended (as stated below). Making some mix task that would mass-unfollow on demand is entirely possible, though.
@mint hmm.. I feel like we are both misunderstanding each other.. I'm not 100% sure about both of us but I'm 100% sure I'm misunderstanding... I also have a tendecy to practically speak in riddles because so much ends up assumed that I don't speak clearly
I feel like I'm misspeaking because "When receiving such an activity" is already not what I mean. I didn't realize activities are sent not received. I will try my best to explain, sorry to repeat myself.
If your server had 1000 inactive accounts, and those accounts had an average of 50 accounts per server, that's 50k accounts that are sending activities to your server, right? So, my idea was kind of a question: how to stop those inactive accounts from, not only receiving the activities, but also how to stop the sending servers from sending them. It's simply wasted resources on both ends.
You propose mrf but that would probably ddos the sending server since the queue for the sending server would build up and those activities would keep being sent and never received (I suppose the receiving server could lie?)
@jesu@mint To clarify. On your local Pleroma instance, it would seem that I no longer follow you, but you still follow me (the remote deactivated account).
@jesu@mint No, Pleroma does not send any deactivation activities (it's local only). Misskey unfollows everyone before deactivating the account, but the followers still remain.
Imagine we have two separate instances and we both follow each other. You are on Pleroma and I use Misskey. If I delete my own account, Misskey would unfollow you from my perspective, but your follow would still remain even though the account behind it is deactivated even on your local instance.
Mastodon might do something similar with follows, but nobody that follows me from Mastodon deleted their account yet. The same remote account deactivation still happens though. Both Mastodon and Misskey also send Delete objects for all posts of that account before deactivating it.
@jesu@mint Both Misskey and Mastodon send account deactivation activities to remote instances. Misskey and maybe Mastodon also unfollow every user the deactivated account followed.
@i@mint@phnt that's why I'm thinking "can I just lie and say this inactive account that has been inactive is deactivated when it's not? would *actually* deactivating it work?
@jesu It already wastes its own resources by mass spamming delete activities for each post of deleted user to the entire network it's aware of. But yeah, like everything else in AP this isn't enforceable and is likely a request on the level of restricting who can reply to the post.
@mint well, now I know this is a pipe dream, because mastodon will never do anything good and this heavily relies on cooperation.
It'd be very very very nice to say "hey, this user has been inactive for 3 months, stop sending statuses to them." I can't imagine how many resources are wasted on this with mastodon.social alone
@jesu@phnt The queue issue cropped up only recently after the update and I haven't experienced anything like that in all two years of hosting the instance, so I'm fairly confident it's a bug. Most of my inbound federation comes from relays anyway.
@phnt@mint I suppose my next question would be "would this be beneficial?" and mint and lain thought not, but I keep seeing queues getting bogged up and I'm curious what it could be
@phnt@jesu Not sure, it happened once on salon as well, where I did no maintenance. And there it offected outbound federation as well, pernia's posts were coming hours late after the restart.
@mint@jesu Also if it's the same thing that occurred during the DB pruning, then only incoming federation was affected. I saw your posts almost in real time it was happening on Ryona Agency.
@feld I've truncated the table just in case after experiencing it. It was around 50k when I woke up and saw no new posts in TWKN, and 41k after I restarted pleromer and the federation queue caught up. Screenshot_20240702_164321.png
@feld Something similar, however, happened on salon and I did not flush its queue after restart. There's a shitload of user refresh workers. Screenshot_20240702_164551.png
@i@mint I think it could be reasonably common for there to be stale "executing" jobs in the table that are left indefinitely and should be recycled due to them remaining from a crash/failure/unclean shutdown. I check mine occasionally but so far I haven't found any.
@feld All from regular servers (FSE, DRC, varis, etc.), fetching a couple either timed out due to remote instance being down or brought the already fetched object.
@feld I run my own fork with a few minor changes (MRFs, DMs about registration requests, stuff like that), but none of them touch Oban or HTTP adapters except for a workaround for Hackney not being able to follow redirects when using HTTP proxy.
@feld Got notified by a friend that it happened to his instance as well (though I have SSH access to it and run the same fork with the same proxy setup there), checked the issues and someone else with supposedly vanilla setup reported the issue, so it isn't just me. https://git.pleroma.social/pleroma/pleroma/-/issues/3286 I think HTTP adapters are unrelated, since the incoming activities get stuck in queue before ever going through the pipeline that would fetch context, apply MRFs, etc. Plus forcefetching works fine in this condition.
@mint I am actively investigating this, trying to find any possible reason this is happening.
My best guess so far is orphaned jobs making Oban think it can't run more jobs because they're dead / stuck in "executing" state.
This should really never happen because Oban itself doesn't crash, but I guess if you restarted Pleroma and it didn't clean itself up gracefully this could happen.
Any chance some of these are Docker deployments or the service could have crashed and restarted automatically due to low resources (OOM, etc)?
@feld >Any chance some of these are Docker deployments or the service could have crashed and restarted automatically due to low resources (OOM, etc)? No, two instances are in LXC container with postgres in separate container being connected to via virtual bridge; I haven't set up any quotas. Third one is running without any containers on a cheapish VPS.
@mint I found it's possible for background queue to get stuck because of super long timeout (15 mins) and some other jobs which were missing timeouts (defaults to infinity), so I've fixed these issues. Some other tweaks in here too.
These changes do not have anything directly to do with the ReceiverWorker, but it may be possible that Oban is not scheduling those jobs because of existing running jobs being stuck. This is unclear to me and doesn't feel like it should work that way in the BEAM, so it could be an Oban-specific behavior with how it is choosing to execute available work.
Investigation is still ongoing until I am certain nothing else could be causing this.
@mint if it's interesting to you, I forked the Oban Lifeline plugin and made a new one called Lazarus which is configured to revive a stuck/dead/orphaned job even if it was at its last attempt due to max_attempts: 1
if the job has failed multiple times though, it lets it go
the original Lifeline plugin would throw away a bunch of our jobs just because they're max_attempts: 1 and it's not super necessary to have them tried again, just dumb that if they failed hard for a totally unhandled reason it wouldn't try again
@feld Yeah, it's subbed to 63 Pleroma's internal relays. It also has this thingy enabled that fetches announced notes separately while rejecting the announce activity itself to reduct DB bloat (like more than 30% of activities on agency were those announces when I decided to write it). https://gitgud.io/ryonagency/pleroma/-/blob/ryona-dev/lib/pleroma/web/activity_pub/mrf/relay_optimization.ex I don't think it should have any effect on federation since it basically just skips transmogrifier and other steps which eventually still call the same fetch_object_from_id.
Either way, now that the draft for 2.7.0 is in works, we'll soon see if the frozen federation issue happens on larger scale than just me and the other guy.
@i@mint I'm running out of ideas, but I did merge in the Oban Live Dashboard so you can go to /phoenix/live_dashboard on your instance as an admin and watch the Oban jobs live (just change page refresh time to like 2 seconds)
at least this way you should be able to spot jobs stuck executing, click on them, and see exactly what it was working on
@feld@mint that's what i did in march after getting tired of pulling up psql, really handy even if it's a bit lackluster in that you can only order the table instead of filtering on the state/queue/worker you'd want to focus on
@mint@i I left myself that note because I swear I saw some returning {:reject, _} without being wrapped by {:error, _} and wanted to investigate further
@mint@feld yeah big pain of values as errors in the land of untyped pattern matching, too many things accidentally shaped wrong in the spaghetti of case fallthroughts
@mint@i I wanted to make better decisions on the Rich Media jobs so we don't retry jobs that will just fail again as I identified some deficiencies there, so I've got some refactoring of the helper/parser/backfill functions to help with this. About to start testing it live.
I noticed a couple harmless bugs in the process, but making computers do less work is important :)