Conversation

Notices

(mint@ryona.agency)'s status on Saturday, 10-Aug-2024 02:09:46 JST
- feld
@feld Can you update Lazarus to include Oban 2.18 in supported range?
Screenshot_20240809_200603.png
In conversation about 6 months ago from ryona.agency permalink
Attachments
1. Screenshot_20240809_200603.png
  https://ryona.agency/media/7e8b9579f0f2f953ce09be9a87b96ecc23e214b007e0b7c75d0c6c5c9111bdc3.png?name=Screenshot_20240809_200603.png
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:22:22 JST feld
  in reply to
  
  @mint in your mix.exs go set Oban to override: true
  
  There should be another example in the file to follow. That will allow it to work
  
  In conversation about 6 months ago permalink
  
  likes this.
- (mint@ryona.agency)'s status on Saturday, 10-Aug-2024 02:24:01 JST
  in reply to
  - feld
  @feld Yeah, I instead made a copy of the repo and changed its mix.exs, then pointed pleroma's mix.exs to it.
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:24:28 JST feld
  in reply to
  - feld
  @mint I take it you had another freeze/crash even with the latest changes in the branch? Well, let's see how long you can go on Oban 2.18. Nothing in that changelog looks relevant but hey, stranger things have happened 🤪
  
  In conversation about 6 months ago permalink
  
  likes this.
- (mint@ryona.agency)'s status on Saturday, 10-Aug-2024 02:25:10 JST
  in reply to
  - feld
  @feld Not yet, but I guess it's better to be prepared that to simply wait. Also updating containers to fresh Alpine version (with OTP 26 and Elixir 1.16).
  
  In conversation about 6 months ago permalink
- (mint@ryona.agency)'s status on Saturday, 10-Aug-2024 02:26:57 JST
  in reply to
  - feld
  @feld Also is there a reason why there's been no mergeback that changes the develop version to 2.x.50?
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:32:29 JST feld
  in reply to
  - feld
  @mint here's my current working theory:
  
  Older Oban used the db triggers. It has very slight overhead, so they changed to the new method for more performance for the most demanding use cases.
  
  If Postgres can't keep up with work and queries start timing out, Ecto/Postgrex (db driver and connection pooler) crashes and restarts. This would cascade up to Oban. And I think in some edge case it can cause Oban to come back online but not properly start the queue processing.
  
  Now you stopped processing jobs. Super weird.
  
  I have a lower resource test server I'm running now and following some relays -- feld@friedcheese.us. Feel free to flood me with follow requests from a giant bot network, I'll need more followers to stress this further 🤭
  
  In conversation about 6 months ago permalink
  
  likes this.
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:34:56 JST feld
  in reply to
  
  @mint Got overlooked, open an issue so we don't forget to do it this weekend or I'll do it early next week
  
  In conversation about 6 months ago permalink
  
  likes this.
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:45:07 JST feld
  in reply to
  - :blank:
  @i @mint It's still a passion project and I really love elixir so it's not going away. Lain wants to start cutting out complexity and unused functionality. There's low hanging fruit for performance improvements. e.g., I have plans to completely refactor and simplify the caching too with a new better approach (Nebulex). I hope to be running a Pleroma instance across multiple redundant tiny computers at home soon as proof of concept that we can scale horizontally and scale down just fine (database and media on another server, but good enough. Serious people can cluster / load balance that too)
  
  It's also possible to run Pleroma with no frontend webserver. I've done this in another project to experiment. Works great! It can get its own certificate with LetsEncrypt and bind on real 80/443.
  
  In conversation about 6 months ago permalink
  
  and † top dog :pedomustdie: like this.
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 02:45:08 JST :blank:
  in reply to
  - feld
  @feld @mint too bad, i do wonder what pleroma would look like with a few full time devs at the helm, even if it's clean up the rest of the cruft
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:45:09 JST feld
  in reply to
  - :blank:
  @i @mint It's true, but the dude made like the world's best modern job queue software so I can't be too mad at him for wanting to run a business and make a living.
  
  Back in ~2018 I almost recruited him to work on Pleroma. We had funding for him, but he was too involved in contract work so it didn't go anywhere beyond initial discussion. I was ready to drive down to Chicago and bring him into our office too 😢
  
  In conversation about 6 months ago permalink
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 02:45:10 JST :blank:
  in reply to
  - feld
  @feld @mint makes sense, the pro and open split is annoying
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:45:11 JST feld
  in reply to
  - :blank:
  @i @mint Lifeline will recover stuck / orphaned jobs but will not retry them if it was their "last attempt". Oban Pro has a better Lifeline plugin that can rescue these too based on custom rules.
  
  I just forked theirs to detect if it was the last job and to reset it so it will be tried again.
  
  In conversation about 6 months ago permalink
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 02:45:12 JST :blank:
  in reply to
  - feld
  @feld @mint why is it even a fork of Oban.Plugins.Lifeline?
  
  In conversation about 6 months ago permalink
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 02:51:24 JST :blank:
  in reply to
  - feld
  @feld @mint good on you for keeping it going, definitely an appreciated effort, even if i'm not looking forward to rebasing all my changes on the recent pleroma develop changes any time soon
  
  and lain is right, a leaner pleroma is a better pleroma for everyone involved
  
  running a 15 node nebulex project right now, playing around with AP vaporware from time to time, it's still magical that libcluster can just gossips up a distributed KV cache with so little effort
  
  In conversation about 6 months ago permalink
  
  likes this.
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 02:51:30 JST feld
  in reply to
  - :blank:
  @i @mint Mind you we had full time devs on it from late 2017-2020, but the focus was almost exclusively on the backend. There were two frontend projects exploring the space as well. If I told you how much money was spent exploring the viability of the fediverse you'd never believe me. It has a lot of zeroes. Unfortunately the timing was just off and something else filled the void for my employer
  
  In conversation about 6 months ago permalink
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 02:51:30 JST :blank:
  in reply to
  - feld
  @feld @mint we can still thank myfreecams for admin-fe, even if it's full of holes still
  
  In conversation about 6 months ago permalink
  
  likes this.
- (mint@ryona.agency)'s status on Saturday, 10-Aug-2024 03:27:03 JST
  in reply to
  @NonPlayableClown @feld @i @hj At the very least Soapbox is still being updated since it's a default frontend for his MastoAPI->Nostr bridge.
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 03:27:04 JST feld
  in reply to
  - Proud owner of a potato sack :flag:
  - :blank:
  @hj @i @mint We cannot rely on PleromaFE to be the one true FE for Pleroma forever. Soapbox isn't bad, and someone could take it over now that Gleason has mostly abandoned it for the fediverse. Phanpy is super cool. Something else could come out tomorrow, you could get hit by a bus, Lain could make an entirely new UI with his AI experiments, etc. Just can't predict the future.
  
  I really wish I had time to do more work in this area but when I have time I gravitate towards problems I already know I can solve, and that's mostly backend work 😓
  
  In conversation about 6 months ago permalink
- NonPlayableClown (nonplayableclown@postnstuffds.lol)'s status on Saturday, 10-Aug-2024 03:27:04 JST NonPlayableClown
  in reply to
  >...someone could take it over now that Gleason has mostly abandoned it for the fediverse.
  
  Has that been confirmed?
  I know he's working more on nostr, but not sure if it's a temporary job or not.
  
  I think there's a lot of neat things in AdminFe and soapbox.
  Hell if confirmed abandonware I wouldn't mind tweaking it in my spare time.
  Though never worked much in elixir.
  
  In conversation about 6 months ago permalink
- Proud owner of a potato sack :flag: (hj@shigusegubu.club)'s status on Saturday, 10-Aug-2024 03:27:06 JST Proud owner of a potato sack :flag:
  in reply to
  - feld
  - :blank:
  @feld @i @mint biggest problem with it is that it's barely maintained. We already have Admin Dashboard in PleromaFE, I'd rather keep improving that.
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 03:27:07 JST feld
  in reply to
  - :blank:
  @i @mint Biggest problem with AdminFE is the difficulty of parsing the config because it's not already JSON or something more approachable. It's got some pretty gross hardcoded stuff in it. I personally want to replace it with one written in LiveView where it can directly work with the native config data structures no problem. If we could just get an initial PoC up that even controls one settings group the rest of the work is easy if not just a little tedious, but it could happen fast
  
  In conversation about 6 months ago permalink
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 03:41:12 JST :blank:
  in reply to
  @mint @feld @NonPlayableClown @hj and https://github.com/BDX-town/Mangane is still maintained if the project ever explodes, though ditto being usable via other existing mobile masto-api clients is still a fundamental feature
  In conversation about 6 months ago permalink
  Attachments
  1. GitHub - BDX-town/Mangane: Alternative frontend for Akkoma
    
    Alternative frontend for Akkoma. Contribute to BDX-town/Mangane development by creating an account on GitHub.
  likes this.
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 03:41:19 JST feld
  in reply to
  @mint @NonPlayableClown @i @hj I just ran into some very serious usability bugs on mobile and when I asked him about it he was like "sorry, working on the Nostr stuff for now" so I just stopped using it
  
  Not mad or anything, he can do whatever he wants
  
  In conversation about 6 months ago permalink
  
  likes this.
- (mint@ryona.agency)'s status on Saturday, 10-Aug-2024 03:53:43 JST
  in reply to
  @i @feld @NonPlayableClown @hj Readme still mentions how it's effective enough to run on a Raspberry Pi, meanwhile I'm struggling to run my instance on a somewhat decently specced machine from ~2013 with a SATA SSD. Regarding the jsonb schema, it sure feels wasteful; something like a separate table with likes/repeats/reacts that only references the post, type and user would've probably done the job just as well if not better.
  
  In conversation about 6 months ago permalink
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 03:53:44 JST :blank:
  in reply to
  @feld @NonPlayableClown @mint @hj just ignore it eventually exploding into a death spiral because your vps/sbc isn't running on a modern local nvme raid and a postgresql auto vacuum having to rewrite all the deep indexes from the constant delete spam cripples the instance for a few minutes then hours then days
  
  it's fine, but just letting it crash loop doesn't leave a nice impression
  
  lain is also right that people should know better and buy better specced servers, but we also lied to them for years about that not being the case
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 03:53:45 JST feld
  in reply to
  @i @NonPlayableClown @mint @hj I don't get why everyone's so mad about Pleroma'a schema. Where's the core issue? Are people just upset that there's jsonb instead of a billion columns? Is there a pervasive impression that those jsonb columns are slow?
  
  In conversation about 6 months ago permalink
- :blank: (i@declin.eu)'s status on Saturday, 10-Aug-2024 03:53:46 JST :blank:
  in reply to
  @feld @NonPlayableClown @mint @hj glussy's schema is currently running into more issues than even pleroma's buckets of slop, if he doesn't figure it out in time, the project is basically dead on arrival
  
  not faulting him for focusing on that for now
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Saturday, 10-Aug-2024 04:41:54 JST feld
  in reply to
  @i @NonPlayableClown @mint @hj I think we could have a dedicated tombstones table so we can check for deleted there first, would probably help. Needs deeper investigation.
  
  Some of the delete work cascading is annoying and I think we could do better there. I've already helped slow the pain by making deletes lower priority and with a narrower queue.
  
  We have some triggers I am suspicious of and need to investigate deeper, but my memory is that these are totally Postgres FTS related. On that note I think even if you choose to use another search backend we still waste energy on the GIN/RUM indexing...
  
  I think also there is an expensive operation for things like number of likes on a post. I wear I saw this count being embedded directly in the Object JSON and that's crazy to me if that's how it was implemented. So deletes require rewriting that row? Holy shit that's gonna bring pain and make your table a Swiss cheese mess over time.
  
  Those post stats (likes, replies, quotes) should probably be in their own table, use FKs, etc. Counts could be cached, and a new activity or a delete should trigger a job to update the count async instead of blocking the commit
  
  Stuff like that could help lower resource instances tremendously
  
  Note I just fixed a bug where we appeared to have been updating the "unreachable_since" field for every instance on every successful activity published due to the test not matching reality. Imagine how many wasted queries that is😭😭😭
  
  In conversation about 6 months ago permalink
  
  likes this.
- Phantasm (phnt@annihilation.social)'s status on Wednesday, 14-Aug-2024 08:33:12 JST Phantasm
  in reply to
  - feld
  @feld @mint I think that this theory is the very likely cause. I've updated FluffyTail to the same Oban version bump commit (dbf29cbae4) on Sunday and it finally broke under it's own weight. I've seen an increase in CPU iowait and disk backlog alerts that must be associated with the Oban v12 jobs migration, because nothing else has changed. I've also noticed a big increase in DB timeouts in the logs. Granted it's running on a garbage tier Frantech VPS with the DB being hosted on a slab with dmcache on the OS drive.
  
  I'll leave the instance in its current failed state if you want more information.
  
  My test instance (@phnt@oban.borked.technology) running on the cheapest OVH Cloud VPS is still working perfectly fine after a week and a half of 4x the traffic FluffyTail receives.
  
  In conversation about 6 months ago permalink
  
  likes this.
- (mint@ryona.agency)'s status on Wednesday, 14-Aug-2024 08:39:50 JST
  in reply to
  - feld
  - Phantasm
  @phnt @feld Looks like the queue clogged up on salon again, this time managed to catch it just five minutes in. The theory might be true; worth noting that postgrex crashes used to cascade into crashing the whole pleroma (#2469), which stopped at around the same time freezes began. Some (but not all) of my instances were affected, had to make the script that curls API every 30 seconds and restarts it when receiving no response.
  
  In conversation about 6 months ago permalink
- (mint@ryona.agency)'s status on Wednesday, 14-Aug-2024 08:41:41 JST
  in reply to
  - feld
  - Phantasm
  @phnt @feld Honestly crashes are preferable to that, since it can be mitigated with a trivial script. With freezes, you'd have to check the date of the last post in TWKN, and there's no guarantee that it was caused by the bug and not some external factors (e.g. broken VPN tunnel like in my setup).
  
  In conversation about 6 months ago permalink
- (mint@ryona.agency)'s status on Wednesday, 14-Aug-2024 08:54:13 JST
  in reply to
  - feld
  - Phantasm
  @feld @phnt Yeah, it was, looks like that comment was a false lead all along. Don't know why I expected anything else.
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Wednesday, 14-Aug-2024 08:54:14 JST feld
  in reply to
  - Phantasm
  @mint @phnt Okay, was this with the new Oban 2.18?
  
  In conversation about 6 months ago permalink
- (mint@ryona.agency)'s status on Wednesday, 14-Aug-2024 09:11:20 JST
  in reply to
  - feld
  - Phantasm
  @feld @phnt Looks like similar symptoms have been reported for quite some time, including on versions that ran without issues here.
  https://github.com/sorentwo/oban/issues/1037
  https://github.com/sorentwo/oban/issues/1129
  https://github.com/sorentwo/oban/issues/842
  One recommended enabling TCP keepalives for postgres, I might try that.
  https://github.com/sorentwo/oban/issues/493#issuecomment-1187001822
  In conversation about 6 months ago permalink
  Attachments
  1. Oban OSS available jobs are not being executed · Issue #1129 · sorentwo/oban
    
    Environment Oban Version 2.17.12 PostgreSQL Version 16 Elixir & Erlang/OTP Versions (elixir --version) Erlang/OTP 26, 1.15.7 Current Behavior iex(tp@100.100.20.1)1> Oban.check_queue(queue: :docx_jo...
  2. Oban Pro: jobs get stuck in executing and become orphans · Issue #1037 · sorentwo/oban
    
    Environment Oban: 2.17.4 Oban Pro Version: 1.3.2 (and 1.3.5 with newest Oban) PostgreSQL Version: 15.2 Elixir: 1.15.7-otp-26 Erlang: 26.0.2 config [filtered], Oban, repo: [filtered], engine: Oban.P...
  3. Stuck jobs · Issue #842 · sorentwo/oban
    
    Versions Oban Version: 2.13.6 correction, 2.14.1 PostgreSQL Version: 14.6 Elixir & Erlang/OTP Versions (elixir --version): Elixir 1.14.2, Erlang/OTP 25 Issue Despite the fix in #769, I see jobs tha...
  4. Jobs stuck in "available" state · Issue #493 · sorentwo/oban
    
    Environment Oban Version: Oban v2.7.1, Oban.Web v2.7.0, Oban.Pro v0.8.0 PostgreSQL Version: PostgreSQL 13.2 Elixir & Erlang/OTP Versions (elixir --version): Elixir 1.12.0 (compiled with Erlang/OTP ...
- feld (feld@bikeshed.party)'s status on Wednesday, 14-Aug-2024 18:02:13 JST feld
  in reply to
  - Phantasm
  @mint @phnt okay, my feld/debugging branch has a clean rollback of Oban to the 2.13.6 version. There is a migration that needs to run. The Oban Live Dashboard is not compatible if you were using it, so sorry about that. But let's see if this gives you stability again
  
  In conversation about 6 months ago permalink
  
  likes this.
- (mint@ryona.agency)'s status on Thursday, 15-Aug-2024 02:31:56 JST
  in reply to
  - feld
  - Phantasm
  @feld @phnt Seems a bit radical for the solution, but okay. I do wonder, though, if moving Oban queue into separate database might help with I/O issues. Maybe even keep it in ramdisk.
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Thursday, 15-Aug-2024 02:36:46 JST feld
  in reply to
  - Phantasm
  @mint @phnt the I/O should really be minimal which is what's baffling
  
  it is possible to make Oban specifically use a dedicated SQLite database in newer releases (a dedicated postgres is possible too, of course)
  
  but let's roll back to the Oban version you never had issues with and work from there. If it still has problems we know it's something else (Postgrex?)
  
  In conversation about 6 months ago permalink
  
  likes this.
- (mint@ryona.agency)'s status on Thursday, 15-Aug-2024 02:38:38 JST
  in reply to
  - feld
  - Phantasm
  @feld @phnt Alright, when I'm done with maintenance on the other test server.
  
  In conversation about 6 months ago permalink
- feld (feld@bikeshed.party)'s status on Thursday, 15-Aug-2024 02:40:59 JST feld
  in reply to
  - Phantasm
  @mint @phnt your patience is GREATLY appreciated and I'll grind through this with you until we have found the smoking gun
  
  In conversation about 6 months ago permalink
  
  likes this.
- Phantasm (phnt@fluffytail.org)'s status on Thursday, 15-Aug-2024 02:41:35 JST Phantasm
  in reply to
  - feld
  - Phantasm
  @mint @phnt @feld
  
  > The theory might be true; worth noting that postgrex crashes used to cascade into crashing the whole pleroma
  Similar thing happened to me, stalled federation and an hour later all connections on localhost were refused (Pleroma did not listen on it's port). When I used IEx to get access to it, IEx had no idea what Oban and Ecto were.
  
  In conversation about 6 months ago permalink
  
  likes this.
- Phantasm (phnt@fluffytail.org)'s status on Thursday, 15-Aug-2024 03:14:28 JST Phantasm
  in reply to
  - feld
  - Phantasm
  @feld @phnt @mint
  the I/O should really be minimal which is what's baffling
  I've looked through the 2.17.0 release notes and sorentwo mentioned disabling the insert trigger functionality completely in config if "sub-second job execution isn't important." That should disable the insert trigger functionality that I suspected of the increased I/O and revert to polling only. After changing that with config :pleroma, Oban, insert_trigger: false the I/O did not change in any way. Still the same behavior. At this point I'm kinda lost at what the issue might be.
  The documentation and release notes also mention different reasons for disabling the trigger:
  https://github.com/sorentwo/oban/releases/tag/v2.17.0
  https://hexdocs.pm/oban/v2-17.html#disable-insert-notifications-optional
  I've also skimmed through Ecto and Postgrex GitHub issues and didn't find anything meaningful that mentioned increased I/O or Oban.
  In conversation about 6 months ago permalink
  Attachments
  1. Release v2.17.0 · sorentwo/oban
    
    This release includes an optional database migration to disable triggers and relax priority checks. See the v2.17 upgrade guide for step-by-step instructions. 📟 Universal Insert Notifications Histo...
  2. No result found on File_thumbnail lookup.
    
    Upgrading to v2.17 — Oban v2.18.0
  likes this.
- feld (feld@bikeshed.party)'s status on Thursday, 15-Aug-2024 03:14:34 JST feld
  in reply to
  - Phantasm
  - Phantasm
  @phnt @phnt @mint the triggers were already removed by the Oban 2.17 v12 database migration, so that triggers setting would have no effect anyway
  
  In conversation about 6 months ago permalink
  
  likes this.
- Phantasm (phnt@fluffytail.org)'s status on Thursday, 15-Aug-2024 03:47:14 JST Phantasm
  in reply to
  - feld
  - Phantasm
  @feld @phnt @mint Updated instance to the newest commit on feld/debugging and downgraded Oban. There's no point in me trying to find it on my own as I have no other clues.
  
  Will report back if it stalls again.
  
  In conversation about 6 months ago permalink
  
  likes this.
- Phantasm (phnt@fluffytail.org)'s status on Friday, 16-Aug-2024 23:24:13 JST Phantasm
  in reply to
  - feld
  - Phantasm
  - Phantasm
  @feld @mint @phnt Pleroma crashed again ~1 minute after I made a post. federator_incoming queue had 0 available jobs, and few retryable. federator_outgoing had 7 failed jobs and zero available/executing.
  Same thing just like last time. Out of nowhere a jump in disk backlog for a minute, disk busytime and Pleroma DB locks. Had almost zero DB timeouts before that.
  Before the crash a lot of (DBConnection.ConnectionError) connection not available and request was dropped from queue after <some number>ms. This means requests are coming in and your connection pool cannot serve them fast enough. showed up in logs. Pleroma used at maximum 12 DB connections. Number of connections or pool size are from the default config, only :pleroma :connections_pool, connect_timeout was increased to 10s from default 5s. :pleroma, Pleroma.Repo, timeout was also increased to 30s.
  The Netdata screenshots are from the same time. Ignore the time difference. Server is UTC-4 (US ET) and Netdata is UTC+2 (CEST).
  
  pleroma-crash_20240816.txt
  postgres-crash_20240816.txt
  In conversation about 6 months ago permalink
  Attachments
  likes this.
- feld (feld@bikeshed.party)'s status on Friday, 16-Aug-2024 23:24:16 JST feld
  in reply to
  - Phantasm
  - Phantasm
  @phnt @phnt @mint I'm a little confused about why there are duplicate/retry jobs firing so quickly as there are some things in the logs that shouldn't be happening successively like that. They may not be the root cause but they're definitely adding to the pressure and we can address it
  
  In conversation about 6 months ago permalink
  
  likes this.

Public

Conversation

Notices

Feeds