Fediverse

28746 readers

56 users here now

A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).

If you wanted to get help with moderating your own community then head over to [email protected]!

Rules

Posts must be on topic.
Be respectful of others.
Cite the sources used for graphs and other statistics.
Follow the general Lemmy.world rules.

Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy

founded 2 years ago

MODERATORS

[email protected]

100

Why is serveral instances down simultaneously? Is it just me? (lemm.ee)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

73 comments fedilink hide all child comments

Can't post images because they're too big so here's imgur: https://imgur.com/a/Fm52ZTB

Edit: lemmy.ml and lemmy.world seem to have come back, I'm just a bit worried that it's another one of those hacks.

Edit 2: Most of those I've tried came back. reddthat.com and sh.itjust.works seems to still be down

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 10 points 1 year ago (2 children)

There is a GitHub issue on it and I experienced the exact same thing with my instance. A timeout occurs and the only way to fix it is to restart it seems. Like everyone else, it's strange that it all happened at the same time.

[–] [email protected] 5 points 1 year ago* (last edited 1 year ago) (2 children)

It's not that strange. A timeout occurs on several servers overnight, and maybe a bunch of Lemmy instances are all run in the same timezone, so all their admins wake up around the same time and fix it.

Well it's a timeout, so by fixing it at the same time the admins have "synchronized" when timeouts across their servers are likely to occur again since it's tangentially related to time. They're likely to all fail again around the same moment.

It's kind of similar to the thundering herd where a bunch of things getting errors will synchronize their retries in a giant herd and strain the server. It's why good clients will add exponential backoff AND jitter (a little bit of randomness to when the retry is done, not just every x^2 seconds). That way if you have a million clients, it's less likely that all 1,000,000 of them will attempt a retry at the extract same time, because they all got an error from your server at the same time when it failed.

Edit: looked at the ticket and it's not exactly the kind of timeout I was thinking of.

This timeout might be caused by something that's loosely a function of time or resources usage. If it's resource usage, because the servers are federated, those spikes might happen across servers as everything is pushing events to subscribers. So, failure gets synchronized.

Or it could just be a coincidence. We as humans like to look for patterns in random events.

[–] [email protected] 3 points 1 year ago

Interesting. Never thought of it that way.

[–] [email protected] 1 points 1 year ago

Interesting

[–] [email protected] 3 points 1 year ago (1 children)

wrong issue lol

[–] [email protected] 1 points 1 year ago (1 children)

This probably makes more sense although the issue I was experiencing earlier had similar logs as the issue I linked and others have commented on it too around the same time. I'm guessing they're related.

[–] [email protected] 2 points 1 year ago (1 children)

The original issue is just a symptom of all database threads being tied up. People just don't know how to follow an error message to the root cause.

The real source of the issue is db locking from triggers and cascading deletes on a major user change.

My report in https://github.com/LemmyNet/lemmy/issues/3649 has the offending query.

[–] [email protected] 2 points 1 year ago

Thanks for clarifying.