[Maintenance] Feb 7 - Mastodon Data Migration

Crashdoom@pawb.social · edit-2 9 months ago

[Maintenance] Feb 7 - Mastodon Data Migration

Kovukono@pawb.social · 9 months ago

I wish you guys had posted that Ko-Fi link in the sidebar. I’d been looking for a way to donate for a bit, but been too lazy to contact you guys to find out.

Crashdoom@pawb.social · 9 months ago

Didn’t even think to do that! Added to the sidebar :3

Crashdoom@pawb.social · 9 months ago

Status update

Both instances are back online again! We’re currently transferring cached media from remote instances to the local storage, so avatars, emojis, and older attachments may currently appear as broken images.

As of Feb 7th at 10:45 PM Mountain Time, pawb.fun has re-generated all feeds, while furry.engineer is continuing with an estimated 25 minutes to go. We’re also re-generating the ElasticSearch indicies which power the full-text search system and expect that to continue through the night.

proto_phantom@pawb.social · 9 months ago

It appears I’m still having some issues with pictures and media, doesn’t seem like it’s limited to a specific instance or time though.

Might this be due to either the migration, today’s issues or something else?

Draconic NEO@pawb.social · edit-2 9 months ago

Having issues with Emojis on the (Pawb.fun) instance, all external ones, from other users appear broken, they show up as the text and glitch when hovered over. Reached out on Mastodon earlier about this, thought I’d also message here too.

LiquidParasyte@pawb.fun · 9 months ago

@Draconic_NEO @crashdoom ditto here, but it seems to be more of an issue with servers we regularly interact with, like tech.lgbt, and less so for uncertain others.

huxley@pawb.social · 9 months ago

Looks like furry.engineer is down?

Stefen Auris@pawb.social · 9 months ago

I’m seeing the same here, something about an Argo tunnel error. @crashdoom@pawb.social

Crashdoom@pawb.social · 9 months ago

Aware and investigating!

Stefen Auris@pawb.social · 9 months ago

and that’s why you’re the best <3

liquidparasyte@pawb.social · edit-2 9 months ago

pawb.fun as well. Something got fucky wucky during the migration, it seems.

natebluehooves@pawb.social · 9 months ago

Correct! to give a bit of background while I wait for backups…

last night we had what appears to be an out of memory error. Our cloudflare tunnels broke around the same time that the internet went out (probably related), and we also didn’t have our nodes configured to keep some ram reserved to allow kubernetes to keep running. Additionally, we still only had 1 replica of the data for furry.engineer and pawb.fun that we were still building/downloading from other instances (mostly cached images).

so it was the perfect storm. node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash. There’s only one copy of the data, so nothing offline to check for corruption against. all the storage with 2 replicas was unaffected.

I’ve done an announcement post on the telegram channel to try and keep people appraised, but this restore is going to take another couple hours probably because I’m trying not to repeat my mistakes by setting things to 1 replica or skipping backups for expediency. My impatience pretty directly caused this issue.

Vincent Hayes@pawb.social · 9 months ago

SysAdmin lesson learned, always make the backups :3

natebluehooves@pawb.social · 9 months ago

Lessons do stick around when you have to learn the hard way!

Exec@pawb.social · 9 months ago

node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash

Oof, that’s pretty much a cascading failure

natebluehooves@pawb.social · 9 months ago

Actually yes. Recovery was slow and painful, but I have policies in place to handle these failures now. I’m sure we will find another failure mode as we go forward!

Spitfire@pawb.social · 9 months ago

Moar storage!

Frosty@pawb.social · 9 months ago

Will the data be synched or backed up off site?

Crashdoom@pawb.social · 9 months ago

Yes, we’ll be maintaining:

Multiple replicas across different disks (local)
Hourly and daily snapshots (local)
Regular off-site backups for disaster recovery

natebluehooves@pawb.social · 9 months ago

Local hardware horse here!

To elaborate a bit, the storage replicas will span three physical servers in realtime, all of which get snapshots hourly in case we need a rollback, and full backups weekly to a fourth system on mechanical drives with 2-disk failure tolerance. This should mean that data loss requires 4 simultaneous system failures.

We have a tape library for automated tape backups, but can’t afford a drive upgrade just yet to make it make sense. The drives are often several thousand dollars, but the tape media is cheap.

Offsite backups are currently in the works, though if anyone has recommendations I would love to add them to our list for consideration.

If anyone has additional questions or suggestions I would be happy to answer tomorrow!

liquidparasyte@pawb.social · 9 months ago

Sorry to necro this, but a few lingering content issues are still lingering. A lot of posts from the last 30 days previously fetched still don’t load, and our side of some instances refuse to load their emoji (most notably tech.lgbt).