tl;dr summary furry.engineer and pawb.fun will be down for several hours this evening (5 PM Mountain Time onward) as we migrate data from the cloud to local storage. We’ll post updates via our announcements channel at https://t.me/pawbsocial.
In order to reduce costs and expand our storage pool, we’ll be migrating data from our existing Cloudflare R2 buckets to local replicated network storage, and from Proxmox-based LXC containers to Kubernetes pods.
Currently, according to Mastodon, we’re using about 1 TB of media storage, but according to Cloudflare, we’re using near 6 TB. This appears to be due to Cloudflare R2’s implementation of the underlying S3 protocol that Mastodon uses for cloud-based media storage, which is preventing Mastodon from properly cleaning up no longer used files.
As part of the move, we’ll be creating / using new Docker-based images for Glitch-SOC (the fork of Mastodon we use) and hooking that up to a dedicated set of database nodes and replicated storage through Longhorn. This should allow us to seamlessly move the instances from one Kubernetes node to another for performing routine hardware and system maintenance without taking the instances offline.
We’re planning to roll out the changes in several stages:
-
Taking furry.engineer and pawb.fun down for maintenance to prevent additional media being created.
-
Initiating a transfer from R2 to the new local replicated network storage for locally generated user content first, then remote media. (This will happen in parallel to the other stages, so some media may be unavailable until the transfer fully completes).
-
Exporting and re-importing the databases from their LXC containers to the new dedicated database servers.
-
Creating and deploying the new Kubernetes pods, and bringing one of the two instances back online, pointing at the new database and storage.
-
Monitoring for any media-related issues, and bringing the second instance back online.
We’ll be beginning the maintenance window at 5 PM Mountain Time (4 PM Pacific Time) and have no ETA at this time. We’ll provide updates through our existing Telegram announcements channel at https://t.me/pawbsocial.
During this maintenance window, furry.engineer and pawb.fun will be unavailable until the maintenance concluded. Our Lemmy instance at pawb.social will remain online, though you may experience longer than normal load times due to high network traffic.
Finally and most importantly, I want to thank those who have been donating through our Ko-Fi page as this has allowed us to build up a small war chest to make this transfer possible through both new hardware and the inevitable data export fees we’ll face bringing content down from Cloudflare R2.
Going forward, we’re looking into providing additional fediverse services (such as Pixelfed) and extending our data retention length to allow us to maintain more content for longer, but none of this would be possible if it weren’t for your generous donations.
I wish you guys had posted that Ko-Fi link in the sidebar. I’d been looking for a way to donate for a bit, but been too lazy to contact you guys to find out.
Didn’t even think to do that! Added to the sidebar :3
Status update
Both instances are back online again! We’re currently transferring cached media from remote instances to the local storage, so avatars, emojis, and older attachments may currently appear as broken images.
As of Feb 7th at 10:45 PM Mountain Time, pawb.fun has re-generated all feeds, while furry.engineer is continuing with an estimated 25 minutes to go. We’re also re-generating the ElasticSearch indicies which power the full-text search system and expect that to continue through the night.
It appears I’m still having some issues with pictures and media, doesn’t seem like it’s limited to a specific instance or time though.
Might this be due to either the migration, today’s issues or something else?
Having issues with Emojis on the (Pawb.fun) instance, all external ones, from other users appear broken, they show up as the text and glitch when hovered over. Reached out on Mastodon earlier about this, thought I’d also message here too.
@Draconic_NEO @crashdoom ditto here, but it seems to be more of an issue with servers we regularly interact with, like tech.lgbt, and less so for uncertain others.
Looks like furry.engineer is down?
I’m seeing the same here, something about an Argo tunnel error. @crashdoom@pawb.social
Aware and investigating!
and that’s why you’re the best <3
pawb.fun as well. Something got fucky wucky during the migration, it seems.
Correct! to give a bit of background while I wait for backups…
last night we had what appears to be an out of memory error. Our cloudflare tunnels broke around the same time that the internet went out (probably related), and we also didn’t have our nodes configured to keep some ram reserved to allow kubernetes to keep running. Additionally, we still only had 1 replica of the data for furry.engineer and pawb.fun that we were still building/downloading from other instances (mostly cached images).
so it was the perfect storm. node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash. There’s only one copy of the data, so nothing offline to check for corruption against. all the storage with 2 replicas was unaffected.
I’ve done an announcement post on the telegram channel to try and keep people appraised, but this restore is going to take another couple hours probably because I’m trying not to repeat my mistakes by setting things to 1 replica or skipping backups for expediency. My impatience pretty directly caused this issue.
SysAdmin lesson learned, always make the backups :3
Lessons do stick around when you have to learn the hard way!
node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash
Oof, that’s pretty much a cascading failure
Actually yes. Recovery was slow and painful, but I have policies in place to handle these failures now. I’m sure we will find another failure mode as we go forward!
Moar storage!
Will the data be synched or backed up off site?
Yes, we’ll be maintaining:
- Multiple replicas across different disks (local)
- Hourly and daily snapshots (local)
- Regular off-site backups for disaster recovery
Local hardware horse here!
To elaborate a bit, the storage replicas will span three physical servers in realtime, all of which get snapshots hourly in case we need a rollback, and full backups weekly to a fourth system on mechanical drives with 2-disk failure tolerance. This should mean that data loss requires 4 simultaneous system failures.
We have a tape library for automated tape backups, but can’t afford a drive upgrade just yet to make it make sense. The drives are often several thousand dollars, but the tape media is cheap.
Offsite backups are currently in the works, though if anyone has recommendations I would love to add them to our list for consideration.
If anyone has additional questions or suggestions I would be happy to answer tomorrow!
Sorry to necro this, but a few lingering content issues are still lingering. A lot of posts from the last 30 days previously fetched still don’t load, and our side of some instances refuse to load their emoji (most notably tech.lgbt).