A mild warning about backups

LockeCJ · May 31, 2024, 3:04pm

Discourse has a built-in backup process, which takes all of the posts, (optionally) images, etc. and bundles them into an archive. This process happens once per day, and currently generates a file that is ~14GiB. I take the current set of (4)files and back them up off-site using a tool called restic that encrypts and de-duplicates them for security and storage efficiency.

Do you see the problem yet?

So I went to check on how much space backups were taking in our offsite storage, and was surprised to discover that it had grown to over 1TiB. Luckily that only costs me ~$6/month in storage fees at the offsite location, but that’s a few orders of magnitude more than 14GiB.

How did this happen? It’s pretty simple, but a little nuanced. Imagine that the current contents of the backup directory on the server hosting this site look like this:

elsewhere-cafe-2024-05-27.tar.gz (14GiB)
elsewhere-cafe-2024-05-28.tar.gz (14GiB)
elsewhere-cafe-2024-05-29.tar.gz (14GiB)
elsewhere-cafe-2024-05-30.tar.gz (14GiB)

Running my offsite backup script for the first time will back up approximately 56GiB worth of data.

The next day, the backup directory on the server looks like this:

elsewhere-cafe-2024-05-28.tar.gz (14GiB)
elsewhere-cafe-2024-05-29.tar.gz (14GiB)
elsewhere-cafe-2024-05-30.tar.gz (14GiB)
elsewhere-cafe-2024-05-31.tar.gz (14GiB)

Discourse created a new backup, and automatically deleted the oldest one, keeping only the four most recent. Running my offsite backup script again will back up approximately 13GiB of additional data. Assuming this process is run daily, it will add that same amount each day, and the backup archive will just keep growing.

Why is that? Each time the backup script is run, it creates a snapshot of the data at the time. Since this data is changing so that 25% of the files are deleted and 25% of the files are new, that means that two adjacent snapshots share 75% of the same data. Unless some of the snapshots are forgotten and pruned, however, the size of the backup repository will just keep growing. This is by design, of course, but if you consider that most backup scenarios are for data that tends to grow incrementally, but isn’t completely replaced every four days, you can probably see how this would end up being contrary to initial expectations.

In any case, I’ve manually run a command to forget and prune all but a certain number of snapshots. This has brought the storage requirements down by a significant amount. Now I need to figure out what strategy to use for ongoing maintenance so that I don’t end up with a 1TiB archive, while also keeping a reasonable number for peace of mind.

smulder · May 31, 2024, 6:08pm

… maybe then use this “snapshot” tool for the domain it was designed for, the actual website that grows incrementally, not the tar.gz backups

LockeCJ · May 31, 2024, 7:51pm

The website itself doesn’t grow, just the database and uploaded assets. It’s generally bad form to make copies of the live files, and since I would need the backup that Discourse creates in order to restore it, it’s really the best course of action. I agree that I wasn’t using the snapshots correctly, which is the main lesson I learned, but I think more generally you don’t want an ever growing set of snapshots anyway. My strategy going forward is likely to be something like “Keep the last 5, plus 1 from each of the last several months.” That way I should have several very recent backups, plus some older ones to guard against recently corrupted backups. I should be able to automate that policy so that I take a snapshot, and then forget any snapshots that don’t meet that policy, and prune them to reduce storage usage.

smulder · May 31, 2024, 9:15pm

… so would that be 5 gzipped tars then, or 5 × 4 gzipped tars, for a total of 20 copies of 8 things

LockeCJ · May 31, 2024, 9:41pm

It wouldn’t be 20 copies due to the de-duplication.
Snapshot 1:

A
B
C
D

Snapshot 2:

B
C
D
E

B, C, and D would be additional references to the same files in snapshot 1.
E would be newly added.

Snapshot 3:

C
D
E
F

C and D would be additional references to the same files in snapshot 1.
E would be an additional reference to the same file in snapshot 2.
F would be newly added.

Snapshot 4:

D
E
F
G

D would be an additional reference to the same file in snapshot 1.
E would be an additional reference to the same file in snapshot 2.
F would be an additional reference to the same file in snapshot 3.
G would be newly added.

Snapshot 4:

E
F
G
H

E would be an additional reference to the same file in snapshot 2.
F would be an additional reference to the same file in snapshot 3.
G would be an additional reference to the same file in snapshot 4.
H would be newly added.

So in 5 snapshots only 8 files would actually be stored. Once snapshot 6 is created, snapshot 1 is forgotten and file A is pruned, while file I is added. There would still be 5 snapshots and 8 files. That should keep the overall backup size relatively consistent, only increasing as individual backup files increase in size.