Discourse has a built-in backup process, which takes all of the posts, (optionally) images, etc. and bundles them into an archive. This process happens once per day, and currently generates a file that is ~14GiB. I take the current set of (4)files and back them up off-site using a tool called restic that encrypts and de-duplicates them for security and storage efficiency.
Do you see the problem yet?
So I went to check on how much space backups were taking in our offsite storage, and was surprised to discover that it had grown to over 1TiB. Luckily that only costs me ~$6/month in storage fees at the offsite location, but that’s a few orders of magnitude more than 14GiB.
How did this happen? It’s pretty simple, but a little nuanced. Imagine that the current contents of the backup directory on the server hosting this site look like this:
- elsewhere-cafe-2024-05-27.tar.gz (14GiB)
- elsewhere-cafe-2024-05-28.tar.gz (14GiB)
- elsewhere-cafe-2024-05-29.tar.gz (14GiB)
- elsewhere-cafe-2024-05-30.tar.gz (14GiB)
Running my offsite backup script for the first time will back up approximately 56GiB worth of data.
The next day, the backup directory on the server looks like this:
- elsewhere-cafe-2024-05-28.tar.gz (14GiB)
- elsewhere-cafe-2024-05-29.tar.gz (14GiB)
- elsewhere-cafe-2024-05-30.tar.gz (14GiB)
- elsewhere-cafe-2024-05-31.tar.gz (14GiB)
Discourse created a new backup, and automatically deleted the oldest one, keeping only the four most recent. Running my offsite backup script again will back up approximately 13GiB of additional data. Assuming this process is run daily, it will add that same amount each day, and the backup archive will just keep growing.
Why is that? Each time the backup script is run, it creates a snapshot of the data at the time. Since this data is changing so that 25% of the files are deleted and 25% of the files are new, that means that two adjacent snapshots share 75% of the same data. Unless some of the snapshots are forgotten and pruned, however, the size of the backup repository will just keep growing. This is by design, of course, but if you consider that most backup scenarios are for data that tends to grow incrementally, but isn’t completely replaced every four days, you can probably see how this would end up being contrary to initial expectations.
In any case, I’ve manually run a command to forget and prune all but a certain number of snapshots. This has brought the storage requirements down by a significant amount. Now I need to figure out what strategy to use for ongoing maintenance so that I don’t end up with a 1TiB archive, while also keeping a reasonable number for peace of mind.