INLS 183 Project 3: Backup with tar, gzip, find, date, and cron

Introduction

Now that I have Samba running on a computer at my work, I crucially needed an automated backup utility to protect the 150+MB of data that my users are sharing. After having installed Samba, Apache, and PHP, I got the impression that there would also be some super-tool to perform automated backups. Though I did find some sketchy utilities, I realized that I already had all the tools I needed---I just had to get them to work together.

My task was to create a script that would recursively tar and gzip all of the files under /export/samba/fileshare/ every night. I knew it would be wasteful and time consuming to backup all 150+MB every night, so I decided to backup the entire fileshare weekly, and then daily backup only those files that had been modified in the last day. The fileshare system isn't exactly what you call 'mission-critical,' it's sole purpose is to make those 150+MB of data available to 10 users, so I wasn't planning on backing the system up to tape, I just wanted to store the tar.gz files in another directory. [The purpose of backing up, of course, being more to protect against accidental file/folder deletion by humans rather than machine failure.] Since multiple backup archives would be saved in the same directory, I wanted the filename to indicate the date. And lastly, I only wanted to store up to a month of backups, so knew I’d also need to include a purge script that would delete files in the backup directory that were more than a month old. The steps appear to be pretty clear-cut, but getting everything to work together was a challenge. Below I chart my manual investigation of each step (utility) and then slowly approach the unified and automated backup process.

Date

Date is one of those quintessential Unix utilities that perform a single simple task, and perform it well. I was familiar with date having used before it find the time:

$ date
Tue Sep 26 22:14:34 EDT 2000

and after having skimmed O'Reilly's Linux in a Nutshell, I knew date's output could be modified and embellished using format options (%A %b %d) much like the dynamic web-scripting work I had done previously (SSI, PHP). My challenge was to get date's customized output into a filename. I decided the following date output was explicit enough for my backup files:

$ date '+%d%b'
26Sep

In this example, the '+' (plus sign) indicates formatting options, and %d%b represent the respective elements of the date. Next, I brushed up on the BASH shell to research redirections and pipes. I thought I'd have to use "" (quotes) or > (redirection) to get date into a filename, but after some guessing and searching, I discovered that I needed to use backticks (`) for command substitution. In other words I could surround the date command in backticks as an argument in the filename. I tested it with vi:

$ vi test.`date +'%d%b'`.attempt1
"test.26Sep.attempt1" [New File]

And it worked!

Tar

Tar is the essential backup utility. It takes a group of files and 'glues' them together as one archive. This was important in tar's early days, when it was used primarily to back systems up to a Tape ARchive. A magnetic tape can only really hold one long linear file, it’s not like a hard drive or a filesystem. As a result, tar was able to stream the contents of a hard drive, containing multiple files, onto a tape as one file. Of course tar could then read the single linear file off of the tape and recreate the filesystem, directories and all. I was familiar with using tar to extract files (tar xf), however, this time I'd be creating an archive for the weekly backup. After some practicing and research, I arrived at this command:

tar cf - /export/samba/fileshare/ > /export/samba/fb/fbweekly.`date '+%d%b'`.tar

The last part of the line above (after the '>') incorporates the date command substitution that I learned above. The redirection symbol '>' takes the output of tar and puts it in that fancily named file. The /export/samba/fileshare/ is the root of the fileshare directory that’s being archived using "tar cf -". The 'c' tells tar to create an archive, the 'f' alerts tar to the source of the input, and the '-' (hyphen) causes tar to store the archive on standard output, which is then redirected to the file,  "/export/samba/fb/fbweekly.`date '+%d%b'`.tar".

Gzip

I wanted desperately to gzip (compress) the archive immediately after I tarred it. I knew that I could use tar's 'z' option, both when compressing and uncompressing, but using it gave me an error about garbage at the end of the tar file (which bothered me), and I really wanted to maximally compress the archive using "gzip -9". For whatever reason, I couldn’t pipe or redirect the line above to gzip, no matter what I tried---so I took the simple way out, which I'd use later in my script: after creating the tar-ball, I decided that I'd just use a separate command to gzip that same custom filename, which several minutes later (after tar finished tarring), would presumably remain the same. (In other words the date wouldn’t have changed from tarring to gzipping):

$ tar cf - /export/samba/fileshare/ > /export/samba/fb/fbweekly.`date '+%d%b'`.tar
$ gzip -9 /export/samba/fb/fbweekly.`date '+%d%b'`.tar

[Note: To be more error-proof, I could also set the value of that day's filename "fbweekly.`date '+%d%b'`.tar" to a variable that I would then use in the tar and gzip lines. Therefore, if the date changed between the tar and the gzip line, this would prevent gzip from not finding the tar file.]

Find

Now that I figured out the steps necessary for creating a weekly backup of the entire fileshare, I needed to figure out how to backup only those files that had changed in the last day, on a daily basis. Just from perusing a few books and webpages, I knew I'd have to use some combination of find and tar, so I started exploring find. I discovered that it's a really excellent tool for finding files based on any number of criteria, be it their name, size, permissions, last modified date, etc. For generating a list of files modified in the last day, I tried:

$ find /export/samba/fileshare/ -mtime -1
/export/samba/fileshare/MBrinson
/export/samba/fileshare/MBrinson/Agenda.doc
/export/samba/fileshare/CMayo
/export/samba/fileshare/CMayo/Bollenbacher Ltr.WHISEProj.doc
/export/samba/fileshare/CMayo/Battle Itinerary.doc
/export/samba/fileshare/JWatt
/export/samba/fileshare/JWatt/ScanJet 5p
/export/samba/fileshare/JWatt/ScanJet 5p/sj215en.exe
...

As you see, this listing shows both the files that were modified as well as the directories that contain them. I figured out that in order to limit the list to files only, I could use:

$ find /export/samba/fileshare/ -mtime -1 \! -type d

where the "\! – type d" means: "don't include directories." Then I combined the find command with tar, using the same command substitution syntax that I had used with date---except this time I was going to add every modified file, found by find, to the tar-ball:

$ tar cf - `find /export/samba/fileshare/ -mtime -1 \! -type d` > backup.tar

But the the line above ended up vomiting out screenfuls worth of this gorp:

tar: /export/samba/fileshare/CMayo/Bollenbacher: Cannot stat: No such file or directory
tar: Ltr.WHISEProj.doc: Cannot stat: No such file or directory
tar: /export/samba/fileshare/CMayo/Battle: Cannot stat: No such file or directory
tar: Itinerary.doc: Cannot stat: No such file or directory
tar: /export/samba/fileshare/JWatt/ScanJet: Cannot stat: No such file or directory
tar: 5p/sj215en.exe: Cannot stat: No such file or directory

I quickly realized that tar was interpreting each space in the filename as the end of the filename, which rendered almost all of the files and paths unintelligible. [Realize that the files and directories were named from Windows, and thus were riddled with spaces.] I assumed that I needed find to surround each "path/file" with quotes, but I couldn't figure out how to do that, so I tried something I saw in another backup script, which separated the find and tar steps using a text file as an intermediary:

$ find /export/samba/fileshare/ -mtime -1 \! -type d > /tmp/modified.files
$ tar cT /tmp/modified.files > /export/samba/fb/fbdaily.`date '+%d%b'`.tar

This time, find redirected it's output to a file in /tmp/, and tar took this file, using the T option, and successfully created the daily archive of modified files. The lines in text file still had spaces in them, but I assume tar treated each line as a separate file, rather that each chunk of text separated by spaces. [Note: This separation of find and tar could also prove advantageous with an if statement that checks to see if the modified.files file contains anything. If it did, then tar would be invoked; if not, then the script would end, preventing tar from trying to create an empty archive.]

I then finished the process with a gzip line:

$ gzip -9 /export/samba/fb/fbdaily.`date '+%d%b'`.tar

Cron/crontab

Now that I figured out how to get find, tar, and gzip to work together, I needed crontab to automate the process. It turned out that crontab was very easy to use. The crontab file is created using the crontab command, which opens up the default editor (vi in my case) and accepts entries in this format:

<minute: 0-59> <hour: 0-23> <dayofmonth: 1-31> <month: 1-12> <dayofweek: 0-6> command

The cron daemon, which is always running, continually checks the crontab file for entries. If the current date and time match the settings in a crontab line, then cron executes the given command.

Seeing as though it would be easiest to have cron run a single script as the command, I encapsulated the previous steps in two different files [This would also allow me to easily add the use of a filename variable that I mentioned above.]:

# fbweekly:
#!/bin/bash
# consists of only two lines that tar the entire fileshare and then gzip it
tar cf - /export/samba/fileshare/ > /export/samba/fb/fbweekly.`date '+%d%b'`.tar
gzip -9 /export/samba/fb/fbweekly.`date '+%d%b'`.tar

# fbdaily:
#!/bin/bash
# consists of three lines: one to find the modified files, and then two to tar and gzip them
find /export/samba/fileshare/ -mtime -1 \! -type d > /tmp/modified.files
tar cT /tmp/modified.files > /export/samba/fb/fbdaily.`date '+%d%b'`.tar
gzip -9 /export/samba/fb/fbdaily.`date '+%d%b'`.tar

With two self-contained scripts, I was able to put these lines into crontab:

30 23 * * 1,2,3,4,5 fbdaily
30 23 * * 0 fbweekly

The first line runs fbdaily at 11:30p every weekday of every month. The second line runs fbweekly at 11:30 every Sunday. At this point, I had accomplished everything I wanted to, but I remembered that if I didn’t automatically purge the backup files, the hard drive would eventually fill up. So I created a script to find files in the backup directory that are more than a month old, and remove them:

# fbclean:
#!/bin/bash
# finds all backup files more than 31 days old and deletes them
find /export/samba/fb/ -mtime +31 \! -type d -exec rm -f {} \;

I was able to accomplish the cleanup process with only one line, which I could have put directly in the crontab file, but I decided to make it a script like the others which I could add to later or use independently of crontab if I so needed. I also discovered the -exec option of find:

-exec rm -f {} \;

The –exec option tells find to execute the following command, "rm –f", for every file that's found. The brackets {} symbolize the place in the rm command where the file to be remove would normally appear, and the escaped semicolon \; marks the end of the –exec argument. With this new script, I modified the crontab file to look like the following, which completed my goal of a completely automated backup process:

30 23 * * 1,2,3,4,5 fbclean; fbdaily
30 23 * * 0 fbweekly

Now fbclean checks for and deletes old backup files every weekday, and then fbdaily creates the archive of modified files for that day.

INLS 183 Project 3: Backup with tar, gzip, find, date and cron script file

Resources

Linux in a Nutshell, 2nd Edition, Ellen Siever, O'Reilly, 2000.
Running Linux, 3rd edition, Matt Welsh, Matthias Kalle Dalheimer, Lar Kaufman, O'Reilly, 2000.
Linux Administration: A Beginner's Guide, Steve Shah, McGraw-Hill, 2000.