Regardless of the operating system you are using, data loss is inevitable. Sooner or later, it will happen to you—the only question is how much data you will lose. Although RAID can act as an insurance policy for hardware failures, it was never designed to serve as a backup and will not perform this task well at all. Human error is always the greatest concern since important files can be accidentally overwritten or deleted at careless moments. It is easy to fall behind on your backups or get complacent; without recent backups you have no recovery strategy. This guide will help you automate your backups on your Linux rig so you will always have your files up to date.
Before you can backup your data, you need an acceptable storage location to copy it to. Optical media like CD-R/RW or DVD-R/RW discs were once a popular (but not necessarily the best) medium to back up to since they held a lot of data for the time and were fairly cheap. Cheap optical media is suitable for short-term storage, but should not be relied upon for the long-term because of the possibility of scratches, oxidation, or organic dye breakdown. (CD rot) Optical media is now even less practical than it used to be since most personal data greatly exceeds what most disc formats can hold. It would take many discs (or one or more discs in a still-expensive format like Blu-ray) to conduct a single backup session. It used to be common practice to include multiple redundant copies of a file on a disc or spread across several discs to improve the chances of recovery in case of damage, and this would inflate the disc count even more. Ultimately, it just isn't worth using optical discs for backup anymore.
Today, the only practical means for backup is either an external hard drive (or several of them, if you want maximum protection) or an external server. It is best to rely on a combination of these methods instead of just one to increase redundancy. In any case, the hard drive(s) should be large enough to accommodate both your current existing data in addition to any foreseeable growth. In the case of servers, you should definitely use a remote server if you have access to one (if you buy web hosting and have plenty of space left on your account, that would be ideal for backups) Regardless of the storage mechanisms you use, the actual file transfer operations should be done with a program called Rsync.
Rsync is a program that copies data from one location to another. Although another program, cp, exists for this purpose, Rsync is far more advanced and efficient; while cp can only copy entire files from one location to another on a local system, Rsync compares the source file to the destination file (if it exists) and only copies the newer parts of the source file to the destination. In this way, Rsync can synchronize data between two locations much like the Windows briefcase tool does. This saves an immense amount of time and bandwidth on backup procedures.
In addition to that, Rsync can sync files on both local and remote systems whereas cp can only work with local systems. (there is a remote version of cp called scp, but even it can only work with whole files) Rsync will be as slow as cp the first time you use it since the destination files must be copied in full to the new backup location, but subsequent sessions will be much faster. You should know that it may take anywhere from several hours to several days to complete the first Rsync session with a remote server, depending on the speed of your connection and the amount of bulk data you need to transfer. Furthermore, Rsync (via SSH) encrypts remote file transfer sessions to keep your data from being sniffed in transit.
Rsync is fairly straightforward. The basic syntax is as follows: “rsync -a [source dir] [destination dir]”. (the -a switch tells rsync to work in “archive” mode, which is ideal for backup functions) Although the basic command listed above will work once you specify the source and destination locations, there are many other options available to tweak Rsync. These can be discovered by reading the Rsync manual page (run “man rsync”).
Although the command line implementation of Rsync allows for easier automation, (more on that next) using Rsync in this way can be difficult for new users who are not used to the command line. In such cases there is a graphical frontend called Grsync that can vastly simplify the backup process. Grsync redefines the various switches as easy-to-understand checkboxes that can be set to the desired combination.
Grsync does have a degree of automation by allowing you to define and automatically run a session (grsync -e [session_name]) but standard Rsync is still much more versatile since you can specify commands directly instead of having to rely on predefined sessions.
Once you have configured Rsync to backup your files, you are only halfway to having a viable backup plan. A decent backup solution must run regularly instead of intermittently, and all by itself Rsync will not update your files unless you manually invoke it. While you can remember to manually run Rsync every day, there is a far easier way to do it.
Linux and similar systems have a utility called Cron, which is essentially a scheduling tool for running other programs. Each user has a crontab file, which is a list of instructions for Cron to execute and the times each instruction should run. In this way, everyone (not just root) can use Cron. Cron works with the system clock; when the correct time for a planned event rolls around, Cron will automatically execute the command.
There are several ways to edit your crontab. The easiest way for new users to configure Cron is to use a frontend like gcrontab or kcron. More advanced users can edit the crontab manually in a text editor like Vim or Kate. To edit the crontab manually, open a terminal and type “crontab -e”. After that, you should check your system process list for a “cron” or “crond” process (root should own it) to make sure that the Cron daemon is running. After editing your crontab, you must restart the cron process ( run “sudo /etc/init.d/cron restart”) before your new changes will work.
Manual crontab editing looks daunting at first but is simple once you get the hang of it. Each row in the crontab list is treated as a separate command. Each row has several columns that must be specified: minute, (abbreviated “m”) hour, (abbreviated “h”) day of month, (abbreviated “dom”) month, (abbreviated “mon”) day of week, (abbreviated “dow”) and the command. Each column is delimited by a single space with no other marks, and it doesn't matter if each row lines up perfectly with the others or not. You can add times/dates as both real numbers (Cron uses a strange 24-hour clock, so noon is 12:00 and midnight is 00:00), abbreviated days of the week (Sun, Thu, etc.), and wild characters. (*) Anything defined with a wild character is interpreted by Cron as “all”, meaning that if the hour on a command is set to “*”, Cron will execute the command every hour. To configure something to run repeatedly at a certain interval, you can use a “split” wild character. (e.g. setting */2 in the hour field will cause the command to run once every two hours on the days you define.)
Cron is quite flexible; Ranges affecting everything between two values are defined by short dashes (-) while multiple nonconsecutive occasions are delimited by commas. For instance, if you wanted to run a command every day from the first of the month to the 10th, you would specify “1-10” in the “dom” field. Likewise, if you wanted a command to run every Monday, Wednesday, and Friday, you should declare “Mon,Wed,Fri” in the “dow” field.
Now that you know about Rsync and Cron, you can probably already see how they can be used together to automate backup processes. Since decent backup procedure recommends backing up to multiple sources, you would have to create multiple crontabs each with a different rsync command. Fortunately, there is a far better way that can be handled with a single Cron job.
The various command shells on Linux (like Bash) have excellent support for scripting. Shell scripts are the Linux equivalent of Windows batch files and offer a way to quickly run multiple commands in a specific pre-defined order and with a preset configuration. If you have much experience at all on the command line, you should not find basic scripting very difficult (there are plenty of online guides to help you write scripts for bash and other shells if you run into trouble). Creating a shell script to hold the necessary Rsync commands is trivial; from that point, you can invoke the shell script in your crontab and each Rsync command will run at the appointed time you set in Cron just as though they were being invoked directly. It helps to specify the full pathname of the shell “/bin/bash $scriptname” in the crontab instead of the shortcut “./$scriptname” to make sure that Cron executes the job successfully. If you have sensitive data, you should definitely consider encrypting it before you place it on a shared server (like a web host). Encryption can be done with GPG in the shell script prior to transmission.
The only foreseeable problem with automated backup is that system configurations tend to change over time. Mount points and IP addresses can be reassigned without notice, and your script will not automatically update itself to include these changes. As long as your Rsync commands are out of date, your files will not be backed up properly and you will have no idea of the problem until it is too late. Therefore, it pays to manually run your backup commands often to check for problems and update your script as necessary.
Although Rsync can sync between locations on the local system without user input, it usually requires a password before it can sync to a remote system (it uses part of SSH's functionality for this). Since automated Cron jobs prevent user input, there is no way to provide the password when it is needed. Because of this, the default SSH behavior will not work for automated backups. You can get around this by setting up public/private key authentication for SSH; as long as the keys match, no password is required. Be forewarned that the automated key authentication will be broken if the IP address of the remote host changes (this is a deliberate feature to help prevent man-in-the-middle attacks) and that security is compromised slightly. Ultimately, it comes down to a trade-off between security and convenience, so choose wisely based on your situation and needs.