It is important to ensure that Google does not index sites whilst they are still on a staging environment, but you cannot lock it down completely - how would your clients proof it? So I run a simple global rewrite rule in Apache that redirects all requests for robots.txt to a central disallow all response. This works great and Google appears to honour the rule as one would hope.

What happens though when something about that central file changes? One fateful night it so happened to occur on an old server I manage. Someone had altered the file and replaced it with an allow all rule. A site from the server started to appear in Google’s listings and thankfully it was picked up quickly, banned through Google webmaster tools and the original robots.txt put in place to protect against future indexing.

This left me needing a quick and dirty little monitoring script to keep an eye on the file. It really didn’t need to be anything crazy - just email me when the file changes so I can investigate what or whom changed it - tell them to desist - and revert it’s contents.

To do this I employed sha512sum and mail inside a simple cron job that would regularly compare the file’s hash against the known good hash. If the hashes do not match then the script will email a short message to let me know to check into it.

Now of course you could just use the cron job to revert the contents automatically, but I wanted to look into why it was happening first. If you’re really worried you could of course replace the contents of the file and then email yourself. In this case it wasn’t so important.

There are plenty of command line tools to help you get a hash for a file - handy when you’ve downloaded something and you want to verify the integrity of file. It used to be common for open source projects to list hashes beside their downloads before GitHub. Anyway there are a number of choices with increasing length and therefore less collision prone (two different files creating the same hash):

  • md5sum
  • sha1sum
  • sha256sum
  • sha512sum

By default they’ll spit out the hash(es) onto the command line (STDOUT), but we’re going to redirect them to a file so we can refer to them later.

sha512sum robots.txt index.html > cron_sums.txt

This will create a text file containing two hash values that we can use to later verify against the files in question. If you, later, take another hash of the files and it doesn’t match the one in cron_sums.txt then that file has changed. There is a handy switch you can pass to sha512sum that makes this process much easier.

sha512sum --status -c cron_sums.txt

The above command’s exit status code can be used to generate a human readable message using a simple || (or) operator on the command line.

sha512sum --status -c cron_sums.txt && echo "Success" || echo "Failed"

The above command is pretty self explanatory so I won’t bother working through it and I’ll move onto sending the email instead. This will be done by using the venerable mail command.

sha512sum --status -c cron_sums.txt && echo "Success" || echo "Failed" | mail -s "File hashes didn't match"

Here the output messages are piped to mail and dutifully sent through to your inbox. This works, but I’d like to only be disturbed when it goes wrong - I don’t care if it succeeds. To do this we’ll pipe the success output to the /dev/null blackhole.

sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match"

So we’ve worked out the bash command we want cron to run for us every minute of the day. Let’s tell cron about it! Execute crontab -e on the command line to open the crontab in your default editor.

Now add the following cron job to the file.

*   *   *   *   *    /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match"

It is worth noting that the paths in the cron_sums.txt file are all relative so you may need to change into the directory containing the files you want to check before running the sha512sum command from cron. Also cron will run in the user’s home directory by default.

*   *   *   *   *    cd /var/www; /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match"

It isn’t pretty and it certainly doesn’t scale (although you could email a list/forwarding group), but it does serve as a quick and dirty fix to warn you of file inconsistency.

As a bonus; to automatically revert the file as well you could add the following to the crontab.

*   *   *   *   *    cd /var/www; /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match" && echo "Disallow: All" > robots.txt

Whilst this is a very simple one-liner example you could of course use the same principles to write a simple little bash script that would be triggered by failure instead.

*   *   *   *   *    cd /var/www; /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || /usr/scripts/