Use AWStats with Amazon S3 / CloudFront
My final post on the subject to wrap everything up of how to automate the processing of Amazon S3 logs with AWStats.
I've recently written two posts about AWS and log processing with an external server so here comes a final post to wrap it all up and tie it together.
Goal
To have an automatic cron
process that polls Amazon S3 and/or CloudFront log
files to your own server and then using logresolvemerge to combine them for
processing with AWStats.
Requirements
An Amazon Web Services account where you collect your log files in a bucket. Your own dedicated server, VPS or host where you have shell access to install and configure your own solutions so you can have Python and boto installed as well as adding your own scripts. Also you should have AWStats installed and up and running. \ I've made this setup using Ubuntu 10.04 which I run on a VPS over at Linode where I currently run a couple of projects.
Setup
I'm currently collecting logs from one CloudFront distribution and one S3 bucket
which I have mapped my own CNAMEs
to. Let's call them s3.example.com and
cdn.example.com. Then I have another bucket which is not public but where I
store other things. So in this bucket I've made a folder named logs and in that
folder I have a folder for each domain.
So the log prefixes for me in this case becomes
/logs/s3.example.com
/logs/cdn.example.com
Download AWS Logs
To be able to download the log files from Amazon to a local directory, check out my earlier post about downloading AWS logs with boto and Python.
Configure AWStats
Now we need to setup some AWStats configuration to prepare it for handling the AWS logs. I'll create to configuration files for this example:
/etc/awstats/awstats.cdn.example.com.conf
/etc/awstats/awstats.s3.example.com.conf
I assume that you already know your way around configuring AWStats so I'll focus
on the specifics for AWS compability. The final log files that will be created
later on for AWStats to use will be stored in /var/log/apache2/ so I point the
LogFile
option to that location. And then we just have to setup the LogFormat
correctly. Below is the setup for S3 log files and then for CloudFront log
files.
S3 AWStats LogFormat
LogFile="/var/log/apache2/s3.example.com.log"
LogFormat="%other %extra1 %time1 %host %logname %other %method %url %otherquot %code %extra2 %bytesd %other %extra3 %extra4 %refererquot %uaquot %other"
CloudFront AWStats LogFormat
LogFile="/var/log/apache2/cdn.example.com.log"
LogFormat="%time2 %cluster %bytesd %host %method %virtualname %url %code %referer %ua %query"
If you also want the CloudFront statistics to display information about the edges you can check out my post about CloudFront Edges in AWStats.
Automate everything
Now when we have all components in place, we just need to automate them so we later on can add it to cron. I've made a bash script which takes care of the automation. The script is not very complicated but I'll make a quick walk through of it, so it can be modified to specific needs and setups. I added a number at the comment for each section in the script which I use as a reference in the list below.
- A few variables used in the script. The date variable is just to collect the current date. I don't really use this information at the moment other than appending it to a temporary directory name. But in case I want to expand on the script in the future to keep the archives around it could be handy. Then I create a variable for each log I want to process. In this example I process two logs, one S3 and one CloudFront, so I have 2 variables here containing paths to temp directories where the log files will be downloaded.
- Here we use the boto Python script I created earlier to download all log files from Amazon to our local temp directories.
- Now when all the log files have been downloaded, we need to combine them into
a format that AWStats can understand. The first line combines the CloudFront
logs. They are very straightforward so they just need to be combined into one
large file and AWStats are ready to process it. Then the second line is to
process the S3 log files into one large log file. S3 is a bit more tricky as
it contains a few things AWStats don't understand, so I use a regular expression to
remove the things that would cause AWStats some headache. I store my AWS
final log files in /var/log/apache2/ which is the path I defined in the
LogFile
option for AWStats earlier. - Our log files are now downloaded and combined into the final log files that are stored in /var/log/apache2/ so I simply delete the temporary downloaded files, as I don't need to keep them around anymore.
- And finally we execute AWStats to update the statistics with the log files we just have processed.
get-aws-logs.sh
#!/bin/bash
# Initial, cron script to download and merge AWS logs
# 29/11 - 2010, Johan Steen
# 1. Setup variables
date=`date +%Y-%m-%d`
cdn_folder="/tmp/log_cdn_$date/"
static_folder="/tmp/log_static_$date/"
# 2. Call the python script to download log folders from Amazon to local folders
python /home/johan/get-aws-logs.py --prefix=logs/cdn.example.com/ --local=$cdn_folder
python /home/johan/get-aws-logs.py --prefix=logs/s3.example.com/ --local=$static_folder
# 3. Merge and add the downloaded log files to the local log file
/usr/local/bin/logresolvemerge.pl ${cdn_folder}* >> /var/log/apache2/cdn.example.com.log
/usr/local/bin/logresolvemerge.pl ${static_folder}* | sed -e 's/SOAP\.\([A-Z]*\)/\1/' -e 's/REST\.\([A-Z]*\)\.[A-Z]*/\1/' >> /var/log/apache2/s3.example.com.log
# 4. Delete the downloaded log files
rm -rf $cdn_folder
rm -rf $static_folder
# 5. Update the AWStats Logs
/usr/lib/cgi-bin/awstats.pl -config=cdn.example.com -update
/usr/lib/cgi-bin/awstats.pl -config=s3.example.com -update
Cron it
And finally, add the bash script to your cron to be run as often as you feel is appropriate for your setup.
# Process the AWS Logs at 4:43 every night
43 4 * * * root /home/johan/get-aws-logs.sh >/dev/null
And that's it. Feel free to leave a comment if you have any questions or suggestions for improvements.