Download & Process Amazon Cloudfront Logs with Awstats

These days we use Amazon Cloudfront for content delivery. Amazon has made it very easy to deliver files in a Amazon Simple Storage Service (S3) bucket using Amazon Cloudfront distribution. If you are using Cloudfront as Content Delivery Network (CDN) your next task will be monitoring the usage. For this Amazon Cloudfront has a provision to store access logs to a S3 bucket. My hurdle was to process the log files stored by Cloudfront. For sites hosted with apache I use Awstats for reading the logs. So my vote was for awstats. Please follow the steps one by one 😉

1. Need to download the log files stored in the S3 bucket. For this I had to use the a python script done by wpstorm.net but I had to make some modification so that it worked for me. Please follow the blog post if you need any help setting up the required libraries.

get-aws-logs.py

#! /usr/bin/env python
"""Download and delete log files for AWS S3 / CloudFront
 
Usage: python get-aws-logs.py [options]
 
Options:
  -b ..., --bucket=...    AWS Bucket
  -p ..., --prefix=...    AWS Key Prefix
  -a ..., --access=...    AWS Access Key ID
  -s ..., --secret=...    AWS Secret Access Key
  -l ..., --local=...     Local Download Path
  -h, --help              Show this help
  -d                      Show debugging information while parsing
 
Examples:
  get-aws-logs.py -b eqxlogs
  get-aws-logs.py --bucket=eqxlogs
  get-aws-logs.py -p logs/cdn.example.com/
  get-aws-logs.py --prefix=logs/cdn.example.com/
 
This program requires the boto module for Python to be installed.
"""
 
__author__ = "Johan Steen (http://www.artstorm.net/)"
__version__ = "0.5.0"
__date__ = "28 Nov 2010"
 
import boto
import getopt
import sys, os
from boto.s3.key import Key
 
_debug = 0
 
class get_logs:
    """Download log files from the specified bucket and path and then delete them from the bucket.
    Uses: http://boto.s3.amazonaws.com/index.html
    """
    # Set default values
    AWS_BUCKET_NAME = '{AWS_BUCKET_NAME}'
    AWS_KEY_PREFIX = ''
    AWS_ACCESS_KEY_ID = '{AWS_ACCESS_KEY_ID}'
    AWS_SECRET_ACCESS_KEY = '{AWS_SECRET_ACCESS_KEY}'
    LOCAL_PATH = '/tmp'
    # Don't change below here
    s3_conn = None
    bucket = None
    bucket_list = None
 
    def __init__(self):
        s3_conn = None
        bucket_list = None
        bucket = None
 
    def start(self):
        """Connect, get file list, copy and delete the logs"""
        self.s3Connect()
        self.getList()
        self.copyFiles()
 
    def s3Connect(self):
        """Creates a S3 Connection Object"""
        self.s3_conn = boto.connect_s3(self.AWS_ACCESS_KEY_ID, self.AWS_SECRET_ACCESS_KEY)
 
    def getList(self):
        """Connects to the bucket and then gets a list of all keys available with the chosen prefix"""
        self.bucket = self.s3_conn.get_bucket(self.AWS_BUCKET_NAME)
        self.bucket_list = self.bucket.list(self.AWS_KEY_PREFIX)
 
    def copyFiles(self):
        """Creates a local folder if not already exists and then download all keys and deletes them from the bucket"""
        # Using makedirs as it's recursive
        if not os.path.exists(self.LOCAL_PATH):
            os.makedirs(self.LOCAL_PATH)
        for key_list in self.bucket_list:
            key = str(key_list.key)
            # Get the log filename (L[-1] can be used to access the last item in a list).
            filename = key.split('/')[-1]
            # check if file exists locally, if not: download it
            if not os.path.exists(self.LOCAL_PATH+filename):
                key_list.get_contents_to_filename(self.LOCAL_PATH+filename)
                print "Downloaded				"+filename
            # check so file is downloaded, if so: delete from bucket
            if os.path.exists(self.LOCAL_PATH+filename):
                key_list.copy(self.bucket,'archive/'+key_list.key)
                print "Moved					"+filename
                key_list.delete()
                print "Deleted					"+filename
 
def usage():
    print __doc__
 
def main(argv):
    try:
        opts, args = getopt.getopt(argv, "hb:p:l:a:s:d", ["help", "bucket=", "prefix=", "local=", "access=", "secret="])
    except getopt.GetoptError:
        usage()
        sys.exit(2)
    logs = get_logs()
    for opt, arg in opts:
        if opt in ("-h", "--help"):
            usage()
            sys.exit()
        elif opt == '-d':
            global _debug
            _debug = 1
        elif opt in ("-b", "--bucket"):
            logs.AWS_BUCKET_NAME = arg
        elif opt in ("-p", "--prefix"):
            logs.AWS_KEY_PREFIX = arg
        elif opt in ("-a", "--access"):
            logs.AWS_ACCESS_KEY_ID = arg
        elif opt in ("-s", "--secret"):
            logs.AWS_SECRET_ACCESS_KEY = arg
        elif opt in ("-l", "--local"):
            logs.LOCAL_PATH = arg
    logs.start()
 
if __name__ == "__main__":
    main(sys.argv[1:])

Note: The above script will download the s3 logs to specified folder. Please make sure you put your Amazon access keys.

2. Now we have bash script which will uses the above python script to download the log files and combine all of them into a single log file and then it will be analyzed by awstats.

Warning: Please read through the script files and make necessary changes needed.
Note: You should have awstats installed on your system. The bellow script uses awstats.
Note: You can download the script files at the end of this blog post where awstats configuration with custom setup for cloudfront log format is also provided.

get-aws-logs.sh

#!/bin/bash
# Initial, cron script to download and merge AWS logs
# 29/11 - 2010, Johan Steen
 
# 1. Setup variables
date=`date +%Y-%m-%d`
static_folder="/tmp/log_static_$date/"

mkdir -pv $static_folder 
python /var/www/scripts/get-aws-logs.py --prefix=logs/www.imthi.com --local=$static_folder
gunzip --quiet  ${static_folder}*
 
/usr/local/awstats/tools/logresolvemerge.pl ${static_folder}* | sed -r -e 's/([0-9]{4}-[0-9]{2}-[0-9]{2})\t([0-9]{2}:[0-9]{2}:[0-9]{2})/\1 \2/g'  >> /var/www/logs/www.imthi.com.log
 
rm -vrf $static_folder
/usr/local/awstats/wwwroot/cgi-bin/awstats.pl -config=imthi -update

I would suggest you to test run the above scripts on a staging / testing environment before moving to a production. Again please change the scripts with your domain details and Amazon access keys.

Download the scripts to download and process Amazon Cloudfront Logs with Awstats.

Have a nice journey exploring the cloud 😉

By Imthiaz

Programmer, SAAS, CMS & CRM framework designer, Love Linux & Apple products, Currently addicted to mobile development & working @bluebeetle

5 comments

  1. Hello there and thanks for the information posted on your blog.

    It seems we do not have the same cloudfront access logs’ format.
    What I have is what is described here: http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?AccessLogs.html

    Fields are in that order (in my case and in Amazon’s docs):
    #Fields: date time x-edge-location c-ip x-event sc-bytes x-cf-status x-cf-client-id cs-uri-stem cs-uri-query c-referrer x-page-url c-user-agent x-sname x-sname-query x-file-ext x-sid

    Whereas you propose to use the following LogFormat:
    LogFormat=”%time2 %cluster %bytesd %host %method %virtualname %url %code %referer %ua %query”

    Can you please check the order of your fields?

    Best regards,

    Kmon

    1. Dear Kmon

      What you are looking at for streaming distribution log.

      The following is an example of a log file for a streaming distribution.
      #Version: 1.0
      #Fields: date time x-edge-location c-ip x-event sc-bytes x-cf-status x-cf-client-id cs-uri-stem cs-uri-query c-referrer x-page-url? c-user-agent x-sname x-sname-query x-file-ext x-sid

      The following is an example log file for a download distribution.
      #Version: 1.0
      #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query

      My script does parsing for only download distribution and not for streaming. Please change your fields in the awstats config with the different format for streaming distribution. It would be great if you can share the format for the same.

      Cheers
      Imthiaz

      1. You are absolutely right.

        Naively adapting awstats to eat cloudfront’s streaming format did not help much.
        In fact, there are many events that can happen while streaming (connect, play, seek, stop, disconnect, etc) and the sc-bytes is unfortunately only the total bytes transferred to the client up to that event (one line of log per event).
        The naive solution that I applied is to concentrate on “stop” events and thus counting the amount of data transferred.
        As you can imagine, a client pressing on stop (e.g., to pause the video) and then play again will be counted “twice”, say XKB for the first stop and then YKB (where Y includes X) for the second stop, resulting in stats that did not make any sense, unfortunately.

        If anyone has found a simple and neat solution to this problem, please post a comment!

        Cheers,

        Kmon

  2. This is the code I use for parsing streaming logs, if it helps at all.

    First, reading all the logfiles in a directory…

    public function process_logs($x)
    {

    // First we’ll create an array of all viable log files in directory $x
    $handle = opendir($x);

    // for EACH file in this directory
    while ($file = readdir($handle)){
    // echo “Checking file: $file \n”;
    // if the filename has the log stem in it then add it to the list
    if (strstr($file,”E9PBX”))
    {
    echo “Adding file: $file\n”;
    $log_files[]= $file;
    }
    }

    foreach($log_files as $logfile)
    {
    // Open the file
    $fh = fopen($logfile,’r’);

    $i=0;

    while (!feof ($fh)){
    $i++;
    //READ the LINE
    $content = fgets ($fh, 4096);
    // echo “line $i: ” . $content;
    // echo “length $i: ” . strlen($content);
    if ($i>2)
    {
    if (strpos($content,”http://”)){

    $rEvent = new rtmpEvent();
    $fields = explode(“\t”,$content);
    $rEvent->init($fields);
    }
    }

    }
    }
    }

  3. And here’s the basic row that you’ll get back in $fields. I don’t use 9,14, or 16, but I think they are defined in the AWS documentation.

    Side note, I am not a real programmer, I just futz around. I’m sure there are better ways to do this.

    /* [0] => [date]
    [1] => [time]
    [2] => [edge location]
    [3] => [ip address]
    [4] => [event]
    [5] => [leading bytes]
    [6] => OK
    [7] => [unique connection id]
    [8] => [streaming server]
    [9] =>
    [10] => [actual connector]
    [11] => [URL that made the call]
    [12] => [user’s machine]
    [13] => [path to file]
    [14] => –
    [15] => [filetype]
    [16] => 1*/

Comments are closed.