permalink

5

Download & Process Amazon Cloudfront Logs with Awstats

These days we use Amazon Cloudfront for content delivery. Amazon has made it very easy to deliver files in a Amazon Simple Storage Service (S3) bucket using Amazon Cloudfront distribution. If you are using Cloudfront as Content Delivery Network (CDN) your next task will be monitoring the usage. For this Amazon Cloudfront has a provision to store access logs to a S3 bucket. My hurdle was to process the log files stored by Cloudfront. For sites hosted with apache I use Awstats for reading the logs. So my vote was for awstats. Please follow the steps one by one 😉

1. Need to download the log files stored in the S3 bucket. For this I had to use the a python script done by wpstorm.net but I had to make some modification so that it worked for me. Please follow the blog post if you need any help setting up the required libraries.

get-aws-logs.py

Note: The above script will download the s3 logs to specified folder. Please make sure you put your Amazon access keys.

2. Now we have bash script which will uses the above python script to download the log files and combine all of them into a single log file and then it will be analyzed by awstats.

Warning: Please read through the script files and make necessary changes needed.
Note: You should have awstats installed on your system. The bellow script uses awstats.
Note: You can download the script files at the end of this blog post where awstats configuration with custom setup for cloudfront log format is also provided.

get-aws-logs.sh

I would suggest you to test run the above scripts on a staging / testing environment before moving to a production. Again please change the scripts with your domain details and Amazon access keys.

Download the scripts to download and process Amazon Cloudfront Logs with Awstats.

Have a nice journey exploring the cloud 😉

5 Comments

  1. Hello there and thanks for the information posted on your blog.

    It seems we do not have the same cloudfront access logs’ format.
    What I have is what is described here: http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?AccessLogs.html

    Fields are in that order (in my case and in Amazon’s docs):
    #Fields: date time x-edge-location c-ip x-event sc-bytes x-cf-status x-cf-client-id cs-uri-stem cs-uri-query c-referrer x-page-url c-user-agent x-sname x-sname-query x-file-ext x-sid

    Whereas you propose to use the following LogFormat:
    LogFormat=”%time2 %cluster %bytesd %host %method %virtualname %url %code %referer %ua %query”

    Can you please check the order of your fields?

    Best regards,

    Kmon

    • Dear Kmon

      What you are looking at for streaming distribution log.

      The following is an example of a log file for a streaming distribution.
      #Version: 1.0
      #Fields: date time x-edge-location c-ip x-event sc-bytes x-cf-status x-cf-client-id cs-uri-stem cs-uri-query c-referrer x-page-url? c-user-agent x-sname x-sname-query x-file-ext x-sid

      The following is an example log file for a download distribution.
      #Version: 1.0
      #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query

      My script does parsing for only download distribution and not for streaming. Please change your fields in the awstats config with the different format for streaming distribution. It would be great if you can share the format for the same.

      Cheers
      Imthiaz

      • You are absolutely right.

        Naively adapting awstats to eat cloudfront’s streaming format did not help much.
        In fact, there are many events that can happen while streaming (connect, play, seek, stop, disconnect, etc) and the sc-bytes is unfortunately only the total bytes transferred to the client up to that event (one line of log per event).
        The naive solution that I applied is to concentrate on “stop” events and thus counting the amount of data transferred.
        As you can imagine, a client pressing on stop (e.g., to pause the video) and then play again will be counted “twice”, say XKB for the first stop and then YKB (where Y includes X) for the second stop, resulting in stats that did not make any sense, unfortunately.

        If anyone has found a simple and neat solution to this problem, please post a comment!

        Cheers,

        Kmon

  2. This is the code I use for parsing streaming logs, if it helps at all.

    First, reading all the logfiles in a directory…

    public function process_logs($x)
    {

    // First we’ll create an array of all viable log files in directory $x
    $handle = opendir($x);

    // for EACH file in this directory
    while ($file = readdir($handle)){
    // echo “Checking file: $file \n”;
    // if the filename has the log stem in it then add it to the list
    if (strstr($file,”E9PBX”))
    {
    echo “Adding file: $file\n”;
    $log_files[]= $file;
    }
    }

    foreach($log_files as $logfile)
    {
    // Open the file
    $fh = fopen($logfile,’r’);

    $i=0;

    while (!feof ($fh)){
    $i++;
    //READ the LINE
    $content = fgets ($fh, 4096);
    // echo “line $i: ” . $content;
    // echo “length $i: ” . strlen($content);
    if ($i>2)
    {
    if (strpos($content,”http://”)){

    $rEvent = new rtmpEvent();
    $fields = explode(“\t”,$content);
    $rEvent->init($fields);
    }
    }

    }
    }
    }

  3. And here’s the basic row that you’ll get back in $fields. I don’t use 9,14, or 16, but I think they are defined in the AWS documentation.

    Side note, I am not a real programmer, I just futz around. I’m sure there are better ways to do this.

    /* [0] => [date]
    [1] => [time]
    [2] => [edge location]
    [3] => [ip address]
    [4] => [event]
    [5] => [leading bytes]
    [6] => OK
    [7] => [unique connection id]
    [8] => [streaming server]
    [9] =>
    [10] => [actual connector]
    [11] => [URL that made the call]
    [12] => [user’s machine]
    [13] => [path to file]
    [14] => –
    [15] => [filetype]
    [16] => 1*/