First visit? Hello! :) Check out the whole photo gallery, or just the good ones.

Efficient logging using Django and Apache

Django 1.3 added support for the Python logging module, which is great if you want to audit the state of your application over time, but possibly overkill. Why? Because your web server already has built-in logging capability which you can extend for your own purposes!

Say you want to implement the simplest of all web applications - a hit counter. This involves recording the fact of an HTTP request. One option is to log hits into a database using Django's ORM. The advantage is that you will always have an up-to-the-second accurate hit count, as featured on Stackoverflow. The major disadvantage is the overhead of writing to the database on every hit. Another option is to use Django's new logging feature to log hits to some form of handler. This handler might be an interface to a database, in which case this is very much the same solution as the first. Alternatively the handler might log to a file, in which case you have to handle concurrency yourself, as FileHandlers are not multi-process safe. This is where Apache comes to the rescue.

Apache already has an efficient multi-process-safe file-writing log handler - the access log! It already logs all HTTP requests, so for the purposes of a hit counter, that may be all you need. More likely, parsing the log file and mapping request URLs to Django objects is more trouble than you fancy, therefore it would help if Apache could log the object id, and its Content Type id. So locate your Apache server configuration file and tell it to do exactly that:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\" %{X-JL6-CID}o %{X-JL6-OID}o" jl6combined CustomLog /var/log/apache2/access.log jl6combined

This defines a new log format, based on the popular "combined" log format, and then tells Apache to use it. The two extra entries on the end refer to two custom headers in the outgoing HTTP response. You can set these within Django. The following is a paraphrase of a view of a photo object on this website:

from django.contrib.contenttypes.models import ContentType from django.shortcuts import render_to_response def showphoto(request, ...): photo = ... # Retrieve the object content_type_pk = ContentType.objects.get(app_label="core", model="photo").id object_pk = photo.id response = render_to_response('photo.html', locals(), RequestContext(request)) response['X-JL6-CID'] = content_type_pk response['X-JL6-OID'] = object_pk return response

To add the headers to the outgoing HTTP response, you treat the response object as a dictionary, then return the response as you normally would in your view.

If you use any other web analytics software, you might like to check whether it can cope with additional unknown fields on the end of a log line. If not, there's nothing stopping you having two access logs - one for traditional analytics, and another containing your enhanced logging information. Remember, this is just logging a couple of IDs for the purposes of a hit counter. An HTTP header can contain anything you like, so you could equally be logging timing information, server load, custom error codes, or a random Shakespearean sonnet - as long as you can encode it in under 8KiB of ASCII.

There are just two minor drawbacks to this scheme.

Firstly, the additional headers are transmitted to the client. If you are logging any sensitive information about your application's internal operations, consider this a huge information leak. Use a reverse-proxying load balancer to strip out anything incriminating. Chances are that the information you want to log is perfectly benign, and something that might well appear in a URL anyway.

Secondly, you will have to process your logs to read the log entries back into an analytics database before you can generate your funky charts and top-content indices. If you absolutely need up-to-the-second log-based analytics, forget it and stick to logging directly into a database. However, it's much more likely that your statistics can happily be an hour, or a day, behind, and then you get to incur the cost of database writes in one summarizing process that can run outside peak hours or off your production system, rather than millions of tiny writes throughout the day to a live database.

Here's a paraphrase of some PostgreSQL that I run in a cron job immediately after the logrotate script has archived off the day's access logs. We pick up after the point where the log has been passed through logresolve, and tee'd into a pipe which feeds psql:

create temporary table tmp_apache(record TEXT); \copy tmp_apache from pstdin create temporary table tmp_accesslog as select 'jl6'::text as site, extract(year from accesstime) as yyyy, extract(month from accesstime) as mm, extract(day from accesstime) as dd, remotehost, identuser, http_user, accesstime, http_method, http_request_uri, http_version, http_response_code, http_response_size, http_referrer, http_user_agent, cid, oid from ( select m[1] as remotehost, m[2] as identuser, m[3] as http_user, (to_char(to_timestamp(m[4], 'DD/Mon/YYYY:HH24:MI:SS'), 'YYYY-MM-DD HH24:MI:SS ') || split_part(m[4], ' ',2))::timestamp with time zone as accesstime, m[5] as http_method, m[6] as http_request_uri, m[7] as http_version, m[8]::smallint as http_response_code, case m[9] when '-' then 0 else m[9]::integer end as http_response_size, m[10] as http_referrer, m[11] as http_user_agent, case when m[12] is null then 0 when m[12]='-' then 0 when m[12]='' then 0 else m[12]::integer end as cid, case when m[13] is null then 0 when m[13]='-' then 0 when m[13]='' then 0 else m[13]::integer end as oid from ( select regexp_matches(record, E'(.*) (.*) (.*) \\[(.*)\\] "(\\w+) (.*) HTTP/(\\d\\.\\d)" (\\d+) (.*) "(.*)" "(.*)"(?: (\\d+) (\\d+))?') as m from tmp_apache ) as subquery1 ) as subquery2; -- Count up hits by cid and oid: create temporary table accesslog_jl6_by_cidoid_new as select cid, oid, count(*) as hit_count from tmp_accesslog group by 1,2; -- Update the running totals: update accesslog_jl6_by_cidoid a set hit_count = a.hit_count + b.hit_count from accesslog_jl6_by_cidoid_new b where a.cid = b.cid and a.oid = b.oid; -- Insert new running totals if cidoid doesn't exist: insert into accesslog_jl6_by_cidoid select * from accesslog_jl6_by_cidoid_new a where not exists ( select null from accesslog_jl6_by_cidoid b where a.cid=b.cid and a.oid=b.oid); -- Update the rank of the running totals: update accesslog_jl6_by_cidoid set hit_count_rank = v1.rn from ( select rank() over (partition by cid order by hit_count desc) as rn, cid, oid from accesslog_jl6_by_cidoid ) as v1 where accesslog_jl6_by_cidoid.cid = v1.cid and accesslog_jl6_by_cidoid.oid = v1.oid;

Damned-cool voodoo! Regular expressions are subtle and hairy, so I make no claims as to how optimal the parsing phase is here.

I won't bore you with the details of writing queries against this summary table; the aim of this post is to document a creative use of the web server log, not really to tell you how to implement a hit counter. For that, check out 1995.

Comments

  1. Sid Mitra, on Thursday 14th April 2011, said:

    One can write a handler that uses celery to write to DB, to make it async from the normal request/response process.

    You can write a custom task to save it a different DB too, if you don't want to incur the cost on the production DB.

Add a comment






Formatting help
You typeYou see
*italics*italics
**bold**bold
[link text](http://www.example.com) link text
* item 1
* item 2
* item 3
  • item 1
  • item 2
  • item 3
> quoted text
quoted text

← Rio: The Movie | Home | Who is Satoshi Nakamoto? →