Skip to content

Querying Webtrends analytics via ODBC with SQLAlchemy

Webtrends is a web traffic analytics package, similar to Google Analytics. Recently we had the requirement of being able to pull data out of reports on a Webtrends instance.
Luckily enough they have a nice RESTful data extraction API; not so luckily it is only available for Webtrends Analytics 9 instances, while we were limited in our requirements to Webtrends 8.

Prior to Webtrends 9 the official data extraction method is a Windows-only ODBC driver, primarily used for connecting Excel spreadsheets and Microsoft ADO applications. The driver provides a pseudo-relational-database interface to pre-made reports on a Webtrends instance, which you can query using a simple SQL-subset.
Notice I use qualifiers like “pseudo” and “interface”, that’s because what’s really going on in the background is the driver makes an HTTP call (with details such as the SQL being sent) to a web service on the Webtrends instance, and the web service returns some binary data representing data in the queried report, which the driver then returns as a table. The reports themselves aren’t actually real tables in a real database, although I’m sure this is how they’re represented somewhere in the Webtrends system, what we get back is a set of aggregated data normally used to display pretty graphs and bar charts in the web interface.

To make life easier for us, as we were already using SQLAlchemy for querying PostgreSQL tables, and we would need to mock our Webtrends data at some point, it made sense to be able to use SQLAlchemy for all the data objects; with that in mind I made the SQLAWebtrends dialect.
After installing SQLAWebtrends you can create ORM classes matching the Webtrends reports you want to query, and then, using a specially formatted DSN, run queries against them as you would any other DB.

Considerations in making a dialect for Webtrends:

  • Nearly all the special features of SQLAlchemy from full unicode support, to field binding and various row counting hacks, need to be disabled as the feature-set provided by the Windows ODBC driver are extremely limited.
  • Method for getting meta-data such as table names and columns needed to be overridden and done using Microsoft ADO compatible method.
  • Some combination of PyODBC’s column pre-binding and the Webtrends driver’s complete lack of features means any attempt at binding will fail, so being executing queries I needed to “unbind” them, replacing ? placeholders with actual data, and relying on inbuilt filtering method to provide any sort of field escaping/filtering.
  • LIMIT clauses are also extremely, erm, limited for want of better words. So that needed overriding too.
  • Finally, SQLAlchemy likes to wrap every value, include names, in quotes; Webtrends doesn’t like this, and quite frankly borks, so we disable this functionality too.
  • Bonus point: PyODBC rocks.

Caveats for use:

  • Your SQLAlchemy models for Webtrends reports shouldn’t include primary key columns, because frankly there probably aren’t any unique primary keys in the reports, but this doesn’t matter as SQLAlchemy won’t care as long as Webtrends doesn’t complain (which it won’t).
  • As with any other SQLAlchemy model you can call properties in the ORM class anything you like, but the underlying table column names need to match up to the column names in the Webtrends report.
  • The ODBC driver and web service don’t support JOINs, so you can’t use these with your ORM models either.
  • The iterator wrapper around the PyODBC cursor instance returned by queries will only ever returns one row unless you call .yield_per(1) on the query-set. I haven’t had time to figure out why this is the case, but I suspect it’s something to do with row pre-buffering, which is disabled as a consequence of yield_per.
  • Every now and again you’ll see rows with lots of blank values in them, except that any number values (measures in Webtrends) will be higher. If you look closely these are actually sums of all the values following it up until the next row of seemingly blank data. These are aggregated summary rows, displaying sums of the data for that particular subset of data (depending on which field is the “dimension” for the report, a sort of primary key used in creating reports). Unless you’re after this data as well, I find the best thing is to just do a quick check for so many blank fields and skip these rows.

Example using models and running a query:

from sqlalchemy.orm import mapper, sessionmaker
from sqlalchemy import create_engine
from sqlalchemy import String, MetaData, Column, Table

# You would probably setup your username, password
# host etc. here, including the profile and template
# you want to query.

# Create the DB connection
engine = create_engine(
    "webtrends+pyodbc://%s:%s@%s:80/%s?dsn=Webtrends&profile_guid=%s" %
        {user, password, host, template, profile}
)
metadata = MetaData(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()

# Table schema
wt_user_report = Table('UsersByPages', metadata,
    Column('User', String, nullable=True),
    Column('PagesURLs', String, nullable=True),
    Column('PageHits', String, nullable=True),
    Column('TimePeriod', String, nullable=True),
    Column('StartDate', String, nullable=True),
    Column('EndDate', String, nullable=True)
)

# ORM class
class WTUserReport(object):
    pass
mapper(WTUserReport, wt_user_report)

# Create a query
query = session.query(WTUserReport).filter(
    TimePeriod="2010.m06.d22"
).yield_per(1) # Remember we need this for the iterator to work

# Iterate over the query-set and print some columns
for r in query:
    print "User %s hit %s %s times" % (
        r.User,
        r.PagesURLs,
        r.PageHits,
)