pseudofish

powered by text files

python

Downloading Google Analytics data from a Python script

If you want to regularly extract data from Google Analytics, it is worthwhile scripting the process. However, it was more complex than I expected.

I started with jaco’s post, and then added some detail from there.

Setting up your Google account

Since Google has moved to OAuth2, you need a service account to get a script to work.

There are a few steps to do from the Google Developers Console:

  1. Create a project (eg. GAStatistics). This step may take a minute or two.
  2. From the project page, go to APIs & auth -> APIs, and add permissions for the Analytics API (accepting the T&Cs)
  3. From the Credentials section of APIs & auth, create a new client ID, and pick Service Account when prompted. You should now automatically download a .p12 file that includes your client ids and secret keys.
  4. For your service account, save the email address that is generated. You’ll use in a few places.

Next you need grant your service account access to your Google Analytics account.

  1. From Google Analytics, pick the web property you want to gather data from.
  2. Navigate to the Admin tab at the top, and then to the User Management section (you need to have permissions, otherwise ask your site admin)
  3. Add your service account email to Add permissions for:, and choose the type of permissions. I usually pick Read & Analyze, as I’m paranoid about messing data up with a malformed script.

From Google’s side of things, you’re ready to go.

Starting a Python project

First, you need to install the Google API Python client. I use virtualenv for scripts so I can package dependencies, so my requirements.txt looks like:

pyopenssl
google-api-python-client

For service accounts, you need to use SignedJwtAssertionCredentials, which needs pyopenssl to work. For details, see this, or just include it.

My script isn’t too different from jaco’s example, however I added a few more things to get you started.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
#!/usr/bin/env python
# -*- coding: utf-8; mode: python -*-
"""
Extract a snapshot of Google Analytics data.
"""

from apiclient.discovery import build

from oauth2client.client import SignedJwtAssertionCredentials
from oauth2client.file import Storage

import httplib2

PRIVATE_KEY = '<...>.p12'  # where you store your private key
GSERVICE_EMAIL = '<...>@developer.gserviceaccount.com'
SITE = 'ga:<...>'  # eg. ga:1234

SCOPE_FEEDS = 'https://www.google.com/analytics/feeds/'
CREDENTIALS_STORE = 'credentials.dat'  # credentials cache file


def authorize(private_key_filename,
              service_email,
              scope,
              credentials_store=CREDENTIALS_STORE):
    """Authorise a service account against Google APIs"""
    http = httplib2.Http()
    storage = Storage(credentials_store)
    credentials = storage.get()

    if credentials is None or credentials.invalid:
        key = None
        with open(private_key_filename, 'rb') as f:
            key = f.read()

            credentials = SignedJwtAssertionCredentials(service_email,
                                                        key,
                                                        scope=scope)
            storage.put(credentials)
    else:
        credentials.refresh(http)

    http = credentials.authorize(http)
    return http


def print_data_table(results, trim=True):
    """Given the results from a query, print to screen."""
    if trim:
        format_str = '%25.24s'
    else:
        format_str = '%25s'

    if results.get('rows', []):
        # Print headers.
        output = []
        for header in results.get('columnHeaders'):
            output.append(format_str % header.get('name'))
        print ''.join(output)

        # Print rows.
        for row in results.get('rows'):
            output = []
            for cell in row:
                output.append(format_str % cell)
            print ''.join(output)
    else:
        print 'No Results Found'

def print_top_pages(service, site, start_date, end_date, max_results=10):
    """Print out top X pages for a given site."""
    query = service.data().ga().get(
        ids=site,
        start_date=start_date,
        end_date=end_date,
        dimensions='ga:hostname,ga:pagePath',
        metrics='ga:pageviews',
        sort='-ga:pageviews',
        filters='ga:hostname!=localhost',
        max_results=max_results).execute()

    print_data_table(query)


if __name__ == '__main__':
    http = authorize(PRIVATE_KEY, GSERVICE_EMAIL, SCOPE_FEEDS)
    service = build(serviceName='analytics', version='v3', http=http)

    # TODO(gmwils): check if service is valid before continuing
    start = '31daysAgo'
    end = 'yesterday'

    print 'Top 10 pages'
    print_top_pages(service, SITE, start, end, 10)

This structure gives a good starting point for playing around with different queries.

I did run into an issue where credentials wouldn’t refresh from the cache. This has been fixed, but may need to patch manually.

Calling Google Analytics from the API

There is a lot of different things you can extract from GA Core Reporting API. I found it very helpful to start with the common queries provided as examples.

There are also some API explorers that help with trying queries out.

For extracting data out of the results, there is example code, and details on the response fields.

I found that I was often using filters to get at the data I wanted, and the filter syntax is documented.

Some examples I found useful:

filters='ga:hostname!=localhost'  # exclude development data
filters='ga:source=~email'  # only email campaigns
filters='ga:source=~email;ga:campaign=~^\w'  # email campaigns, with a non-blank campaign tag

To avoid date arithmetic in Python, the API also allows relative dates, so you can write things like:

start = '31daysAgo'
end = 'yesterday'

You can use the dimensions tag to split by additional fields, such as ga:campaign or ga:date.

There are various selectors that can be used in queries, including ones for custom variables, goals and dimensions. Our conversion team has several custom goals setup, so I can query for the results using:

metrics='ga:goal07Completions,ga:goal07Starts,ga:goal07ConversionRate'

I feel like I’m only scratching the surface with Google Analytics every time I use it. Figuring out scripting allows me to dig into further integrations for our normal workflows.