Tim Nash "stuff" Blog

Centralising your Analytics in a decentralised way

0

I am working on timnash.co.uk v3.0 oh yeah!! As part of the new site I will be using it as a place to run more experimental Behaviour Modelling and analytical bits, more importantly I want to make it easy for people to see what I’m gathering and what I’m doing with it. I haven’t worked out entirely how I’m going to do that yet so stay tuned. However one of the things I have been pondering is how I am going to combine my disparate stats gathering system.

Currently I run:

  • Google Analytics
  • Google Weboptimiser and other A/B testing
  • GetClicky
  • Heatmap software
  • Occasional CSS History profiler
  • Surveys

If I want to track a user across all these currently I can’t for example if I want to see the clickmap of a user I can’t compare it with Google Analytical data for example. This is fine but the more work I do the more I want to be able to follow a user through the entire experience, now at this stage many people may start to think ok, reduce the number of third party software and this thought has occured to me. The reason I use getclicky and Google Analytics is I can’t do better it’s that simple.

Privacy Concerns

The biggest issue when linking multiple systems together is the inevitable extra privacy issues, while these systems are separate they are psuedo anonymous combining them makes it much easier to identify a user especially when linked with a login/commenting system where they have to give their email and other information like name. However in many ways I think centralising your data makes dealing with concerns easier to deal with for example you can set up a single “remove me from your tracking” service (also you can track how many people have opt’d out! oh wait is that wrong?) so centralising my data not only will make things easier for me it will make it easier for visitors who have privacy concerns.

Central storage area

The obvious way to centralise all the data is to create a central storage repository and put data in it, of course this immediately prevents several obvious problems.

  • Replication
  • Single Point of Failure
  • Reliability

Replication – There is rarely a good reason in life to have two working copies of something, your analytics data included, apart from the fact you have to maintain both copies you also have to check data integrity and it’s taking up space and therefore costing more to store.

Single Point of Failure – while not normally a problem, when something is being continually used both for read write it”s life expectancy is limited made worse by the fact that several parts of the site will be reliant on the system to make choices, if the system falls over or worse is just slow it will cause issues throughout the site.

Reliability – One of the reasons to use third party services is so I don’t have to handle such things as uptime and reliability any benefit in getting someone else to do the work is lost if I then redo it.

the advantage is speed and as long as it’s up we should be able to access everything instantly.

Decentralised with key link

The second approach to look at it is linking all the various services with a common key. Most third party services worth anything will allow you to store a custom value against a visitor. If the same custom value is used per visitor for all the services then they can be tracked through various calls to each services API. This is easier said then done…

A couple of problems that immediately come to mind

  • Identifying the unique visitor
  • Linking a visitor after the fact
  • What controls the initial identification

It also has the potential for a single point of failure of the totally centralised solution, the service that tags visitors is down the data is lost. This however seems a much smaller risk, at worse some visitors are not tagged correctly and it probably means the site has far worse problems!

Identifying the unique visitor – This at first glance seems easy but to be accurate is actually more difficult and is a post in it’s own right. Once identified the next problem is choosing a naming strategy for a visitor Id if we had a centralised relational database this would be easy it would be the id of the row but we don’t. Some ideas I played with was timestamp, IP and profile type or some combination of these.

Once the ID of the unique user is set and stored on their machine either through a session, cookie or some more hardy persist storage they can be simply picked up in the future.

Linking to a user after the fact – There can be times where a user maybe identified after a service has stored data about the individual some systems will automatically tie in the old data with the new, others won’t unfortunately there is not much you can do barring a recursive check and additions. For example let’s assume a user visits a site on a laptop from home, then visits at work. We treat his work log in as a different instance, when he logs in, we can identify this new visitor under the same user. However we have already sent a pile of custom keys to all our analytical packages.

What controls the initial identification – here is a more tricky issue in the scenario to my blog, a simple wordpress plugin that checks to see if a persistent storage or cookie is on the users machine, determines ID and adds a cookie as needed.

so two competing systems both with problems the solution seems to be a blend between the two.

Decentralised in a Centralised way

I’m going to run through two examples of the way I’m going to centralise my data, one for here timnash.co.uk and the other for a membership site.

On timnash.co.uk I’m going for a totally decentralised approach, a wordpress plugin, will identify users based on if they have been tagged before as I have no easy way to identify if they are previous user on a different browser machine, except if they comment there is no major advantage of maintaining any form of database control. Users will be tagged with a combination of timestamp+profileid+random number
This is then included as custom data to all the stats gathering packages and stored on the users machine using browsers persistent storage. If a user wishes me not to collate individual data they can opt out via the privacy page, this will place a persistent storage cookie, telling the system to not attach the key to their pages, to opt out entirely they will still need to individual drop out of each service.

For a Membership I run I plan a similar system however as it has a login system, individual browser profiles (unique keys) will be stored against a logged in user. This will allow these profiles to be linked via the username and has the advantage of spotting password sharers if their are a large quantity of browser combinations (it should be able to detect even if users use proxies or are on a corporate network)

so that’s the plan, anyone see any major issues with it? let me know, ideally before I fully build it! How are you managing your various data services?

Consulting

Looking to develop a similar system or interested in doing detailed tracking and profiling of users? Why not come and have a chat and see what I can do for you! For more details please contact me or look on my consulting services.
Consulting

While I no longer offer personal consultancy if you are interested in going further then please let us know at Coding Futures


Currently No Comments

Add a comment



*Required

You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.