How bots mess up clickthrough stats and can be used for evil

on June 29th, 2010 by Richard Orelup in Web tech
Jump down to the “After the Technical” section if you don’t want to have an greater understanding of the technical side of how URL lookup twitter bots and tracking traffic on the internet works.

In the first couple seconds of putting up a shortened URL on twitter I had all this bot activity come to my site.

From access.log of

That’s 22 bots, more could have came later but this was just the initial blast. This is from an account that is followed by few so it has nothing to do with follow size.

But here is the thing. On ever single one of those hits on my server, php is ran to generate that page. This can be quite an annoyance to people with a page that has a long load time since that many concurrent hits could really screw you up (it shouldn’t but that’s another article.)

For most of us though this means little to nothing. If you use they won’t count as a clickthrough because they have tracked things for a long time and know which ones are bots. This is one of the biggest issues with, they don’t seem to do any filtering (just checked, still do, and got 28 bots) so your numbers all include bot lookups. For people like myself who don’t have the vast data does I just have to kind of work with the user agents and track other IPs that seem strange, as well i can just track if they later loaded some JS on the destination pages (that’s a whole nothing product :) )

When it comes to most website tracking, bots don’t run Javascript (well most you will deal with, some do but again not the ones we are dealing with.) or even grab the JS files as you can see in the logs. So the Google analytics code is never run so you don’t even see they were there.

Okay now that we got a bit of the technical stuff behind us lets look at the issue that I brought up related to this. First you need to see this neat report from that shows 29 bots who RT’d this link. The people who actually RT’d it were only 4. Now it is not uncommon that you may get a random RT or 2 from a bot because you hit some keyword they are just repeating, but 29 is not very common (if it was tons of VL articles would have 1000 hits.) What this speaks more to is that someone fed this address to the bot network to be RT’d.

After the technical stuff

What isn’t going to be obvious to most is what real advantage there is to this. In most cases this is more a search or @mention type of spam just hoping for a little traffic to be found that way. But this is why I had to put the technical stuff out there first because that’s where you will start to see how this would be even more advantagous to someone like VL who is doing things a little differently.

If you go to any VL article, at the bottom you will see a hit counter. That hit counter is being run through the PHP portion of the page. So unlike the other tracking methods that use JS and are unaffected by bots, this counter is. So the second an article is tweeted the hit counter is going to go up. If you would like to see for yourself just grab a random article you think noone will be reading right now frmo a few days ago. Get the original hits. Then create an link for it. Go tweet that link from an account that you expect noone to see and ask people not to click it. If you add a “-” to the end of the url from you will get the stats. Notice how you had around 20+ “accessed directly” times and if you go back to the VL article you originally picked you will see that it’s that number of times+1 (because you just went to it again) more hits. Here are some numbers for one I just did on a VL article

Original VL article Hits: 26 link 21 “clickthroughs”
Post tweet Hits: 49

Now okay who cares, why would anyone trust these numbers? Well thats where the issue comes in at, they make their money based on advertising. When they are trying to get someone to pay for ads they will point them at articles and say see how many hits they get. The problem is now those numbers are all out of whack and completely inaccurate. Does this mean that other tracking methods are perfect? Hell no, It’s easy to trick your own analytics software to believe whatever you want it to. But when it comes to tricking Google Analyitcs on your own blog who cares and takes more effort then it’s really worth. But when you have something as simple as this that is going to be WAY overblown especially when you are using your Twitter account as an RSS feed.

As well, is it not only loading their hit counter every time a bot hits the page, it’s also loading the ads as they are hard coded into the HTML. I don’t know what they offer to their advertisers to see or if they are using other methods of logging to see if the ads were actually displayed, but I really doubt it since they don’t seem to be savvy enough to understand why that would be needed. I would bet their advertisers are not aware of these facts. I’ll try to get more information on this whole situation tomorrow at the Social Media Day event.

So this leaves a lot of unanswered questions. Is the marketing material are they talking about Page views and uniques from Google analytics or from whatever they are using to track the hit counters? Why have the hit counter there when it is obviously flawed unless you are trying to give off a feeling that more people are in fact coming to the site? Why would you use these numbers as selling points to potential advertisers in meetings? Were these things done intentionally or just a complete lack of understanding of what any of this actually means? Does ignorance remove you from being ethically responsible for your actions towards a client? Can you really plead ignorance when you are the one hyping and selling it as being real and of great accomplishment? If you are going to sell a product based on your word, shouldn’t you have understanding of the product you are selling and not be completely ignorant of the nature of it? Is ignorance just a go to response when something is pointed out?

I know lots of ethical things to debate in there and I have a feeling that tomorrow will be a fun day going over it. I have a lot of things to say on those questions but that’s enough for now. The Synergizer might show up with my view on ethics in technology and how martketing people abuse it (well I would more argue you can’t abuse ethics when you have none :) ) with no desire to ever really understand it.

I am also not saying that all these things happened here and that there is any directly intentional unethical behavior on VL’s side. That is near impossible to know concretely without someone actually coming forward and admitting to it. Probably tomorrow both VL and I will have a better understanding of their tracking and the numbers they are selling.

So hopefully for those who read the tech side they have a greater understanding of what’s going on behind the scenes and battles us developers have with tracking. For those who only read the later part, I hope you have a better understanding where there are issues with the methods of tracking used in this case and can see where someone would gain financial benefits from doing something like this. I know that was hard for some people to comprehend when they aren’t really tech people who have not put the research into this that I have on my own projects.

