ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

Web analytics discussion

August 15, 2007 10:24am

Subscribe [0]
  • #1 / Aug 15, 2007 10:24am

    Geof Harries

    109 posts

    If you want stats, and real stats, log parsers are the only way to get real numbers.

    I completely disagree with this statement. Perhaps that 30% difference is Google being more accurate and your log files being the opposite.

    With log files you face problems with proxy caching (how current is your ISP?) and browser caching (when a user clicks their back button the page viewed is likely cached and not requested from the server so you miss the traffic).

    Log files also don’t make it possible to track by IP address because a single user will be lumped in with other users from their same IP address. Think of the valuable data you are missing here. Finally, if your website is hosted on multiple servers for multiple components it’s a slow and complex ordeal to thread the different log sources together to produce a cohesive report.

    Now, using log files as a data source actually has some benefits: you own the data (never has to be exposed to a third party) and it is easy to implement as the logs are on your server. But, for me, that’s it. Compare these “benefits” to Google, Mint and others who use client-side javascript.

    With javascript you can track much more data and that data is more precise because it’s coming from individual computers not from an individual server You can also also define very strict variables such as being able to track a campaign ID, product, category, price and brand. For marketers, it makes their reporting task that much faster and more precise.

    In my experience, javascript-based reports are generated much faster than logs and they are more up-to-date, in some cases, instantaneous. Logs have to be regularly rotated…..................and patiently waited for.

  • #2 / Aug 15, 2007 10:40am

    JT Thompson

    745 posts

    It’s impossible to be more accurate than the logs on the site. That’s the only way to get 100% true data. The ability to manipulate the information is only limited by the person and the stats program they’re using.
    There is nothing you can do in Google that you can’t in a log parser and they aren’t even comparable in accuracy. Not even remotely.

    if your info is cached it’s cached. there’s no difference with that google script. AWstats has a small bit of code you put in your pages that tracks anything Google can and more, and it doesn’t rely on remote connections to do it (that fact alone puts the local parsing and stats far above any remote script service). There are many modules you can add to it. The only reason I’m even talking about AWstats is because it’s free. It does exactly the same thing Google does and builds the data from it. ALL of the big stats programs allow campaign tracking and they’re very good. That’s primariy what people want in a stats program.

    Google can’t do anything about a proxy either. The things you’re talking about are nothing more than a simple cookie that’s places, or a bit of code sent to google. that’s the only way they can get the info. That is what the code from awstats puts in and doing so gives just as much information.

    I’m not sure what you’re thinking about grouping?? log files have much more information than what’s even available on google stats.

    Not to mentioni some people turn javascript off completely

    If someone can’t get the statistical data from a log file it’s because either the software they’re parsing it with is weak, or they don’t know how to manage the data.

    30% was generous with the site I tracked that had the most traffic.

    A full log analysis enables AWStats to show you the following information:
    * Number of visits, and number of unique visitors,
    * Visits duration and last visits,
    * Authenticated users, and last authenticated visits,
    * Days of week and rush hours (pages, hits, KB for each hour and day of week),
    * Domains/countries of hosts visitors (pages, hits, KB, 269 domains/countries detected, GeoIp detection),
    * Hosts list, last visits and unresolved IP addresses list,
    * Most viewed, entry and exit pages,
    * Files type,
    * Web compression statistics (for mod_gzip or mod_deflate),
    * OS used (pages, hits, KB for each OS, 35 OS detected),
    * Browsers used (pages, hits, KB for each browser, each version (Web, Wap, Media browsers: 97 browsers, more than 450 if using browsers_phone.pm library file),
    * Visits of robots (319 robots detected),
    * Worms attacks (5 worm’s families),
    * Search engines, keyphrases and keywords used to find your site (The 115 most famous search engines are detected like yahoo, google, altavista, etc…),
    * HTTP errors (Page Not Found with last referrer, ...),
    * Other personalized reports based on url, url parameters, referer field for miscellanous/marketing purpose,
    * Number of times your site is “added to favourites bookmarks”.
    * Screen size (need to add some HTML tags in index page).
    * Ratio of Browsers with support of: Java, Flash, RealG2 reader, Quicktime reader, WMA reader, PDF reader (need to add some HTML tags in index page).
    * Cluster report for load balanced servers ratio.

    AWStats also supports the following features:
    * Can analyze a lot of log formats: Apache NCSA combined log files (XLF/ELF) or common (CLF), IIS log files (W3C), WebStar native log files and other web, proxy, wap or streaming servers log files (but also ftp or mail log files). See AWStats F.A.Q. for examples.
    * Works from command line and from a browser as a CGI (with dynamic filters capabilities for some charts),
    * Update of statistics can be made from a web browser and not only from your scheduler,
    * Unlimited log file size, support split log files (load balancing system),
    * Support ‘not correctly sorted’ log files even for entry and exit pages,
    * Reverse DNS lookup before or during analysis, support DNS cache files,
    * Country detection from IP location (geoip) or domain name,
    * Plugins for US/Canadian Region , Cities, ISP and/or Organizations reports (require non free third product geoipregion, geoipcity, geoipisp and/or geoiporg database)
    * WhoIS links,
    * A lot of options/filters and plugins can be used,
    * Multi-named web sites supported (virtual servers, great for web-hosting providers),
    * Cross Site Scripting Attacks protection,
    * Several languages. See AWStats F.A.Q. for full list.
    * No need of rare perl libraries. All basic perl interpreters can make AWStats working,
    * Dynamic reports as CGI output.
    * Static reports in one or framed HTML/XHTML pages, experimental PDF export,
    * Look and colors can match your site design, can use CSS,
    * Help and tooltips on HTML reported pages,
    * Easy to use (Just one configuration file to edit),
    * Analysis database can be stored in XML format for XSLT processing,
    * A Webmin module,
    * Absolutely free (even for web hosting providers), with sources (GNU General Public License),
    * Available on all platforms,
    * AWStats has a XML Portable Application Description.

    The AWStats ExtraSection features are powerfull setup options to allow you to add your own report not provided by default with AWStats. You can use it to build special reports, like number of sales for a particular product, marketing reports, counting for a particular user or agent, etc…

    If we went to a program that really got involved it could cost thousands. so i’m only referring to the free one.

  • #3 / Aug 15, 2007 9:37pm

    Nevin Lyne

    370 posts

    Ok, going to write a book, and as always just my opinion after years of dealing with stats for small web sites to really large corporate sites.

    Something to consider is that at least in the Windows world most of the “system/internet protection” tools, like Norton, and others, have features turned on, at times by default, for privacy.  So they will block things like 3rd party cookies, so there went Google Analytics cookie being set to track the user properly.  Many also flush cookies each time your browser is closed, so this will throw off first time vs. repeat visitors, all of those users will almost always show up a “new” visitors.

    Generic log analysis tools, without the aid of javascript components, will track a “session” which is the visitation of a single IP address over a defined period of time.  Most default to 20 minutes. Some do this based on IP + browser type & version/build # to get more accuracy even with multiple people on the IP.  So if a user visits from an IP address, wanders your site for 5 minutes, an hour later someone else comes to visit from that same IP address they are now a new session, or even at the same time, if the log tool processes IP + browser agent info.  This works well too if like me, your view has a static IP addresses at my home office and at work.  If I visit your site at 9am, 10:30am, 1pm, 3pm, and 9pm your logs will show that, and your stats package will track that as 5 separate visitations/sessions.  If its still in my browser cache, in most cases, the browser is going to hit the server to check to see if the page as been modified since last visit, tripping the fact that I hit the site, and will chalk up another “session”. 

    To touch on the part about combining logs if you have your site on multiple servers. If you have your site hosted on load balanced servers, or serve dynamic content from one server and all static from another, any decent log analysis tool will process the logs and collate the data based on date/time and the fact it was separate logs, so will not duplicate data, but rather combined data without the need of self “merging” log files.

    There simply is not a perfect solution, you can not get a 100% (not even close) picture of who or what is visiting your site.  Log files have weaknesses in tracking “true” individual visitors, but have a good work around to track general sessions based on IP address, time limits, and possibly browser agent string uniqueness.  Javascript + cookies can both be blocked completely by some personal security software so they will not show up at all, or as conflicting/skewed data.  Using them together presents different challenges because which do you trust more, the data directly from the logs if the javascript/cookies was not accepted, or do you trust the cookie/javascript data if the IP address is the same, though the cookie could be being reset each time, and a person that reads your site 10 times a day is showing up as 10 “new” visitors, rather than 1 “repeat” visitor.

    The only thing I recommend, and have explained time and again, is select an option, and stick to it.  You should be tracking trends, ie: more traffic, less traffic, time of day, certain types of articles or content.  IF you get stuck on the actual accuracy of the “numbers” you are missing the point of long term statistical analysis.  But as long as you are tracking your overall trends in a consistent manner you will have a better understanding of your site, your traffic, and your visitors likes and dislikes.

    If you are going to do both logs and something like Mint/Google/etc, don’t compare the two directly, rather track trends within each tool separately.  They are always going to provide different data from one another as they track data differently.  Can’t compare Apples and Oranges.  Even comparing say Apples to Apples has to be done with the same log processing tool, as each log processing application is going to track what a “session” is differently almost 100% of the time.

    Watch the overall trends, and focus more on your sites content, rather than focusing massive effort to get ‘accurate’ visitor information, and wasting that time when you could be providing more or improved content for your site.

  • #4 / Aug 16, 2007 11:23am

    Leslie Camacho

    1340 posts

    Hi guys,

    this is such a great discussion that really has little to do with the original thread, so I split it off.

  • #5 / Aug 21, 2007 11:29pm

    allgood2

    427 posts

    what was the original thread? Not that this conversation wasn’t good by itself, I’m just curious to see what initiated it.

  • #6 / Aug 22, 2007 12:52am

    zwenthe

    7 posts

    Something to consider is that at least in the Windows world most of the “system/internet protection” tools, like Norton, and others, have features turned on, at times by default, for privacy.  So they will block things like 3rd party cookies, so there went Google Analytics cookie being set to track the user properly.

    Google Analytics actually sets 1st party cookies. So unless the browser doesn’t allow cookies at all then you’re ok.

    In addition, to users clearing cookies… that ok, in my opinion. Visit Sessions are better to track and are accurate.

  • #7 / Oct 25, 2007 1:36pm

    Geof Harries

    109 posts

    allgood2 - The original thread was here.

    And to further strengthen my position against logs, Boxes and Arrows just posted a great, though lengthy, article on the subject.

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases