@anonymous-matomo-user opened this Issue on November 5th 2012

Count of unique visitors by Referer plugin don't agree with count of that by Visits plugin.

For example: On one site "Visits Over Time" report has a 28339 visitors and 48697 visits. But in Referrers->Overview tab in "Details by Referrer Type" report we can see that total count of Visits is 48697 (36499 by direct, 6295 by websites, 3099 by search engines and 2804 by campaigns). But total count of Unique visitors is 33684 (22684 by direct entry, 5488 by websites, 11332 by search engines and 7293 as campaigns). That has a difference from Visits plugin by 5345 visitors.

Looked for this bug I finded problem in Referer plugins method Referers::archiveDayAggregateVisits. It uses ArchiveProcessing_Day::queryVisitsByDimension method to find and aggregate visits and visitors. But if visitor has a several visits in one day and with different types of referrers, it calculates him more than one time.

I wrote a patch for fix this behavior. When I tested it, as addition to fix the problem turn out that the method Referers::archiveDayAggregateVisits with my patch works more fast than original version about two times on my laptop (~1.4 sec opposite ~3.1 sec of original execution time).

@anonymous-matomo-user commented on November 5th 2012

Attachment: Patch to fix it
wan-51.piwik.patch

@anonymous-matomo-user commented on November 6th 2012

Attachment:
3517-1.patch

@anonymous-matomo-user commented on November 6th 2012

Attachment:
3517-2.patch

@anonymous-matomo-user commented on November 6th 2012

Attachment:
3517-3.patch

@anonymous-matomo-user commented on November 6th 2012

Attachment:
3517-4.patch

@anonymous-matomo-user commented on November 6th 2012

Attachment:
3517-5.patch

@anonymous-matomo-user commented on November 6th 2012

Attachment:
3517-total.patch

@anonymous-matomo-user commented on November 5th 2012

I made pull request on github with this patch: https://github.com/piwik/piwik/pull/13

@mattab commented on November 6th 2012 Member

Patch review:

Can you please in the patch only submit the code lines that change (and not change the style of other lines or add line breaks) and also, please don't rename the function with ByRef since it's the same thing (it will make patch clearer)

why not modify queryVisitsByDimension() instead of duplicating code in getPlainVisitsCursor() ?

@anonymous-matomo-user commented on November 6th 2012

Thank you for reviewing my patch!

Ok, I'll try to describe all of it as good as I can =)

As I already said, issue begans from incorrect count of unique visitors in reports related with referrer type. I found incorrect data in database table piwik_archive_blob_* in rows by name 'Referer_types'. All of those rows (with that name) were inserted in database by Referers plugin, as I saw. After that I watched code of that plugin, and found invocation of method Piwik_ArchiveProcessing_Day::queryVisitsByDimension, which builds sql query. That query calculates indices by groups of dimensions. In that I found a little bug - each visitor, who visited site a few times in query period, was calculated a several times - once for each visit, unique by referer type.
As we can't correctly calculate unique visitors on database side (in MySQL, for exam. By some others it can be possible), I decided to write a new method, for building clear sql query, which will be fetch all visits, without grouping of them (plain visits). I called it getPlainVisitsCursor, as it returns opened query cursor to list of plain selected visits. This method doesn't loads all of data in memory, then it may be called as cursor. Using all of that we're can calculate all visits correctly on php side. A new method (getPlainVisitsCursor) do another things as opposed to queryVisitsbyDimension method - it's not makes groups in sql, and doesn't aggregate anything data on database side. Sure, we already have a metod called queryVisitsSimple, but it also doesn't suited for us, as it loads all selected data in memory immediately, and doesn't have definitions of indices fields - that shouldn't be defined on outer side.

The first patch appends new method with name Piwik_ArchiveProcessing_Day::getPlainVisitsCursor. File with it called as 3517-1.patch

By next step, we're can calculate all selected data on php side. It can be done in method
Piwik_Referers::archiveDayAggregateVisits - where that calculated data is being needed.
After that patch, time of method execution (on my laptop) increased from 3.1 to 3.6 seconds.

The second patch makes some aggregation logic in method Piwik_Referers::archiveDayAggregateVisits. Hi's called as 3517-2.patch

Of course we want to get the most better performance of our code as we can. For that I analized a source code of our method, and found that than invokes other method - updateInterestStats by Piwik_ArchiveDayProcessing_Day class. It takes two arguments. First of them is array with new statistic indices, and second is old array with our aggregated data. Count of its invocations can amount to hundreds of thousands. And first argument of it is scalar array, which being fully copied each time when this method is invoked. For avoid this behaviour I wrote new method, called as
updateInterestStatsByRef, and it passes both of first array arguments by reference. He has the same logic as in original method updateInterestStats, but it has much better performance becouse thats doesn't copies an array from first argument. Therefore I rewrote original method, so it's just delegates call to updateInterestStatsByRef and doesn't duplicates the same logic implementation. New method is need as I don't want to brake a backward compatibility. If anybody call method updateInterestStats with first argument as array literal, and I'll try to take it by reference,
that's be an php error.

The third patch appends a new method to Piwik_ArchiveProcessing_Day by name updateInterestStatsByRef. File with it called as 3517-3.patch

By the next step we applying our new method to aggregation logic in method Piwik_Referers::archiveDayAggregateVisits.
After that patch execution time of the method Piwik_Referers::archiveDayAggregateVisits on my laptop was 1.47 second.

That's 3517-4.patch

Finally I want to fix up a code style for all those steps. Along with it I'd give an insignificant performance improvements, as save some values from array in variables, because making a new variable is more fast than redoing many searches in associative arrays by string key.
Execution time after that patch is 1.41 (on my laptop)

That's 3517-5.patch

All steps by one patch in file called as 3517-total.patch

Thank you for your time, and sorry for my bad english =)

@mattab commented on December 14th 2012 Member

Thanks for the patch. unfortunately we cannot apply because it has performance improvements impact.

@anonymous-matomo-user commented on December 15th 2012

Yep. I saw, we're haven't unique visitors in Referrer reports now. Nice resolve of the problem. Thanks.

This Issue was closed on December 15th 2012
Powered by GitHub Issue Mirror