@tsteur opened this Pull Request on August 13th 2019 Member

Seeing roughly 2-6 times faster segment archiving using this patch.
Generally: The less visitors a segment matches, the much faster the archiving query will become (say even 100 times faster)... if all visitors match the segment, there's a chance it takes slightly longer. The more complex the segment be, the faster this is as well.

fix https://github.com/matomo-org/matomo/issues/11900
refs https://github.com/matomo-org/matomo/issues/14119#issuecomment-519675939

Current problems:

  • In the log aggregator we are creating a temporary table. In the log aggregator, we can however not know if all the plugins are using the same DB connection. When a DB reader is configured, then this temporary table needs to be created on the reader. However, if a plugin archiver does Db::get()->query() instead of $this->logAgreggator->getDb()->query() then the archiving will fail. I reckon this is edge case though since the DB reader is a new feature and not yet wildly used. All our plugins including premium features are compatible
  • Another DB reader thing: I don't know if all MySQL replications are happy with creating temporary tables... the reader needs to be able to create a temporary table there
  • create temporary table: The syntax is like create temporary table logtemp select idvisit from log_visit ... where the select is generated depending on the configured segment. I'm not sure if this works with older MySQL versions. I reckon so but not sure.
  • I will need someone to check the test fixture changes
  • Will then need to disable the segment config and create new tests with the segment enabled

In general it works like this when a segment is defined for archivers (not live query as there is no benefit to it):

create temporary table logtemp$segmentHash select idvisit from log_visit where idsite = 1 and visit_last_action_time >= ? and visit_last_action_time <= ? AND ... // segment query appended... sometimes also more joins etc

-- now each archiver query uses that temp table

select $archiveSelect from logtemp$segmentHash left join log_link_visit_action where id_action_event_category is not null  -- just an example... We do not exuecute the ` idsite = 1 and visit_last_action_time >= ? and visit_last_action_time <= ?` again in this query

-- drop table (will also automatically deleted on connection end)

In other words:

// traditionally be 
$segment->getSelectQuery($select, $from, $where, ...);

// now we do
$segmentTable = 'logtemp445454545';
'create temporary table ' . $segmentTable . $segment->getSelectQuery('idvisit', 'log_visit', 'idsite=1 and time > ? and time < ?', ...);
array_unshift($segmentTable, $from);
$noSegment->getSelectQuery($select, $from);
// because we execute the segment query in advance, we can always execute basically the raw archiver query (inner joined with the segment table)

From what I see it should pretty much get rid of subqueries etc. and we basically always run the original archiving query and no longer join random tables.

@tsteur commented on August 15th 2019 Member

FYI: For now not creating temporary tables as if table be larger than the memory then we would need to execute the select another time and try to create the table on disk... if table result is > allowed memory the create table simply fails

@tsteur commented on August 15th 2019 Member

Just did a test btw where a query took 177s, and when using engine=memory the same archiver query was actually not much faster (171s).

@tsteur commented on August 15th 2019 Member

All the tests now succeed. There's one failing test in the bandwidth plugin related to this change, all the other test failures are unrelated. I will fix the bandwidth test after merging as otherwise things get complicated re the submodule etc.

@tsteur commented on August 15th 2019 Member

BTW: @mattab just fixed another issue while testing with a DB reader where the ranking query was before not using the Reader DB

@tsteur commented on August 18th 2019 Member

btw @mattab forgot to mention one thing as a side note...

eg page urls we archive like log_link_visit_action.server_time < ... > ...... and goals we archive on log_conversion.server_time < ...>... but this is not 100% correct as we don't respect the visit time and whether the visit for this was also in the same day or not...

The segments now include log_visit.last_visit_action_time which is correct so there might be slightly different number when having segment Continent=EUR and segment Continuent!=EUR where the total of those 2 segments might not be 100% the number without the segment.

This Pull Request was closed on August 16th 2019
Powered by GitHub Issue Mirror