Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster segment archiving #14761

Merged
merged 22 commits into from Aug 16, 2019
Merged

Faster segment archiving #14761

merged 22 commits into from Aug 16, 2019

Conversation

tsteur
Copy link
Member

@tsteur tsteur commented Aug 13, 2019

Seeing roughly 2-6 times faster segment archiving using this patch.
Generally: The less visitors a segment matches, the much faster the archiving query will become (say even 100 times faster)... if all visitors match the segment, there's a chance it takes slightly longer. The more complex the segment be, the faster this is as well.

fix #11900
refs #14119 (comment)

Current problems:

  • In the log aggregator we are creating a temporary table. In the log aggregator, we can however not know if all the plugins are using the same DB connection. When a DB reader is configured, then this temporary table needs to be created on the reader. However, if a plugin archiver does Db::get()->query() instead of $this->logAgreggator->getDb()->query() then the archiving will fail. I reckon this is edge case though since the DB reader is a new feature and not yet wildly used. All our plugins including premium features are compatible
  • Another DB reader thing: I don't know if all MySQL replications are happy with creating temporary tables... the reader needs to be able to create a temporary table there
  • create temporary table: The syntax is like create temporary table logtemp select idvisit from log_visit ... where the select is generated depending on the configured segment. I'm not sure if this works with older MySQL versions. I reckon so but not sure.
  • I will need someone to check the test fixture changes
  • Will then need to disable the segment config and create new tests with the segment enabled

In general it works like this when a segment is defined for archivers (not live query as there is no benefit to it):

create temporary table logtemp$segmentHash select idvisit from log_visit where idsite = 1 and visit_last_action_time >= ? and visit_last_action_time <= ? AND ... // segment query appended... sometimes also more joins etc

-- now each archiver query uses that temp table

select $archiveSelect from logtemp$segmentHash left join log_link_visit_action where id_action_event_category is not null  -- just an example... We do not exuecute the ` idsite = 1 and visit_last_action_time >= ? and visit_last_action_time <= ?` again in this query

-- drop table (will also automatically deleted on connection end)

In other words:

// traditionally be 
$segment->getSelectQuery($select, $from, $where, ...);

// now we do
$segmentTable = 'logtemp445454545';
'create temporary table ' . $segmentTable . $segment->getSelectQuery('idvisit', 'log_visit', 'idsite=1 and time > ? and time < ?', ...);
array_unshift($segmentTable, $from);
$noSegment->getSelectQuery($select, $from);
// because we execute the segment query in advance, we can always execute basically the raw archiver query (inner joined with the segment table)

From what I see it should pretty much get rid of subqueries etc. and we basically always run the original archiving query and no longer join random tables.

@tsteur tsteur added the Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. label Aug 13, 2019
@tsteur tsteur added this to the 3.12.0 milestone Aug 13, 2019
; Note: if you use any plugins, this need to be compliant with Matomo and
; * depending on the segment you create you may need a newer MySQL version (eg 5.7 or newer)
; * use a reader database for archiving in case you have configured a database reader
enable_segments_cache = 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would disable this before the release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we'll need to write tests with this setting enabled and then check the segmentation works

@tsteur
Copy link
Member Author

tsteur commented Aug 15, 2019

FYI: For now not creating temporary tables as if table be larger than the memory then we would need to execute the select another time and try to create the table on disk... if table result is > allowed memory the create table simply fails

@tsteur
Copy link
Member Author

tsteur commented Aug 15, 2019

Just did a test btw where a query took 177s, and when using engine=memory the same archiver query was actually not much faster (171s).

@tsteur
Copy link
Member Author

tsteur commented Aug 15, 2019

All the tests now succeed. There's one failing test in the bandwidth plugin related to this change, all the other test failures are unrelated. I will fix the bandwidth test after merging as otherwise things get complicated re the submodule etc.

@tsteur
Copy link
Member Author

tsteur commented Aug 15, 2019

BTW: @mattab just fixed another issue while testing with a DB reader where the ranking query was before not using the Reader DB

@tsteur
Copy link
Member Author

tsteur commented Aug 18, 2019

btw @mattab forgot to mention one thing as a side note...

eg page urls we archive like log_link_visit_action.server_time < ... > ...... and goals we archive on log_conversion.server_time < ...>... but this is not 100% correct as we don't respect the visit time and whether the visit for this was also in the same day or not...

The segments now include log_visit.last_visit_action_time which is correct so there might be slightly different number when having segment Continent=EUR and segment Continuent!=EUR where the total of those 2 segments might not be 100% the number without the segment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When segmenting by page URL, Events/Downloads/Outlinks/Site search reports always show no data
3 participants