@bx80 opened this Pull Request on November 8th 2022 Contributor


This is a rework of #18221 to improve performance and decrease memory usage when archiving.


  • Aggregation and attribution is no longer done in code and is performed as part of the database query.
  • Queries to retrieve goal metric data for pages are now run individually for each goal to decrease peak load on the database server and to reduce memory usage when enriching the actions array in code.
  • Page URL and page titles unit tests have been expanded to cover multiple days

Refs: L3-313


@justinvelluppillai commented on November 29th 2022 Member

@bx80 is this still ready for review or are you addressing @sgiehl 's previous comment still?

@bx80 commented on November 29th 2022 Contributor

@justinvelluppillai I'm currently looking at a solution for supporting segments with the custom query. We definitely don't want to duplicate the segment query building logic, so it's probably a case of either improving the generateQuery method so it can build the complicated custom goals pages query or at least support having a pre-built full query passed to it and then apply the segment wrapping.

@justinvelluppillai commented on November 29th 2022 Member

ok thanks, I've removed the Needs review label meanwhile 👍🏽

@justinvelluppillai commented on November 30th 2022 Member

@bx80 @tsteur mentioned a few times that he's done something similar with multi attribution so maybe worth talking to him about this

@bx80 commented on December 2nd 2022 Contributor

To allow segment support for the goals/pages query without duplicating any of the segment logic or reworking the goals/pages query again, I've taken the following approach:

  • LogAggregator::generateQuery() now has two new optional parameters which allow a prebuilt query and custom bind order to be provided.
  • If a prebuilt query is provided then generateQuery() will ignore the normal $select, $from and $where parameters and just use the query directly. Query hints and parameter binding will take place as normal.
  • If a prebuilt query is provided and there is also a segment then the segment temporary table will be created as usual, but a simple wrapper will be used instead of attempting to modify the prebuilt query joins and where clause (which isn't viable for complex queries). The simple wrapper joins the prebuilt query to the segment table as a derived table and only selects records from the prebuilt query derived table which have an idvisit in the segment temporary table. This simple approach means generateQuery() doesn't need to know the structure of the prebuilt query.

This should be reusable for any other complex archiving queries in the future as long as they select the idvisit column.

The SomePageGoalVisitsWithConversion fixture has been modified to set a country code on visits and the new TrackGoalsPagesSegmentTest integration test tests the goals/pages API calls with a simple segment.

If the test builds complete without issues then this should ready for review.

@sgiehl commented on December 2nd 2022 Member

@bx80 The approach looks good for me. But could you maybe try to refactor that in a way, that we end up having two methods. So the generateQuery like we had before and a second method that only handles prebuilt queries? One method for both looks a way to complex for me. Guess it should be easily possible to move various parts from generateQuery to private methods, that can then be used by both methods. Maybe you could even simplify the generateQuery afterwards to build a query that can then be passed to the new method maybe...

@bx80 commented on December 4th 2022 Contributor

@sgiehl I've reworked the generateQuery() method, the segment temp table creation and origin hints are now in private methods. All custom query related code has been moved from generateQuery() to a new generateQueryCustom() method which only needs two parameters. It's much cleaner this way :+1:

I've also made the segment generation for custom queries ignore enable_segment_cache = 0 and always create a temporary table. The goals/pages query is the only custom query at the moment and it is unlikely to work without use of a temp table.

@bx80 commented on December 6th 2022 Contributor

Thanks for the feedback @sgiehl :slightly_smiling_face:

I've applied all the suggested improvements. For large dataset performance the query was tested on a real world database where a single goal had 500k+ conversions in a single day. The new query now runs for each goal individually and it took just under 5minutes to complete archiving for this goal, whereas previously it failed to complete goal actions archiving at all. There are more details about the performance testing in the comments for L3-313

Powered by GitHub Issue Mirror