@bx80 opened this Pull Request on October 26th 2021 Contributor

Description:

Fixes #2030

Adds new reports for goal conversions by page url, page title and entry page to the goals overview, see the detailed specification in #2030 for full details.

System tests have now been added.

Review

@github-actions[bot] commented on November 12th 2021 Contributor

This issue is in "needs review" but there has been no activity for 7 days. ping @matomo-org/core-reviewers

@justinvelluppillai commented on November 22nd 2021 Contributor

@bx80 anything needed here to progress this?

@bx80 commented on November 22nd 2021 Contributor

@justinvelluppillai Just needs reviewing.

@bx80 commented on November 23rd 2021 Contributor

@sgiehl Looks like I managed to come up with a query in LogAggregator::queryConversionsByPageView() that worked in MariaDB but not in MySQL with strict mode. :facepalm:
I've committed a fix and tested the query on both MariaDB and MySQL, hopefully that sorts the error you encountered.

@github-actions[bot] commented on December 4th 2021 Contributor

This issue is in "needs review" but there has been no activity for 7 days. ping @matomo-org/core-reviewers

@tsteur commented on December 9th 2021 Member

@sgiehl @bx80 overall no big preference on that.

If we can do the same as for other reports and merge it into the regular pages report that might be beneficial and maybe few more things will function out of the box and be more consistent eg for API users. Queries would be still separate though indeed.

If I understand this right there will be only one attribution model (linear attribution aka each page gets the same weight) and then it should be fine. If there were different attribution models, then it may be again a bit different, not sure. although then it would be just about finding different metric names for each attribution model.

I would recommend giving it a try for say 4-6 hours and check for blockers or whether there aren't any big issues.

@github-actions[bot] commented on December 28th 2021 Contributor

If you don't want this PR to be closed automatically in 28 days then you need to assign the label 'Do not close'.

@github-actions[bot] commented on February 16th 2022 Contributor

This issue is in "needs review" but there has been no activity for 7 days. ping @matomo-org/core-reviewers

@bx80 commented on March 3rd 2022 Contributor

@sgiehl It would definitely be nicer to set the show_goals / has_goals options on the standard actions reports and let the visualization system take care of the rest.

I've tried this and it mostly works but I did run into a couple of issues, so before I get stuck into fixing them it'd be great to get your input and check I'm on the right track :slightly_smiling_face:

1) The $view->config->show_goals setting makes assumptions about which goals visualization is to be used, the new page goal metrics need to be shown using GoalPages visualization instead of the usual Goals visualization. We could work around this by adding an alternative $view->config->show_goals_pages setting to use the different visualization and columns. Does that seem like a good approach?

2) The new page goal metrics cannot be entirely calculated within the archive and requires filters to be run on the dataset to populate some of the metric columns. This could be done on the Actions.getPageX API methods and we could remove the Goals.getPageX API methods which always apply the filters, but I'm concerned about performance if these filters are run on every Actions.getPageX call, even when goal data isn't required.
So I'm thinking we could either:
a) Run the filters for every API call and hope it doesn't hurt performance too much or
b) Perhaps there is a way to set the new Actions API parameter includeGoals to true just for calls made for the Goals visualization? If this could be done then we could run the goals filters only when they are needed which would minimize any performance hit. I'm not sure how to easily adjust the API call parameters based on the chosen visualization, but I can take a look or maybe you have some pointers on how this can be done?

@sgiehl commented on March 3rd 2022 Member

@bx80 I would have tried this way:
Don't use any extra visualization. Instead change the Goals viz and add all required stuff there. You could e.g. check with something like $this->requestConfig->getApiModuleToRequest() and $this->requestConfig->getApiMethodToRequest() if it is an action report in order to set different parameters.

If you need extra filters to be applied for the Viz only, you can actually do this with something like this in the visualization:

public function beforeGenericFiltersAreAppliedToLoadedDataTable() {
    parent::beforeGenericFiltersAreAppliedToLoadedDataTable();
    $this->dataTable->filter('FilterToApply');
} 

Let me know if you need any further details.

@bx80 commented on March 10th 2022 Contributor

@sgiehl Thanks for the pointers, they were really helpful :+1:

I've now removed the new reports and API methods previously added to the Goals plugin. The two additional visualizations have also been removed and their functionality added to the standard goals visualization.

The actions page reports now have the proper flags set to indicate that they have goals metrics so the Goals visualization is automatically shown as an data table option and the reports show on the goals overview screen menu. The visualization will then check if one of the action pages API calls is being made and will set the includeGoals parameter on the API call which lets the API apply the additional filters instead of the visualization, this should allow external API calls to query page goals and still have the goals metrics data populated by the filters.

This is ready for the next stage of reviewing and hopefully is getting close to where it needs to be :slightly_smiling_face:

@sgiehl commented on March 23rd 2022 Member

@bx80 I had another look at the changes and was thinking about possible improvements for quite a while. As I wasn't that deep into the code and thus not sure what would work and not I decided to do the improvements locally to ensure they would work.
I don't think it makes much sense to add another dozens of comments and let you redo all the stuff.
I have now pushed the changes to a separate branch. See https://github.com/matomo-org/matomo/compare/m-2030-goals-per-page...goalpagesrefactor?expand=1
Maybe we can have a quick call/chat and go through the changes so I can explain you the reasons, for a better understanding.

@bx80 commented on March 27th 2022 Contributor

@sgiehl Thanks, much appreciated! :slightly_smiling_face: This will save a lot of time, I've read through the changes and they all make total sense to me :+1:

  • The includeGoals parameter and RemoveGoals filter probably weren't worth the extra complexity if goals are to be included by default.
  • I'd missed a few places where constants should have been used instead of metric column names.
  • I hadn't realized $table->deleteColumns() would work from a filter, so that's a learning point and a lot more efficient than passing metadata around.
    • array_filter() with no callback removes negative value array elements - I'll definitely be using this one! :writing_hand:

Happy to have a chat if there's anything else of which that I should make note.

It'd probably be best if I merge your branch goalpagesrefactor into this one m-2030-goals-per-page to preserve this PR comment history and then recheck all build tests. Is that ok?

@justinvelluppillai commented on March 28th 2022 Contributor

array_filter() with no callback removes negative value array elements

Does it? I thought it removed anything falsey

@bx80 commented on March 28th 2022 Contributor

array_filter() with no callback removes negative value array elements

Does it? I thought it removed anything falsey

Yes, sorry that's what I meant by 'negative' as anything not a positive value (and also zero), should have been more specific. falsey is a useful word for that, I'll add it to my vocabulary along with truey :smile:

@sgiehl commented on March 28th 2022 Member

@bx80 Yes guess that were properly the basics. Your also commonly used queueFilter, which caused various issues I guess. It's good to lookup the difference between that and a direct filter, as that one is applied directly, and queueFilter after the postProcessor finished. Guess that also caused your issue with nb_conversions_page_rate metric, as it was caluclated too late and so missing in some cases. Let me know f you have any further questions.

Feel free to simply merge over the changes from my branch, or cherry pick that commit to here...

@bx80 commented on March 29th 2022 Contributor

@sgiehl @justinvelluppillai Changes merged in and all related tests now pass. Final check that we're all good to merge this into 4.x-dev?

@sgiehl commented on March 29th 2022 Member

@bx80 I can have a final look at that tomorrow. Not sure if we shall merge it before 4.9 or for 4.10 though.

@sgiehl commented on March 31st 2022 Member

@bx80 I had a final look at the code and applied some more fixes based on the requirements mentioned in the original issue. (Hiding the unused revenue columns in the UI didn't work correctly anymore)

What I did not yet do is some performance testing. I don't have a big database locally, but might be good to test the archiving with a big bunch of data to check if it runs smoothly. Not sure if we maybe could run an archiving query on our demo or another big account to avoid possible problems when releasing the feature.

@bx80 commented on April 6th 2022 Contributor

Performance testing results:

The two new archive queries were both tested against a large database.

:heavy_check_mark: The query from LogAggregator::queryConversionsByEntryPageView() completed in 8 seconds for ~6,000 entry pages.
:x: The query from LogAggregator::queryConversionsByPageView() ran for 1,500 seconds before being terminated, the query plan showed ~200,000 conversions in the archived time period. Poor performance is likely due to the dependent sub-query, the EXPLAIN output has been logged. I will do some more query optimization and then we can re-test.

@bx80 commented on April 7th 2022 Contributor

@sgiehl I've updated the tracker insert testing tool to be able to generate goal conversions as part of generating random tracker data, this allowed me to create 1,000,000 visits, actions and conversions in a local database within a few minutes. I've then used this to test the slow LogAggregator::queryConversionsByPageView() query.

Moving where clauses to joins and using ORDER BY NULL on the subquery to eliminate the use of filesort has improved the run time for me from 2mins45sec to 1min15sec.

The main outstanding reason for poor performance is the need to group the result by idgoal, idvisit and idaction. idgoal and idvisit are in the log_conversion table, but idaction is in the joined log_action table. Grouping by columns in different tables causes temporary table usage for the main result. Switching grouping from log_action.idaction to log_conversion.idaction_url improves the query time from 1min15sec to <15sec and avoids temporary table creation, however this will not work for page titles as log_conversion.idaction_url is only populated for page url hits.

The log_conversion table has a idaction_url column but not an idaction_name or idaction column. It seems inconsistent to store the idaction_url but not the idaction_name.

Could we add an idaction_name column to the log_conversion table, get new conversions to populate it and add a migration to fill it for existing data?

This would dramatically improve query performance for this query and possibly help future queries which need to aggregate conversions by page title action.

@sgiehl commented on April 7th 2022 Member

@bx80 Thanks for the update and the improvements.

Adding new columns to log tables isn't something we should do in a normal release. As it can take ages for bigger instances to add new columns to such tables, we are only doing that in major releases.

If we decide to add a new column, filling in data for old conversions shouldn't be done automatically. But we could provide a console command to achieve that.

Are there other possibilities to improve that? Did you check if an index would improve the runtime? Or we don't group by all columns and aggregate the results in PHP instead 🤔

@tsteur commented on April 7th 2022 Member

Just FYI index changes to log tables can also only be added in major releases as it's the same problem where it can take many hours or days to add indexes and would cause unexpected issues when updating. Also new indexes on log tables if any possible we'd want to avoid as they're adding "cpu, disk and financial cost" on tracking requests

@bx80 commented on April 8th 2022 Contributor

Thanks for the info :+1: Makes sense that time consuming db updates shouldn't happen in normal releases :computer::hourglass::sleeping:

I can't see any way to improve the query performance further with the current schema, so if we can't easily add indices or new columns then the only solution is to do the aggregation in PHP.

I've adjusted the queryConversionsByPageView query to remove all grouping and also removed the sub-query. I've slightly reworked the code structure so the ungrouped data returned by the query is aggregated before being passed to a shared method for addition to the actions table. The simplified query now executes in ~250ms for 2,000,000 conversion records. There might be a lot of data for the archiving process to work though, but there are only ~13 numeric columns per row so it shouldn't use too much memory.

@sgiehl commented on April 11th 2022 Member

@bx80 Are the number changes in the test files in your last commit expected?
Imho that would either mean, that the numbers were incorrect before, or the new calculation doesn't have the same result for some reason and might need to be adjusted.

Also why did you remove all the group by? Couldn't you keep those that were covered by an index and fast. That way we might not need to process too many records in PHP.

@bx80 commented on April 11th 2022 Contributor

@sgiehl The group by column that wasn't covered by the index was the last one log_action.idaction, if I leave the other group by columns then the data would already be partly aggregated for goal and visit, then it wouldn't be possible to breakdown by page.

There was a difference in INDEX_GOAL_NB_VISITS_CONVERTED values for some tests after moving the grouping to PHP. I suspect the numbers were incorrect before and the TrackGoalsPagesTest is still passing, but it would be good to be sure.

I'll extend the SomePageGoalVisitsWithConversions fixture to record multiple conversions in a single visit, it's very time consuming to debug log replay tests for this sort of issue, a known pattern of visits, page views and conversions will be better.

@bx80 commented on April 13th 2022 Contributor

@sgiehl The TrackGoalsPagesTest now covers multiple goal conversions in a single visit. I've manually calculated what the page goal metrics should be (screenshot of calculations attached) and then checked the test results against those values. From what I can see the correct figures are being returned in all cases.

image

@sgiehl commented on April 14th 2022 Member

@bx80 awesome. Thanks for adjusting & checking it that much in detail! I looked through the changes and for me everything looked fine. But while clicking through the UI I then realized that there is a problem with calculating the "Viewed before rate". That one is done using a filter. It works without problems if the table only has one level. But when url or titles are tracked, that are displayed in a multi level table, the rate for the subtables is calculated incorrectly. The reason for that is easy to explain, but guess a lot harder to fix. When a subtable is loaded, the filter gets applied to the subtable only. So the filter uses the total number of goals based on that subtable only. That can then result in such numbers:

image

Besides that obviously visible reason, there might also be table limits that might make the numbers incorrectly. Let's assume a quite big page with thousands of sites. Each site might trigger goals. But the pages reports is limited to e.g. 500 records (by config). The current calculation would only count the goal conversions that happened on a site that is included in the report.

I'm not sure what the best solution would be to solve those issues. One solution would be to let the filter query Goals.get to receive the exact number of conversions per goal. Another would be to do that while archiving and include that total conversion metric in each record. That total record could then be removed by the calculation filter.

Regarding your test case I guess it would be good to change it, so some url including the same paths are tracked - to get the page reports with multiple levels.
Also it might be good to add another tracking for a second (or third) day and also receive a weeks report in the test, so we can ensure the aggregated numbers are correct as well.

@github-actions[bot] commented on May 4th 2022 Contributor

If you don't want this PR to be closed automatically in 28 days then you need to assign the label 'Do not close'.

@bx80 commented on May 22nd 2022 Contributor

@sgiehl Thanks for the pointers, really helpful :+1:

I've taken the approach of having the CalculateConversionPageRate filter make an API call to Goals.get to find the total conversion for the parent data table. I briefly tried adding the conversions totals to each row when archiving, but this resulted in elevated conversion numbers (probably because rows were automatically summed again, I didn't go any further with it). The API call seems cleaner and easier to understand.

As part of testing this I've improved the test case to include sub-pages, more conversions, more visits and visits across two days. An screenshot of the sanity check calculations is here:

image

The new Goals_action_goals_visualization_page_urls_subtable.png UI data table screenshot values now match this spreadsheet exactly.

For pages with sub-pages, the 'viewed before rate' can end up calculating at more than 100%, this is going to be confusing so I've tweaked the filter to cap the 'viewed before rate' to 100% - so if at least one of the sub-pages was viewed before every conversion then a 100% 'viewed before rate' has been achieved.

@bx80 commented on May 25th 2022 Contributor

Did you check how the archiving behaves if no goals are configured and ecommerce is disabled? Are any queries fired nevertheless and could maybe be avoided?

An attempt to lookup goal conversion totals is only made for goals that are contained in the row metadata. So if there isn't any goal metadata then no queries will be made and the filter will not run at all.

This does mean looping the table checking the metadata on each row and putting the goal ids into an array, but this should be fast with in-memory data, the alternative would be to make another API call to get all the available goals for the site and then check totals for each of these goals even if they are not present in the table - I think this would be much slower, especially if the site has dozens of goals but the data table is only being viewed for a single goal.

@bx80 commented on May 25th 2022 Contributor

There are some failed tests I'll need to check.

@bx80 commented on May 31st 2022 Contributor

The missing conversion rate formatting on the GoalsPages_individual_goal_updated.png UI test should be fixed once #19295 is merged into 4.x-dev

This Pull Request was closed on May 31st 2022
Powered by GitHub Issue Mirror