Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make archiving faster when thousands of websites with low or no traffic #5922

Closed
mattab opened this issue Aug 4, 2014 · 15 comments
Closed
Labels
answered For when a question was asked and we referred to forum or answered it. c: Performance For when we could improve the performance / speed of Matomo. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.

Comments

@mattab
Copy link
Member

mattab commented Aug 4, 2014

Let's discuss ideas to make archiving faster when thousands of websites with low or no traffic.

@mattab mattab added this to the Short term milestone Aug 4, 2014
@mattab
Copy link
Member Author

mattab commented Aug 4, 2014

  • Instead of archiving each website separately, and issue SQL queries for each plugin of each website, instead each plugin will archive all websites with a set of SQL queries.
    • we could change the Archiver to issue a GROUP BY idsite statement to process all websites at once
    • it will only work if the dataset can fit in memory (may not work well for biggest number of websites). Maybe we could process by batch of 1,000 websites?

@mattab
Copy link
Member Author

mattab commented Aug 4, 2014

  • archive first the websites that have more traffic, to provide better user experience
  • avoid re-processing the period archive (which is slow as we select all sub-periods and sum and write the data). To avoid re-processing we may update the existing archive for this period, and update date2 column to today. This way we avoid lots of writing and also having to delete these outdated "current period" archives. from Archiving performance for thousands of websites often without traffic #4940

@czolnowski
Copy link
Contributor

  • use tracker cache/db to store last visit for checking if sites have visits in archiving

@mattab
Copy link
Member Author

mattab commented Aug 4, 2014

if anyone has a archive.log for this use case of thousands of websites with low traffic, please attach to this ticket, thanks!

@bjornhij
Copy link

In the following URL the logfile of archiving about 8000 small websites (I can only attach images here so I have to link to an external URL):

http://stats.exto.nl/5922/archive.log

Hope this helps. Since archiving was moved to a CLI process, it became to slow to archive every day.

@mattab mattab added the c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. label Sep 9, 2014
@mattab
Copy link
Member Author

mattab commented Dec 15, 2014

Right now i'm archiving some empty websites and it takes a long time:

INFO CoreConsole[2014-12-15 23:18:14] [9369d] Archived website id = 14, period = day, 0 visits in last last52 days, 0 visits today, Time elapsed: 1.235s
INFO CoreConsole[2014-12-15 23:18:29] [9369d] Archived website id = 14, period = week, 0 visits in last last260 weeks, 0 visits this week, Time elapsed: 15.551s
INFO CoreConsole[2014-12-15 23:18:35] [9369d] Archived website id = 14, period = month, 0 visits in last last52 months, 0 visits this month, Time elapsed: 5.111s
INFO CoreConsole[2014-12-15 23:18:35] [9369d] Archived website id = 14, period = year, 0 visits in last last7 years, 0 visits this year, Time elapsed: 0.738s
INFO CoreConsole[2014-12-15 23:18:35] [9369d] Archived website id = 14, 4 API requests, Time elapsed: 22.638s [1/1365 done]

Note that It was the first time that those websites are being archived, which explains some of the slowness, but still: something should be done so that archiving some empty websites should be faster than 22 seconds.

@tsteur
Copy link
Member

tsteur commented Apr 7, 2015

As @czolnowski suggested something like a flag for each site should help I reckon. Basically, we wanna know whether there was at least one tracking request since the last archiving. Or to describe it differently: We wanna only trigger the archiving for sites, that had at least one tracking request since the last archiving run. This will only help if one has many sites with 0 visits. It does not help for many sites with low traffic.

At first I thought we could just query the log_visit table or the log_link_visit_action table to check whether there was any record since the last archiving but that's probably not doable since one might track something for a previous day and invalidates the archives. I'm not sure if we update the last archiving date in this case. Maybe it is doable.

It is probably not worth storing a flag in the option table or so as it would make Tracking slower. A cache file etc can also not be used as we might clear the cache. If there's a solution it should be probably based on the log_ table.

@mattab
Copy link
Member Author

mattab commented Apr 9, 2015

We wanna only trigger the archiving for sites, that had at least one tracking request since the last archiving run.

if you don't trigger archiving when there are no new visit, then we would have missing daily archives and missing week/month/year archives, leading to "no data" in some reports. we'd need to change more code (eg. change the code that deletes out of date archives).

Maybe there is instead some room to decrease CPU walltime of archiving requests on "no traffic" website to make them very fast?

@tsteur
Copy link
Member

tsteur commented Apr 9, 2015

When there is already an archive, and there were no visits, we don't need to rearchive or not? Of course we might need to change some code but that's normal or not?

@mattab
Copy link
Member Author

mattab commented Apr 9, 2015

When there is already an archive, and there were no visits, we don't need to rearchive or not?
Of course we might need to change some code but that's normal or not?

Yes we'd need to change code (archive selector would need to allow reading old archives, and do not purge outdated archives as we may read them if we don't re-archive every day eg. 5 days old archives)... it's possible but maybe error prone.

I was hoping there we could make the use case "pre-archiving a site when there is no visit" request so fast that we wouldn't need to be clever about reading old archives, etc. not sure if archiving very fast those "low / no traffic" days is really possible though?

@tsteur
Copy link
Member

tsteur commented Apr 10, 2015

FYI: A quick profile of archiving one day for one site: 60-80% is spent outside archiving for bootstrapping Piwik, loading all reports, all segments, ... It might be possible to make a faster version for the CLI that doesn't bootstrap a lot and directly calls something like new Piwik\ArchiveProcessor\Loader()->prepareArchive()

@mattab mattab added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Apr 10, 2015
@diosmosis
Copy link
Member

Would it be possible to publish these profiles? It would be interesting to be able to examine them.

@tsteur
Copy link
Member

tsteur commented Apr 12, 2015

I didn't have it anymore but quickly started the archiver again. Attached a screenshot
archiver_profile

and send you the actual profile via message as I cannot attach it here.

I wrote another version of the archiver where I did just the following in a simple command:

 $pluginNames = Plugin\Manager::getInstance()->getAllPluginsNames(); // we should only get plugins with archiver
 $params = new Parameters(new Site($idSite), Factory::build($period, $date), new Segment('', array($idSite)));
$loader = new Loader($params);
foreach ($pluginNames as $pluginName) {
     $loader->prepareArchive($pluginName);
}

This way I imported more than 300-400 sites per minute (period = day) that each have 2-6 visits instead of only 50 sites per minute.

@mattab
Copy link
Member Author

mattab commented Apr 27, 2015

running core:archive for a website that has no data is still slow. It takes about 20 seconds on my laptop to process day/week/month/year/range reports (5 API requests). Here is a typical output:

INFO [2015-04-27 05:11:16] Starting Piwik reports archiving...
INFO [2015-04-27 05:11:16] Will pre-process for website id = 27, day period
INFO [2015-04-27 05:11:16] - pre-processing all visits
INFO [2015-04-27 05:11:17] Archived website id = 27, period = day, 0 segments, 0 visits in last last9 days, 0 visits today, Time elapsed: 0.023s
INFO [2015-04-27 05:11:17] Will pre-process for website id = 27, week period
INFO [2015-04-27 05:11:17] - pre-processing all visits
INFO [2015-04-27 05:11:19] Archived website id = 27, period = week, 0 segments, 0 visits in last last9 weeks, 0 visits this week, Time elapsed: 1.569s
INFO [2015-04-27 05:11:19] Will pre-process for website id = 27, month period
INFO [2015-04-27 05:11:19] - pre-processing all visits
INFO [2015-04-27 05:11:23] Archived website id = 27, period = month, 0 segments, 0 visits in last last9 months, 0 visits this month, Time elapsed: 3.811s
INFO [2015-04-27 05:11:23] Will pre-process for website id = 27, year period
INFO [2015-04-27 05:11:23] - pre-processing all visits
INFO [2015-04-27 05:11:37] Archived website id = 27, period = year, 0 segments, 0 visits in last last7 years, 0 visits this year, Time elapsed: 14.471s
INFO [2015-04-27 05:11:37] Will pre-process for website id = 27, range period
INFO [2015-04-27 05:11:37] - pre-processing all visits
INFO [2015-04-27 05:11:38] Archived website id = 27, period = range, 0 segments, 0 visits in last previous30 ranges, 0 visits this range, Time elapsed: 0.497s
INFO [2015-04-27 05:11:38] Archived website id = 27, 5 API requests, Time elapsed: 21.300s [1/74 done]

The problem gets N times worse when you add N segments...

If we can improve this performance in future Piwik versions, it would for sure help a lot!

@mattab
Copy link
Member Author

mattab commented Nov 25, 2015

Closing this issue as we made heaps of progress recently and this issue scope is too wide.

Well done to the team for all improvements done in last few months!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered For when a question was asked and we referred to forum or answered it. c: Performance For when we could improve the performance / speed of Matomo. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

No branches or pull requests

5 participants