Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite cron archiving process for easier maintenance and performance #15499

Merged
merged 69 commits into from May 7, 2020

Conversation

diosmosis
Copy link
Member

@diosmosis diosmosis commented Feb 1, 2020

New workflow

The current workflow for core:archive is to check if there are visits for a site, then launching archiving w/ lastN days/weeks/months/years. The new workflow is: invalidate archive, run core:archive and let it process the queue.

Relevant changes

  • Archive invalidation now creates archive rows w/ NULL ts_archived if there isn't an existing archive to set the done value for. This allows core:archive to notice and process the archive, even though one never existed beforehand.
  • core:archive now selects the next invalidated archive from the archive tables and marks it as in progress so we don't have to look for commands that are already running to know if an archive is being processed.

Notes

  • there are some bug fixes to the archive invalidation code here, some archives were not invalidated that should be, and some archives were invalidated that did not need to be.
  • there is an optimization in Loader.php that vastly improves the performance for archiving sites with no visits (as demonstration, all the system test jobs are much faster now; the one w/ ArchiveCronTest eg has improved from 21 mins to 6 mins).

Fixes #15117

@diosmosis diosmosis added the Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. label Feb 1, 2020
@diosmosis diosmosis added this to the 4.0.0 milestone Feb 1, 2020
@diosmosis
Copy link
Member Author

diosmosis commented Apr 23, 2020

@tsteur it's needed to know when an archiving job was killed or was stopped for some other reason. in that case the archive row is set to DONE_IN_PROGRESS, but never gets finalized. So we use the lock + it's expiration time to know if we can restart it.

Edit: it might be possible to remove the need if we have a start or expiration time in the archive and update that instead. Though that might just move the I/O from the lock table to the archive table.

Edit 2: actually, maybe we can fix this by adding the invalidation table, can you see my comment in the other issue?

@diosmosis
Copy link
Member Author

@tsteur ready for another review, there is one test w/ a random failure (fails due to day change), and a couple minor TODO, but can be reviewed.

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good... left only a few comments @diosmosis

@@ -299,6 +299,21 @@ public function getTablesCreateSql()
) ENGINE=$engine DEFAULT CHARSET=utf8
",

'archive_invalidations' => "CREATE TABLE `{$prefixTables}archive_invalidations` (
idinvalidation BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
idarchive INTEGER UNSIGNED NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw is it supposed to be nullable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is in case it's for an archive that does not exist. Eg, if we invalidate a day archive, but there is no week, month or year (because here are no visits for them or it hasn't been processed), we still want to process them. In this case we have to set idarchive to NULL.

ts_invalidated DATETIME NULL,
status TINYINT(1) UNSIGNED DEFAULT 0,
PRIMARY KEY(idinvalidation),
INDEX index_idsite_dates_period_name(idsite, date1, date2, period, name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw be the index maybe better idsite,period,date1 eg as used in the getNextInvalidatedArchive () query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might work, will have to check it's ok for ranges in case I query for individual invalidations... I don't think I do...

// + 1 Done flag per Plugin, for each "Last N" date
'archive_numeric_2010_01' => 138,
// for each "Last N" date that has data (just one date)
'archive_numeric_2010_01' => 22,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking @diosmosis do you maybe know why it were before so many entries? is it less now because we delete some archives?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has a big optimization for sites/periods that have no visits (see PR description). If we see there will be no visits for an archive, we don't even initiate archiving, so no archive is created.

}

$fields = ['idarchive', 'name', 'idsite', 'date1', 'date2', 'period', 'ts_invalidated'];
Db\BatchInsert::tableInsertBatch('archive_invalidations', $fields, $dummyArchives);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @diosmosis seems we need to pass the table with a prefixed name here... got this error:

  Try #1: LOAD DATA INFILE : SQLSTATE[42S02]: Base table or view not found: 1146 Table 'piwik3.archive_invalidations' doesn't exist[42],
  Try #2: LOAD DATA LOCAL INFILE : SQLSTATE[42000]: Syntax error or access violation: 1148 The used command is not allowed with this MySQL version[42000]
ERROR [2020-05-06 04:20:56] 5905   Uncaught exception: libs/Zend/Db/Statement/Pdo.php(234): SQLSTATE[HY093]: Invalid parameter number: parameter was not defined
SQLSTATE[HY093]: Invalid parameter number: parameter was not defined
#0 libs/Zend/Db/Statement.php(300): Zend_Db_Statement_Pdo->_execute(Array)
#1 libs/Zend/Db/Adapter/Abstract.php(479): Zend_Db_Statement->execute(Array)
#2 libs/Zend/Db/Adapter/Pdo/Abstract.php(238): Zend_Db_Adapter_Abstract->query('INSERT IGNORE I...', Array)
#3 core/Db/Adapter/Pdo/Mysql.php(309): Zend_Db_Adapter_Pdo_Abstract->query('INSERT IGNORE I...', Array)
#4 core/Db.php(276): Piwik\Db\Adapter\Pdo\Mysql->query('INSERT IGNORE I...', Array)
#5 core/Db/BatchInsert.php(41): Piwik\Db::query('INSERT IGNORE I...', Array)
#6 core/Db/BatchInsert.php(133): Piwik\Db\BatchInsert::tableInsertBatchIterate('archive_invalid...', Array, Array)
#7 core/DataAccess/Model.php(202): Piwik\Db\BatchInsert::tableInsertBatch('archive_invalid...', Array, Array)
#8 core/Archive/ArchiveInvalidator.php(433): Piwik\DataAccess\Model->updateArchiveAsInvalidated('piwik_archive_n...', Array, Array, NULL, false)
#9 core/Archive/ArchiveInvalidator.php(285): Piwik\Archive\ArchiveInvalidator->markArchivesInvalidated(Array, Array, NULL, true, false)
#10 plugins/CoreAdminHome/API.php(163): Piwik\Archive\ArchiveInvalidator->markArchivesAsInvalidated(Array, Array, 'day', NULL, false, false)
#11 core/CronArchive.php(955): Piwik\Plugins\CoreAdminHome\API->invalidateArchivedReports(Array, Array, 'day')
#12 core/CronArchive.php(875): Piwik\CronArchive->invalidateRecentDate('yesterday')
#13 core/CronArchive.php(341): Piwik\CronArchive->invalidateArchivedReportsForSitesThatNeedToBeArchivedAgain()
#14 core/CronArchive.php(258): Piwik\CronArchive->run()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the prefix locally in my test but then still got the bound parameter wrong error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the content of $dummyArchives which actually looked good:

array (
  0 => 
  array (
    'idarchive' => '22',
    'name' => 'done',
    'idsite' => 1,
    'date1' => '2020-05-04 00:00:00',
    'date2' => '2020-05-04 00:00:00',
    'period' => 1,
    'ts_invalidated' => '2020-05-06 04:25:23',
  ),
  1 => 
  array (
    'idarchive' => '21',
    'name' => 'done',
    'idsite' => 1,
    'date1' => '2020-05-04 00:00:00',
    'date2' => '2020-05-10 00:00:00',
    'period' => 2,
    'ts_invalidated' => '2020-05-06 04:25:23',
  ),
  2 => 
  array (
    'idarchive' => '19',
    'name' => 'done',
    'idsite' => 1,
    'date1' => '2020-05-01 00:00:00',
    'date2' => '2020-05-31 00:00:00',
    'period' => 3,
    'ts_invalidated' => '2020-05-06 04:25:23',
  ),
)

resulting in Uncaught exception: parameter number: parameter was not defined

It generated this query:

INSERT IGNORE INTO piwik_archive_invalidations
					  (idarchive,name,idsite,date1,date2,period,ts_invalidated)
					  VALUES (?,?,?,?,?,?,?)

The bind value looks good so not quite sure what causes it
image
image
Not quite understanding it... Maybe you can reproduce when eg disabling [General] enable_load_data_infile = 1. Otherwise must be something with my install maybe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, weird, will look into it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this in another PR: #15919

@diosmosis
Copy link
Member Author

@tsteur applied review feedback + answered questions + created another PR for the error you encountered: #15919

@tsteur
Copy link
Member

tsteur commented May 6, 2020

Works for me now @diosmosis and reckon we can merge if tests pass (currently they are all failing).

BTW... I was thinking with this new table if we should change the logic in https://github.com/matomo-org/matomo/blob/3.13.5/core/Archive/ArchiveInvalidator.php#L76-L91 ?

In Tracker/Visit.php we mark an archive to be invalidated if the tracking request is for a previous day. In tracking request it then checks if the tracker cache has already registered an invalidation or not by querying the tracker cache. This tracker cache is populated from the getAllRememberToInvalidateArchivedReportsLater method and wondering if this can be all replaced now to also directly query the invalidation table to have one "source of truth" only and not have a similar queue again in the options table etc. rememberToInvalidateArchivedReportsLater could then also directly insert an entry into the archive invalidation table instead of the options table?

The only thing I can think of, is that when many archive invalidations are queued eg because of a changed segment, then the tracker cache might be getting a bit big but this shouldn't be the case normally.

Not sure it's clear what I mean? Any thoughts?

@diosmosis
Copy link
Member Author

Not sure it's clear what I mean? Any thoughts?

If you mean query archive_invalidations in the tracker for rememberToInvalidateArchivedReportsLater, then it means it might make archive_invalidations a bottleneck? I guess it's hard to say how it would behave... the invalidations table will generally be smaller though (except when core:archive isn't called), so maybe it would be better... could try it in another PR?

@tsteur
Copy link
Member

tsteur commented May 7, 2020

If you mean query archive_invalidations in the tracker for rememberToInvalidateArchivedReportsLater, then it means it might make archive_invalidations a bottleneck

I meant calling it when populating the tracker cache... and in tracking mode we would check the tracker cache (not the DB directly). Pretty much like it works now just we don't look at options table anymore when populating the cache or when remembering to invalidate an archive. Bottleneck wise it should in the end be the same no matter if we set options table or add an entry to invalidate archived reports table.

@tsteur
Copy link
Member

tsteur commented May 7, 2020

Actually @diosmosis I realise this might not be best to do since we'd need to invalidate directly also the week, month, year and per tracking request create several invalidation entries... so maybe we keep the option table for now... and PR should be good to merge once tests pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor cron archiving for simplicity
3 participants