New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak behaviour of orphaned segment archive purge #14857
Conversation
…L rather than infile for writing numeric spool
# Conflicts: # core/DataAccess/ArchiveWriter.php
# Conflicts: # core/DataAccess/ArchiveWriter.php
…ecently soft-deleted segments
# Conflicts: # core/DataAccess/ArchiveWriter.php # core/Db/BatchInsert.php # tests/PHPUnit/Integration/DataAccess/ArchiveWriterTest.php
# Conflicts: # core/Archive/ArchivePurger.php
core/DataAccess/ArchiveWriter.php
Outdated
@@ -171,6 +171,7 @@ private function getModel() | |||
protected function logArchiveStatusAsIncomplete() | |||
{ | |||
$this->insertRecord($this->doneFlag, self::DONE_ERROR); | |||
$this->flushSpool('numeric'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I had removed that one before merging the PR re the spools as after some more investigating found it wasn't needed.
plugins/SegmentEditor/Model.php
Outdated
*/ | ||
public function getSegmentsDeletedSince(Date $date) | ||
{ | ||
$dateStr = $date->toString('Y-m-d'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use $date->getDatetime()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually seeing it seems to be expecting a timestamp so might have to use getTimestamp()
alternatively
plugins/SegmentEditor/Model.php
Outdated
{ | ||
$dateStr = $date->toString('Y-m-d'); | ||
$sql = "SELECT DISTINCT definition, enable_only_idsite FROM " . Common::prefixTable('segment') | ||
. " WHERE deleted = 1 AND ts_last_edit >= '$dateStr'"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we want to use bound parameters... meaning AND ts_last_edit >= ?
and then below we use Db::fetchAll($sql, array($dateStr))
. In this case it wouldn't be a security issue to not use bound parameters but it's still better in general and makes sure it can't become a security issue in the future
core/DataAccess/Model.php
Outdated
$idSite = (int)$segment['enable_only_idsite']; | ||
$segmentHash = Segment::getSegmentHash($segment['definition']); | ||
// Valid segment hashes are md5 strings - just confirm that it is so it's safe for SQL injection | ||
if (!preg_match('/^[a-z0-9A-Z]+$/', $segmentHash)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just btw here we can use ctype_xdigit($segmentHash)
core/DataAccess/Model.php
Outdated
$idSiteClause = ' AND idsite = ' . $idSite; | ||
} elseif (! empty($segment['idsites_to_preserve'])) { | ||
// A segment for all sites was deleted, but there are segments for a single site with the same definition | ||
$idSiteClause = ' AND idsite NOT IN (' . implode(',', $segment['idsites_to_preserve']) . ')'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe re possible SQL injection we should do $segment['idsites_to_preserve'] = array_map('intval', $segment['idsites_to_preserve'])
core/DataAccess/Model.php
Outdated
return array_column($rows, 'idarchive'); | ||
} | ||
|
||
private static function getDeletedSegmentWhereClause(array $segment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw any reason this method is static? Looks like it's not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katebutler be good to not have the method static if possible
core/DataAccess/Model.php
Outdated
$validSegmentClauses[] = $sql; | ||
$segmentClauses = []; | ||
foreach ($segments as $segment) { | ||
$segmentClauses[] = self::getDeletedSegmentWhereClause($segment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw be good to add the segmentClause only if
if(!empty($segment['definition'])) {
$segmentClauses[] = self::getDeletedSegmentWhereClause($segment);
}
because otherwise it would delete archives where done%
. It shouldn't ever happen but could to have a safety net here just in case in the future somewhere definition
suddenly gets lost or someone is not using the method correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this will be really important. I just had triggered a case where definition
wasn't set for some reason. So we need to make sure to not accidentally delete regular archives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katebutler be good to add the definition check just to be 100% safe.
Found one bug @katebutler I had these segments (among some others):
and it generated this output
Causing the deletion of regular segments. The problem is that it first was doing |
BTW Might be good to add a test for this particular case where the logic was failing |
core/DataAccess/Model.php
Outdated
return array_column($rows, 'idarchive'); | ||
} | ||
|
||
private static function getDeletedSegmentWhereClause(array $segment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katebutler be good to not have the method static if possible
core/DataAccess/Model.php
Outdated
$validSegmentClauses[] = $sql; | ||
$segmentClauses = []; | ||
foreach ($segments as $segment) { | ||
$segmentClauses[] = self::getDeletedSegmentWhereClause($segment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katebutler be good to add the definition check just to be 100% safe.
plugins/SegmentEditor/Model.php
Outdated
return array(); | ||
} | ||
|
||
$existingSegments = self::getExistingSegmnetsLike($deletedSegments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just btw there is small typo getExistingSegmnetsLike
=> getExistingSegmentsLike
|
||
private static function getExistingSegmnetsLike(array $segments) | ||
{ | ||
$whereClauses = array(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be good to add a little check if (empty($segments)) { return array(); }
. I know this is done in the method that is calling this method already, be just good to add this here as well in case the method will be used in the future somewhere else too and then it prevents the case of having a broken sql..
also be good to remove the static
and make it a regular method
$bind = array(); | ||
foreach ($segments as $segment) { | ||
$bind[] = $segment['definition']; | ||
if ($segment['enable_only_idsite'] == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katebutler but concerned about this code here: https://github.com/matomo-org/matomo/blob/3.12.0-b3/core/Segment.php#L104-L108
Not sure if this can have any effect on the condition. I wonder if there are cases where we would store actionType==sitesearches
as definition in the DB, and if passed differently in the API if we would also accept / store "actionType%3D%3Dsitesearches
in case it is double encoded and not just once... according to the logic both segment definitions would be accepted in that class and have the same hash. The segment editor would probably store the segments always in the same way, but if someone uses the API and double encodes it, this might still work by the looks and we would store it urlencoded...
actually I just tested it and this would be the case:
See https://example.com/index.php?module=API&method=SegmentEditor.add&format=JSON&token_auth=88d86212d1967a8565d05e51b1bfaa19&idSite=1&name=foo&definition=actionType%253D%253Dsitesearches
vs using actionType%3D%3Dsitesearches
as definition.
So it seems we also need to do (definition = ? or definition = ? or definition = ?)
with a bind of $definition, urlencode($definition), urldecode($definition)
for each segment :( because we can't know if the given definition is urlencoded or urldecoded etc.
*/ | ||
public function getSegmentsDeletedSince(Date $date) | ||
{ | ||
$dateStr = $date->getDatetime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katebutler not 100% sure why but for my MySQL version it works only when using getTimestamp()
instead of getDatetime()
. Makes kind of sense since the data is stored as timestamp but not sure how it works somewhere else.
Can you test to use a getTimestamp()
and see what it does for you if tests still pass? Then this seems to be the safer option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example $date->getTimestamp()
is 1567934627 for 2019-09-08. On my localhost when I use that value in the query it matches any record with ts_last_edit
not null, even if I set it right back as close to epoch as mysql will let me go (1970-01-01 12:01). This causes the relevant test to fail as the code picks up records that have been deleted more than a week ago.
I am on quite an old version of mysql (5.7.27).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good! The date in my virtual machine was outdated :( works 👍
Fixes #14828