Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak behaviour of orphaned segment archive purge #14857

Merged
merged 18 commits into from Sep 16, 2019
Merged

Tweak behaviour of orphaned segment archive purge #14857

merged 18 commits into from Sep 16, 2019

Conversation

katebutler
Copy link

Fixes #14828

@katebutler katebutler added the Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. label Sep 4, 2019
@katebutler katebutler added this to the 3.12.0 milestone Sep 4, 2019
@@ -171,6 +171,7 @@ private function getModel()
protected function logArchiveStatusAsIncomplete()
{
$this->insertRecord($this->doneFlag, self::DONE_ERROR);
$this->flushSpool('numeric');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I had removed that one before merging the PR re the spools as after some more investigating found it wasn't needed.

*/
public function getSegmentsDeletedSince(Date $date)
{
$dateStr = $date->toString('Y-m-d');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use $date->getDatetime()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually seeing it seems to be expecting a timestamp so might have to use getTimestamp() alternatively

{
$dateStr = $date->toString('Y-m-d');
$sql = "SELECT DISTINCT definition, enable_only_idsite FROM " . Common::prefixTable('segment')
. " WHERE deleted = 1 AND ts_last_edit >= '$dateStr'";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we want to use bound parameters... meaning AND ts_last_edit >= ? and then below we use Db::fetchAll($sql, array($dateStr)). In this case it wouldn't be a security issue to not use bound parameters but it's still better in general and makes sure it can't become a security issue in the future

$idSite = (int)$segment['enable_only_idsite'];
$segmentHash = Segment::getSegmentHash($segment['definition']);
// Valid segment hashes are md5 strings - just confirm that it is so it's safe for SQL injection
if (!preg_match('/^[a-z0-9A-Z]+$/', $segmentHash)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just btw here we can use ctype_xdigit($segmentHash)

$idSiteClause = ' AND idsite = ' . $idSite;
} elseif (! empty($segment['idsites_to_preserve'])) {
// A segment for all sites was deleted, but there are segments for a single site with the same definition
$idSiteClause = ' AND idsite NOT IN (' . implode(',', $segment['idsites_to_preserve']) . ')';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe re possible SQL injection we should do $segment['idsites_to_preserve'] = array_map('intval', $segment['idsites_to_preserve'])

return array_column($rows, 'idarchive');
}

private static function getDeletedSegmentWhereClause(array $segment)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw any reason this method is static? Looks like it's not needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katebutler be good to not have the method static if possible

$validSegmentClauses[] = $sql;
$segmentClauses = [];
foreach ($segments as $segment) {
$segmentClauses[] = self::getDeletedSegmentWhereClause($segment);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw be good to add the segmentClause only if

if(!empty($segment['definition'])) {
     $segmentClauses[] = self::getDeletedSegmentWhereClause($segment);
 }

because otherwise it would delete archives where done%. It shouldn't ever happen but could to have a safety net here just in case in the future somewhere definition suddenly gets lost or someone is not using the method correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this will be really important. I just had triggered a case where definition wasn't set for some reason. So we need to make sure to not accidentally delete regular archives

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katebutler be good to add the definition check just to be 100% safe.

@tsteur
Copy link
Member

tsteur commented Sep 6, 2019

Found one bug @katebutler

I had these segments (among some others):

INSERT INTO `piwik_segment` (`idsegment`, `name`, `definition`, `login`, `enable_all_users`, `enable_only_idsite`, `auto_archive`, `ts_created`, `ts_last_edit`, `deleted`)
VALUES
	(48, 'exclude funds library', 'customVariableValue1!@HTTP-Grabber%2520www.fundslibrary.co.uk', 'root', 0, 1, 1, '2019-04-08 02:43:54', '2019-08-30 22:48:49', 1),
	(49, 'tets3', 'actions>=1', 'root', 0, 0, 1, '2019-08-29 02:14:09', '2019-08-30 22:48:49', 0),
	(50, 'test23434', 'actions>=1', 'root', 0, 0, 1, '2019-08-29 02:14:54', '2019-08-30 22:48:49', 1),
	(51, 'tererer', 'actions>=1', 'root', 0, 3, 1, '2019-08-29 02:15:33', '2019-08-30 22:48:49', 0),
	(52, 'tererer', 'actions>=1', 'root', 0, 1, 1, '2019-08-29 02:15:42', '2019-08-30 22:48:49', 1);

and it generated this output

array (
  0 => 
  array (
    'definition' => 'customVariableValue1!@HTTP-Grabber%2520www.fundslibrary.co.uk',
    'enable_only_idsite' => '1',
    'idsites_to_preserve' => 
    array (
    ),
  ),
  1 => 
  array (
    'idsites_to_preserve' => 
    array (
      0 => '3',
    ),
  ),
)

Causing the deletion of regular segments. The problem is that it first was doing unset($deletedSegments[$i]); and then in the next iteration it was doing $deletedSegments[$i]['idsites_to_preserve'][] = $existing['enable_only_idsite'];. I reckon it might help to break the loop after calling unset($deletedSegments[$i]);.

@tsteur
Copy link
Member

tsteur commented Sep 6, 2019

BTW Might be good to add a test for this particular case where the logic was failing

@katebutler katebutler added Needs Review PRs that need a code review and removed Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Sep 10, 2019
return array_column($rows, 'idarchive');
}

private static function getDeletedSegmentWhereClause(array $segment)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katebutler be good to not have the method static if possible

$validSegmentClauses[] = $sql;
$segmentClauses = [];
foreach ($segments as $segment) {
$segmentClauses[] = self::getDeletedSegmentWhereClause($segment);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katebutler be good to add the definition check just to be 100% safe.

return array();
}

$existingSegments = self::getExistingSegmnetsLike($deletedSegments);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just btw there is small typo getExistingSegmnetsLike => getExistingSegmentsLike


private static function getExistingSegmnetsLike(array $segments)
{
$whereClauses = array();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be good to add a little check if (empty($segments)) { return array(); }. I know this is done in the method that is calling this method already, be just good to add this here as well in case the method will be used in the future somewhere else too and then it prevents the case of having a broken sql..
also be good to remove the static and make it a regular method

$bind = array();
foreach ($segments as $segment) {
$bind[] = $segment['definition'];
if ($segment['enable_only_idsite'] == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katebutler but concerned about this code here: https://github.com/matomo-org/matomo/blob/3.12.0-b3/core/Segment.php#L104-L108

Not sure if this can have any effect on the condition. I wonder if there are cases where we would store actionType==sitesearches as definition in the DB, and if passed differently in the API if we would also accept / store "actionType%3D%3Dsitesearches in case it is double encoded and not just once... according to the logic both segment definitions would be accepted in that class and have the same hash. The segment editor would probably store the segments always in the same way, but if someone uses the API and double encodes it, this might still work by the looks and we would store it urlencoded...

actually I just tested it and this would be the case:
image

See https://example.com/index.php?module=API&method=SegmentEditor.add&format=JSON&token_auth=88d86212d1967a8565d05e51b1bfaa19&idSite=1&name=foo&definition=actionType%253D%253Dsitesearches vs using actionType%3D%3Dsitesearches as definition.

So it seems we also need to do (definition = ? or definition = ? or definition = ?) with a bind of $definition, urlencode($definition), urldecode($definition) for each segment :( because we can't know if the given definition is urlencoded or urldecoded etc.

*/
public function getSegmentsDeletedSince(Date $date)
{
$dateStr = $date->getDatetime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katebutler not 100% sure why but for my MySQL version it works only when using getTimestamp() instead of getDatetime(). Makes kind of sense since the data is stored as timestamp but not sure how it works somewhere else.

Can you test to use a getTimestamp() and see what it does for you if tests still pass? Then this seems to be the safer option

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example $date->getTimestamp() is 1567934627 for 2019-09-08. On my localhost when I use that value in the query it matches any record with ts_last_edit not null, even if I set it right back as close to epoch as mysql will let me go (1970-01-01 12:01). This causes the relevant test to fail as the code picks up records that have been deleted more than a week ago.
I am on quite an old version of mysql (5.7.27).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good! The date in my virtual machine was outdated :( works 👍

@tsteur tsteur merged commit 9f767b4 into 3.x-dev Sep 16, 2019
@tsteur tsteur deleted the 14828 branch September 16, 2019 20:50
@mattab mattab added the not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. label Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tweak behaviour of orphaned segments
3 participants