Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor new action insertion code so duplicate actions will not exist + provide means to fix duplicates #7112

Merged
merged 17 commits into from Feb 9, 2015

Conversation

diosmosis
Copy link
Member

This pull request modifies the tracking logic so duplicate actions will not exist in the log_action table and so even if duplicate actions exist in the log_action table, only one action ID will be used at any time.

Fixes #6436

Changes include:

  • new code when inserting an action, instead of just executing an INSERT and returning the last insert ID, the tracker will now:
    • perform an insert
    • select the smallest idaction that matches the name (another query)
    • check if the last insert ID != selected ID, and if true, delete the inserted action
  • changes to SQL that selects idactions from the log_action table so they will always select the lowest idaction. Sometimes involves group bys.
  • there's an update + console command for fixing duplicates. Neither one invalidates existing archives, since I couldn't find a good way to do this in the Update w/o running possibly big queries when creating the Update SQL.
    • I'm not sure if having an update is the best idea. It will automatically remove duplicates for all users, but the SQL it uses to do so is dependent on the data in the tables. So users that want to run the SQL statements themselves might be a bit confused/put-off. Also, w/o the update, I can add archive invalidation to the command.

@mnapoli @tsteur @mattab What are your thoughts?

@mnapoli
Copy link
Contributor

mnapoli commented Jan 31, 2015

I remember a discussion about a UNIQUE index, the reason it wasn't implemented that way was because of the potential index size right?

Regarding the patch I don't have a lot of feedback, I'm not familiar at all with the tracker. Is the additional query a problem? I guess it's searching against the whole table, so I wonder how it performs.

@diosmosis
Copy link
Member Author

Do you mean the query after an INSERT? There's an index on hash + type, so it should be fast. I'm not positive about the group by's but I think the index will be used there too, since it should group by type + hash then name.

EDIT: Yeah, I was going to put a UNIQUE index, but there's no size limit on the action name, and unique indexes would require one. Since URLs can be large, a realistic limit would result in a very large index (I think).

@tsteur
Copy link
Member

tsteur commented Jan 31, 2015

Do you mind writing tests for the Tracker/Model part?

I'm not positive about the group by's but I think the index will be used there too, since it should group by type + hash then name.

Maybe we can ask PRO to run such a query on a big instance with EXPLAINS to be sure an index is used?

Also not sure re the Update. I would probably just remove it and if someone wants to run it, they can use the command. Should be announced in the release blog post and mentioned in FAQ

@diosmosis
Copy link
Member Author

@tsteur Added tests, still waiting for EXPLAINS results. Mind performing another review in the meantime?

@tsteur
Copy link
Member

tsteur commented Feb 3, 2015

Looking for duplicated action ids this way looks like a pretty good solution to me 👍 It shouldn't cause much overhead as it is only done once on each action creation anyway.

The DuplicateActionRemover class contains many private methods which sometimes indicate that the class could be split into multiple classes but I'm not so deep in it as you are. Maybe It's worth it, maybe not. Might make more methods separately/easier testable when they are public afterwards. Maybe the SQL parts of that class can be also moved into the Tracker\Model part so it can be reused and easier tested when needed?

Otherwise 👍

diosmosis added 9 commits February 4, 2015 14:18
…ays using least idaction and deleting duplicates when they are found to be inserted. Since tracker process can potentially fail before duplicates are removed, added test to make sure reports work when duplicate actions exist in the DB.
…group bys in Tracker/Model.php and TableLogAction.
@diosmosis
Copy link
Member Author

Note: I removed some private methods, but I don't think the SQL is actually useful enough to move to another class (in a plugin or in core). The queries could only be used for detecting and fixing duplicate actions.

@tsteur
Copy link
Member

tsteur commented Feb 4, 2015

It's not only whether it can be reused by core or plugins. It is also that the SQL statements just rather belong to a "model" and it makes them better testable. Also you never know whether it can be maybe reused in the future. If anyone has a use case he/she would probably not find it in that class but maybe in the model... Eg those methods could be maybe useful for other cases in the future one never knows: https://github.com/piwik/piwik/pull/7112/files#diff-23c6df91cba3becc700f91001c4e39bbR203 or https://github.com/piwik/piwik/pull/7112/files#diff-23c6df91cba3becc700f91001c4e39bbR172 . At least the SQL could be extracted in another class in that plugin. Doesn't have to be core Tracker\Model I agree. This will also make the code more readable since the queries are only a detail and there would be maybe less private methods ;) There is still a lot of code in private methods which indicates that a lot of that code actually belongs somewhere else (it "indicates" it only, it doesn't mean this is always actually the case. it's up to you). It's maybe not so important here.

Can you maybe use bind params here? https://github.com/piwik/piwik/pull/7112/files#diff-23c6df91cba3becc700f91001c4e39bbR179 and here https://github.com/piwik/piwik/pull/7112/files#diff-23c6df91cba3becc700f91001c4e39bbR268 or maybe convert Ids to integer? I know we only get numbers from the database but it will prevent any possible injections in the future in case someone wants to reuse it somewhere else or so and it doesn't hurt.

diosmosis added 6 commits February 6, 2015 09:52
…ed DuplicateActionRemover into a DAO that only deals w/ duplicate actions, created two new DAO classes (one for log_action manipulation and one for getting relational table metadata), deprecated SHOW COLUMNS method in Db in favor of new table metadata DAO, and added & fixed tests.
@diosmosis
Copy link
Member Author

@tsteur @mnapoli I've performed a pretty big refactor on this branch, I think it merits another review. Let me know your thoughts.

$result[$table] = $columns;
}
return $result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the $tablesWithIdActionColumns is hardcoded, couldn't it be the same for the columns? Or can they be dynamically added by plugins maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be added via dimension classes (eg, ContentName). I don't know if there are any non-core ones that add idaction columns, but many core ones do so I guess it's conceivable that they exist. Also, I guess it's possible for some plugins to be disabled in which case the columns wouldn't exist.

if ($logger === null) {
$logger = StaticContainer::get('Psr\Log\LoggerInterface');
}
$this->logger = $logger;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you prefer those if statements could be eg:

$this->archiveInvalidator = $invalidator ?: new ArchiveInvalidator();

You could also get all those dependencies from the container so that you can use dependency injection without having null in constructors, e.g.:

class DuplicateActionRemover
{
    // ...
    public function __construct(TableMetadata $tableMetadataAccess, LoggerInterface $logger)

instead of:

    public function __construct($tableMetadataAccess = null, $logger = null)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also get all those dependencies from the container so that you can use dependency injection without having null in constructors

Would I have to access StaticContainer directly? Maybe it's possible to do it once for the console command? That would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would definitely be nice!

The problem lies with how Symfony's Console work: we need to register command objects, so that means creating every single command just to init the console (see symfony/symfony#12063).

So if we start doing dependency injection in commands, that means every dependency will be created and injected, with dependencies of dependencies, etc... So 1 or 2 command isn't a big deal but we risk running into a case where a huge load of objects (maybe most of the application) are created by the container just to end up using a few of them…

PHP-DI offers to do lazy injection but:

  1. we can't use that in Piwik since it requires a lib that requires PHP > 5.3.20 or something
  2. it would mean marking most services as lazy, which doesn't make any sense
  3. this is not a clean solution anyway…

So in the meantime we are stuck with doing service location instead of dependency injection in commands…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way we could do something a tiny bit cleaner by introducing a get() method on commands that would proxy that to the container, e.g.:

$logger = $this->get('Psr\Log\LoggerInterface');

(like in Symfony's controllers)

That's still service location, it's just a few characters shorter (and no static calls to StaticContainer)… I don't have a strong opinion on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnapoli Is it possible to inject dependencies in-place? Ie, after an object has been created? We could maybe create commands w/o setting dependencies and then before executing a command, we resolve the dependencies and call Command::execute. The dependencies would have to be specified via the @Inject annotation I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's possible ($container->injectOn($obj)) but we decided not to use annotations last time it was discussed.

Another option would be to keep commands as simple as possible and move everything into classes, though that's very far from the current reality in the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a proxy command class can be used in core\Console.php and the actual command instance wouldn't be created until the proxy's execute() method is called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commands are instantiated because Symfony's Console needs to read the options/configuration. Even if we had a proxy commands would still be instantiated, unless we change the way commands are configured (i.e. we diverge from Symfony's official way) which would be fine by me (given this problem is pretty important)…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it that sounds a good idea, I think I'll have a look at it (having DI only in controllers is really limiting…)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be awesome if it works :)

@mattab mattab added this to the Piwik 2.11.0 milestone Feb 9, 2015
@mattab mattab added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Feb 9, 2015
@mattab mattab changed the title Fixes #6436, refacor new action insertion code so duplicate actions will not exist + provide means to fix duplicates Refactor new action insertion code so duplicate actions will not exist + provide means to fix duplicates Feb 9, 2015
@mattab
Copy link
Member

mattab commented Feb 9, 2015

Looks good to me!

ready to merge?

mattab pushed a commit that referenced this pull request Feb 9, 2015
Refactor new action insertion code so duplicate actions will not exist + provide means to fix duplicates
@mattab mattab merged commit 369e939 into master Feb 9, 2015
@mattab
Copy link
Member

mattab commented Feb 9, 2015

Well done @diosmosis

It's a real relief that this bug will be fixed! 👍

@diosmosis diosmosis deleted the 6436_action_dupes branch February 9, 2015 21:20
diosmosis pushed a commit that referenced this pull request Feb 9, 2015
diosmosis pushed a commit that referenced this pull request Feb 9, 2015
diosmosis pushed a commit that referenced this pull request Feb 13, 2015
…plicates in DB when doing segment query foraction since group by in sub-query is slow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants