@tsteur opened this Issue on March 14th 2022 Member

Getting recently below error regularly. I assume this started with https://github.com/matomo-org/matomo/pull/18797

[2022-03-13 19:01:13] piwik.ERROR: Uncaught exception in API: PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '127824106' for key 'PRIMARY' in libs/Zend/Db/Statement/Pdo.php:233 Stack trace: #0 libs/Zend/Db/Statement/Pdo.php(233): PDOStatement->execute() #1 libs/Zend/Db/Statement.php(300): Zend_Db_Statement_Pdo->_execute(Array) #2 libs/Zend/Db/Adapter/Abstract.php(479): Zend_Db_Statement->execute(Array) #3 libs/Zend/Db/Adapter/Pdo/Abstract.php(238): Zend_Db_Adapter_Abstract->query('INSERT INTO log...', Array) #4 core/Db/Adapter/Pdo/Mysql.php(314): Zend_Db_Adapter_Pdo_Abstract->query('INSERT INTO log...', Array) #5 core/DataAccess/LogAggregator.php(325): Piwik\Db\Adapter\Pdo\Mysql->query('INSERT INTO log...', Array) #6 core/DataAccess/LogAggregator.php(351): Piwik\DataAccess\LogAggregator->createTemporaryTable('logtmpsegmented...', 'SELECT
/* trigg...', Array) #7 core/DataAccess/LogAggregator.php(561): Piwik\DataAccess\LogAggregator->generateQuery('count(distinct ...', Array, 'log_visit.visit...', '', '') #8 core/ArchiveProcessor.php(546): Piwik\DataAccess\LogAggregator->queryVisitsByDimension(Array, 'log_visit.visit...', Array, Array) #9

Goal of this issue to avoid these problems. The segment query already seems to do a distinct so maybe we execute the query twice when the data actually already exists and there is no need to execute it again?

@tsteur commented on March 15th 2022 Member

In case it helps, it seems to mostly happen with high traffic instances. It could be though that it's unrelated to high traffic and maybe they use certain features or so

@mikkeschiren commented on March 30th 2022 Contributor

The site we have the issue on would be regarded as a high traffic site. For low traffic instances, we don't see the issue at the moment.

@bx80 commented on March 31st 2022 Contributor

I think this is going to be hard to track down without a full segment query. The exception was thrown by code inside the createTemporaryTable method just after a check to make sure the table doesn't already exist, so that would seem to rule out the query being executed twice?

Since it's likely that prior to the temporary index being added in #18797 duplicate records were being inserted, maybe it would be an acceptable workaround to use INSERT IGNORE when inserting the records into the temporary table?

@sgiehl commented on April 5th 2022 Member

@tsteur I guess the error happens during archiving, right? Are you able to look up which segment is archived when it fails?

@tsteur commented on April 5th 2022 Member

@sgiehl I'm seeing in one of them around visitorType and there are some nested calls which I can see in the full backtrace.
fyi we're seeing this error around 20+ times a day. I'll ping you in chat re full details.

@sgiehl commented on April 12th 2022 Member

@tsteur I'm not able to reproduce that in any way. Which makes it hard to find the problem.
Based on the error messages it seems it happens unrelated to the type of segment, period or date that is archived.
I guess the error could basically have two reasons:

  • The segment query returns an id twice, which should be unlikely due to the DISTINCT select. We could maybe try if using a GROUP BY would solve the issue
  • The method tries to insert the data "twice". That would mean the checks if the table already exists won't work properly for some reason. But looking at the code that shouldn't happen.
@tsteur commented on April 12th 2022 Member

@sgiehl what would be the next steps to resolve this issue and make progress?

If the distinct was the problem, I would assume then it be maybe more reproducible but maybe not.

@sgiehl commented on April 12th 2022 Member

That's a good question. Simply using a Insert Ignore would for sure solve the issue. But it might only hide the origin problem.

Not sure if cloud team could maybe try to debug that a bit more to see if it is the query that contains duplicates or if the table already exists before, but records are inserted again nevertheless.

@tsteur commented on April 12th 2022 Member

I had a quick search for "when does mysql distinct not return distinct values" and "mysql distinct vs group by" as we can pretty much rule out the table exists check not working.

Finding few posts around like https://www.experts-exchange.com/articles/12282/Select-Distinct-is-returning-duplicates.html where it does mention the group by you mentioned earlier as well. Although maybe it's more when other columns are included. It's not quite clear.
Also it might hard to say if a group by was faster or slower (seems to depend on indexes etc, because we sort anyway it might not be a big performance difference). Otherwise potentially the ignore could be a solution.

Can you get in touch with the cloud team so they know what to log?

@sgiehl commented on April 25th 2022 Member

@tsteur Pushed some code that could maybe help to identify the queries that produce duplicate records:
I haven't tested the changes, and also I'm not sure if logging an error might still let the archiving fail. So maybe we should log an info instead 🤔 What do you think?

@tsteur commented on April 26th 2022 Member

Be good to test it if the log

  • makes the archiving fail if using cron archiving
  • if the error could show up in the UI when this happens assuming browser archiving is enabled
  • If there's anything printed, the echoed JSON may not be able to be read

If only the cloud team applies this patch, then the cloud team can use a special logger though that won't cause any issues and only logs to file.

Is the purpose of the second query to check if the insert works when using INSERT IGNORE? It might be good to still throw the exception in that case just to make sure we won't ignore this error as it otherwise could cause performance issues from running the same query twice (these are often very slow).

@sgiehl commented on April 28th 2022 Member

@tsteur I would suggest we only let the cloud team apply the patch, so we can gather the queries that maybe produce the problem. I've added the INSERT IGNORE so the archiving won't fail any longer. But we can also leave that away in a first step, so we only log the queries.

@tsteur commented on April 28th 2022 Member

@JasonMortonNZ @samjf could we log the query mentioned in https://github.com/matomo-org/matomo/compare/m18797?expand=1#diff-4140e491c64a9dcbf541f11e1db846adc3c27a4000810494ba8688e5feeee776R332 using our cloud file logger whenever the error occurs?

Because the temp table insert eventually works at some point when it tries to archive again, we would need to re-run the select part of that query fairly soon after this error happens at some point and see if we get duplicate data.

@tsteur commented on June 9th 2022 Member

We've been logging the queries that produced this error and then executed the select part. It's not clear why they would produce any duplicate visitIDs. Often they don't even have any joins. Also created temporary tables and tested the identical insert query but no error either.

Example query:

SELECT /* trigger = CronArchive, idSegments = [90] */
                distinct log_visit.idvisit as idvisit
                log_visit AS log_visit
                ( log_visit.visit_last_action_time >= "2022-06-05 22:00:00"
                AND log_visit.visit_last_action_time <= "2022-06-12 21:59:59"
                AND log_visit.idsite IN (1) )
                ( log_visit.config_device_type = 0 )
                group by idvisit having count(*) > 1
            ORDER BY
                log_visit.idvisit ASC

I've executed a few different queries and never received any duplicate visit ID (when testing the select) or error (when testing the insert).

This was tested from a MySQL client directly, not through application code.

What else could we log or do to resolve this issue @sgiehl ?

cc @JasonMortonNZ @samjf

@tsteur commented on June 9th 2022 Member

Verified that the table is empty when this error happens, then the insert is executed and the duplicate record error appears.

@tsteur commented on June 9th 2022 Member

FYI I've also applied below patch but the error seems to still happen meaning it doesn't look like the read uncommitted is causing this.

diff --git a/core/DataAccess/LogAggregator.php b/core/DataAccess/LogAggregator.php
index cada4b1571..eb2a94611b 100644
--- a/core/DataAccess/LogAggregator.php
+++ b/core/DataAccess/LogAggregator.php
@@ -305,7 +305,7 @@ class LogAggregator
             // set uncommitted is easily noticeable in the code as it could be missed quite easily otherwise
             // we set uncommitted so we don't make the INSERT INTO... SELECT... locking ... we do not want to lock
             // eg the visits table
-            if (!$transactionLevel->setUncommitted()) {
+            if (0 && !$transactionLevel->setUncommitted()) {
                 $canSetTransactionLevel = false;
@tsteur commented on June 9th 2022 Member

Is there anything else we should log or try?

@EreMaijala commented on June 9th 2022 Contributor

@tsteur This may be a silly question since I'm not sure what's actually being done here, but is it by any change possible that this happens when there are concurrent archivers and they happen to write same id's to the same temporary table at the same time?

@tsteur commented on June 10th 2022 Member

@EreMaijala I was wondering this as well but it shouldn't be unless the DB connection was shared.

Temporary tables only exist for the DB connection and aren't visible to other DB connections. As soon as the DB connection is closed the temporary table is also removed.

The archiving basically works like this:

  • the core:archive script is running and launches sub-commands for each archive that needs to be processed
    • it launches a sub-command (eg using shell_exec in php) to archive a report. These sub-commands could technically share the DB connection from the parent but it's not the case from what I can see on our platform.
    • it may launch multiple sub-commands in parallel (eg using shell_exec in php) to trigger the archiving of reports. This is not the case though on our platform as we always only launch one command. Meaning we can be sure there's no DB connection reuse happening
  • another core:archive script may be running in parallel. This has guaranteed a new DB connection and wouldn't cause a problem
    • it launches another command (eg using shell_exec in php) to trigger the archiving
    • it may launch multiple commands in parallel (eg using shell_exec in php) to trigger the archiving

Sorry it's probably not clear what I'm describing above.

There are cases where sub-processes launched from one PHP command can share the DB connection from the parent command. From testing this is not the case though on our platform and we're also not launching multiple commands in parallel so this shouldn't be the case here.

I believe we can rule out that parallel commands are trying to write into the same table.

See also https://dba.stackexchange.com/a/22103

@justinvelluppillai commented on August 9th 2022 Member

@tsteur is there anything we can do here yet to move this one along or should we move it out of the milestone?

@tsteur commented on August 9th 2022 Member

@justinvelluppillai that would be up to the core team but we definitely need to resolve this issue as it affects quite a few users and breaks the archiving.

@bx80 commented on August 10th 2022 Contributor

Would it be worth writing a test script to build a suitable random dataset and then repeating the select / temporary table creation until the error is encountered? If we can automate recreating the issue reliably, even if only once every x attempts, then we could narrow down the possible causes.

@tsteur commented on August 10th 2022 Member

I did run these queries many times after we encountered the error on production and couldn't reproduce it. It was using the very same query just minutes later and running it many times. It may be really hard to reproduce. Since we know we don't execute the query twice we could potentially also think about simply adding an ignore to the insert and the problem be gone? Just a random thought.

@bx80 commented on August 10th 2022 Contributor

That seems like a pragmatic solution to me :+1: We've established that it only happens occasionally, it's hard to reproduce, we're not running the query multiple times and since the example queries are returning a simple list of primary keys then ignoring a duplicate insert value won't affect the intended outcome.

@EreMaijala commented on August 10th 2022 Contributor

While I appreciate the pragmatic approach and hope for a quick resolution to be able to upgrade Matomo, I'm still a bit worried that the root cause is unknown. Not that I have any better idea, so maybe ignore is the best solution anyway.

@SteVio89 commented on August 10th 2022

We also had the problem on our instance.

Additionally, we had another issue related to MariaDB, similar to #18864.

By adding enable_segments_cache = 0 to the config file, we were able to resolve both issues.

It may not be the best solution for the issue, but it may be a temporary solution until a fix is found.

@tsteur commented on August 10th 2022 Member

enable_segments_cache = 0

When this is set, the query will be no longer executed but it also makes archiving the reports in the background slower.

While I appreciate the pragmatic approach and hope for a quick resolution to be able to upgrade Matomo, I'm still a bit worried that the root cause is unknown

I would be too. However, because we ruled out that the same query might be executed twice and is writing data twice into it I'm not as worried. It's neither executed in parallel by another process nor after another. This would have been my biggest worry.

@justinvelluppillai commented on August 23rd 2022 Member

Let's go with the INSERT IGNORE for this and progress the issue.

This Issue was closed on August 24th 2022
Powered by GitHub Issue Mirror