@Littlericket opened this Issue on January 28th 2021

Hi,

we've recently updated to 4.1.1 (from 3.x) and we're unable to view archived data for 2018, but archive data is present (also calculated nb_visits from archive_numeric). It seems like archived data is still there but somehow not visible on the UI, and we can't really explain why this happen. 2019 is working fine.

We also can't just invalidate the archive tables because raw logs are being removed after 365 days.

archive_numeric (2018_12, Query is not correct but only to show that data should be there):
image

UI:
image

Any clue?

@tsteur commented on January 28th 2021 Member

@Littlericket it's hard to say without access to the database. Have you configured report deletions as well besides raw data deletion after 365 days? The data would then be removed though AFAIK so I'm assuming that's not the case. I can't think of a way how this would be related to the Matomo 4 upgrade so it might be caused by something else but you never know. (Maybe someone invalidated reports etc).

We could maybe have a quick look if there's a chance for us to get access to the database.

For these archives that you're showing there, can you select the values for the related "done" flags? Basically, what you'd want to check for each idarchive

select * from matomo_archive_numeric_2020_09 where idarchive IN (33,40,59) and name like "%done%"

You'd need to replace the table name and the value for the idarchive and replace it with the idArchives you selected above.

If you're familiar with mysql you could also check if there are any duplicates for the same dates. It would look like this:

set sql_mode='';

select count(*), idsite, date1, period, value, `name`, ts_archived from matomo_archive_numeric_2020_09 where period = 3 and date1 = '2020-09-01' and name like "%done%" group by idsite, name, date1, period, value

You would need to replace the table name and the value for date1. If you can post otherwise the output for this that be great.

@Littlericket commented on January 29th 2021

Hi @tsteur,

thanks for the quick reply! Unfortunatly I cannot give access to the database, but I'll try providing the result of the queries.
We haven't setup archive deletion, only the raw logs are getting removed. We've enabled this because the database of raw logs reached a few hundred gigs and we don't have that much RAM to keep all the data sets in memory.

Result of your first query (large result set): https://s.littlericket.me/2b8oaeqhly
Result of the second duplicates query: https://s.littlericket.me/syrfo5qhcv

As far as I can tell there are no rows with count > 1, so no duplicates?

@tsteur commented on January 31st 2021 Member

Thanks @literalplus Are you seeing the problem for "all visits" or only for some specific segments maybe?

@Littlericket commented on February 1st 2021

@tsteur this is the case for "all visits" and also all defined segments.

@tsteur commented on February 1st 2021 Member

@Littlericket thanks for this. Can you execute the query

select count(*), idsite, date1, period, value, `name`, ts_archived
from piwik_archive_numeric_2018_12
where period = 3 and date1 = '2018-12-01' and name like "%done%"
group by idsite, name, date1, period, VALUE
ORDER BY 1 desc

also only for a specific idsite where you experience this? This might help better understand things. eg add AND idsite=1 to the WHERE clause.

@Littlericket commented on February 2nd 2021

@tsteur sure! Limited to idsite=1
Result set is at https://s.littlericket.me/lu8tryt918

@tsteur commented on February 2nd 2021 Member

Thanks @Littlericket This looks good to me. Eg I see a done archive for that idSite & month meaning all visits should show data. I'm assuming you're seeing data for December 2018?

Could you also query the data for the year like this?

select count(*), idsite, date1, period, value, `name`, ts_archived
from piwik_archive_numeric_2018_01
where period = 4 and date1 = '2018-01-01' and name like "%done%" AND idsite=1
group by idsite, name, date1, period, VALUE
ORDER BY 1 desc
@Littlericket commented on February 3rd 2021

Hi @tsteur, again, thanks for the support!
There is no visible data in the UI when viewing december:
image

Heres the result set of your given query: https://s.littlericket.me/syminfdbdr
Also "almost" the same query (without the value column because of binary) for the blob archive: https://s.littlericket.me/ml80njslta

@tsteur commented on February 4th 2021 Member

@diosmosis do you maybe have any idea?

@Littlericket I know you already said it's not possible to give us access to the DB. If there's otherwise the chance you could send us a dump of an entire archive table that be great. We would likely not need any other data.

@diosmosis commented on February 4th 2021 Member

@Littlericket some ideas to help diagnose:

  • could you try loading an API request for the data like: index.php?module=API&method=VisitsSummary.get&period=day&date=2018-01-01&idSite=1&token_auth=...

  • if that doesn't return data, could you add the following script to the root of your matomo install and run it from the command line (then remove it)? (you'll have to set the username/password in the script below, preferably to a view user)
<?php
use Piwik\ErrorHandler;
use Piwik\ExceptionHandler;
use Piwik\FrontController;

if (!defined('PIWIK_DOCUMENT_ROOT')) {
    define('PIWIK_DOCUMENT_ROOT', dirname(__FILE__) == '/' ? '' : dirname(__FILE__));
}
if (file_exists(PIWIK_DOCUMENT_ROOT . '/bootstrap.php')) {
    require_once PIWIK_DOCUMENT_ROOT . '/bootstrap.php';
}
if (!defined('PIWIK_INCLUDE_PATH')) {
    define('PIWIK_INCLUDE_PATH', PIWIK_DOCUMENT_ROOT);
}

require_once PIWIK_INCLUDE_PATH . '/core/bootstrap.php';

if (!defined('PIWIK_PRINT_ERROR_BACKTRACE')) {
    define('PIWIK_PRINT_ERROR_BACKTRACE', false);
}

if (!defined('PIWIK_ENABLE_ERROR_HANDLER') || PIWIK_ENABLE_ERROR_HANDLER) {
    ErrorHandler::registerErrorHandler();
    ExceptionHandler::setUp();
}

FrontController::setUpSafeMode();

if (!defined('PIWIK_ENABLE_DISPATCH')) {
    define('PIWIK_ENABLE_DISPATCH', true);
}

$environment = new \Piwik\Application\Environment(null);
$environment->init();

if (!\Piwik\Common::isPhpCliMode()) {
    exit;
}

$controller = FrontController::getInstance();
$controller->init();

$idSite = 1;
$date = '2018-12-01';
$username = 'USERNAME';
$password = 'PASSWORD';

$auth = new \Piwik\Plugins\Login\Auth();
$auth->setLogin($username);
$auth->setPassword($password);
\Piwik\Access::getInstance()->reloadAccess($auth);

$data = \Piwik\Plugins\VisitsSummary\API::getInstance()->get($idSite, 'day', $date);
print_r($data->getRows());

$params = new \Piwik\ArchiveProcessor\Parameters(new \Piwik\Site($idSite), \Piwik\Period\Factory::build('day', $date), new \Piwik\Segment('', [$idSite]));
$result = \Piwik\DataAccess\ArchiveSelector::getArchiveIdAndVisits($params);

print "idarchives: " . print_r($result[0], true);
print "visits: " . $result[1] . "\n";
print "usable: " . ($result[3] ? 'true' : 'false') . "\n";
  • also is browser triggered archiving enabled for your system?
@Littlericket commented on February 4th 2021

@tsteur can give you the archive table as a download, but I need to contact you privately instead of posting the dump here..

@diosmosis

  1. API request returns zeros for all xml keys:

    <result>
    <nb_uniq_visitors>0</nb_uniq_visitors>
    <nb_users>0</nb_users>
    <nb_visits>0</nb_visits>
    <nb_actions>0</nb_actions>
    <nb_visits_converted>0</nb_visits_converted>
    <bounce_count>0</bounce_count>
    <sum_visit_length>0</sum_visit_length>
    <max_actions>0</max_actions>
    <bounce_rate>0 %</bounce_rate>
    <nb_actions_per_visit>0</nb_actions_per_visit>
    <avg_time_on_site>0</avg_time_on_site>
    </result>
  2. 
    $ php 17160.php
    Array
    (
    [0] => Piwik\DataTable\Row Object
        (
            [maxVisitsSummed] => 0
            [metadata:Piwik\DataTable\Row:private] => Array
                (
                )
    
            [isSubtableLoaded:Piwik\DataTable\Row:private] =>
            [subtableId] =>
            [storage:ArrayObject:private] => Array
                (
                    [nb_uniq_visitors] => 0
                    [nb_users] => 0
                    [nb_visits] => 0
                    [nb_actions] => 0
                    [nb_visits_converted] => 0
                    [bounce_count] => 0
                    [sum_visit_length] => 0
                    [max_actions] => 0
                )
    
        )

)
idarchives: Array
(
[0] => 836463
)
visits: 0
usable: true


3. browser triggered archiving is disabled

browser_archiving_disabled_enforce = 1
enable_create_realtime_segments = 0
enable_browser_archiving_triggering = 0

@diosmosis commented on February 5th 2021 Member

@Littlericket can you run the following SQL:

SELECT * FROM piwik_archive_numeric_2018_01 WHERE idarchive = 836463;

If you'd like to email the data, you can email it to dizzy@innocraft.com.

@diosmosis commented on February 10th 2021 Member

@Littlericket looking at the data you sent it looks like there is just no data in the archives for those dates. The only thing I can think of that might have caused this is deleting old log data, then having archives of old data be re-archived causing new archives with zero visits to be created. Do you have backups with the old archive data?

EDIT: actually that doesn't seem like the case if the data was present before updating, since the ts_archived is 2019...

@tsteur perhaps we should prevent invalidating or re-archiving archives that have no log visits and are older than the delete_log_data_older_than value? though this might still be a problem if the value is changed.

@Littlericket commented on February 10th 2021

@diosmosis thanks for the investigation! No, we dont have any backup of old archive data. If there are visits (visible in the upper screenshots) can't we just somehow adjust the archive parameters to display the archives which have data in the UI?

If I understand correctly, there has been data somewhen but the data got refreshed by a archive cron run with zero calculated visits when raw data is getting removed?

@tsteur commented on February 10th 2021 Member

@diosmosis I've been looking through the data and indeed there is only a done flag but no other report data. I've been looking through the code (eg DataAccess model) and I can't explain or find anything where we would delete archives but keep the done flag.

perhaps we should prevent invalidating or re-archiving archives that have no log visits and are older than the delete_log_data_older_than value? though this might still be a problem if the value is changed.

I'm now thinking we might not be able to add this because a user might want to invalidate/remove data manually? It could be done though maybe and we could throw an exception when someone tries to invalidate data for such a range and require a force parameter or so. Generally I don't think that's the problem here though since Matomo would still keep and show invalidated data until the data has been rearchived which is not the case here.

The done flag is from 2019-09-23 18:07:52 so it couldn't have been created recently.

@Littlericket I suppose no data was deleted there manually in the database somehow? Do you know if the data was still there before updating Matomo 4? Like maybe it appeared to have gone lost after Matomo 4 update and maybe for a while nobody looked at the data and it maybe got missing before that?

Generally I see a lot of done flags in that table without any data. Also for various segments that were archived over 2020. Are you using any custom or third party Matomo plugins @Littlericket ?

I'm trying to understand what could delete archive data in Matomo but keep the done flag. I can't see anything yet in the code where this happens.

@Littlericket commented on March 23rd 2021

@tsteur very sorry for the late response!
We are not using any third-party plugins besides:

  • InvalidateReports
  • MarketingCampaignsReporting
  • QueuedTracking

, which never had any problems at all.
I am not sure if the data was "gone" or not visible before the upgrade, but after the upgrade we checked several things. And of course we have not deleted any data manually.

What we had a year ago was a invalidation for all time ranges which had crashed due to low memory, but that should not delete any data.

Powered by GitHub Issue Mirror