@tsteur opened this Issue on August 30th 2021 Member

refs L3-141

Problem 1 - If task fails, it won't be executed / retried again.

Generally I can see reschedules the task here before executing it matomo/Scheduler.php meaning if eg a scheduled report fails to send the report (or execute any task) it won’t be executed again. Meaning if it's a weekly scheduled report that week no report will be sent. Same applies to any other task that if there is an exception they won't be completed until they run again according to their schedule. They won't be retried the next time the scheduler runs as they were already rescheduled according to their schedule.

Problem 2 - Schedule reports specific

Maybe at the same time a cache:clear happened while we were trying to send the report? I’m thinking the execution stopped here matomo/API.php.

Error logs are for example

[2021-08-16 08:39:05] piwik.WARNING: /vendor/tecnickcom/tcpdf/tcpdf.php(7370): Warning - imagepng(/tmp/tcpdf/__tcpdf_12c3eb6b7d6f9bd4a51494b31b58fb98_imgmask_plain_e40746f50d0211c25b80d87ea0473683): failed to open stream: No such file or directory - Matomo 4.4.1 -

[2021-08-16 08:39:06] piwik.WARNING: /vendor/tecnickcom/tcpdf/include/tcpdf_static.php(296): Notice - tempnam(): file created in the system's temporary directory - Matomo 4.4.1 -

[2021-08-16 08:39:08] piwik.WARNING: /vendor/tecnickcom/tcpdf/include/tcpdf_static.php(1825): Warning - fopen(file:///tmp/assets/PDF Email Report - 3.2021-08-09.2.fr - rJ4AgwmQKD_bZwrLMtR7kvAkfZCFr8Vo9092Ud3c.pdf): failed to open stream: No such file or directory - Matomo 4.4.1 -

[2021-08-16 08:39:05] piwik.WARNING: /vendor/tecnickcom/tcpdf/tcpdf.php(7365): Warning - imagepng(/tmp/tcpdf/__tcpdf_12c3eb6b7d6f9bd4a51494b31b58fb98_imgmask_alpha_e40706f40d5211c25b80d87ea0473683): failed to open stream: No such file or directory - Matomo 4.4.1

Problem 3 - Exceptions are never logged or not visible

It seems we are catching such exceptions but never maybe throw them again or log them? See matomo/Scheduler.php

Some thoughts

I don't think we need to fix all the problems I suppose if we fix problem 1 then won't need to fix problem 2 and the other way around.

We could probably ignore problem 2 if we fix problem 1 as it means we would retry the task next time the task runner runs. Not sure what potential problems this could cause especially when multiple task runners run at the same time. There could be small edge cases where some reschedule time gets lost but could be likely ignored. It likely would be fine if we were to set the original time again for this task (as it was before rescheduling the task). However, depending on where a task failed this may cause issues: For example let’s say the error happens while sending custom alerts after sending an email. If then an exception happens we would next time retry the same custom alert maybe and send the same email again. It’s like a task would need to define on which exception a task should be retried and when not. For example if there is an error in writing the Geo IP to disk we would run this command on every archiving run and download the DB again and again etc. In any case if there are exceptions during a failed task then we would want to be notified. For example custom alerts currently sends all alerts as part of one task. If one of them plays up then all the other alerts will be resend hourly and cause people to be paged/alerted at night (happened in the past for users). We would maybe need to refactor the task to have one task per alert similar to scheduled reports.

Problem 3 would need to be solved for better debugging I suppose. Meaning logging information so it can be fixed. Or we could let the entire archive run fail so we get an email meaning we stop executing any exceptions after that and stop executing things. This can have downsides though that important tasks after this task wouldn’t be executed. Another one could be to rethrow all thrown exceptions after executing all the tasks. There could be downsides to this too.

We could also decide to fix problem 2 and keep the existing behaviour in scheduled tasks and instead make sure that the tasks itself make sure of retrying things when there is an exception. This might be easiest way to fix things without side effects but requires maybe to pay more attention in each task. Like ScheduledReports could schedule the tasks smarter depending on when they were sent last based on report.ts_last_sent column. We would also need to check other tasks if we would need to change them too.

Maybe we could fix problem 3 first to see how often this happens and then assess priority. If we were to receive emails on cloud when this happens then we'd get probably a good understanding how often this possibly happens. Nonetheless it needs to be fixed eventually either way as it does seem to happen and people rely on the reports.

Any change here might also affect our cloud/account tasks so we would need to make sure things don't cause side effects there too.

Powered by GitHub Issue Mirror