@bx80 opened this Pull Request on November 17th 2021 Contributor

Description:

Fixes #17938 (partial)

Catches and logs exceptions that occur when running scheduled tasks

Review

@bx80 commented on November 17th 2021 Contributor

This PR currently just addresses problem 3 of the three problems listed in #17938 by logging any exceptions.

@tsteur Do we want to just add this logging for now to help determine priority or look for a solution for rescheduling/retrying failed tasks?

As previously noted in the issue, it seems like there would need to be some indication passed from the failed task as to whether to reschedule or not (eg. if an email has been sent then don't reschedule again) and perhaps additionally some sort of retry counter option to avoid getting stuck in a pointless reschedule/fail loop. (reschedule x times, then give up)

So for example:
GeoIP download, no diskspace. Task indicates that it can be rescheduled, task is run three times but disk is still full. Done.
Notification task, error after sending some notification. Task indicates that it should not be rescheduled. Done.
Report fails because of cache clear. Task indicates that it can be rescheduled, second run works. Done.

Perhaps we could start by adding a mechanism to return a reschedule=true/false indication when tasks encounter an exception and then have the scheduler exception handler use this to reschedule if required. Initially we could just have the email reports task check for exceptions that are worth retrying (eg. the 'No such file or directory' error) and then pass back reschedule=true. If this works well then other scheduled tasks could also use the same mechanism as needed.

What do you think?

@tsteur commented on November 17th 2021 Member

Perhaps we could start by adding a mechanism to return a reschedule=true/false indication when tasks encounter an exception and then have the scheduler exception handler use this to reschedule if required.

@bx80 I think that makes sense and maybe it's not too complicated to implement? Then we could decide on a per task level whether we want to reschedule or not if it fails.

@bx80 commented on November 19th 2021 Contributor

I've added a fairly simple rescheduling mechanism, for it to work the task must catch an exception and re-throw it with the error code 1. The scheduler execute task exception handler will check if the exception has error code 1 and if so will set a flag to reschedule the task for a retry. The task will then be rescheduled for one hour into the future by the main schedule loop.

To prevent infinite retry loops a new option 'TaskScheduler.retryList' has been added, it is a simple list of task names and the number of times they have been scheduled for a retry, when this count reaches 3 then the failing task will no longer be rescheduled one hour into the future and should fall back to it's normal next scheduled time.

The scheduled reports 'sendReport' method has been adjusted to catch any exceptions thrown on the generateReport() method and re-throw with error code = 1. So failures to generate the report should now be retried, but failures in the send part of the process will not be retried.

I'm not sure if a one hour bump and three retries are the best values to use, happy to change them to whatever makes the most sense.

@github-actions[bot] commented on November 27th 2021 Contributor

This issue is in "needs review" but there has been no activity for 7 days. ping @matomo-org/core-reviewers

@github-actions[bot] commented on January 8th 2022 Contributor

This PR was last updated more than one month ago, maybe it's time to close it. Please check if there is anything we still can do or close this PR. ping @matomo-org/core-reviewers

@bx80 commented on January 25th 2022 Contributor

I've added a new integration test for the various timetable retry methods and improved the option loading/reloading code as suggested.

There is a new paragraph submitted for the scheduled task developer guide https://github.com/matomo-org/developer-documentation/pull/605 and a new issue added to implement this in the CustomAlerts plugin.

@bx80 commented on January 27th 2022 Contributor

All tests are fixed, this should now be ready for a final review.

@sgiehl commented on January 31st 2022 Member

@bx80 you could also try adding some integration tests for the retry mechanism. Like creating some kind of test task that does nothing but throwing a retry exception and test that the system behaves correctly with rescheduling and aborting if retry limit was reached. Otherwise this PR already looks fine.

This Pull Request was closed on February 1st 2022
Powered by GitHub Issue Mirror