Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle scheduled task failures #18335

Merged
merged 10 commits into from Feb 1, 2022
Merged

Conversation

bx80
Copy link
Contributor

@bx80 bx80 commented Nov 17, 2021

Description:

Fixes #17938 (partial)

Catches and logs exceptions that occur when running scheduled tasks

Review

@bx80 bx80 added the Easier debugging For issues that make troubleshooting issues for developers easier. label Nov 17, 2021
@bx80 bx80 added this to the 4.7.0 milestone Nov 17, 2021
@bx80 bx80 self-assigned this Nov 17, 2021
@bx80
Copy link
Contributor Author

bx80 commented Nov 17, 2021

This PR currently just addresses problem 3 of the three problems listed in #17938 by logging any exceptions.

@tsteur Do we want to just add this logging for now to help determine priority or look for a solution for rescheduling/retrying failed tasks?

As previously noted in the issue, it seems like there would need to be some indication passed from the failed task as to whether to reschedule or not (eg. if an email has been sent then don't reschedule again) and perhaps additionally some sort of retry counter option to avoid getting stuck in a pointless reschedule/fail loop. (reschedule x times, then give up)

So for example:
GeoIP download, no diskspace. Task indicates that it can be rescheduled, task is run three times but disk is still full. Done.
Notification task, error after sending some notification. Task indicates that it should not be rescheduled. Done.
Report fails because of cache clear. Task indicates that it can be rescheduled, second run works. Done.

Perhaps we could start by adding a mechanism to return a reschedule=true/false indication when tasks encounter an exception and then have the scheduler exception handler use this to reschedule if required. Initially we could just have the email reports task check for exceptions that are worth retrying (eg. the 'No such file or directory' error) and then pass back reschedule=true. If this works well then other scheduled tasks could also use the same mechanism as needed.

What do you think?

@tsteur
Copy link
Member

tsteur commented Nov 17, 2021

Perhaps we could start by adding a mechanism to return a reschedule=true/false indication when tasks encounter an exception and then have the scheduler exception handler use this to reschedule if required.

@bx80 I think that makes sense and maybe it's not too complicated to implement? Then we could decide on a per task level whether we want to reschedule or not if it fails.

@bx80
Copy link
Contributor Author

bx80 commented Nov 19, 2021

I've added a fairly simple rescheduling mechanism, for it to work the task must catch an exception and re-throw it with the error code 1. The scheduler execute task exception handler will check if the exception has error code 1 and if so will set a flag to reschedule the task for a retry. The task will then be rescheduled for one hour into the future by the main schedule loop.

To prevent infinite retry loops a new option 'TaskScheduler.retryList' has been added, it is a simple list of task names and the number of times they have been scheduled for a retry, when this count reaches 3 then the failing task will no longer be rescheduled one hour into the future and should fall back to it's normal next scheduled time.

The scheduled reports 'sendReport' method has been adjusted to catch any exceptions thrown on the generateReport() method and re-throw with error code = 1. So failures to generate the report should now be retried, but failures in the send part of the process will not be retried.

I'm not sure if a one hour bump and three retries are the best values to use, happy to change them to whatever makes the most sense.

@bx80 bx80 added Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. labels Nov 19, 2021
@github-actions
Copy link
Contributor

This issue is in "needs review" but there has been no activity for 7 days. ping @matomo-org/core-reviewers

@github-actions github-actions bot added the Stale The label used by the Close Stale Issues action label Nov 27, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2022

This PR was last updated more than one month ago, maybe it's time to close it. Please check if there is anything we still can do or close this PR. ping @matomo-org/core-reviewers

@github-actions github-actions bot added the Stale for long The label used by the Close Stale Issues action label Jan 8, 2022
@bx80 bx80 removed Needs Review PRs that need a code review Stale The label used by the Close Stale Issues action Stale for long The label used by the Close Stale Issues action labels Jan 11, 2022
@sgiehl sgiehl modified the milestones: 4.7.0, 4.8.0 Jan 18, 2022
Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

@tsteur would you mind having a quick look if that solution fits. I think it should work fine, but we would need to go through some our tasks afterwards and check if we would need to throw the RetryableException somewhere else as well.

core/Scheduler/Scheduler.php Outdated Show resolved Hide resolved
plugins/ScheduledReports/API.php Show resolved Hide resolved
@sgiehl sgiehl requested a review from tsteur January 19, 2022 16:23
@bx80 bx80 added the Needs Review PRs that need a code review label Jan 19, 2022
Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested it but looked through the code and left a few comments.

Also few things here:

core/Scheduler/Scheduler.php Show resolved Hide resolved
core/Scheduler/Scheduler.php Show resolved Hide resolved
core/Scheduler/Timetable.php Show resolved Hide resolved
core/Scheduler/Timetable.php Show resolved Hide resolved
core/Scheduler/Timetable.php Outdated Show resolved Hide resolved
@bx80
Copy link
Contributor Author

bx80 commented Jan 25, 2022

I've added a new integration test for the various timetable retry methods and improved the option loading/reloading code as suggested.

There is a new paragraph submitted for the scheduled task developer guide matomo-org/developer-documentation#605 and a new issue added to implement this in the CustomAlerts plugin.

@bx80
Copy link
Contributor Author

bx80 commented Jan 27, 2022

All tests are fixed, this should now be ready for a final review.

@sgiehl
Copy link
Member

sgiehl commented Jan 31, 2022

@bx80 you could also try adding some integration tests for the retry mechanism. Like creating some kind of test task that does nothing but throwing a retry exception and test that the system behaves correctly with rescheduling and aborting if retry limit was reached. Otherwise this PR already looks fine.

@bx80 bx80 merged commit f20c4cb into 4.x-dev Feb 1, 2022
@bx80 bx80 deleted the m-17938-scheduled-task-exceptions branch February 1, 2022 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easier debugging For issues that make troubleshooting issues for developers easier. Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some scheduled tasks might be skipped on error - meaning eg a scheduled report might not be sent
3 participants