Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled tasks: Always read timetable from the database and not from memory #17849

Merged
merged 3 commits into from Aug 2, 2021

Conversation

tsteur
Copy link
Member

@tsteur tsteur commented Aug 2, 2021

… memory

I was reviewing another issue and then saw that we actually read always the cached option entry for the scheduled tasks timetable by the looks. This is executed in https://github.com/matomo-org/matomo/blob/4.4.1/core/Scheduler/Scheduler.php#L105-L115

Because it is normal to have 2 or many more archivers running in parallel it's not uncommon that multiple archivers might execute the task runner at the same time. They would all fetch the timetable (the entries of what scheduled tasks to execute when) and they would all have a different version of it and work on this version constantly. However, because it can take a long time (from seconds up to hours) to execute all tasks, there's a high risk that some tasks may be executed multiple times if we don't always read the timetable from the database. It will cause quite a few additional queries but should reduce some concurrency issues.

Currently, there was already code to always read the DB value again. However, Option::get would always first return a cached result from memory and not fetch the DB again.

Basically this is how it currently looks like:

job 1: load tasks, returns [a,b,c,d]
job 1: work task a
job 2: load tasks, returns [b,c,d]
job 1: work task b 
job 2: work task b
job 1: work task c
job 2: work task c
...

The task runner logic is still far from being thread safe but this should improve it quite a bit.

Consequence of all this is a lot of added load as several tasks may be executed multiple times, potentially some scheduled reports or custom alerts may be sent multiple times (I remember seeing such reports), etc.

Description:

Please include a description of this change and which issue it fixes. If no issue exists yet please include context and what problem it solves.

Review

  • Functional review done
  • Potential edge cases thought about (behavior of the code with strange input, with strange internal state or possible interactions with other Matomo subsystems)
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed

… memory

I was reviewing another issue and then saw that we actually read always the cached option entry for the scheduled tasks timetable by the looks. This is executed in https://github.com/matomo-org/matomo/blob/4.4.1/core/Scheduler/Scheduler.php#L105-L115

Because it is normal to have 2 or many more archivers running in parallel it's not uncommon that multiple archivers might execute the task runner at the same time.  They would all fetch the timetable (the entries of what scheduled tasks to execute when) and they would all have a different version of it and work on this version constantly. However, because it can take a long time (from seconds up to hours) to execute all tasks, there's a high risk that some tasks may be executed multiple times if we don't always read the timetable from the database. It will cause quite a few additional queries but should reduce some concurrency issues.

Currently, there was already code to always read the DB value again. However, `Option::get` would always first return a cached result from memory and not fetch the DB again. 

Basically this is how it currently looks like:

```
job 1: load tasks, returns [a,b,c,d]
job 1: work task a
job 2: load tasks, returns [b,c,d]
job 1: work task b 
job 2: work task b
job 1: work task c
job 2: work task c
...
```

The task runner logic is still far from being thread safe but this should improve it quite a bit.

Consequence of all this is a lot of added load as several tasks may be executed multiple times, potentially some scheduled reports or custom alerts may be sent multiple times (I remember seeing such reports), etc.
@tsteur tsteur added the Needs Review PRs that need a code review label Aug 2, 2021
@tsteur tsteur added this to the 4.5.0 milestone Aug 2, 2021
@tsteur tsteur merged commit e646570 into 4.x-dev Aug 2, 2021
@tsteur tsteur deleted the concutask-1 branch August 2, 2021 21:53
@justinvelluppillai justinvelluppillai changed the title Scheduled tasks: Always read timetable from the database and not from… Scheduled tasks: Always read timetable from the database and not from memory Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants