https://github.com/matomo-org/matomo/pull/17379 - Mouve

Site d'origine

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initiate range archiving if an archive is invalidated, the request is from the browser, and browser archiving is authorized #17379

Merged
merged 26 commits into from Apr 12, 2021

Conversation

@diosmosis
Copy link
Member

@diosmosis diosmosis commented Mar 23, 2021

Description:

Currently, if a range archive is invalid, and the user views it, it will not re-archive. This is because ranges are generally archived well after the end date, so the ts_archived will always be later than the minimum processed time: https://github.com/matomo-org/matomo/blob/3.x-dev/core/ArchiveProcessor/Loader.php#L223-L236

I think, previously, if going through the "trigger archiving" code path (https://github.com/matomo-org/matomo/blob/4.x-dev/core/Archive.php#L571), we never looked for invalidated archives, so we'd always initiate archiving. This change happened in 3c2b5f7 to make sure we don't initiate archiving of invalidated archives under certain circumstances if we don't have to.

Also for some reason a test was removed in a commit (but the test data was still there)... I put it back.

FYI @tsteur

Review

  • Functional review done
  • Potential edge cases thought about (behavior of the code with strange input, with strange internal state or possible interactions with other Matomo subsystems)
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed
diosmosis added 2 commits Mar 23, 2021
…the browser, and browser archiving for the current request is authorized.
@tsteur
Copy link
Member

@tsteur tsteur commented Mar 23, 2021

makes sense 👍

@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Mar 24, 2021

@tsteur added two more tests: one where we track a visit in the past, run core:archive and make sure it incremented the data, and another where we track a visit in the past after archiving a range, then rearchive the range via a direct API request, and check that it updated.

Caught a bug in the first test. In

$minArchiveProcessedTime = $isPeriodIncludesToday ? Date::now()->subSeconds(Rules::getPeriodArchiveTimeToLiveDefault($params->getPeriod()->getLabel())) : null;
we check if there is an existing archive that we can use, and if so, we don't invalidate. This was for today archives since they would be invalidated often and we didn't want to initiate archiving if there was already one that was created within the TTL value.

Unfortunately, we default to null for other periods, so if we find any archive regardless of when it was created, we don't invalidate, and don't insert anything into the archive_invalidations table. The bug manifests like this:

  • data is archived (time 2020-03-23 12:00:00)
  • user tracks data in the past for a day (2012-08-09) (time 2020-03-23 22:30:00)
  • next core:archive, we see the day in the list to invalidate
  • we check whether there is currently an archive there. there is w/ DONE_OK, created 2020-03-23 12:00:00.
  • we don't invalidate because the archive is there, even though there's been a new visit
  • the data is not rearchived
@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Mar 24, 2021

@tsteur actually just decided it was better to not perform the check if the date didn't include today. I'm thinking this would avoid edge cases when rearchiving data in the past (eg, report specific or segment archives). For periods that include today, they will eventually get archived again, so for (eg) a Funnel that is rearchived, we'd archive the months and days not including today, then a little while later the 'today' included archives (ie, day, week, month year) will get archived, including the old data.

This makes me think of some other edge cases w/ periods that include today as well, will think about them.

@@ -904,7 +904,11 @@ public function isThereExistingValidPeriod(Parameters $params)
$isYesterday = $params->getPeriod()->getLabel() == 'day' && $params->getPeriod()->getDateStart()->toString() == Date::factory('yesterday')->toString();

$isPeriodIncludesToday = $params->getPeriod()->isDateInPeriod($today);
$minArchiveProcessedTime = $isPeriodIncludesToday ? Date::now()->subSeconds(Rules::getPeriodArchiveTimeToLiveDefault($params->getPeriod()->getLabel())) : null;
if (!$isPeriodIncludesToday) {

This comment has been minimized.

@tsteur

tsteur Mar 25, 2021
Member

@diosmosis can we maybe add a comment why we skip here?

This comment has been minimized.

@tsteur

tsteur Mar 25, 2021
Member

technically we could still keep the logic and it would just work the next time it runs after minArchiveProcessedTime has passed?

This comment has been minimized.

@diosmosis

diosmosis Mar 25, 2021
Author Member

@tsteur I left a comment w/ my reasoning here: #17379 (comment)

I was thinking there could be weird edge cases where we rearchive say a month of data in the past, but only some of it gets archived. Eg, let's say we try to archive a specific funnel for the last month:

  • user invalidates a day from a week ago (or a visit is tracked in the past for that day)
  • archiver runs, the day archive from a week ago is rearchived (ts_archived = '2020-03-24 19:00:00')
  • user changes a funnel, last month is scheduled to rearchive
  • archiver runs, every day is invalidated except the day from a week ago. the data is archived but incorrect.

This isn't an issue I think for tracking in the past so we could keep it in that case, because if we don't invalidate, it's not removed from the "remember to invalidate" option values. So we'll do it again. But other types of invalidation it probably shouldn't be done.

Though only issue i can think of w/ this change for tracking in the past is if visits keep coming in, we keep invalidating, except we only do this once per site before archiving, so I don't think that will happen...

This comment has been minimized.

@tsteur

tsteur Mar 25, 2021
Member

user changes a funnel, last month is scheduled to rearchive

do I get it right it wouldn't invalidate the day from a week ago because of a race condition if this happens shortly after the archiver ran? Technically, when scheduling to rearchive last month then it would also invalidate each day and week so they would be archived again?

This comment has been minimized.

@diosmosis

diosmosis Mar 25, 2021
Author Member

do I get it right it wouldn't invalidate the day from a week ago because of a race condition if this happens shortly after the archiver ran?

In this specific example, yes. I suspect there are many strange edge cases like this that could happen.

Technically, when scheduling to rearchive last month then it would also invalidate each day and week so they would be archived again?

I guess it's better to use an example of last2 months, where there's no period that contains today, and the new visit is tracked in the month before last.

It would invalidate the adjacent days, and the week. When the week gets archived, it would see there is a usable day archive and just use it since value = DONE_OK and ts_archived is recent and date does not include today (if my reasoning is correct).

This comment has been minimized.

@tsteur

tsteur Mar 26, 2021
Member

OK thanks 👍

This comment has been minimized.

@diosmosis

diosmosis Mar 28, 2021
Author Member

@tsteur took another look, and my reasoning for this was off, so removed the code

This comment has been minimized.

This comment has been minimized.

@tsteur

tsteur Mar 29, 2021
Member

we could merge the remaining changes then once tests pass?

This comment has been minimized.

@diosmosis

diosmosis Mar 29, 2021
Author Member

👍 will do, working on test fixes now

@tsteur
Copy link
Member

@tsteur tsteur commented Mar 26, 2021

@diosmosis there are some tests failing. Is it maybe due to this PR?

diosmosis added 4 commits Mar 28, 2021
@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Mar 29, 2021

@tsteur fixed the test that was failing, the ArchiveInvalidationTest was kind of broken, fixed a bunch of things there and found an issue in Loader.php. W/ browser archiving, the invalidate doesn't happen early enough, so moved it. Will merge if tests pass on travis.

diosmosis added 4 commits Mar 29, 2021
…idateRecentDate("yesterday").
@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Mar 30, 2021

@tsteur fixed the tests and modified invalidation in CronArchive.php again to do so. Here is how the invalidation uses the ttl:

  • if data was tracked to a past date, then we check that if there was a recent archive within the TTL, if so we use it. next core:archive run we'd check again.
  • if doing scheduled rearchiving, we don't perform the check and just make sure the archives are invalidated so they will be processed
  • if invalidating 'today' (or the period includes today), we perform the TTL check to make sure it is archived again
  • if invalidating 'yesterday', we don't perform the TTL check and look for any archive (this is ONLY for the invalidateRecentDate function which is there to check if the date has changed since yesterday)

Does this sound right to you?

@tsteur
Copy link
Member

@tsteur tsteur commented Mar 30, 2021

@diosmosis sounds right. I checked the documentation in https://github.com/matomo-org/matomo/blob/4.2.0/config/global.ini.php#L323-L340 again and it is indeed only applied to periods including today.

if data was tracked to a past date, then we check that if there was a recent archive within the TTL, if so we use it. next core:archive run we'd check again.

so maybe here we would actually indeed not check the TTL which I think you had in the code initially and I was asking for more information if I remember correctly.

does that make sense?

@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Mar 30, 2021

@tsteur

so maybe here we would actually indeed not check the TTL which I think you had in the code initially and I was asking for more information if I remember correctly.

does that make sense?

It makes sense. I left it, since in this case, it should be fine to have the TTL check (we will always try and invalidate again). It might also be useful if a user tracks lots of data in the past continuously? Like the cloud user that tracks data to yesterday instead of today. Then if the ttl is 6hrs, but we run core:archive every hour, maybe we'd still want to wait until the archives are out of date before invalidating?

@tsteur
Copy link
Member

@tsteur tsteur commented Mar 30, 2021

Yes that's good 👍

diosmosis added 2 commits Mar 31, 2021
…child archive or when all child archives are usable while still respecting ttl for periods that include today
@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Apr 1, 2021

@tsteur last commit will make sure ranges are only force archived in the following situations:

  • range is invalid and we are allowed to archive day archives in the same request
  • range is invalid and we are not allowed to archive day archives in same request, but every child archive is usable (not invalid)

If the range contains today, then we rely only on the archive becoming too old (so we respect the TTL).

I also made sure the same logic worked when browser triggered archiving of segments was allowed.

I tried to make the logic as clear as possible and the added query is only executed if all other checks fail and only for range archives. Added several new tests as well. Need to add a couple more to ArchiveSelectorTest if this all makes sense to you.

When reviewing the pr I noticed the new test I wrote previously in CronArchiveTest was incorrect, so found these edge cases this way.

core/ArchiveProcessor/Rules.php Outdated Show resolved Hide resolved
core/ArchiveProcessor/Rules.php Outdated Show resolved Hide resolved
core/ArchiveProcessor/Rules.php Outdated Show resolved Hide resolved
diosmosis added 2 commits Apr 7, 2021
…iod is used and archiving is enabled for the current request/period
// made invalid, we will correctly re-archive below.
if ($this->invalidateBeforeArchiving) {
$this->invalidatedReportsIfNeeded();
}

This comment has been minimized.

@diosmosis

diosmosis Apr 7, 2021
Author Member

Note: Invalidating before we look for an archive, otherwise, the invalidation won't happen.

This comment has been minimized.

@tsteur

tsteur Apr 8, 2021
Member

@diosmosis is this maybe only needed for certain periods or so? I see CoreAdminHome.archiveReports sets this to true when archive php is not triggered meaning this is the case for most requests even when browser archiving is disabled. It then shouldn't invalidate any reports. Maybe need to use something like $this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled or maybe only add it for some periods like ($this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled) || ($this->invalidateBeforeArchiving && period === range && Rules::isRangeArchivingEnabled).

This comment has been minimized.

@diosmosis

diosmosis Apr 8, 2021
Author Member

@tsteur I think it's required for browser archiving to support the "tracking data in the past" feature. Otherwise, the invalidations that get queued in the option table never get applied. We could add isBrowserArchivingEnabled as a safety. I don't think we can restrict to period because it's meant to execute on every query (basically it just does this: "are there queued invalidations? if so, let's invalidate before we launch archiving"). Since it's just for browser archiving I don't think we have to care too much about it.

This comment has been minimized.

@tsteur

tsteur Apr 9, 2021
Member

@diosmosis this could be still executed though when browser archiving is disabled right? And then because archive php is not triggered it might invalidate them? Might be good to test this

This comment has been minimized.

@diosmosis

diosmosis Apr 9, 2021
Author Member

Oh I see, I think in the old code we didn't invalidate in core:archive, so maybe it was all done here... I'll add the check for browser archiving being enabled.

if ($endDateTimestamp) {
// past archive
return $endDateTimestamp;
}

This comment has been minimized.

@diosmosis

diosmosis Apr 7, 2021
Author Member

Note: if we see a range period and we are allowed to archive and we're in a browser request, we check the ttl no matter what.

So if range archiving via browser is forced, we always archive in the browser every time the ttl is no longer valid.

If range archiving is not allowed via browser, we use the default logic (if the range does not include today, use latest archive that was archived after end date). If it includes today, we check the TTL using the existing code.

@@ -857,10 +857,11 @@ public function invalidateRecentDate($dateStr, $idSite)
'date' => $date->getDatetime(),
]);

$this->invalidateWithSegments([$idSite], $date->toString(), 'day');
$isYesterday = $dateStr == 'yesterday';
$this->invalidateWithSegments([$idSite], $date->toString(), 'day', false, $doNotIncludeTtlInExistingArchiveCheck = $isYesterday);

This comment has been minimized.

@diosmosis

diosmosis Apr 7, 2021
Author Member

Note: this and below code skips a particular check if we are invalidating yesterday (not a date including yesterday, just yesterday; we do this like we're invalidating today in case the day has changed and we want to initiate archiving for yesterday).

Since we only care if the day changed since the archive was created, we want to check if there's any archive available that we can use, so we don't want to do a ttl check.

This comment has been minimized.

@tsteur

tsteur Apr 8, 2021
Member

could we maybe also add inline a comment for this explaining this behaviour?

This comment has been minimized.

@diosmosis

diosmosis Apr 8, 2021
Author Member

👍

diosmosis added 2 commits Apr 7, 2021
…rchiveIdAndVisits is used in multiple code paths now
if (!empty($idArchives)
&& !$this->params->getArchiveOnlyReport()
&& !Rules::isForceArchivingSinglePlugin()
&& !$this->shouldForceInvalidatedArchive($value, $tsArchived)

This comment has been minimized.

@diosmosis

diosmosis Apr 7, 2021
Author Member

Note: if an archive is found to be invalidated, we check if we should force archiving. This check was originally in ArchiveSelector, but I moved it since ArchiveSelector is used in multiple places now.

@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Apr 7, 2021

@tsteur made some changes ready for another review

Copy link
Member

@tsteur tsteur left a comment

@diosmosis looks good so far. Added a few comments that may need looking at.

Also I'm thinking it may be useful to add a new method isArchivingEnabledFor like below since there are a lot of ! and it makes the code maybe more easily understandable. I'm sure we could use that method eventually in a lot of places and change a lot of isArchivingDisabledFor calls

diff --git a/core/ArchiveProcessor/Loader.php b/core/ArchiveProcessor/Loader.php
index 4e792338cc..f79714e882 100644
--- a/core/ArchiveProcessor/Loader.php
+++ b/core/ArchiveProcessor/Loader.php
@@ -275,7 +275,7 @@ class Loader
     {
         // for range periods we can archive in a browser request request, make sure to check for the ttl no matter what
         $isRangeArchiveAndArchivingEnabled = $this->params->getPeriod()->getLabel() == 'range'
-            && !Rules::isArchivingDisabledFor([$this->params->getSite()->getId()], $this->params->getSegment(), $this->params->getPeriod()->getLabel());
+            && Rules::isArchivingEnabledFor([$this->params->getSite()->getId()], $this->params->getSegment(), $this->params->getPeriod()->getLabel());
 
         if (!$isRangeArchiveAndArchivingEnabled) {
             $endDateTimestamp = self::determineIfArchivePermanent($this->params->getDateEnd());
@@ -444,7 +444,7 @@ class Loader
 
         // the archive is invalidated and we are in a browser request that is allowed archive it
         if ($value == ArchiveWriter::DONE_INVALIDATED
-            && !Rules::isArchivingDisabledFor([$params->getSite()->getId()], $params->getSegment(), $params->getPeriod()->getLabel())
+            && Rules::isArchivingEnabledFor([$params->getSite()->getId()], $params->getSegment(), $params->getPeriod()->getLabel())
         ) {
             // if coming from core:archive, force rearchiving, since if we don't the entry will be removed from archive_invalidations
             // w/o being rearchived
diff --git a/core/ArchiveProcessor/Rules.php b/core/ArchiveProcessor/Rules.php
index 947611e5c5..27d505a7ea 100644
--- a/core/ArchiveProcessor/Rules.php
+++ b/core/ArchiveProcessor/Rules.php
@@ -197,6 +197,11 @@ class Rules
         return !$generalConfig['browser_archiving_disabled_enforce'];
     }
 
+    public static function isArchivingEnabledFor(array $idSites, Segment $segment, $periodLabel)
+    {
+        return !self::isArchivingDisabledFor($idSites, $segment, $periodLabel);
+    }
+
     public static function isArchivingDisabledFor(array $idSites, Segment $segment, $periodLabel)
     {
         $generalConfig = Config::getInstance()->General;
// made invalid, we will correctly re-archive below.
if ($this->invalidateBeforeArchiving) {
$this->invalidatedReportsIfNeeded();
}

This comment has been minimized.

@tsteur

tsteur Apr 8, 2021
Member

@diosmosis is this maybe only needed for certain periods or so? I see CoreAdminHome.archiveReports sets this to true when archive php is not triggered meaning this is the case for most requests even when browser archiving is disabled. It then shouldn't invalidate any reports. Maybe need to use something like $this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled or maybe only add it for some periods like ($this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled) || ($this->invalidateBeforeArchiving && period === range && Rules::isRangeArchivingEnabled).

}

// if coming from a browser request, and period does not contain today, force rearchiving
if (!$params->getPeriod()->isDateInPeriod(Date::factory('today'))) {

This comment has been minimized.

@tsteur

tsteur Apr 8, 2021
Member

do we need to add the timezone to today depending on the idsite maybe?

This comment has been minimized.

@diosmosis

diosmosis Apr 8, 2021
Author Member

👍

@@ -857,10 +857,11 @@ public function invalidateRecentDate($dateStr, $idSite)
'date' => $date->getDatetime(),
]);

$this->invalidateWithSegments([$idSite], $date->toString(), 'day');
$isYesterday = $dateStr == 'yesterday';
$this->invalidateWithSegments([$idSite], $date->toString(), 'day', false, $doNotIncludeTtlInExistingArchiveCheck = $isYesterday);

This comment has been minimized.

@tsteur

tsteur Apr 8, 2021
Member

could we maybe also add inline a comment for this explaining this behaviour?

@diosmosis
Copy link
Member Author

@diosmosis diosmosis commented Apr 10, 2021

@tsteur applied review feedback

@tsteur
tsteur approved these changes Apr 12, 2021
Copy link
Member

@tsteur tsteur left a comment

LGTM if tests pass 👍

…ile methods used in next test were loaded)
@diosmosis diosmosis merged commit 5573d22 into 4.x-dev Apr 12, 2021
1 of 3 checks passed
1 of 3 checks passed
@github-actions
PHPCS
Details
@travis-ci
Travis CI - Branch Build Failed
Details
@travis-ci
Travis CI - Pull Request Build Failed
Details
@diosmosis diosmosis deleted the range-archive-invalidate-fix branch Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants

level de log mis à 1

Récupération du document https://github.com/matomo-org/matomo/pull/17379.

Récupération de l'URL https://github.com/matomo-org/matomo/pull/17379...

Raccourcis

Commandes

Fermer