https://github.com/matomo-org/matomo/pull/16795 - Mouve

Site d'origine

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create robots.txt to prevent bots from indexing Matomo app #16795

Merged
merged 4 commits into from Nov 30, 2020
Merged

Conversation

@mattab
Copy link
Member

@mattab mattab commented Nov 25, 2020

According to one user, this has helped worked around the issues with the google ads submission.

Similar to #6552 #15273 but as a robots.txt file as well as the meta tags.

Review

  • Functional review done
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed
According to one user, this has helped worked around the issues with the google ads submission. 
Similar to #6552 #15273 but as a robots.txt file as well as the meta tags.
@mattab mattab added this to the 4.0.1 milestone Nov 25, 2020
@sgiehl
sgiehl approved these changes Nov 25, 2020
Copy link
Member

@sgiehl sgiehl left a comment

Shouldn't hurt to add that I guess

@Findus23
Copy link
Member

@Findus23 Findus23 commented Nov 25, 2020

Just keep in mind that this means if people use wget to e.g. download CSV reports, their scripts will break.

And Google shows an annoying warning in the search console that the website accessed can't be checked completely as some resources (Matomo) are blocked (at least it did years ago).

@tsteur tsteur modified the milestones: 4.0.1, 4.1.0 Nov 25, 2020
@tsteur
Copy link
Member

@tsteur tsteur commented Nov 25, 2020

was going to move this to 4.1 as it can break things and such a change should not be in a patch release where we're trying to make Matomo 4 upgrade stable but then moved it back to 4.0.1 as we have only rolled out Matomo 4 to a few users so far. Nonetheless it's a bit risky to put it now into 4.0.1 without any notice etc. @mattab be great to mention this in the Matomo 4 changelog right away in the initial list of things.

Can you also add a developer changelog entry?

@tsteur
Copy link
Member

@tsteur tsteur commented Nov 25, 2020

@mattab

maybe below would do as well (not sure we can define multiple rules)?

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

User-agent: *
Disallow: /matomo.php
Disallow: /piwik.php

see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

and https://developers.google.com/search/docs/advanced/robots/create-robots-txt

and http://www.robotstxt.org/db.html

@tsteur tsteur modified the milestones: 4.0.1, 4.0.2, 4.0.3 Nov 26, 2020
@mattab
Copy link
Member Author

@mattab mattab commented Nov 29, 2020

I didn't realise it would break BC potentially. Then maybe we could try to instead only set the robots.txt to:

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

User-agent: *
Disallow: /matomo.php
Disallow: /piwik.php

we'd have no guarantee it helps with the google ads malware mis-identification issue, but at least it wouldn't break BC?

@tsteur
Copy link
Member

@tsteur tsteur commented Nov 29, 2020

I have no big preference. I suppose we could always try and see if it helps? We could also add more Google bots if needed see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

@mattab
Copy link
Member Author

@mattab mattab commented Nov 29, 2020

Asked the 6 people who had experienced an issue and will see if it helps them to add the simple robots.txt in this PR.

if they confirm this workaround works for them, we could

  1. consider breaking BC (risky / not great)
  2. or instead try to list in robots.txt all the google bots from the link below (so wget and any other user agents can still fetch reports), and hope it works then (or ask them again to test it, if they're willing)

We could also add more Google bots if needed see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

@tsteur
Copy link
Member

@tsteur tsteur commented Nov 30, 2020

fyi tested this on the demo using https://technicalseo.com/tools/robots-txt/ and things should work like that

Added more crawlers to the list @mattab

tsteur added 2 commits Nov 30, 2020
@mattab
Copy link
Member Author

@mattab mattab commented Nov 30, 2020

LGTM

@tsteur tsteur merged commit 23739ca into 4.x-dev Nov 30, 2020
0 of 2 checks passed
0 of 2 checks passed
Travis CI - Branch Build Failed
Details
Travis CI - Pull Request Build Failed
Details
@tsteur tsteur deleted the robots.txt branch Nov 30, 2020
@MichaIng
Copy link

@MichaIng MichaIng commented Jan 8, 2021

And Google shows an annoying warning in the search console that the website accessed can't be checked completely as some resources (Matomo) are blocked (at least it did years ago).

That is true, such a warning shows up on each web page in Google Search Console: https://support.google.com/webmasters/answer/6352293#blocked-resources

As matomo.php/piwik.php cannot be accessed. And it breaks tracking bots in general then, I guess, which in turn renders the Bot Tracker plugin obsolete: https://plugins.matomo.org/BotTracker

It is probably reasonable to not index the Matomo web UI, but it could also be interesting to track bots 🤔.

@tsteur
Copy link
Member

@tsteur tsteur commented Jan 10, 2021

@MichaIng I suppose an easy fix be for the BotTracker plugin to delete the robots.txt regularly. Eg it can do this on plugin activation, plugin and core update, it could even do it in a regular task say every hour or daily. This would pretty much ensure the file never exists and bots can be tracked except if the file was not writable (deletable). The plugin could also mark this file to be ignore for the "file integrity check" probably so it won't complain if it doesn't exist maybe. Happy to give more hints if someone is keen on implementing this.

@MichaIng
Copy link

@MichaIng MichaIng commented Jan 10, 2021

Yes this is what I did, but it sounds more like a workaround than a good solution? Cleaner would be probably to solve it via Robots-Tag header set within PHP and an option in Matomo (allow/disallow bot crawling, which implies allowing/disallowing crawlers being tracked by Matomo), that can then be switched when installing the BotTracker app. I guess making it more fine grained and block only files that are not required for loading the tracking js by default wouldn't block much that is not blocked by .htaccess/webserver or authentication anyway, right?

But even that I removed the robots.txt and assured that Google crawler is able to check and index piwik.php and matomo.php, it still fails load it with tracking query string when crawling other files:

Calling the exact same URL + query string manually works and is successfully tracked in Matomo and the cases where Google fails are a directory index and a simple HTML page without any CSS or JavaScript, aside of what Cloudflare injects to load the Piwik app into all pages.

I'm not sure how to debug this, probably we'd need a few other cases to assure it is a general issue and not limited to our and/or similar setups, or e.g. the Cloudflare app (although the query string is perfectly fine, so from that point on not sure how it could have any effect). And if it is a general issue, we'd need to ask Google community, I think.

Btw, good to know that wget respects robots.txt, I would have never guessed that 😄!

@tsteur
Copy link
Member

@tsteur tsteur commented Jan 11, 2021

I guess making it more fine grained and block only files that are not required for loading the tracking js by default wouldn't block much that is not blocked by .htaccess/webserver or authentication anyway, right?

We could maybe only block index.php, matomo.php and piwik.php but it be basically the same as like now pretty much.

But even that I removed the robots.txt and assured that Google crawler is able to check and index piwik.php and matomo.php, it still fails load it with tracking query string when crawling other files:

Sorry not quite understanding this part. So you removed robots.txt but then Google crawler still fails to access it when there is a tracking query string? As you are using Cloudflare might be good to check if caching for this endpoint is disabled? I'm not too much into Cloudflare unfortunately. Maybe it has the robots.txt cached?

@MichaIng
Copy link

@MichaIng MichaIng commented Jan 11, 2021

We could maybe only block index.php, matomo.php and piwik.php but it be basically the same as like now pretty much.

Since matomo.php and piwik.php are required for tracking, those would need to be allowed. index.php could be blocked.

Sorry not quite understanding this part. So you removed robots.txt but then Google crawler still fails to access it when there is a tracking query string?

Exactly, the crawler loads piwik.js successfully but fails to call piwik.php, exactly the same way as when blocked by robots.txt. Caching was my first idea as well, but live-testing piwik.php directly succeeds, which would fails as well if robots.txt or X-Robots-Tag did block it. But probably I've overseen something, or Google uses multiple robots.txt caches depending on how a resource is accessed, I'll keep an eye on it.


EDIT: The issue persists and neither Matomo nor the webserver nor PHP report any error. Probably that mobile friendly test tool or the Google crawler itself simply denies to access resources with such long query string, not sure, it doesn't give any information anywhere. I tried to enter the whole URL with query string into Search Console URL inspection but it won't do the test, but as well doesn't show any reason why, neither do the related help/doc pages give any hint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants

Raccourcis

Commandes

Fermer