Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of the spammers list #7674

Closed
mnapoli opened this issue Apr 14, 2015 · 26 comments
Closed

Better handling of the spammers list #7674

mnapoli opened this issue Apr 14, 2015 · 26 comments
Assignees
Labels
Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Milestone

Comments

@mnapoli
Copy link
Contributor

mnapoli commented Apr 14, 2015

In #5099 we have many people regularly report new spammers (which spam Piwik with fake visits). We need to find a more scalable solution as it's beginning to become a real problem.

Goals:

  • make it easier for users to report new spammers
  • Piwik should auto-update the spammers list (every day or something)
  • the list should be kept up to date in future releases (for the Piwik installs that are setup to avoid any external network call)
  • optional: share that spammers list with the world as open data?

Ideas:

  • store the spammers list in a new GitHub repository in a JSON file (or YAML or whatever)
  • users can report new spammers with issues and pull requests (later we can create a better UI/website for that)
  • register that package on Packagist:
    • Piwik requires that package, which means the list is bundled in Piwik's releases (no first download required)
    • any other project can use the list by cloning the git repository or requiring it in Packagist
  • Piwik would download new versions of that list in tmp/ every day or week -> the version in tmp/ would override the one installed in vendor/

I'm not too sure yet about the Packagist part (it's not a PHP package, would require to use composer update before releases) but using submodules is definitely a no-go…

@mnapoli mnapoli added the Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. label Apr 14, 2015
@Globulopolis
Copy link
Contributor

We need some UI in Piwik to:

  1. Manage the list of URLs.
  2. Update list via one click.

It's more tech step. But Piwik must support regex in these URLs.

At least, Piwik on update list event should check for userlist and merge it if userlist has different data.

@gaumondp
Copy link

And please, make it "low tech" so people with server that can't go on the internet can "copy-paste & save" the content of a file to upgrade the list...

@AgentGod
Copy link

So it seems the easiest way from a user standpoint will be to keep spammer list in external file, which to be updated on daily basis.
In piwik admin panel beside each Referrer URL to be a button report as a SPAM, each report to be sent to verification list, where if some url have several reports to be moved from verification list to the main list.

@futureweb
Copy link
Contributor

Guess it would make sense to have an (auto Updated) "general List" and something User/Installation based ...
As European Users often see different Referal Spam than US,Chinese, ... Users
I use Piwik for about 500-600 different Sites - and Referal Spam often differs from Site to Site (of course some are the same on rather all Sites)

So one could add specific Entries only for the own Instance of Piwik and don't have to wait for the general List to be updated.

@mnapoli
Copy link
Contributor Author

mnapoli commented Apr 14, 2015

The idea of a custom user list seems like a good idea but I don't think it's beneficial for everybody on the long run: spammers are spammers for everybody. If people do not report them because they can flag them in their user list, then the interest in the global list is gone.

We should maybe take the problem the other way: when admin report spammers, they are added to the custom list. That way they don't have to wait for the spammer to be added to the official list, but it still means that users will report spammers and not simply create a custom list.

However we may want to start with a simpler goal at first (one where there is no UI to report spammers, and no way to have a custom list).

@gaumondp The list should be updated on each Piwik update, like now, I'm not sure letting users manually update the list is that necessary. That should be enough for a start for those installs that don't have internet access, especially since those might not be the target of spammers (since they don't have internet access).

@futureweb
Copy link
Contributor

I like the idea of reporting spammers to the global list adds them to the custom list ... prevents bad reporting bahaviour as you say.
But updating the global list only on Piwik Updates is not frequent enough as Spammers will alter their Domains faster when they know they are banned ...
I would suggest to update the global list like the GeoCity Database on a regular base like AgentGod already posted.
(already stated in your initial post)

@gaumondp
Copy link

@mnapoli I don't want to make my case the rule but I know few people with big installation and very rigid environment/infrastructure can't keep up with Piwik fast release cycle.

In fact, I usually update 4 times a year. So we're often 3 release behind at update time. I don't think I'm alone though.

@mnapoli
Copy link
Contributor Author

mnapoli commented Apr 15, 2015

@gaumondp and those setups cannot use auto-update of the list?

@gaumondp
Copy link

Exact, no auto-update spammer possible, no one-click GeoIP updating, no easy install for stuff at http://plugins.piwik.org/ ...

And considering size (DB is at 22 GB here right now), no Web interface Piwik update possible. We use the CLI for that.

@AgentGod
Copy link

@gaumondp that's why it should be simple external txt file with spammer list in it, which can be updated easily through cli.
Some people will report through user interface, some with big installs will not.
The idea is before some link to be in generally distributed spam list to be automatically verified from several sources.

@mnapoli
Copy link
Contributor Author

mnapoli commented Apr 15, 2015

@gaumondp OK then we can document how to update the updated list, i.e. there will be 2 files:

  • built-in list, updated with every Piwik update (this is the one installed by Composer)
  • latest list, updated either through auto-update or manually (will probably be in the tmp/ directory)

That doesn't require any additional effort and should address all use cases. Then once that is done we can discuss of how to let users update manually through the UI if that's really necessary.

@gaumondp
Copy link

@mnapoli , I'm just giving information and use case about few environment and scenario I know about that maybe you don't see often. I'm not "requiring" stuff. :) I'm just good at being devil's advocate.

I'm not sure about saving the list in /tmp/ directory though. In my view, everything in /tmp/ once emptied will be "auto-generated". Tell me if I'm wrong about this! But you know Piwik internals better than me for sure so I trust you about where to store such file.

@mnapoli
Copy link
Contributor Author

mnapoli commented Apr 16, 2015

That's appreciated to list the different use cases, I for sure don't have a clear overview of all of them. In that case there is no additional effort so I don't see any issue ;)

Regarding the directory, maybe somebody else can chime in on this but I'm afraid we need a folder with write access.

@gaumondp
Copy link

Maybe a new table in Piwik but "feedable" from the text file or a future Web interface with a simple "each line is a spammer" so it's copy-paste enabled ?

Or if GeoIP database is in /misc/ maybe it makes sense to use this one instead of /tmp/ if you don't want an additional table in the DB ?

@mnapoli mnapoli added this to the Short term milestone Apr 17, 2015
@mnapoli
Copy link
Contributor Author

mnapoli commented Apr 19, 2015

For the record the new list is here: https://github.com/piwik/referrer-spam-blacklist

@openjck
Copy link

openjck commented Apr 29, 2015

Should the improved handling also discount spam visits retroactively?

@mattab
Copy link
Member

mattab commented Apr 29, 2015

@openjck no it will not remove referrer spammers from historical data

@pedrosanchezpernia
Copy link

Is there a way/command to remove referrer from historical data ? (Maybe a "rebuilt)

@mattab
Copy link
Member

mattab commented Jun 12, 2015

Since we will have a long dev cycle for 3.0.0 I reckon we need to provide users a solution to have constantly auto-updated spammers and really leverage our referrer spammer list. Moving to 2.14.0

@mattab mattab modified the milestones: 2.14.0, Short term Jun 12, 2015
@mnapoli
Copy link
Contributor Author

mnapoli commented Jun 12, 2015

👍 makes sense

@mattab
Copy link
Member

mattab commented Jun 20, 2015

Note: we can't easily store the file on disk (not ideal to store in tmp/ as it can be flushed). So I suggest to cache the spammers.txt file in DB option table)

@mnapoli
Copy link
Contributor Author

mnapoli commented Jun 21, 2015

@mattab doing so prevents users from updating manually (in environments without internet access)

@futureweb
Copy link
Contributor

@mnapoli I guess one possibility for Environments without Internet Access would be to to update the DB from a temporary File on Disk (tmp/)? So flushing tmp/ wouldn't be an issue.

  • Update File
  • run Update Script (CLI or GUI triggered)

@mnapoli
Copy link
Contributor Author

mnapoli commented Jun 21, 2015

@futureweb it would require more effort to implement, and would be less practical to use (requires SSH access, or requires to log in into Piwik instead of just dropping a file through FTP), but that's still a better solution than nothing so I guess we could do that.

@mattab
Copy link
Member

mattab commented Jun 22, 2015

  • when users upgrade Piwik to latest version, they will get latest version of referrer spammer list.
    • on the release checklist, before releasing a new stable version, we will tag new version of spammer list and update composer.lock to use the latest
  • additionally to get the latest spammer list, Piwik users who have access to the internet, will receive the latest file (proposed: update once a week)

@mnapoli mnapoli self-assigned this Jun 23, 2015
mnapoli added a commit that referenced this issue Jun 23, 2015
The blacklist is updated weekly from github (stored in the option table).
mnapoli added a commit that referenced this issue Jun 23, 2015
@mnapoli
Copy link
Contributor Author

mnapoli commented Jun 23, 2015

PR: #8186

mnapoli added a commit that referenced this issue Jun 24, 2015
The blacklist is updated weekly from github (stored in the option table).
mnapoli added a commit that referenced this issue Jun 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

8 participants