Referrer spam is a menace today, but simply blacklisting the URLs or the IP addresses will bound to fail. How about a dynamic blacklist?
This is how I foresee it to work. Essentially, there is a "score table" to keep scores of the domain of each referring URL. The better behaved the URL, the higher the score. The worse, the lower.
- The score table is initialized with several friendly search engines, custom URLs (specified by the admin), and the host URL, with some high number (or allow the admin to customize). URLs in the whitelist will be considered safe.
- Anti-flood: If a referrer URL is referred to for X number of times within Y seconds, penalize the domain's score by Z. Allow X, Y, and Z to be customized.
- For each a referrer URL not existing in the score table OR with score < X, retrieve HTML page from the referring URL using curl (after some delay) and then parse to see if indeed the page contains a visible link to webpage. This means that the referring URL is legit and thus add the score for the domain by Y. Else, the URL is spam, penalize domain by Z. Allow X, Y, Z to be customized.
- Check the referrer URL or IP address in Project Honeypot or other anti-spam websites to see whether it's a known blacklist. If so, penalize the domain by X.
- For each incoming referrer URL, if the domain score is low enough (< X), ban right away. If the domain score is high enough (> Y), allow recording right away.
- Also, over time (after Z weeks), purge the records or multiple by a number between 0 and 1 to enforce checking on high-scoring domain from time to time.
I think #2, #3, and #4 will nail most of the referrer spam.