referrer spam detection

View: New views
13 Messages — Rating Filter:   Alert me  

referrer spam detection

by Hannes Wallnoefer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi list,

here is a very rough referrer spam detection and blocking script I
wrote for antville.org. I think it may be useful for other big
antville installations. It's very rough in  its current state, and not
at all integrated into the antville app infrastructure. It needs to be
polished and probably should be be integrated into the antville
SysMgr.

Attached you find file refspam, which provides a global object
containing two functions: Refspam.track(), which should be called as
first thing in HopObject.onRequest(), and Refspam.dump(), which should
be called from Root.refspam_action() and provides output for current
referrer blocking state and blocked requests.

The way referrer detection and blocking works is very simple, it's
described here <http://www.henso.com/log/2006.05.28/1154/>. We keep a
least-recently-used Hashtable of size 128 in app.data.refspam which is
keyed with the host names of referrer headers we get. As soon as we
see more than 20 requests with a given referrer host, we check if the
number of IP addresses the requests came from is below a given ratio,
and if the number of referrer path names is above a certain ratio
(this is to prevent valid intranet links to be qualified as spam), and
if so, requests are redirected to the /refspam action which displays a
message and provides a link to continue to the original target.
Referrer bots won't follow the redirect, so it's a good safety net.

The script also contains a hardcoded whiltelist for hostnames which
currently contains ".antville.org" and ".google.". The parameters and
the whitelist should probably be configurable through antville's
management interface, and there probably should also be configurable
blacklist.

I hope this will be useful for somebody, and that somebody is going to
integrate this into the antville code base.

hannes


_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

refspam.js (4K) Download Attachment

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hannes Wallnoefer wrote:
> Hi list,

Hi,

> here is a very rough referrer spam detection and blocking script I
> wrote for antville.org. I think it may be useful for other big
> antville installations. It's very rough in  its current state, and not
> at all integrated into the antville app infrastructure. It needs to be
> polished and probably should be be integrated into the antville
> SysMgr.
<snip />

thanxs for that. I thought of writing such a "thing" (global spam
detection) myself, but hadn't the time.

I will testdrive your script and let you know how it works.

Thanxs again

cu Philipp
--
XML is the ASCII for the new millenium
(Cocoon Documentation)
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by Hannes Wallnoefer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I just found there's an error in the original script that makes the
spam redirect go into an infinite loop. I'm attaching the new script
with a fix. Also, the detection check thresholds changed a little bit
since the last version.

Actually, I think the approach I chose is not the ideal one. For
instance, one person clicking through a (large) blogroll on his/her
weblog looks very much like referrer spam to this script. From what I
know now, I would suggest the following approach:

* Track referrer hosts like my script currently does
* Whenever one host is *definitely* referrer spam, automatically add
it to a permanent blacklist and send the site admin a mail about it
* Offer the site admin a list of referrer hosts sorted by requests/ip
address ratio and let him/her manually add sites to the permanent
blacklist.
* A request with a referrer host that is blacklisted is redirected to
the refspam page so if it's a valid request, users can still click
through.

If anybody is interested in implementing this you're welcome. I'm
available for any questions you may have.

hannes

2006/5/29, Franz Philipp Moser <philipp.moser@...>:

>
> Hannes Wallnoefer wrote:
> > Hi list,
>
> Hi,
>
> > here is a very rough referrer spam detection and blocking script I
> > wrote for antville.org. I think it may be useful for other big
> > antville installations. It's very rough in  its current state, and not
> > at all integrated into the antville app infrastructure. It needs to be
> > polished and probably should be be integrated into the antville
> > SysMgr.
> <snip />
>
> thanxs for that. I thought of writing such a "thing" (global spam
> detection) myself, but hadn't the time.
>
> I will testdrive your script and let you know how it works.
>
> Thanxs again
>
> cu Philipp
> --
> XML is the ASCII for the new millenium
> (Cocoon Documentation)
> _______________________________________________
> Antville-dev mailing list
> Antville-dev@...
> http://helma.org/mailman/listinfo/antville-dev
>


_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

refspam.js (5K) Download Attachment

Re: referrer spam detection

by NightHawk-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I am not much of a Javascript Developer.. But is it maybe possible to
also add a check against the referrerspam filterlist that is already
existant in Antville? I have often thought that this would be a nice
addon, because I am suffering from some special spammer that keeps
"referring" links from only a few domains and usually with obvious
words like 'viagra' or 'casino' in the URL. It is actually pretty easy
to filter those, but they're still being tracked - which I would like
to avoid aswell.


On 5/29/06, Hannes Wallnoefer <hannesw@...> wrote:

> I just found there's an error in the original script that makes the
> spam redirect go into an infinite loop. I'm attaching the new script
> with a fix. Also, the detection check thresholds changed a little bit
> since the last version.
>
> Actually, I think the approach I chose is not the ideal one. For
> instance, one person clicking through a (large) blogroll on his/her
> weblog looks very much like referrer spam to this script. From what I
> know now, I would suggest the following approach:
>
> * Track referrer hosts like my script currently does
> * Whenever one host is *definitely* referrer spam, automatically add
> it to a permanent blacklist and send the site admin a mail about it
> * Offer the site admin a list of referrer hosts sorted by requests/ip
> address ratio and let him/her manually add sites to the permanent
> blacklist.
> * A request with a referrer host that is blacklisted is redirected to
> the refspam page so if it's a valid request, users can still click
> through.
>
> If anybody is interested in implementing this you're welcome. I'm
> available for any questions you may have.
>
> hannes
>
> 2006/5/29, Franz Philipp Moser <philipp.moser@...>:
>>
>> Hannes Wallnoefer wrote:
>> > Hi list,
>>
>> Hi,
>>
>>> here is a very rough referrer spam detection and blocking script I
>>> wrote for antville.org. I think it may be useful for other big
>>> antville installations. It's very rough in  its current state, and not
>>> at all integrated into the antville app infrastructure. It needs to be
>>> polished and probably should be be integrated into the antville
>>> SysMgr.
>> <snip />
>>
>> thanxs for that. I thought of writing such a "thing" (global spam
>> detection) myself, but hadn't the time.
>>
>> I will testdrive your script and let you know how it works.
>>
>> Thanxs again
>>
>> cu Philipp
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Hannes Wallnoefer wrote:
> I just found there's an error in the original script that makes the
> spam redirect go into an infinite loop. I'm attaching the new script
> with a fix. Also, the detection check thresholds changed a little bit
> since the last version.

I started to implement some things like skins, and so on, and found some
other problem.

*) what to do with weblogs with their own domain. req.path doesn't work
here, or am I wrong?

> Actually, I think the approach I choose is not the ideal one. For
> instance, one person clicking through a (large) blogroll on his/her
> weblog looks very much like referrer spam to this script. From what I
> know now, I would suggest the following approach:
>
> * Track referrer hosts like my script currently does
> * Whenever one host is *definitely* referrer spam, automatically add
> it to a permanent blacklist and send the site admin a mail about it
> * Offer the site admin a list of referrer hosts sorted by requests/ip
> address ratio and let him/her manually add sites to the permanent
> blacklist.
> * A request with a referrer host that is blacklisted is redirected to
> the refspam page so if it's a valid request, users can still click
> through.
I'm working on it, also try to implement a whitelist. I will integrate
your suggestions.

We also should not add hosts to the cache that are allready on the
blacklist/whitelist, doesn't make sense.

> If anybody is interested in implementing this you're welcome. I'm
> available for any questions you may have.

I am and I started some things as you can see in the attached file. Got
this implementation working on our weblogs.

logAcces may not be the right place for the check, but it worked good
for me.

> hannes
<snip />

cu Philipp
--
XML is the ASCII for the new millenium
(Cocoon Documentation)


_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

antirefspam_module.zip (5K) Download Attachment

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Hannes Wallnoefer wrote:
<snip />
> * Track referrer hosts like my script currently does

Done ;)

> * Whenever one host is *definitely* referrer spam, automatically add
> it to a permanent blacklist and send the site admin a mail about it

Done

> * Offer the site admin a list of referrer hosts sorted by requests/ip
> address ratio and let him/her manually add sites to the permanent
> blacklist.

Done

> * A request with a referrer host that is blacklisted is redirected to
> the refspam page so if it's a valid request, users can still click
> through.

Done. Added a security feature so not every url is accepted

> If anybody is interested in implementing this you're welcome. I'm
> available for any questions you may have.

I hope this looks like you want it. I also implemented a whitelist,
manually adding/removing hosts. As I sayed I added the track function to
the Global/logAccess() function. The whole thing is encapsuled in an
AntiSpamRefMgr mounted on root.

Added support for domains, maybe not needed for antville.org

Please take a look. I tested it on our weblogs and it worked out of the box.

> hannes
<snip />

I wonder if app.data Objects are stored(serialized) when the app is
restarted? Because we would loose the white and blacklist.

cu Philipp
--
XML is the ASCII for the new millenium
(Cocoon Documentation)


_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

antirefspam_module.zip (10K) Download Attachment

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

One last question, why this redirect? We could just skip the Access entry?

cu Philipp

Franz Philipp Moser wrote:
<snip />
--
XML is the ASCII for the new millenium
(Cocoon Documentation)
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by NightHawk-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I guess it saves the load of generating the HTML source. Esp. those
spammers often send a lot of requests at once and don't even bother
read the answer from the webserver - so its really just a waste of cpu
power that could be avoided.


On 5/30/06, Franz Philipp Moser <philipp.moser@...> wrote:
> One last question, why this redirect? We could just skip the Access entry?
>
> cu Philipp
>
> Franz Philipp Moser wrote:
> <snip />
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



nighthawk wrote:
> I guess it saves the load of generating the HTML source. Esp. those
> spammers often send a lot of requests at once and don't even bother
> read the answer from the webserver - so its really just a waste of cpu
> power that could be avoided.
<snip />

Ohh, sorry yes missed that ;) you are right of course. So logAccess is
not the right place to redirect, it should than be onRequest() instead.

Where is it integrated in antville?

cu Philipp
--
XML is the ASCII for the new millenium
(Cocoon Documentation)
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I added it now to HopObject/onRequest() just after res.handlers.site is
filled and it works fine.

We need res.handlers.site on our blogs, but for antville.org I think you
can just add "root.refspam.track()" as the first instruction in
HopObject/onRequest().

Works good thx to hns for thinking about that and finding a quick solution.

I put the whole thing under GPL on my blog if anybody needs it:

http://weblogs.brandnews.at/pm/stories/3808/

cu Philipp

nighthawk wrote:

> I guess it saves the load of generating the HTML source. Esp. those
> spammers often send a lot of requests at once and don't even bother
> read the answer from the webserver - so its really just a waste of cpu
> power that could be avoided.
>
>
> On 5/30/06, Franz Philipp Moser <philipp.moser@...> wrote:
>> One last question, why this redirect? We could just skip the Access entry?
>>
>> cu Philipp
<snip />

--
XML is the ASCII for the new millenium
(Cocoon Documentation)
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Franz Philipp Moser wrote:
<snip />
> I put the whole thing under GPL on my blog if anybody needs it:
>
> http://weblogs.brandnews.at/pm/stories/3808/
<snip />

Sorry for that but I released it now, under the antville licence so
everybody can use it.

cu Philipp
--
XML is the ASCII for the new millenium
(Cocoon Documentation)
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by Franz Philipp Moser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi list,

I don't know why but the implementation makes trouble. As I mentioned on
the helma-user list:

http://helma.org/pipermail/helma-user/2006-May/006533.html

I get all the time these strange tomany open files errors from java, so
I think there is something buggy.

Another thing strange is that I tried to add the black/whitelist to the
root object so it gets stored after a restart. First of all it doesn't
get stored, and today in the morning the black/whitelist on the root
object disappeared. They where simply null. Maybe I should use an other
java Object to store the lists, or simply a HopObject.

Can somebody help? The current version can be downloaded from my weblog:

http://weblogs.brandnews.at/pm/stories/3808/

cu Philipp

<snip />
--
XML is the ASCII for the new millenium
(Cocoon Documentation)
_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

Re: referrer spam detection

by Hannes Wallnoefer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

For the record, here's the current and probably final (as far as I'm
concerned) version of my refspam.js file. It's been working very well
for over a week on antville.org, most of the blocking being done by
the blacklist, with occasional attacks from new spammers being
detected quite reliably.

hannes

2006/5/29, Hannes Wallnoefer <hannesw@...>:

> Hi list,
>
> here is a very rough referrer spam detection and blocking script I
> wrote for antville.org. I think it may be useful for other big
> antville installations. It's very rough in  its current state, and not
> at all integrated into the antville app infrastructure. It needs to be
> polished and probably should be be integrated into the antville
> SysMgr.
>
> Attached you find file refspam, which provides a global object
> containing two functions: Refspam.track(), which should be called as
> first thing in HopObject.onRequest(), and Refspam.dump(), which should
> be called from Root.refspam_action() and provides output for current
> referrer blocking state and blocked requests.
>
> The way referrer detection and blocking works is very simple, it's
> described here <http://www.henso.com/log/2006.05.28/1154/>. We keep a
> least-recently-used Hashtable of size 128 in app.data.refspam which is
> keyed with the host names of referrer headers we get. As soon as we
> see more than 20 requests with a given referrer host, we check if the
> number of IP addresses the requests came from is below a given ratio,
> and if the number of referrer path names is above a certain ratio
> (this is to prevent valid intranet links to be qualified as spam), and
> if so, requests are redirected to the /refspam action which displays a
> message and provides a link to continue to the original target.
> Referrer bots won't follow the redirect, so it's a good safety net.
>
> The script also contains a hardcoded whiltelist for hostnames which
> currently contains ".antville.org" and ".google.". The parameters and
> the whitelist should probably be configurable through antville's
> management interface, and there probably should also be configurable
> blacklist.
>
> I hope this will be useful for somebody, and that somebody is going to
> integrate this into the antville code base.
>
> hannes
>
>
>


_______________________________________________
Antville-dev mailing list
Antville-dev@...
http://helma.org/mailman/listinfo/antville-dev

refspam.js (10K) Download Attachment
LightInTheBox - Buy quality products at wholesale price