Lemonodor: Wiki Spam Fighter

November 23, 2005

Wiki Spam Fighter

I mocked up an interface to a possible anti-wiki abuse tool.

The idea was to make it easier to delete the kind of spam the ALU Wiki gets, which seems to be from a person using a single IP address, laboriously editing pages and satisfying captchas challenges.

There's some more discussion of the issues at the ALU Wiki Meta-discussion page.

I wanted to use logs of actual spamming efforts to make sure that grouping changes by IP address would in fact be useful, so I scraped the recent changes page. I decided to try Manuel Odendahl's html-match library to assist with the scraping, and despite it taking some time to figure out how to do what I wanted (there isn't much in the way of documentation, at least in English), it ended up saving me a lot of effort.

(html-match:html-pattern
    (html-match:+or ((:div :id "recentchangesdateheader") ?thing)
            ((:div :id "recentchangesblock") ?thing))
    (let ((day (day-p ?thing)))
      (if day
          (setf date-string day)
          (let ((change (parse-changes ?thing date-string)))
        (when change
          (push change changes))))))

I did find a few minor bugs in the Kiwi code running the ALU Wiki while working on this. Along with one inexcusable bug for a web-based application: incorrect HTML entity handling.

Posted by jjwiseman at November 23, 2005 06:54 PM

Comments

Hi John, I'm curious why Bayesian approaches and/or LSA techniques aren't been used more to combat Wiki Spam. I've only done a tiny bit of reading so far but haven't haven't found a good answer. What do you think?

Posted by: Gary King on November 26, 2005 12:37 PM

I doubt Bayesian approaches would be very effective against many spammers. Many do use spammy keywords, but some just post links or hide their links in normal looking text. And there would likely be a lot of false positives since legitimate edits to wikis are often minor changes or just a single new line. There is just not enough text to classify reliably.

What currently seems to be pretty effective and does not require a blacklist is the BadBehavior plugin. It can be used with a number of different wiki and blog software. It works by looking for unusual characteristics in the HTTP requests.

Posted by: JoeChongq on December 12, 2005 12:21 AM