Hung Truong: The Blog!

How To Stop RSS Scrapers From Stealing Your Content. Plus Revenge!

June 22, 2006 | 3 Minute Read

I noticed a while back in my Technorati blog backlinks thing that a certain blog was reposting all of the articles from my anime blog. It was an RSS Scraper that ran on some software called “Autoblogger Pro.” Basically, the site scraped posts from a number of anime blogs and reposted the content with adsense ads. While I didn’t mind the link back, it still bugged me that the person who set this up was simply profiting from other people’s work. I figured I’d do something about it.

The first thing I noticed was that all comments on the blog were turned off. This makes sense, since every comment would have probably been “stop stealing my content, you jackass!” There wasn’t any contact info or even an “about” page. I decided that contacting the owner of the website wouldn’t have helped anyway. It was time to take matters into my own hands.

As a Computer Scientist, I’ve learned to think as an adversary. For most automated systems, there’s usually a way to exploit some sort of flaw. I just had to figure out the weakness in Autoblogger Pro. In this case, the weakness is that everything is automated via RSS scraping. If they can’t get my RSS feed, they can’t scrape it.

dnslookup.jpg

The first thing I had to do was find their IP address. I used the dns lookup tool here to find the IP of the website. Note that the website’s IP is not necessarily the one that’s pulling the feed. It was in my case, though.

log.jpg

Just to be safe, I checked out my logs to see if that IP was pulling my feed. Bingo! They were pulling the feed every few hours.

In order to stop them from accessing the feed, all I’d have to do is deny that IP. It’s pretty simple with .htaccess’ rewrite engine. Now, you could simply throw a forbidden 403 code, but where’s the fun in that? Sure, your site isn’t indexed anymore, but what about all the other sites whose content is being stolen? Someone has to stand up to these bullies!

I decided to make a fake RSS feed to redirect to. This one would have 1000 entries in it, each 1 minute apart from the last. This would muddy up their site, and disallow anyone from actually seeing the stolen content. The fake feed would be auto-generated at the time of request, so each time they pulled the feed, it would be recent. I wrote this in my rails application, but it would probably be just as easy in php. Then I made my .htaccess forward to the fake rss for addresses coming from that ip’s location.

RewriteEngine On

RewriteCond %{REMOTE_ADDR} ^209.200.12.(.*)$

RewriteRule .* http://www.thebadrss.com/feed(not really the fake feed) [R,L]

Voila! After waiting a while for the scraper to pick up the feed, I was happy to see the results. You’ll notice that I inserted random numbers into each entry, just to make sure they were all unique. I thought that maybe the software could detect duplicate entries.

This might not work for all content stealers, but if enough people start spamming the spammers, maybe they’ll stop. I’d really love to see the look on their face when the guy running this site sees what’s become of it!

Full Disclosure: I actually run an RSS aggregator of sorts, myself. It’s called Anime Nano. There are a few differences, though. First off, it’s opt in. Second, it’s a community of readers and bloggers who like anime, not a one-man profit machine. Third, visit it at animenano.com!