How To Stop RSS Scrapers From Stealing Your Content. Plus Revenge!

I noticed a while back in my Technorati blog backlinks thing that a certain blog was reposting all of the articles from my anime blog. It was an RSS Scraper that ran on some software called “Autoblogger Pro.” Basically, the site scraped posts from a number of anime blogs and reposted the content with adsense ads. While I didn’t mind the link back, it still bugged me that the person who set this up was simply profiting from other people’s work. I figured I’d do something about it.

The first thing I noticed was that all comments on the blog were turned off. This makes sense, since every comment would have probably been “stop stealing my content, you jackass!” There wasn’t any contact info or even an “about” page. I decided that contacting the owner of the website wouldn’t have helped anyway. It was time to take matters into my own hands.

As a Computer Scientist, I’ve learned to think as an adversary. For most automated systems, there’s usually a way to exploit some sort of flaw. I just had to figure out the weakness in Autoblogger Pro. In this case, the weakness is that everything is automated via RSS scraping. If they can’t get my RSS feed, they can’t scrape it.

dnslookup.jpg

The first thing I had to do was find their IP address. I used the dns lookup tool here to find the IP of the website. Note that the website’s IP is not necessarily the one that’s pulling the feed. It was in my case, though.

log.jpg

Just to be safe, I checked out my logs to see if that IP was pulling my feed. Bingo! They were pulling the feed every few hours.

In order to stop them from accessing the feed, all I’d have to do is deny that IP. It’s pretty simple with .htaccess’ rewrite engine. Now, you could simply throw a forbidden 403 code, but where’s the fun in that? Sure, your site isn’t indexed anymore, but what about all the other sites whose content is being stolen? Someone has to stand up to these bullies!

I decided to make a fake RSS feed to redirect to. This one would have 1000 entries in it, each 1 minute apart from the last. This would muddy up their site, and disallow anyone from actually seeing the stolen content. The fake feed would be auto-generated at the time of request, so each time they pulled the feed, it would be recent. I wrote this in my rails application, but it would probably be just as easy in php. Then I made my .htaccess forward to the fake rss for addresses coming from that ip’s location.

RewriteEngine On
RewriteCond %{REMOTE_ADDR} ^209\.200\.12\.(.*)$
RewriteRule .* http://www.thebadrss.com/feed(not really the fake feed) [R,L]

Voila! After waiting a while for the scraper to pick up the feed, I was happy to see the results. You’ll notice that I inserted random numbers into each entry, just to make sure they were all unique. I thought that maybe the software could detect duplicate entries.

This might not work for all content stealers, but if enough people start spamming the spammers, maybe they’ll stop. I’d really love to see the look on their face when the guy running this site sees what’s become of it!

Full Disclosure: I actually run an RSS aggregator of sorts, myself. It’s called Anime Nano. There are a few differences, though. First off, it’s opt in. Second, it’s a community of readers and bloggers who like anime, not a one-man profit machine. Third, visit it at animenano.com!

13 Responses to “How To Stop RSS Scrapers From Stealing Your Content. Plus Revenge!”


  • Good article. Is good do this with de .htaccess file.

    If you don’t have this file you can use a Wordpress plugin
    http://www.anieto2k.com/2006/06/15/autor-feed-elige-quien-te-sindica/

    Bye.

  • Yay,revenge!

    All you need is a programme that allows you to see through someone elses computer screen to take a snap shot of their faces when they saw it. LOL.

  • LOL, Congrats, way better than just blocking them …

  • Fantastic post, thank you very much!

    I’ve been looking for a way to shut down RSS scrapers and I love the revenge angle you take.

    Soultrance

  • I think it would be great if someone addressed the problem of identifying the scrapping addresses with one-time content poisoning, in case they pull from a different IP address. There must be a module which allows to watermark feeds on the fly. This is, of course, puts load on your CPU, but it’s all in a good cause.

  • Hi, thanks for the tip for using network-tools.com to get the scraper’s IP address. I’m using the Antileech plugin which is supposed to do the same thing (feed fake RSS content based on IP address), however it didn’t work for one of the scrapers and I had to block their IP in my htaccess file after reading your post.

    Have you written about how you created your own fake RSS feed? I’d love to learn how you did that.

  • Ugh I was trying to do this but I don’t get the htaccess thing T_T These fagg0ts are stealing like everyone’s content:
    http://direct-anime.org/

  • Fantastic article!

    But the true irony of the post is that your google ads on this page are now displaying sites such as “Easy web data scraping” etc…

    How did you write the auto-generated feed?
    Thanks.

  • Hmm, a few people have asked how I did a fake rss feed. So I should really write a post about it huh? I’ll put it on my to do list.

    Basically, I generated xml and used dates that were close to the current date so that the fake content would show up on the front page instead of the archives. I’ll leave it as an exercise for the reader to try this until I write a post about it.

  • I keep noticing these in my trackbacks. They’re a real pain. One guy was just wholesale reposting my entire content – not just a snippet – and to cap it off he was hotlinking my images. I sent him a strongly worded email (I whois’d his domain name and found his email address), and now the site is a blank page. Guess he got the message.

  • What’s the best RSS scraper service? an online one, please :)

  • ha ha! that’s genius. unfortunately, i don’t know how to do any of that stuff, so i have to foil my scraper with lame methods. i just put a line in my posts linking back to my blog as the source of the material. and i set my feed to “short” when it used to be “ful”. but i can’t contact the douchebag, and i can’t leave comments on the stolen posts :(

    on the bright side, i’m getting back links everyday :\

  • Good stuff,
    I reason i came here is of course i was looking for a solution to a similar problem.Currently am just going with an ip block but i may eventually rename the feeds and re-submitt them,it’s a pain but it’s a radically reliable solution
    Thank you very much for the article

Leave a Reply