Stopping scripters from slamming your website hundreds of times a second
Stopping scripters from slamming your website hundreds of times a second
[update] I've accepted an answer, as lc deserves the bounty due to the well thought-out answer, but sadly, I believe we're stuck with our original worst case scenario: CAPTCHA everyone on purchase attempts of the crap. Short explanation: caching / web farms make it impossible for us to actually track hits, and any workaround (sending a non-cached web-beacon, writing to a unified table, etc.) slows the site down worse than the bots would. There is likely some pricey bit of hardware from Cisco or the like that can help at a high level, but it's hard to justify the cost if CAPTCHAing everyone is an alternative. I'll attempt to do a more full explanation in here later, as well as cleaning this up for future searchers (though others are welcome to try, as it's community wiki).
I've added bounty to this question and attempted to explain why the current answers don't fit our needs. First, though, thanks to all of you who have thought about this, it's amazing to have this collective intelligence to help work through seemingly impossible problems.
I'll be a little more clear than I was before: This is about the bag o' crap sales on woot.com. I'm the president of Woot Workshop, the subsidiary of Woot that does the design, writes the product descriptions, podcasts, blog posts, and moderates the forums. I work in the css/html world and am only barely familiar with the rest of the developer world. I work closely with the developers and have talked through all of the answers here (and many other ideas we've had).
Usability of the site is a massive part of my job, and making the site exciting and fun is most of the rest of it. That's where the three goals below derive. CAPTCHA harms usability, and bots steal the fun and excitement out of our crap sales.
To set up the scenario a little more, bots are slamming our front page tens of times a second screenscraping (and/or scanning our rss) for the Random Crap sale. The moment they see that, it triggers a second stage of the program that logs in, clicks I want One, fills out the form, and buys the crap.
In current (2/6/2009) order of votes:
lc: On stackoverflow and other sites that use this method, they're almost always dealing with authenticated (logged in) users, because the task being attempted requires that.
On Woot, anonymous (non-logged) users can view our home page. In other words, the slamming bots can be non-authenticated (and essentially non-trackable except by IP address). So we're back to scanning for IPs, which a) is fairly useless in this age of cloud networking and spambot zombies and b) catches too many innocents given the number of businesses that come from one IP address (not to mention the issues with non-static IP ISPs and potential performance hits to trying to track this).
Oh, and having people call us would be the worst possible scenario. Can we have them call you?
BradC Ned Batchelder's methods look pretty cool, but they're pretty firmly designed to defeat bots built for a network of sites. Our problem is bots are built specifically to defeat our site. Some of these methods could likely work for a short time until the scripters evolved their bots to ignore the honeypot, screenscrape for nearby label names instead of form ids, and use a javascript-capable browser control.
lc again "Unless, of course, the hype is part of you
Answer by Matt Boehm for Stopping scripters from slamming your website hundreds of times a second
You could try to make the price harder for scripts to read. This is achieved most simply by converting it to an image, but a text recognition algorithm could still get around this. If enough scripters get around it, you could try applying captcha-like things to this image, but obviously at the cost of user experience. Instead of an image, the price could go in a flash app.
Alternately, you could try to devise a way to "shuffle" the HTML pf a page in some way that doesn't affect the rendering. I can't think of a good example off the top of my head, but I'm sure it's somehow doable.
Answer by lc. for Stopping scripters from slamming your website hundreds of times a second
How about implementing something like SO does with the CAPTCHAs?
If you're using the site normally, you'll probably never see one. If you happen to reload the same page too often, post successive comments too quickly, or something else that triggers an alarm, make them prove they're human. In your case, this would probably be constant reloads of the same page, following every link on a page quickly, or filling in an order form too fast to be human.
If they fail the check x times in a row (say, 2 or 3), give that IP a timeout or other such measure. Then at the end of the timeout, dump them back to the check again.
Since you have unregistered users accessing the site, you do have only IPs to go on. You can issue sessions to each browser and track that way if you wish. And, of course, throw up a human-check if too many sessions are being (re-)created in succession (in case a bot keeps deleting the cookie).
As far as catching too many innocents, you can put up a disclaimer on the human-check page: "This page may also appear if too many anonymous users are viewing our site from the same location. We encourage you to register or login to avoid this." (Adjust the wording appropriately.)
Besides, what are the odds that X people are loading the same page(s) at the same time from one IP? If they're high, maybe you need a different trigger mechanism for your bot alarm.
Edit: Another option is if they fail too many times, and you're confident about the product's demand, to block them and make them personally CALL you to remove the block.
Having people call does seem like an asinine measure, but it makes sure there's a human somewhere behind the computer. The key is to have the block only be in place for a condition which should almost never happen unless it's a bot (e.g. fail the check multiple times in a row). Then it FORCES human interaction - to pick up the phone.
In response to the comment of having them call me, there's obviously that tradeoff here. Are you worried enough about ensuring your users are human to accept a couple phone calls when they go on sale? If I were so concerned about a product getting to human users, I'd have to make this decision, perhaps sacrificing a (small) bit of my time in the process.
Since it seems like you're determined to not let bots get the upper hand/slam your site, I believe the phone may be a good option. Since I don't make a profit off your product, I have no interest in receiving these calls. Were you to share some of that profit, however, I may become interested. As this is your product, you have to decide how much you care and implement accordingly.
The other ways of releasing the block just aren't as effective: a timeout (but they'd get to slam your site again after, rinse-repeat), a long timeout (if it was really a human trying to buy your product, they'd be SOL and punished for failing the check), email (easily done by bots), fax (same), or snail mail (takes too long).
You could, of course, instead have the timeout period increase per IP for each time they get a timeout. Just make sure you're not punishing true humans inadvertently.
Answer by Robert Venables for Stopping scripters from slamming your website hundreds of times a second
I say expose the price information using an API. This is the unintuitive solution but it does work to give you control over the situation. Add some limitations to the API to make it slightly less functional than the website.
You could do the same for ordering. You could experiment with small changes to the API functionality/performance until you get the desired effect.
There are proxies and botnets to defeat IP checks. There are captcha reading scripts that are extremely good. There are even teams of workers in India who defeat captchas for a small price. Any solution you can come up with can be reasonably defeated. Even Ned Batchelder's solutions can be stepped past by using a WebBrowser control or other simulated browser combined with a botnet or proxy list.
Answer by BradC for Stopping scripters from slamming your website hundreds of times a second
Take a look at this article by ned Batchelder here. His article is about stopping spambots, but the same techniques could easily apply to your site.
Rather than stopping bots by having people identify themselves, we can stop the bots by making it difficult for them to make a successful post, or by having them inadvertently identify themselves as bots. This removes the burden from people, and leaves the comment form free of visible anti-spam measures.
This technique is how I prevent spambots on this site. It works. The method described here doesn't look at the content at all.
Some other ideas:
- Create an official auto-notify mechanism (RSS feed? Twitter?) that people can subscribe to when your product goes on sale. This reduces the need for people to make scripts.
- Change your obfuscation technique right before a new item goes on sale. So even if the scripters can escalate the arms race, they are always a day behind.
EDIT: To be totally clear, Ned's article above describe methods to prevent the automated PURCHASE of items by preventing a BOT from going through the forms to submit an order. His techniques wouldn't be useful for preventing bots from screen-scraping the home page to determine when a Bandoleer of Carrots comes up for sale. I'm not sure preventing THAT is really possible.
With regard to your comments about the effectiveness of Ned's strategies: Yes, he discusses honeypots, but I don't think that's his strongest strategy. His discussion of the SPINNER is the original reason I mentioned his article. Sorry I didn't make that clearer in my original post:
The spinner is a hidden field used for a few things: it hashes together a number of values that prevent tampering and replays, and is used to obscure field names. The spinner is an MD5 hash of:
- The timestamp,
- The client's IP address,
- The entry id of the blog entry being commented on, and
- A secret.
Here is how you could implement that at WOOT.com:
Change the "secret" value that is used as part of the hash each time a new item goes on sale. This means that if someone is going to design a BOT to auto-purchase items, it would only work until the next item comes on sale!!
Even if someone is able to quickly re-build their bot, all the other actual users will have already purchased a BOC, and your problem is solved!
The other strategy he discusses is to change the honeypot technique from time to time (again, change it when a new item goes on sale):
- Use CSS classes (randomized of course) to set the fields or a containing element to display:none.
- Color the fields the same (or very similar to) the background of the page.
- Use positioning to move a field off of the visible area of the page.
- Make an element too small to show the contained honeypot field.
- Leave the fields visible, but use positioning to cover them with an obscuring element.
- Use Javascript to effect any of these changes, requiring a bot to have a full Javascript engine.
- Leave the honeypots displayed like the other fields, but tell people not to enter anything into them.
I guess my overall idea is to CHANGE THE FORM DESIGN when each new item goes on sale. Or at LEAST, change it when a new BOC goes on sale.
Which is what, a couple times/month?
If you accept this answer, will you give me a heads-up on when the next one is due? :)
Answer by lc. for Stopping scripters from slamming your website hundreds of times a second
Disclaimer: This answer is completely non-programming-related. It does, however, try to attack the reason for scripts in the first place.
Another idea is if you truly have a limited quantity to sell, why don't you change it from a first-come-first-served methodology? Unless, of course, the hype is part of your marketing scheme.
There are many other options, and I'm sure others can think of some different ones:
an ordering queue (pre-order system) - Some scripts might still end up at the front of the queue, but it's probably faster to just manually enter the info.
a raffle system (everyone who tries to order one is entered into the system) - This way the people with the scripts have just the same chances as those without.
a rush priority queue - If there is truly a high perceived value, people may be willing to pay more. Implement an ordering queue, but allow people to pay more to be placed higher in the queue.
auction (credit goes to David Schmitt for this one, comments are my own) - People can still use scripts to snipe in at the last minute, but not only does it change the pricing structure, people are expecting to be fighting it out with others. You can also do things to restrict the number of bids in a given time period, make people phone in ahead of time for an authorization code, etc.
Answer by Oli for Stopping scripters from slamming your website hundreds of times a second
Time-block user agents that make so-many requests per minute. Eg if you've got somebody requesting a page exactly every 5 seconds for 10 minutes, they're probably not a user... But it could be tricky to get this right.
If they trigger an alert, redirect every request to a static page with as little DB-IO as possible with a message letting them know they'll be allowed back on in X minutes.
It's important to add that you should probably only apply this on requests for pages and ignore all the requests for media (js, images, etc).
Answer by Paul Dixon for Stopping scripters from slamming your website hundreds of times a second
How about introducing a delay which requires human interaction, like a sort of "CAPTCHA game". For example, it could be a little Flash game where during 30 seconds they have to burst checkered balls and avoid bursting solid balls (avoiding colour blindness issues!). The game would be given a random number seed and what the game transmits back to the server would be the coordinates and timestamps of the clicked points, along with the seed used.
On the server you simulate the game mechanics using that seed to see if the clicks would indeed have burst the balls. If they did, not only were they human, but they took 30 seconds to validate themselves. Give them a session id.
You let that session id do what it likes, but if makes too many requests, they can't continue without playing again.
Answer by falstro for Stopping scripters from slamming your website hundreds of times a second
There are a few other / better solutions already posted, but for completeness, I figured I'd mention this:
If your main concern is performance degradation, and you're looking at true hammering, then you're actually dealing with a DoS attack, and you should probably try to handle it accordingly. One common approach is to simply drop packets from an IP in the firewall after a number of connections per second/minute/etc. For example, the standard Linux firewall, iptables, has a standard operation matching function 'hashlimit', which could be used to correlate connection requests per time unit to an IP-address.
Although, this question would probably be more apt for the next SO-derivate mentioned on the last SO-podcast, it hasn't launched yet, so I guess it's ok to answer :)
EDIT:
As pointed out by novatrust, there are still ISPs actually NOT assigning IPs to their customers, so effectively, a script-customer of such an ISP would disable all-customers from that ISP.
Answer by Joe Philllips for Stopping scripters from slamming your website hundreds of times a second
- Provide an RSS feed so they don't eat up your bandwidth.
- When buying, make everyone wait a random amount of time of up to 45 seconds or something, depending on what you're looking for exactly. Exactly what are your timing constraints?
- Give everyone 1 minute to put their name in for the drawing and then randomly select people. I think this is the fairest way.
- Monitor the accounts (include some times in the session and store it?) and add delays to accounts that seem like they're below the human speed threshold. That will at least make the bots be programmed to slow down and compete with humans.
Answer by Dave Sherohman for Stopping scripters from slamming your website hundreds of times a second
I'm not seeing the great burden that you claim from checking incoming IPs. On the contrary, I've done a project for one of my clients which analyzes the HTTP access logs every five minutes (it could have been real-time, but he didn't want that for some reason that I never fully understood) and creates firewall rules to block connections from any IP addresses that generate an excessive number of requests unless the address can be confirmed as belonging to a legitimate search engine (google, yahoo, etc.).
This client runs a web hosting service and is running this application on three servers which handle a total of 800-900 domains. Peak activity is in the thousand-hits-per-second range and there has never been a performance issue - firewalls are very efficient at dropping packets from blacklisted addresses.
And, yes, DDOS technology definitely does exist which would defeat this scheme, but he's not seeing that happen in the real world. On the contrary, he says it's vastly reduced the load on his servers.
Answer by Shawn Miller for Stopping scripters from slamming your website hundreds of times a second
Preventing DoS would defeat #2 of @davebug's goals he outlined above, "Keep the site at a speed not slowed by bots" but wouldn't necessary solve #1, "Sell the item to non-scripting humans"
I'm sure a scripter could write something to skate just under the excessive limit that would still be faster than a human could go through the ordering forms.
Answer by Steven A. Lowe for Stopping scripters from slamming your website hundreds of times a second
Q: How would you stop scripters from slamming your site hundreds of times a second?
A: You don't. There is no way to prevent this behavior by external agents.
You could employ a vast array of technology to analyze incoming requests and heuristically attempt to determine who is and isn't human...but it would fail. Eventually, if not immediately.
The only viable long-term solution is to change the game so that the site is not bot-friendly, or is less attractive to scripters.
How do you do that? Well, that's a different question! ;-)
...
OK, some options have been given (and rejected) above. I am not intimately familiar with your site, having looked at it only once, but since people can read text in images and bots cannot easily do this, change the announcement to be an image. Not a CAPTCHA, just an image -
- generate the image (cached of course) when the page is requested
- keep the image source name the same, so that doesn't give the game away
- most of the time the image will have ordinary text in it, and be aligned to appear to be part of the inline HTML page
- when the game is 'on', the image changes to the announcement text
- the announcement text reveals a url and/or code that must be manually entered to acquire the prize. CAPTCHA the code if you like, but that's probably not necessary.
- for additional security, the code can be a one-time token generated specifically for the request/IP/agent, so that repeated requests generate different codes. Or you can pre-generate a bunch of random codes (a one-time pad) if on-demand generation is too taxing.
Run time-trials of real people responding to this, and ignore ('oops, an error occurred, sorry! please try again') responses faster than (say) half of this time. This event should also trigger an alert to the developers that at least one bot has figured out the code/game, so it's time to change the code/game.
Continue to change the game periodically anyway, even if no bots trigger it, just to waste the scripters' time. Eventually the scripters should tire of the game and go elsewhere...we hope ;-)
One final suggestion: when a request for your main page comes in, put it in a queue and respond to the requests in order in a separate process (you may have to hack/extend the web server to do this, but it will likely be worthwhile). If another request from the same IP/agent comes in while the first request is in the queue, ignore it. This should automatically shed the load from the bots.
EDIT: another option, aside from use of images, is to use javascript to fill in the buy/no-buy text; bots rarely interpret javascript, so they wouldn't see it
Answer by Phil for Stopping scripters from slamming your website hundreds of times a second
Instead of blocking suspected IPs it may be effective to reduce the amount of data you give to an address as its hits/min goes up. So if the bot hits you up more than a secret randomly changing threshold it will not see the data. Logged in users would always see the data. Logged in users that hit the server too often would be forced to re-authenticate, or be given a captcha.
Answer by Zac Thompson for Stopping scripters from slamming your website hundreds of times a second
I don't know how feasible this is: ... go on the offensive.
Figure out what data the bots are scanning for. Feed them the data that they're looking for when you're NOT selling the crap. Do this in a way that won't bother or confuse human users. When the bots trigger phase two, they'll log in and fill out the form to buy $100 roombas instead of BOC. Of course, this assumes that the bots are not particularly robust.
Another idea is to implement random price drops over the course of the bag o crap sale period. Who would buy a random bag o crap for $150 when you CLEARLY STATE that it's only worth $20? Nobody but overzealous bots. But then 9 minutes later it's $35 dollars ... then 17 minutes later it's $9. Or whatever.
Sure, the zombie kings would be able to react. The point is to make their mistakes become very costly for them (and to make them pay you to fight them).
All of this assumes you want to piss off some bot lords, which may not be 100% advisable.
Answer by 1800 INFORMATION for Stopping scripters from slamming your website hundreds of times a second
All right so the spammers are out competing regular people to win the "bog of crap" auction? Why not make the next auction be a literal "bag of crap"? The spammers get to pay good money for a bag full of doggy do, and we all laugh at them.
Answer by DaveC for Stopping scripters from slamming your website hundreds of times a second
The solution to this may be to attach a little bit of client side processing to actions of logging in and buying. The processing can be a negligible amount so that individuals are not affected but bots attempting to do the tasks many times will be hampered by the extra work load.
The processing can be a simple equation to solve done in javascript, unless you don't want to have to require javascript on your site.
Answer by Friedrich for Stopping scripters from slamming your website hundreds of times a second
Hm I remember having read "Linux Firewalls" Attack Detection and Response with ... The situations there seem to be very comparable. And someone else has suggested that also. Just block a client temporarily or in progressive steps to throttle them down. If it's realyl from a few sites this must be quite efficient
Regards
Answer by Christopher Mahan for Stopping scripters from slamming your website hundreds of times a second
You need to figure a way to make the bots buy stuff that is massively overpriced: 12mm wingnut: $20. See how many the bots snap up before the script-writers decide you're gaming them.
Use the profits to buy more servers and pay for bandwidth.
Answer by Adam Davis for Stopping scripters from slamming your website hundreds of times a second
The method Woot uses to combat this issue is changing the game - literally. When they present an extraordinarily desirable item for sale, they make users play a video game in order to order it.
Not only does that successfully combat bots (they can easily make minor changes to the game to avoid automatic players, or even provide a new game for each sale) but it also gives the impression to users of "winning" the desired item while slowing down the ordering process.
It still sells out very quickly, but I think that the solution is good - re-evaluating the problem and changing the parameters led to a successful strategy where strictly technical solutions simply didn't exist.
Your entire business model is based on "first come, first served." You can't do what the radio stations did (they no longer make the first caller the winner, they make the 5th or 20th or 13th caller the winner) - it doesn't match your primary feature.
No, there is no way to do this without changing the ordering experience for the real users.
Let's say you implement all these tactics. If I decide that this is important, I'll simply get 100 people to work with me, we'll build software to work on our 100 separate computers, and hit your site 20 times a second (5 seconds between accesses for each user/cookie/account/IP address).
You have two stages:
- Watching front page
- Ordering
You can't put a captcha blocking #1 - that's going to lose real customers ("What? I have to solve a captcha each time I want to see the latest woot?!?").
So my little group watches, timed together so we get about 20 checks per second, and whoever sees the change first alerts all the others (automatically), who will load the front page once again, follow the order link, and perform the transaction (which may also happen automatically, unless you implement captcha and change it for every wootoff/boc).
You can put a captcha in front of #2, and while you're loathe to do it, that may be the only way to make sure that even if bots watch the front page, real users are getting the products.
But even with captcha my little band of 100 would still have a significant first mover advantage - and there's no way you can tell that we aren't humans. If you start timing our accesses, we'd just add some jitter. We could randomly select which computer was to refresh so the order of accesses changes constantly - but still looks enough like a human.
First, get rid of the simple bots
You need to have an adaptive firewall that will watch requests and if someone is doing the obvious stupid thing - refreshing more than once a second at the same IP then employ tactics to slow them down (drop packets, send back refused or 500 errors, etc).
This should significantly drop your traffic and alter the tactics the bot users employ.
Second, make the server blazingly fast.
You really don't want to hear this... but...
I think what you need is a fully custom solution from the bottom up.
You don't need to mess with TCP/IP stack, but you may need to develop a very, very, very fast custom server that is purpose built to correlate user connections and react appropriately to various attacks.
Apache, lighthttpd, etc are all great for being flexible, but you run a single purpose website, and you really need to be able to both do more than the current servers are capable of doing (both in handling traffic, and in appropriately combating bots).
By serving a largely static webpage (updates every 30 seconds or so) on a custom server you should not only be able to handle 10x the number of requests and traffic (because the server isn't doing anything other than getting the request, and reading the page from memory into the TCP/IP buffer) but it will also give you access to metrics that might help you slow down bots. For instance, by correlating IP addresses you can simply block more than one connection per second per IP. Humans can't go faster than that, and even people using the same NATed IP address will only infrequently be blocked. You'd want to do a slow block - leave the connection alone for a full second before officially terminating the session. This can feed into a firewall to give longer term blocks to especially egregious offenders.
But the reality is that no matter what you do, there's no way to tell a human apart from a bot when the bot is custom built by a human for a single purpose. The bot is merely a proxy for the human.
Conclusion
At the end of the day, you can't tell a human and a computer apart for watching the front page. You can stop bots at the ordering step, but the bot users still have a first mover advantage, and you still have a huge load to manage.
You can add blocks for the simple bots, which will raise the bar and fewer people with bother with it. That may be enough.
But without changing your basic model, you're out of luck. The best you can do is take care of the simple cases, make the server so fast regular users don't notice, and sell so many items that even if you have a few million bots, as many regular users as want them will get them.
You might consider setting up a honeypot and marking user accounts as bot users, but that will have a huge negative community backlash.
Every time I think of a "well, what about doing this..." I can always counter it with a suitable bot strategy.
Even if you make the front page a captcha to get to the ordering page ("This item's ordering button is blue with pink sparkles, somewhere on this page") the bots will simply open all the links on the page, and use whichever one comes back with an ordering page. That's just no way to win this.
Make the servers fast, put in a reCaptcha (the only one I've found that can't be easily fooled, but it's probably way too slow for your application) on the ordering page, and think about ways to change the model slightly so regular users have as good a chance as the bot users.
-Adam
Answer by abelenky for Stopping scripters from slamming your website hundreds of times a second
My solution would be to make screen-scraping worthless by putting in a roughly 10 minute delay for 'bots and scripts.
Here's how I'd do it:
- Log and identify any repeat hitters.
You don't need to log every IP address on every hit. Only track one out of every 20 hits or so. A repeat offender will still show up in a randomized occassional tracking.
Keep a cache of your page from about 10-minutes earlier.
When a repeat-hitter/bot hits your site, give them the 10-minute old cached page.
They won't immediately know they're getting an old site. They'll be able to scrape it, and everything, but they won't win any races anymore, because "real people" will have a 10 minute head-start.
Benefits:
- No hassle or problems for users (like CAPTCHAs).
- Implemented fully on server-side. (no reliance on Javascript/Flash)
- Serving up an older, cached page should be less performance intensive than a live page. You may actually decrease the load on your servers this way!
Drawbacks
- Requires tracking some IP addresses
- Requires keeping and maintaining a cache of older pages.
What do you think?
0 comments:
Post a Comment