By Tom Le, Director of Research and Development, BT Managed Security Solutions Group
Organizations store an increasing amount of copyrighted and company sensitive information online — everything from pricing to stock levels can be derived from web sites which, though aimed at customers, can be used by other organizations to gain competitive information.
In the past, this kind of competitive analysis happened when staff walked into their competitors and checked everything from prices to merchandise. The process was people intensive, slow and arduous. In the last decade, however, these covert activities have become a lot easier with the use of automated programs, robots, spiders, and other automatic methodologies that can be used to gain competitive information by automating and repeating queries to obtain quotes or gain an idea of stock levels. High volumes of automated queries can also result in increased infrastructure and bandwidth costs as well as performance degradation for legitimate users.
In addition to the automation technology being widely available, all that is needed for an attacker to lease a botnet and execute the automation on hundreds or thousands of autonomous nodes is a credit card and web access to the leased command and control system.
Detecting Automated Bot Activity
On the face of it, companies are powerless to stop the competitive “spying” that goes on over the web, but utilizing logs enables a picture to be drawn over who and what is gaining information from corporate websites. Unusually high volumes of activity and other anomalous behavior can be identified by analyzing web server logs for patterns of automated usage which are clearly “not human” activity.
Some bot masters are cognizant that their targets may be monitoring for anomalous behavior, such as a high volume of queries. The bot client could attempt to defeat such monitoring attempts by purposefully limiting the number of requests per time interval. This works especially well with a distributed bot attack where the bot master has control of hundreds or thousands of nodes. For example, instead of querying as fast as possible, the bot may be programmed to purposefully query once every 5 seconds, or any other arbitrary delay. Some bots even include random sleep timeouts to evade detection.
Fortunately, most automated, distributed attacks – even those attempting evasion by masking their volume, frequency or pattern of access – can be discovered utilizing statistical analysis of web access. For example, the volume and frequency of HTTP GETs and POSTs can be analyzed on a statistical basis to determine the probability that any set of access activity is a human or automated user.
The key advantage to a statistical model is that knowledge of the underlying web application, URLs and form parameters is not needed as the underlying assumption is that a normal human user will only submit a predictable volume and frequency of HTTP GETs or POSTs. Through a combination of statistical tests to determine average user activity from observed behavior, and some common sense parameters for what type of expected behavior is clearly non-human (e.g., a human user would never perform a particular query X times in Y seconds), non-human activity can be identified automatically from statistical analysis of web server logs.
For cases where the bot attacker is attempting to mimic human behavior to evade detection or where the nature of the attack does not lend itself to volume or frequency analysis — there are still several methods to detect the non-human web user. Since bots will crawl the web site to mimic human user, a “honeypot” can be created. To effectively deploy a honeypot, the actual values that a bot should access need to look normal to the bot, but be invisible to the human user. This can include hidden form values, CSS layout schemes, challenge/response values, and javascript functions. All of these methods would be invisible to a normal human user and rendered in their browser automatically, whereas an automated bot or parser would submit incorrect or missing values.
Deploying Countermeasures to Stop Automated Bot Activity
Once the automated bot activity can be detected, there are various forms of countermeasures that can be deployed to prevent or mitigate the impact of bot activity. Each type of countermeasure carries some amount of cost and risk as well as different levels of effectiveness and duration — where effectiveness is how much the countermeasure reduces or eliminates the impact of the bot activity; and duration is how long that countermeasure can be expected to last before the attack behavior changes.
There are six general types of countermeasures that can be deployed:
Block – A blocking countermeasure prevents the bot client from connecting. This is the most effective countermeasure as it eliminates all activity. However, it is usually immediately noticed by the bot master and can lead to an on-going cat and mouse game.
Delay – A delay countermeasure deliberately slows the response provided to a query from a bot. This has the advantage of providing the bot what it is looking for but doing so at a slow pace to minimize impact on the web server environment and not trigger any alarms.
Confuse – A confusion countermeasure provides the bot client incorrect data. This does not have to be purposefully bad data; for example, it could simply be delayed or cached data that minimizes impact on the web server environment. In other cases, a confusion countermeasure could provide a bot an “infinite link loop” where it will continue to crawl to what it thinks is a new URL but is just an endless loop that keeps the bot busy with little impact on the web server.
Reduce – A reduction countermeasure provides the bot exactly what it is asking for but only a limited amount of data is given. For example, if a bot initiates a session by querying all items in inventory, a reduction countermeasure could return just a few items rather than the complete list. Another example would be to limit the number of inquiries per unique user, e.g., only X transactions per hour are allowed.
Deflect – A deflection countermeasure redirects a bot’s queries to a source different than the actual production environment to minimize impact on production resources, which results in a behavior similar to a “delay” countermeasure. In addition, a separate environment could also contain staged data similar to a “confusion” countermeasure. The key benefit of a deflection vs. a delay or confusion countermeasure is that after redirecting the attacker to a separate environment, tests can then be conducted against the attacking client to see how it will react to changes in the web application. For example, what is the attacking application’s ability to parse new javascript functionality?
Verification – A verification countermeasure verifies that either the human user or the web client is a legitimate user. In cases of verifying the human user, a Turing Test such as CAPTCHA can be used. In the case of verifying the web client, HTML/javascript can be used such that a normal web browser can successfully (and automatically, with no user action required) interact with the web application, but a bot client will often (if not always) fail. This acts as a “block” countermeasure in that illegitimate bot users who are not using a normal web browser client will be unable to navigate the website.
Mitigating competitive spying is in fact far more feasible than most organizations realize. While it does take some specialized skills, knowledge and monitoring that steps out of the box and uses real analysis, the six countermeasures listed here offer some options on how to take the first step.