Let us paint you a picture:
You’ve realized that the Internet is full of valuable data that can help your business so you’ve decided to leverage it. You’ve learned about data extraction and built your own scraper in Python. All is set - you’ve chosen a web page and sent the bot to work. Then, out of the deep blue, the website blocks your scraper and won’t let you extract information.
Tough luck but don’t fret, the solution could not be easier.
Scraping data is a popular occurrence for companies today because the gathered information can be used in a variety of ways to improve profitability. One of the most common problems is being blocked during the scraping process. We use a variety of methods to prevent this issue, including IP rotation, the star of today’s article.
But here’s a rather common question: why do websites try to block your bots if you’re extracting data lawfully and ethically? Simple, they don’t know your intentions, and they stand too much to lose by not acting.
Bots have gotten a pretty rotten reputation with site owners because of the many ways in which they have been used as saboteurs, invaders, or general nuisances. The problem with this view is that bots are simply tools. No one is complaining about the bots Google uses to find and index pages. The point is — bots can be both good and bad, depending on how they’re being used.
With that in mind, website owners are somewhat justified in mistrusting bots. There are plenty of ways in which bots cause problems, either intentionally or not:
- They can mess with the analytics of the site. The analytics software doesn’t generally detect visitors that are bots, so it counts them, resulting in skewed reports.
- They can send so many requests that it ends up slowing down the host server, maybe even making the website unavailable to other visitors. This is usually intentional and goes by the name of DDoS attack.
- For websites that rely on ad revenue on their pages, bots can seem like a boon at first, since they generate more money for the site. The problem is that advertising networks are no fools — they’ll notice that some of the ads are being viewed by bots, which is a form of click fraud. Suffice to say, websites don’t want to be accused of that.
- eCommerce websites can have a lot of headaches due to bots. Some scripts buy new products the second they are available so that the creator can then resell them at a profit, creating artificial scarcity. Alternatively, bots can mess with the inventory, adding items to the shopping cart and stopping, effectively blocking real shoppers access to those products.
In brief, you can’t really blame a website for being wary of bots. Next question, how did they identify you in the first place?
Websites are built for humans (generally speaking) and if one detects a foreign bot, such as a web scraper, it will most likely block it. So the question is — how did the website trace your robot?
For a site to block you, it first has to identify the bot, and it does that by monitoring for unusual surfing behaviour.
Web scrapers are faster than any human, that’s their appeal, but it’s most often also the smoking gun. If you task the bot with scraping ten pages from a website, it will finish the job in less time than it took you to issue it. All the website has to do is see that a single IP sent ten requests faster than any human could and it will identify the bot.
There are also other ways, the most well known being:
- Browser fingerprinting
- TLS fingerprinting
- Checking the IP on lists of known proxies
There are also other countermeasures to web scrapers, like CAPTCHAs, but those are more meant to stop suspicious behaviour, rather than detect it.
How to Avoid the IP Excommunicado
The funny thing about avoiding IP blocks is that the more IPs you have, the less likely it is that any of them will get spotted. And, of course, if some of them still get the banhammer, you’ll still have plenty.
So, your first stop is a strong proxy pool. For that, you’ll need to get a reliable proxy pool provider, since it’s the most cost-effective option. Instead of buying IPs, you just pay a monthly fee and get access to hundreds of thousands or even millions of IPs.
Besides the sheer volume of proxies, you’ll also have to take a look at the composition of the proxy pool. Some IPs are more conspicuous than others while some websites are more perceptive. You could use premium proxies for all your scraping, but that would be wasteful, since better proxies cost more money.
What matters is that you have access to all the tools you might need and the knowledge to choose the right one for each situation.
The last piece of the puzzle is the aspect of rotating the IPs you use. Using the same proxy leads to the problem we presented earlier — a single IP making requests too fast to be human. But, with your proxy pool on hand, you can send each request from a different source. The website no longer sees one hyperactive user, but ten different users surfing at almost the same time.
This has been a quick overview of what you’ll have to consider. Now, let’s go into more details on what kind of proxies to get and how to best use them.
Find the Right Disguise
There are plenty of different proxies to choose from and many criteria to consider. At first, the subject can seem very complicated and you might want to throw in the towel, but hang in there! You’ll get the basics down just by reading a cool, informative and humble article, like this one!
First off, let’s talk about anonymity, the primary draw of proxy IPs. First off, it’s not a given, some proxies don’t try to hide your real IP, they act as middlemen and nothing more, these are called transparent proxies. When a request is made through such an IP, one of the headers will notify the website that it is in fact a proxy, while another will send your actual address.
Next, just because you’re using a disguise doesn’t immediately mean that you’re fooling anyone. Anonymous proxies hide your real address, but not the fact that they’re proxies. The request header is what gives you away again. The site won’t know who or where you are, but they’ll know that someone is visiting with an IP.
Finally, there are high anonymity proxies, also called elite. These are the real deal, as they not only keep your identity secret, but they also refrain from announcing themselves as proxies. Don’t get us wrong, a determined webmaster will identify all proxies, no matter how good the disguise is, but elite proxies still give the best chance to go unnoticed.
For web scraping, there are generally two types of advertised proxies: datacenter and residential. Both types of IPs mask your actual address, the difference lies more in their nature.
Datacenter proxies are cloud-based IPs with no real location. Built on modern infrastructure, these proxies are fairly inexpensive and you can get access to a few thousand without breaking the bank. Additionally, datacenter IPs use a good internet connection so you’ll be able to extract data faster than with other types of proxies. The downside is the fact that their lack of a real location and shared subnet (part of the IP is the same for all proxies from the same “family”) make datacenter IPs easier to detect and subsequently block.
Residential proxies can be considered the high-quality option because they’re real IPs, provided by real Internet service providers and with real physical locations. In short, they’re nigh-indistinguishable from regular visitors. A proxy pool should have residential IPs from as many different locations as possible to ensure good speeds and access to geo-restricted content. Having the best results, it’s no surprise that residential proxies also have higher prices.
Cover Your Tracks
If a proxy does its job well, it will look like your bot’s IP is its genuine address. That’s all well and good, but a proxy can’t hide the way bots work, which is very fast. So with a single, high-quality proxy, your bot will just get the proxy IP blocked and you’ll be back at square one.
If you have several proxies, you can switch to a different one with each request so that the activity of one zealous bot looks like a swarm of different users. If all goes well, none of the IPs get blocked and the web scraper does its job.
You can manually switch proxies but the process is lengthy and frustrating, the opposite of what using robots should be. As such, most web scraping tools worth their salt have automatic proxy rotation features.
For WebscrapingAPI, it goes like this: every request for every web page you make is automatically made through a different IP. Even if you scrape the same page one hundred times, the website will register it as one hundred different visitors accessing the page.
In some instances, you might actually want the website to recognize you. In that situation, you just have to modify a parameter in your request and you’ll use the same IP when revisiting a page.
Rotating your proxies is completely necessary if you want to extract data from several pages on the same website. Automatic proxy rotation is meant to make the process easy and painless.
Words of Reassurance
There is no need to panic when a web scraper gets blocked by a website, for as long as we’re not infringing any copyright, bypassing the restriction does not mean one would be doing something illegal. Thankfully, IP rotation is a quick and efficient fix for the blocked scrapers of the world.
To get into the happy scraping, try out our free plan and get the 1000 no-strings-attached API calls.