FRIHOST FORUMS SEARCH FAQ TOS BLOGS COMPETITIONS
You are invited to Log in or Register a free Frihost Account!


the website ban me when I use mechanize to grab its content,





turbowolf
I keep grab photos from one website.

At first I show them the default user agent (WWW-Mechanize). But
obviously they don't like this behavior and ban the user agent
WWW-Mechanize. So I have to forge an IE user agent. It works for some
time. But the website administrator seemed to detect my behavior and
banned me again. This time I can not bypass it by simply modifying the
user agent.

I already set the user-agent of the Mechanize the same as the one my
IE browser uses. But it can't work. I can use IE to browse the
website, but I only get an empty page when I use Mechanize. All the
decisions are done the server side because in my client side any
scripts or redirect header is not received. All I get is an empty
page! So how can they judge I am not using a standard browser and how
can I bypass this obstacle again? Andy suggestions are appreciated.
Arnie
Has it occurred to you the webmaster does not want you to do this for a reason? ... When I notice in my traffic logs that somebody is messing around and producing a huge load, I'd IP-ban him.
hawkwing8
Arnie wrote:
Has it occurred to you the webmaster does not want you to do this for a reason? ... When I notice in my traffic logs that somebody is messing around and producing a huge load, I'd IP-ban him.


Yes. if you produce large amounts you will get banned, any respectful webmaster would do the same Smile
infobankr
Its fair and completely legitimate for webmasters to block users who user their site in ways that it was not intended for it to be used. Depending upon what types of photographs you are trying to obtain, there are better sources and ways to obtain photos other that bulk downloading them from a website.

You can buy CD-ROMs of stock photography, for example.
Stubru Freak
Try using Wireshark and compare the headers sent.
MrBlueSky
Are you sure WWW::Mechanize really gets an empty page. Most of the times when webscrapers stop working is because the layout of the page is changed. Even a simple added newline or an extra tag can break your scrape-script.

Also, can you give the url of the site?
turbowolf
Thanks a lot. The problem is that the website master not only ban the access of WWW::Mechanize, but also ban other browser than IE. All other browsers other than IE accesses their website from an IP only get an empty page. I am unhappy because I am a paid customer of their website and I only get a little mount of pages every day. (Because I don't have time to keep browsing their website so I hope I can do it automately).

Now the problem is that I can't use my Firefox to browse their website. In fact I can use Firefox to browse their website from a different IP and a different user name. Any suggestion?

MrBlueSky wrote:
Are you sure WWW::Mechanize really gets an empty page. Most of the times when webscrapers stop working is because the layout of the page is changed. Even a simple added newline or an extra tag can break your scrape-script.

Also, can you give the url of the site?
MrBlueSky
WWW::Mechanize supports the use of a proxy. Maybe you can make it use a proxy and change the user agent string to IE before you run it through that proxy the first time so the webmaster doesn't notice it's a script?
BlueVD
first of all...
Emulating the browser agent doesn't fully do it.
They might run a jscript or ajax script that can do further checks...
If you know programming I'd recommend to code your own crawler based on the WebBrowser component form MS...
Stubru Freak
BlueVD wrote:
first of all...
Emulating the browser agent doesn't fully do it.
They might run a jscript or ajax script that can do further checks...
If you know programming I'd recommend to code your own crawler based on the WebBrowser component form MS...


Javascript won't be executed by a web crawler. So just emulating the user agent will work just fine.
turbowolf
I already forge a user agent in my script. Furthermore, I use "User agent swithers" plugin for firefox to forge an IE browser. But they still can detect that I am using a browser other than IE and ban me. This is why I feel unhappy.

I can't use proxy. Because the website master is so aggresive that he threat to ban the users frequently change their client IP address.

So I am curious. How could he detect my script(or my firefox browser) when I begin to login into the system. I already compare the cookies my script get with what the browser get. They are really the same.

MrBlueSky wrote:
WWW::Mechanize supports the use of a proxy. Maybe you can make it use a proxy and change the user agent string to IE before you run it through that proxy the first time so the webmaster doesn't notice it's a script?
turbowolf
I am using a script written based on Ruby Mechanize.

I was banned just when I start crawling. When I request a page, they just sent me back an empty one. So I think they only judge based on my user agent. But I already set the user agent the same as the one my IE browser using. So the problem is: What clue I leave for them to detect?

BlueVD wrote:
first of all...
Emulating the browser agent doesn't fully do it.
They might run a jscript or ajax script that can do further checks...
If you know programming I'd recommend to code your own crawler based on the WebBrowser component form MS...
Stubru Freak
turbowolf wrote:
I am using a script written based on Ruby Mechanize.

I was banned just when I start crawling. When I request a page, they just sent me back an empty one. So I think they only judge based on my user agent. But I already set the user agent the same as the one my IE browser using. So the problem is: What clue I leave for them to detect?


Use Wireshark to see the exact changes to your request between different browsers. They could detect your user agent using other parts of your request, besides the real user agent identifier.

Also, check to see if you really get an empty page, and not just a page without body, as that could mean javascript is used to load the real page.
turbowolf
Aha, thank you very much. I will try Wireshark when I have time. Too busy I am these days.

Stubru Freak wrote:
turbowolf wrote:
I am using a script written based on Ruby Mechanize.

I was banned just when I start crawling. When I request a page, they just sent me back an empty one. So I think they only judge based on my user agent. But I already set the user agent the same as the one my IE browser using. So the problem is: What clue I leave for them to detect?


Use Wireshark to see the exact changes to your request between different browsers. They could detect your user agent using other parts of your request, besides the real user agent identifier.

Also, check to see if you really get an empty page, and not just a page without body, as that could mean javascript is used to load the real page.
Related topics
nolits.com
:: 1,000 :: Avatars ::
>>> ::: Header ??? ::: Who will make ???
Carl's Start-to-Finish Professional Website Tutorial
The 5 Golden Rules of Professional Design
A question about my site and the search engines
What is mambo
My soon-to-be Website 'www.HabboTouch.cq.bz'
Jyoko! The eDrifter...
Master Tutorial on SEO
Buying Website - 1000+ FRIH$
Website Software - A Tool
Do you have such laws in your country?
Rules to be followed while building a website
Reply to topic    Frihost Forum Index -> Computers -> Software

FRIHOST HOME | FAQ | TOS | ABOUT US | CONTACT US | SITE MAP
© 2005-2011 Frihost, forums powered by phpBB.