This is the mail archive of the cygwin@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Wget ignores robot.txt entry


Max,

Right.

How can I have read the wget man page so many times and not have seen that? I guess it's 'cause I'm always looking for something specific, like the difference between "-o" and "-O".

The only think I hate worse than being wrong is not knowing it (plus showing it).

Wget is orphaned? That's bad news, since it seems to have it all over cURL. (Sure. Go ahead and prove me wrong. I might as well get it over with... for now.)

Randall Schulz


At 18:41 2003-02-13, Max Bowsher wrote:
Randall R Schulz wrote:
> Lowell,
>
> What's in your "~/.wgetrc" file? If it contains this:
>
> robots = off
>
> Then wget will not respect a "robots.txt" file on the host from which
> it is retrieving files.
>
> Before I learned of this option (accessible _only_ via this directive
> in the .wgetrc file)

Or, on the command line -erobots=off :-)

Whilst this does control whether wget downloads robots.txt, a quick test
confirms that even when it does get robots.txt, it still wanders into
cgi-bin.

I'd suggest taking this to the wget list, except wget it currently
maintainer-less, and, it appears, bitrotted.

Max.

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Bug reporting:         http://cygwin.com/bugs.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]