This is the mail archive of the
cygwin@cygwin.com
mailing list for the Cygwin project.
Re: Wget ignores robot.txt entry
Lowell,
Max Bowsher reported:
Or, on the command line -erobots=off :-)
Whilst this does control whether wget downloads robots.txt, a quick
test confirms that even when it does get robots.txt, it still wanders
into cgi-bin.
I'd suggest taking this to the wget list, except wget it currently
maintainer-less, and, it appears, bitrotted.
Max.
As for this:
Perhaps there is a counterpart to the above, i.e., <meta name="robots"
content="follow"> that's being involked and someone from Redhat could
check into and rule this out.
You should realize that for open source programs like wget, the
recommended practice is to examine the source yourself.
Randall Schulz
At 17:43 2003-02-14, L Anderson wrote:
Randall R Schulz wrote:
Lowell,
What's in your "~/.wgetrc" file? If it contains this:
robots = off
Then wget will not respect a "robots.txt" file on the host from which
it is retrieving files.
Before I learned of this option (accessible _only_ via this directive
in the .wgetrc file), I did something too clever by half to get
robots.txt ignored, so I know that wget does respect it.
I have only two wgetrc related files as follows:
/etc/wgetrc
/usr/doc/wget-1.8.2/sample.wgetrc
NB: I use win98 and these are under my cygwin directory i:\cygwin
(i.e. /cygdrive/i).
I have never changed either file--I just accept the default installed
by setup. However, the two files differ by a few lines which are just
comments anyway. i.e. doing:
$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc
73,74c73,74
< # You can set the default proxy for Wget to use. It will override the
< # value in the environment.
---
> # You can set the default proxies for Wget to use for http and ftp.
> # They will override the value in the environment.
75a76
> #ftp_proxy = http://proxy.yoyodyne.com:18023/
shows this. Moreover,
$ grep robot /etc/wgetrc
# Setting this to off makes Wget not download /robots.txt. Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
#robots = on
shows the only references to "robot" are also comments.
The stated default for wget is "robots=on" which I have seen honored
for quite a number of other downloads and since I didn't use "-e
robots=off", that can't explain it. The only other thing I have found
that might be related is not under my control and I haven't yet
figured out how to check it. From the wget documentation it states:
"
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the META tag, like this:
<meta name="robots" content="nofollow">
This is explained in some detail at
<http://www.robotstxt.org/wc/meta-user.html>. Wget supports this
method of robot exclusion in addition to the usual /robots.txt exclusion.
"
Perhaps there is a counterpart to the above, i.e., <meta name="robots"
content="follow"> that's being involked and someone from Redhat could
check into and rule this out.
Thanks (and still puzzled)!
Lowell Anderson
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Bug reporting: http://cygwin.com/bugs.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/