Robots exclusion standard is considered propper netiquette, so any kind of script that exhibits
crawling-like behavior is expected to abide by it.
The intended use of this class is to feed it a url before you intend to visit it. The class will
automatically attempt to read the robots.txt file and will return a boolean value to indicate if
you are allowed to visit this url.
Maximum Crawl-delays and request-rates maxed-out at 60seconds.
The class will block until the detected crawl-delay (or request-rate) allows visiting the url.
For instance, if Crawl-delay is set to 3, the Robots_txt::urlAllowed() method will block for 3
seconds when called a second time. An internal clock is kept with the last visited time, so if
the delay is already expired, the method will not block.
Example usage
foreach($arrUrlsToVisit as $strUrlToVisit) {
if(Robots_txt::urlAllowed($strUrlToVisit,$strUserAgent)) {
#visit url, do processing. . .
}
}
The simple example above will ensure you abide by the wishes of the site owners.
Note: an unofficial non-standard extension exists, that limits the times that crawlers
are allowed to visit a site. I choose to ignore this extension because I feel it
is unreasonable.
Note: You are only *required* to specify your userAgent the first time you call the urlAllowed method, and
only the first value is ever used.
Example Usage
var_dump(Robots_txt::urlAllowed('http://slashdot.org/','Slurp'));
var_dump(Robots_txt::urlAllowed('http://slashdot.org/test','Slurp'));
|