File: README.txt

Recommend this page to a friend!

???

File:	`???`
Role:	Documentation
Content type:	`text/plain`
Description:	Usage Examples
Class:	Robots_txt Test if a URL may be crawled looking at robots.txt
Author:	By Andy Pieters
Last change:
Date:	17 years ago
Size:	`1,604 bytes`

Download

		Robots exclusion standard is considered propper netiquette, so any kind of script that exhibits
		crawling-like behavior is expected to abide by it.

		The intended use of this class is to feed it a url before you intend to visit it. The class will
		automatically attempt to read the robots.txt file and will return a boolean value to indicate if
		you are allowed to visit this url.

		Maximum Crawl-delays and request-rates maxed-out at 60seconds.

		The class will block until the detected crawl-delay (or request-rate) allows visiting the url.

		For instance, if Crawl-delay is set to 3, the Robots_txt::urlAllowed() method will block for 3
		seconds when called a second time. An internal clock is kept with the last visited time, so if
		the delay is already expired, the method will not block.

		Example usage

		foreach($arrUrlsToVisit as $strUrlToVisit) {

			if(Robots_txt::urlAllowed($strUrlToVisit,$strUserAgent)) {

				#visit url, do processing. . . 
			}
		}

		The simple example above will ensure you abide by the wishes of the site owners.

		Note: an unofficial non-standard extension exists, that limits the times that crawlers
			  are allowed to visit a site. I choose to ignore this extension because I feel it
			  is unreasonable.

		Note: You are only *required* to specify your userAgent the first time you call the urlAllowed method, and
			  only the first value is ever used.
			  
Example Usage
	var_dump(Robots_txt::urlAllowed('http://slashdot.org/','Slurp'));
	var_dump(Robots_txt::urlAllowed('http://slashdot.org/test','Slurp'));

About us

Advertise on this site

For more information send a message to info at phpclasses dot org.

File: README.txt

Contents