Recommend this page to a friend! |
Spider Engine | > | All threads | > | Understanding patterns | > | (Un) Subscribe thread alerts |
|
Kristiensen - 2007-04-09 19:38:09
me again
I'm very new to progamming your script looks very interresting I'm playing with example5_google_results.php but I would like to return for example only the links of the websites that google indexes... not the cached, not the related, nothing else than a list of link something like link 1 = http://www.link1.com link 2 = http://www.link2.com but I really don't undertstand how do to it thanx you in advance for your help Steffy
Radu Topala - 2007-04-10 12:53:04 - In reply to message 1 from Kristiensen
it's very simple ... instead of using "title","content","cache","similar", you should replace that with "dummy" in $obj->pattern. then, all the results are having this two variables, "dummy" and "link". you'll use the "link" variable as you need.
enjoy.
Kristiensen - 2007-04-14 05:30:52 - In reply to message 2 from Radu Topala
thank you for your time and tips, but I'm blonde !!
here is how I configure your class ============================================================================= $obj=new MySpider(); $obj->url="http://www.google.com/search?q=kitesurf&start={range[0]}"; $obj->range=array(0=>array("start"=>0,"end"=>1,"step"=>1)); //$obj->pattern_definition=array("dummy","link","title","content","cache","similar"); //dummy is used for content that changes between pages and we are not interested in it $obj->pattern_definition=array("dummy","link"); //dummy is used for content that changes between pages and we are not interested in it $obj->start='<div>'; $obj->end=array("to_process"=>array('</body>'),"not_to_process"=>array()); //$obj->pattern='<div class=g{p[dummy]}><h2 class=r><a{p[dummy]}href="{p[link]}"{p[dummy]}class=l{p[dummy]}>{p[title]}</a></h2><table border=0 cellpadding=0 cellspacing=0><tr><td class=j><font size=-1>{p[content]}<br><span{p[dummy]}>{p[dummy]}</span><nobr><a class=fl href="{p[cache]}">{p[dummy]}</a> - <a class=fl href="{p[similar]}">{p[dummy]}</a></nobr></font></td></tr></table></div>'; $obj->pattern='<h2 class=r><a{p[dummy]}href="{p[link]}"{p[dummy]}class=l{p[dummy]}><br />'; $obj->fetchData(); ================================================================ anf here it what it output =============================================================== Url: http://www.google.com/search?q=kitesurf&start=0 processing! Array ( [0] => Array ( [dummy] => [link] => http://www.kite-surf.com/ ) [1] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:7zy9asTytgAJ:www.kite-surf.com/+kitesurf&hl=en&ct=clnk&cd=1&ie=UTF-8 ) [2] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kite-surf.com/ ) [3] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:siM2eG3yXtAJ:www.prokitesurf.com/+kitesurf&hl=en&ct=clnk&cd=2&ie=UTF-8 ) [4] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.prokitesurf.com/ ) [5] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:tKJCI76umJAJ:www.kitesurf.com/+kitesurf&hl=en&ct=clnk&cd=3&ie=UTF-8 ) [6] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.com/ ) [7] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:Ws_E5o4bE70J:kitesurfingschool.org/howto.htm+kitesurf&hl=en&ct=clnk&cd=4&ie=UTF-8 ) [8] => Array ( [dummy] => [link] => http://kitesurfingschool.org/ ) [9] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:ClpdkWWJKKIJ:kitesurfingschool.org/+kitesurf&hl=en&ct=clnk&cd=5&ie=UTF-8 ) [10] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:kitesurfingschool.org/ ) [11] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:1dhkGBb_ufYJ:en.wikipedia.org/wiki/Kitesurfing+kitesurf&hl=en&ct=clnk&cd=6&ie=UTF-8 ) [12] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:en.wikipedia.org/wiki/Kitesurfing ) [13] => Array ( [dummy] => [link] => http://www.planetkitesurf.com/ ) [14] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:3Ie_N0I2siMJ:www.planetkitesurf.com/+kitesurf&hl=en&ct=clnk&cd=7&ie=UTF-8 ) [15] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.planetkitesurf.com/ ) [16] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:iiJiIYaUVU4J:www.kitesurf.ie/+kitesurf&hl=en&ct=clnk&cd=8&ie=UTF-8 ) [17] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.ie/ ) [18] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:74vvf5HhPDAJ:www.kitesurfusa.com/+kitesurf&hl=en&ct=clnk&cd=9&ie=UTF-8 ) [19] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurfusa.com/ ) ) Url: http://www.google.com/search?q=kitesurf&start=0 has been processed in 1.46 sec ! Url: http://www.google.com/search?q=kitesurf&start=1 processing! Array ( [0] => Array ( [dummy] => [link] => http://www.prokitesurf.com/ ) [1] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:siM2eG3yXtAJ:www.prokitesurf.com/+kitesurf&hl=en&ct=clnk&cd=2&ie=UTF-8 ) [2] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.prokitesurf.com/ ) [3] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:tKJCI76umJAJ:www.kitesurf.com/+kitesurf&hl=en&ct=clnk&cd=3&ie=UTF-8 ) [4] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.com/ ) [5] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:Ws_E5o4bE70J:kitesurfingschool.org/howto.htm+kitesurf&hl=en&ct=clnk&cd=4&ie=UTF-8 ) [6] => Array ( [dummy] => [link] => http://kitesurfingschool.org/ ) [7] => Array ( [dummy] => [link] => http://kitesurfingschool.org/ ) [8] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:ClpdkWWJKKIJ:kitesurfingschool.org/+kitesurf&hl=en&ct=clnk&cd=5&ie=UTF-8 ) [9] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:kitesurfingschool.org/ ) [10] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:1dhkGBb_ufYJ:en.wikipedia.org/wiki/Kitesurfing+kitesurf&hl=en&ct=clnk&cd=6&ie=UTF-8 ) [11] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:en.wikipedia.org/wiki/Kitesurfing ) [12] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:3Ie_N0I2siMJ:www.planetkitesurf.com/+kitesurf&hl=en&ct=clnk&cd=7&ie=UTF-8 ) [13] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.planetkitesurf.com/ ) [14] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:iiJiIYaUVU4J:www.kitesurf.ie/+kitesurf&hl=en&ct=clnk&cd=8&ie=UTF-8 ) [15] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.ie/ ) [16] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:74vvf5HhPDAJ:www.kitesurfusa.com/+kitesurf&hl=en&ct=clnk&cd=9&ie=UTF-8 ) [17] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurfusa.com/ ) [18] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:q2t4h4LRp5oJ:www.kiteboardingholidays.com/+kitesurf&hl=en&ct=clnk&cd=10&ie=UTF-8 ) [19] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kiteboardingholidays.com/ ) ) Url: http://www.google.com/search?q=kitesurf&start=1 has been processed in 1.47 sec ! ==================================================================== but what I want to achieve would be simply [1] => http://..... [2] => http://..... [3] => http://..... [4] => http://..... [5] => http://..... and so on is that possible and if so how shall I process? thanx you in advance Steffy
Radu Topala - 2007-04-14 05:48:11 - In reply to message 3 from Kristiensen
use this pattern instead:
$obj->pattern='<div class=g{p[dummy]}><h2 class=r><a{p[dummy]}href="{p[link]}"{p[dummy]}class=l{p[dummy]}>{p[dummy]}</a></h2><table border=0 cellpadding=0 cellspacing=0><tr><td class=j><font size=-1>{p[dummy]}<br><span{p[dummy]}>{p[dummy]}</span><nobr><a class=fl href="{p[dummy]}">{p[dummy]}</a> - <a class=fl href="{p[dummy]}">{p[dummy]}</a></nobr></font></td></tr></table></div>'; you can then delete all the dummies with unset(). good luck.
Kristiensen - 2007-04-14 06:11:03 - In reply to message 4 from Radu Topala
YES..
I'm going somewhere how would you use the unset to remove the dummy? Steffy
Radu Topala - 2007-04-14 06:29:20 - In reply to message 5 from Kristiensen
Kristiensen - 2007-04-14 06:44:49 - In reply to message 6 from Radu Topala
yes yesy I know the unset fucntion, but where would you put into that script?
also let say that I want in my result page to have the link click able, what should I modifiy in the patern? thanx you for your time and sorry to be dumb Steffy
Radu Topala - 2007-04-14 06:58:36 - In reply to message 7 from Kristiensen
in my google example you have already a good unset example :
function processData($pattern_matches) //you can do whatever you want here with the pattern matches, insert in a database etc. { foreach ($pattern_matches as $k=>$v) { if($v['dummy']) { unset($pattern_matches[$k]['dummy']); } } print_r($pattern_matches); } you shouldn't modify the pattern !! you can add <a href="" to each link after processing, in processData function. have a nice day. |
info at phpclasses dot org
.