PHP Classes

Understanding patterns

Recommend this page to a friend!

      Spider Engine  >  All threads  >  Understanding patterns  >  (Un) Subscribe thread alerts  
Subject:Understanding patterns
Summary:Understanding patters
Messages:8
Author:Kristiensen
Date:2007-04-09 19:38:09
Update:2007-04-14 06:58:36
 

  1. Understanding patterns   Reply   Report abuse  
Picture of Kristiensen Kristiensen - 2007-04-09 19:38:09
me again

I'm very new to progamming

your script looks very interresting

I'm playing with example5_google_results.php

but I would like to return for example only the links of the websites that google indexes... not the cached, not the related, nothing else than a list of link

something like
link 1 = http://www.link1.com
link 2 = http://www.link2.com

but I really don't undertstand how do to it

thanx you in advance for your help

Steffy

  2. Re: Understanding patterns   Reply   Report abuse  
Picture of Radu Topala Radu Topala - 2007-04-10 12:53:04 - In reply to message 1 from Kristiensen
it's very simple ... instead of using "title","content","cache","similar", you should replace that with "dummy" in $obj->pattern. then, all the results are having this two variables, "dummy" and "link". you'll use the "link" variable as you need.
enjoy.

  3. Re: Understanding patterns   Reply   Report abuse  
Picture of Kristiensen Kristiensen - 2007-04-14 05:30:52 - In reply to message 2 from Radu Topala
thank you for your time and tips, but I'm blonde !!

here is how I configure your class

=============================================================================
$obj=new MySpider();
$obj->url="http://www.google.com/search?q=kitesurf&start={range[0]}";
$obj->range=array(0=>array("start"=>0,"end"=>1,"step"=>1));
//$obj->pattern_definition=array("dummy","link","title","content","cache","similar"); //dummy is used for content that changes between pages and we are not interested in it
$obj->pattern_definition=array("dummy","link"); //dummy is used for content that changes between pages and we are not interested in it

$obj->start='<div>';
$obj->end=array("to_process"=>array('</body>'),"not_to_process"=>array());
//$obj->pattern='<div class=g{p[dummy]}><h2 class=r><a{p[dummy]}href="{p[link]}"{p[dummy]}class=l{p[dummy]}>{p[title]}</a></h2><table border=0 cellpadding=0 cellspacing=0><tr><td class=j><font size=-1>{p[content]}<br><span{p[dummy]}>{p[dummy]}</span><nobr><a class=fl href="{p[cache]}">{p[dummy]}</a> - <a class=fl href="{p[similar]}">{p[dummy]}</a></nobr></font></td></tr></table></div>';
$obj->pattern='<h2 class=r><a{p[dummy]}href="{p[link]}"{p[dummy]}class=l{p[dummy]}><br />';

$obj->fetchData();
================================================================

anf here it what it output
===============================================================
Url: http://www.google.com/search?q=kitesurf&start=0 processing!
Array ( [0] => Array ( [dummy] => [link] => http://www.kite-surf.com/ ) [1] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:7zy9asTytgAJ:www.kite-surf.com/+kitesurf&hl=en&ct=clnk&cd=1&ie=UTF-8 ) [2] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kite-surf.com/ ) [3] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:siM2eG3yXtAJ:www.prokitesurf.com/+kitesurf&hl=en&ct=clnk&cd=2&ie=UTF-8 ) [4] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.prokitesurf.com/ ) [5] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:tKJCI76umJAJ:www.kitesurf.com/+kitesurf&hl=en&ct=clnk&cd=3&ie=UTF-8 ) [6] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.com/ ) [7] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:Ws_E5o4bE70J:kitesurfingschool.org/howto.htm+kitesurf&hl=en&ct=clnk&cd=4&ie=UTF-8 ) [8] => Array ( [dummy] => [link] => http://kitesurfingschool.org/ ) [9] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:ClpdkWWJKKIJ:kitesurfingschool.org/+kitesurf&hl=en&ct=clnk&cd=5&ie=UTF-8 ) [10] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:kitesurfingschool.org/ ) [11] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:1dhkGBb_ufYJ:en.wikipedia.org/wiki/Kitesurfing+kitesurf&hl=en&ct=clnk&cd=6&ie=UTF-8 ) [12] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:en.wikipedia.org/wiki/Kitesurfing ) [13] => Array ( [dummy] => [link] => http://www.planetkitesurf.com/ ) [14] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:3Ie_N0I2siMJ:www.planetkitesurf.com/+kitesurf&hl=en&ct=clnk&cd=7&ie=UTF-8 ) [15] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.planetkitesurf.com/ ) [16] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:iiJiIYaUVU4J:www.kitesurf.ie/+kitesurf&hl=en&ct=clnk&cd=8&ie=UTF-8 ) [17] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.ie/ ) [18] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:74vvf5HhPDAJ:www.kitesurfusa.com/+kitesurf&hl=en&ct=clnk&cd=9&ie=UTF-8 ) [19] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurfusa.com/ ) ) Url: http://www.google.com/search?q=kitesurf&start=0 has been processed in 1.46 sec !
Url: http://www.google.com/search?q=kitesurf&start=1 processing!
Array ( [0] => Array ( [dummy] => [link] => http://www.prokitesurf.com/ ) [1] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:siM2eG3yXtAJ:www.prokitesurf.com/+kitesurf&hl=en&ct=clnk&cd=2&ie=UTF-8 ) [2] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.prokitesurf.com/ ) [3] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:tKJCI76umJAJ:www.kitesurf.com/+kitesurf&hl=en&ct=clnk&cd=3&ie=UTF-8 ) [4] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.com/ ) [5] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:Ws_E5o4bE70J:kitesurfingschool.org/howto.htm+kitesurf&hl=en&ct=clnk&cd=4&ie=UTF-8 ) [6] => Array ( [dummy] => [link] => http://kitesurfingschool.org/ ) [7] => Array ( [dummy] => [link] => http://kitesurfingschool.org/ ) [8] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:ClpdkWWJKKIJ:kitesurfingschool.org/+kitesurf&hl=en&ct=clnk&cd=5&ie=UTF-8 ) [9] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:kitesurfingschool.org/ ) [10] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:1dhkGBb_ufYJ:en.wikipedia.org/wiki/Kitesurfing+kitesurf&hl=en&ct=clnk&cd=6&ie=UTF-8 ) [11] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:en.wikipedia.org/wiki/Kitesurfing ) [12] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:3Ie_N0I2siMJ:www.planetkitesurf.com/+kitesurf&hl=en&ct=clnk&cd=7&ie=UTF-8 ) [13] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.planetkitesurf.com/ ) [14] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:iiJiIYaUVU4J:www.kitesurf.ie/+kitesurf&hl=en&ct=clnk&cd=8&ie=UTF-8 ) [15] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurf.ie/ ) [16] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:74vvf5HhPDAJ:www.kitesurfusa.com/+kitesurf&hl=en&ct=clnk&cd=9&ie=UTF-8 ) [17] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kitesurfusa.com/ ) [18] => Array ( [dummy] => [link] => http://66.102.9.104/search?q=cache:q2t4h4LRp5oJ:www.kiteboardingholidays.com/+kitesurf&hl=en&ct=clnk&cd=10&ie=UTF-8 ) [19] => Array ( [dummy] => [link] => /search?hl=en&ie=UTF-8&q=related:www.kiteboardingholidays.com/ ) ) Url: http://www.google.com/search?q=kitesurf&start=1 has been processed in 1.47 sec !
====================================================================

but what I want to achieve would be simply
[1] => http://.....
[2] => http://.....
[3] => http://.....
[4] => http://.....
[5] => http://.....

and so on

is that possible and if so how shall I process?

thanx you in advance

Steffy

  4. Re: Understanding patterns   Reply   Report abuse  
Picture of Radu Topala Radu Topala - 2007-04-14 05:48:11 - In reply to message 3 from Kristiensen
use this pattern instead:
$obj->pattern='<div class=g{p[dummy]}><h2 class=r><a{p[dummy]}href="{p[link]}"{p[dummy]}class=l{p[dummy]}>{p[dummy]}</a></h2><table border=0 cellpadding=0 cellspacing=0><tr><td class=j><font size=-1>{p[dummy]}<br><span{p[dummy]}>{p[dummy]}</span><nobr><a class=fl href="{p[dummy]}">{p[dummy]}</a> - <a class=fl href="{p[dummy]}">{p[dummy]}</a></nobr></font></td></tr></table></div>';

you can then delete all the dummies with unset().
good luck.

  5. Re: Understanding patterns   Reply   Report abuse  
Picture of Kristiensen Kristiensen - 2007-04-14 06:11:03 - In reply to message 4 from Radu Topala
YES..

I'm going somewhere

how would you use the unset to remove the dummy?

Steffy

  6. Re: Understanding patterns   Reply   Report abuse  
Picture of Radu Topala Radu Topala - 2007-04-14 06:29:20 - In reply to message 5 from Kristiensen

  7. Re: Understanding patterns   Reply   Report abuse  
Picture of Kristiensen Kristiensen - 2007-04-14 06:44:49 - In reply to message 6 from Radu Topala
yes yesy I know the unset fucntion, but where would you put into that script?

also let say that I want in my result page to have the link click able, what should I modifiy in the patern?

thanx you for your time and sorry to be dumb

Steffy

  8. Re: Understanding patterns   Reply   Report abuse  
Picture of Radu Topala Radu Topala - 2007-04-14 06:58:36 - In reply to message 7 from Kristiensen
in my google example you have already a good unset example :
function processData($pattern_matches) //you can do whatever you want here with the pattern matches, insert in a database etc.
{
foreach ($pattern_matches as $k=>$v)
{
if($v['dummy'])
{
unset($pattern_matches[$k]['dummy']);
}
}
print_r($pattern_matches);
}

you shouldn't modify the pattern !!
you can add <a href="" to each link after processing, in processData function.
have a nice day.