Skip to content

PHP Query the jQuery in PHP: Building a simple PHP crawler using it

by Vinícius Krolow on agosto 16th, 2010

The CSS3 selectors power using PHP, that is fantastic is not it? Well I was suppose to use XPATH with DomDocument always when I need filter some HTML, but the idea to make a class that understand CSS3 patterns and return the Dom object that match with the rule, always pass in my mind, someday ago I was decide to initialize one class to do that but before I thought let’s look for before, and I found what I needed and it was a great job made by Tobiasz Cudnik, the project is called phpQuery, a library written in PHP5 and also provides a Command Line Interface.

Let’s show the power of phpQuery, I’ll do a kind of crawler not so smart just to show the power of phpQuery, to do that I’ll create a simple structure of class to give to flexibility to the code:

Crawler.php, a Class that make request for URL’s that we pass!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
< ?php
 
require 'phpQuery.php';
 
class Crawler {
 
	public $content;
 
	public function __construct() {
	}
 
	protected function request($url, $contentType = 'utf-8') {
		$this->content = $this->get_content($url);
		phpQuery::newDocument($this->content);
		return $this->content;
	}
 
	private function get_content($url) {
		$ch = curl_init();
		$timeout = 5; 
		curl_setopt ($ch, CURLOPT_URL, $url);
		curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
		curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
		$file_contents = curl_exec($ch);
		curl_close($ch);
 
		return $file_contents;
	}
 
}
?>

CrawlerInteface.php, An inteface to define the methods of each crawler class will have

1
2
3
4
5
6
7
< ?php
interface CrawlerInterface {
 
	public function getData();
 
}
?>

CrawlerFactory.php, The class that call the implemented classes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
< ?php
require 'Crawler.php';
 
abstract class CrawlerFactory {
 
	private $className;
	private $args;
 
	public static function factory($className, $args = array()) {
		if (is_file(dirname(__FILE__) . '/Crawler' . $className . '.php')) {
			require_once 'Crawler' . $className . '.php';
 
			$class = new $className(implode(',', $args));
 
			return $class;
		} else {
			throw new Exception('Is not a valid class');
		}
	}
}
?>

Well I created the crawler class that make the request to get the content also a common interface for each site that we want to get the content, and the factory class that will call the classes that we implement for each site.

Let’s build a simple feed class, will read the items from RSS for one blog.

CrawlerCobaia.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
< ?php
require 'CrawlerInterface.php';
 
class Cobaia extends Crawler implements CrawlerInterface {
 
	const URL = 'http://feeds.feedburner.com/cobaia/';
 
	public function __construct() {
		parent::__construct();
	}
 
	/**
	* Implements the getData method
	*/
	public function getData() {
		//request the url
		$this->request(self::URL);
		//filter by the tag item
		$news = pq('item');
		$result = array();
		//iterate by the results
		foreach ($news as $new) {
			//get the tag pubDate inside the DOM of new, and return the text
			$date = pq('pubDate', $new)->text();
			$date = date('Y-m-d', strtotime($date));
 
			$result[] = array('url' => pq('link', $new)->text(), //get the tag link inside the DOM of new, and return the text
							  'title' => pq('title', $new)->text(), //get the tag title inside the DOM of new, and return the text
							  'date' => $date);
		}
 
		return $result;
	}
 
 
}
?>

Well we implement the getData that make te request for the content, and search using the phpQuery the tags item, and inside each item we look for pubDate, link and title.

let’s test it:

1
2
3
4
5
6
7
8
9
10
11
< ?php
	//require the CrawlerFactory.php
	require 'CrawlerFactory.php';
 
	//call the Cobaia class and retrive the instance
	$crawler = CrawlerFactory::factory('Cobaia');
	//get the result
	$result = $crawler->getData();
	//show the result
	var_dump($result);
?>

Another example using the phpQuery selectors:

CrawlerPhpnet.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
< ?php
require 'CrawlerInterface.php';
 
class Phpnet extends Crawler implements CrawlerInterface {
 
	const URL = 'http://php.net';
 
	public function __construct() {
		parent::__construct();
	}
 
	/**
	* Implements the getData method
	*/
	public function getData() {
		//request the url
		$this->request(self::URL);
		$title = pq('h1.summary:first');
		return pq('a', $title)->text();
	}
 
 
}
?>

In this example we get the last title entry from the php.net website, using the CSS3 selector of phpQuery we can get the title:

let’s test:

1
2
3
4
5
6
7
8
9
10
11
< ?php
	//require the CrawlerFactory.php
	require 'CrawlerFactory.php';
 
	//call the Phpnet class and retrive the instance
	$crawler = CrawlerFactory::factory('Phpnet');
	//get the result
	$result = $crawler->getData();
	//show the result
	var_dump($result);
?>

This was an example of possibilites that we can do using the phpQuey, we made a simpĺe crawler, that get date from the some sites, we can do various things using the class phpQuery is very powerfull, and will help a lot your development.

I put the code in github, fell free to change the code.

From → php

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS