PHP Query the jQuery in PHP: Building a simple PHP crawler using it
The CSS3 selectors power using PHP, that is fantastic is not it? Well I was suppose to use XPATH with DomDocument always when I need filter some HTML, but the idea to make a class that understand CSS3 patterns and return the Dom object that match with the rule, always pass in my mind, someday ago I was decide to initialize one class to do that but before I thought let’s look for before, and I found what I needed and it was a great job made by Tobiasz Cudnik, the project is called phpQuery, a library written in PHP5 and also provides a Command Line Interface.
Let’s show the power of phpQuery, I’ll do a kind of crawler not so smart just to show the power of phpQuery, to do that I’ll create a simple structure of class to give to flexibility to the code:
Crawler.php, a Class that make request for URL’s that we pass!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | < ?php require 'phpQuery.php'; class Crawler { public $content; public function __construct() { } protected function request($url, $contentType = 'utf-8') { $this->content = $this->get_content($url); phpQuery::newDocument($this->content); return $this->content; } private function get_content($url) { $ch = curl_init(); $timeout = 5; curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $file_contents = curl_exec($ch); curl_close($ch); return $file_contents; } } ?> |
CrawlerInteface.php, An inteface to define the methods of each crawler class will have
1 2 3 4 5 6 7 | < ?php interface CrawlerInterface { public function getData(); } ?> |
CrawlerFactory.php, The class that call the implemented classes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | < ?php require 'Crawler.php'; abstract class CrawlerFactory { private $className; private $args; public static function factory($className, $args = array()) { if (is_file(dirname(__FILE__) . '/Crawler' . $className . '.php')) { require_once 'Crawler' . $className . '.php'; $class = new $className(implode(',', $args)); return $class; } else { throw new Exception('Is not a valid class'); } } } ?> |
Well I created the crawler class that make the request to get the content also a common interface for each site that we want to get the content, and the factory class that will call the classes that we implement for each site.
Let’s build a simple feed class, will read the items from RSS for one blog.
CrawlerCobaia.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | < ?php require 'CrawlerInterface.php'; class Cobaia extends Crawler implements CrawlerInterface { const URL = 'http://feeds.feedburner.com/cobaia/'; public function __construct() { parent::__construct(); } /** * Implements the getData method */ public function getData() { //request the url $this->request(self::URL); //filter by the tag item $news = pq('item'); $result = array(); //iterate by the results foreach ($news as $new) { //get the tag pubDate inside the DOM of new, and return the text $date = pq('pubDate', $new)->text(); $date = date('Y-m-d', strtotime($date)); $result[] = array('url' => pq('link', $new)->text(), //get the tag link inside the DOM of new, and return the text 'title' => pq('title', $new)->text(), //get the tag title inside the DOM of new, and return the text 'date' => $date); } return $result; } } ?> |
Well we implement the getData that make te request for the content, and search using the phpQuery the tags item, and inside each item we look for pubDate, link and title.
let’s test it:
1 2 3 4 5 6 7 8 9 10 11 | < ?php //require the CrawlerFactory.php require 'CrawlerFactory.php'; //call the Cobaia class and retrive the instance $crawler = CrawlerFactory::factory('Cobaia'); //get the result $result = $crawler->getData(); //show the result var_dump($result); ?> |
Another example using the phpQuery selectors:
CrawlerPhpnet.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | < ?php require 'CrawlerInterface.php'; class Phpnet extends Crawler implements CrawlerInterface { const URL = 'http://php.net'; public function __construct() { parent::__construct(); } /** * Implements the getData method */ public function getData() { //request the url $this->request(self::URL); $title = pq('h1.summary:first'); return pq('a', $title)->text(); } } ?> |
In this example we get the last title entry from the php.net website, using the CSS3 selector of phpQuery we can get the title:
let’s test:
1 2 3 4 5 6 7 8 9 10 11 | < ?php //require the CrawlerFactory.php require 'CrawlerFactory.php'; //call the Phpnet class and retrive the instance $crawler = CrawlerFactory::factory('Phpnet'); //get the result $result = $crawler->getData(); //show the result var_dump($result); ?> |
This was an example of possibilites that we can do using the phpQuey, we made a simpĺe crawler, that get date from the some sites, we can do various things using the class phpQuery is very powerfull, and will help a lot your development.
I put the code in github, fell free to change the code.