Extract website data using php

Many time web programmer needs to get some data from other website. Extraction of particular data from other website is also known as web scraping or Web Harvesting. In this tutorial i will explain how to extract data from website using php.

First extract complete html source of webpage

php has inbuilt function file_get_contents to do this

$html=file_get_contents("http://www.somewebsite.com")

But i will not recommend it because we can not set timeout parameter in function. So instead of this i will use curl library.

Here is php function to get html source of webpage using curl

function getHTML($url,$timeout)
{
       $ch = curl_init($url); // initialize curl with given url
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
       curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
       curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
       return @curl_exec($ch);
}

 

Now use regular expression match to extract particular data from source

Here is php code to extract title of website

$html=getHTML("http://www.website.com",10);
preg_match("/<title>(.*)</title>/i", $html, $match);
$title = $match[1];

If you don’t know how to use regexp then there is a php class simplehtmldom to parse html.

include_once("simple_html_dom.php");
//use curl to get html content
function getHTML($url,$timeout)
{
       $ch = curl_init($url); // initialize curl with given url
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
       curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
       curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
       return @curl_exec($ch);
}
$html=getHTML("http://www.website.com",10);
// Find all images on webpage
foreach($html->find("img") as $element)
echo $element->src . '<br>';

// Find all links on webpage
foreach($html->find("a") as $element)
echo $element->href . '<br>';

SimpleHtmlDom has selectors just like jquery. So it is very easy to use.


Liked It? Get Free updates in your Email

Delivered by feedburner

4 thoughts on “Extract website data using php

  1. TheD
    #

    Is there anyway to prevent someone from scraping my own website?

    Reply
  2. Manny
    #

    I understand this is an old sample/post, but I tried your code and I get an error on the “find”…

    What is this?

    foreach($html->find(“img”)

    Fatal error: Call to a member function find() on array

    Thank you

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *