Many time web programmer needs to get some data from other website. Extraction of particular data from other website is also known as web scraping or Web Harvesting. In this tutorial i will explain how to extract data from website using php.
First extract complete html source of webpage
php has inbuilt function file_get_contents to do this
$html=file_get_contents("http://www.somewebsite.com")
But i will not recommend it because we can not set timeout parameter in function. So instead of this i will use curl library.
Here is php function to get html source of webpage using curl
function getHTML($url,$timeout) { $ch = curl_init($url); // initialize curl with given url curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error return @curl_exec($ch); }
Now use regular expression match to extract particular data from source
Here is php code to extract title of website
$html=getHTML("http://www.website.com",10); preg_match("/(.*) /i", $html, $match); $title = $match[1];
If you don’t know how to use regexp then there is a php class simplehtmldom to parse html.
include_once("simple_html_dom.php"); //use curl to get html content function getHTML($url,$timeout) { $ch = curl_init($url); // initialize curl with given url curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error return @curl_exec($ch); } $html=str_get_html(getHTML("http://www.website.com",10)); // Find all images on webpage foreach($html->find("img") as $element) echo $element->src . ' '; // Find all links on webpage foreach($html->find("a") as $element) echo $element->href . ' ';
SimpleHtmlDom has selectors just like jquery. So it is very easy to use.
Is there anyway to prevent someone from scraping my own website?
only way is to encode url
What do you do if the data doesn’t load until a link is clicked? Currently I am using your method to return the raw data of this cart ( http://bitcoincharts.com/charts/btceUSD#rg1zig1-minztgCzm1g3zm2g4 ) but its blank, because it takes an action before it populates.
I understand this is an old sample/post, but I tried your code and I get an error on the “find”…
What is this?
foreach($html->find(“img”)
Fatal error: Call to a member function find() on array
Thank you
You forgot to include the simplehtmldom: http://simplehtmldom.sourceforge.net/
Use like this:
find(“a”) as $element)
echo $element->href . ”;
?>
could you tell us how to upload data from one website to other directly using php
Pingback: 9월8일 – NUB of things … the nub of the issuses & problems
when we use simplehtmldom, but in source only use getHTML()