<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Perplexed Labs &#187; parallel</title>
	<atom:link href="http://blog.perplexedlabs.com/tag/parallel/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.perplexedlabs.com</link>
	<description>web development war stories from the frontlines to the backend</description>
	<lastBuildDate>Sat, 24 Jul 2010 16:27:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>PHP Forking to Concurrency with pcntl_fork()</title>
		<link>http://blog.perplexedlabs.com/2010/03/02/php-forking-to-concurrency/</link>
		<comments>http://blog.perplexedlabs.com/2010/03/02/php-forking-to-concurrency/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 13:00:10 +0000</pubDate>
		<dc:creator>Matt</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[fork]]></category>
		<category><![CDATA[parallel]]></category>
		<category><![CDATA[pcntl]]></category>
		<category><![CDATA[pcntl_fork]]></category>
		<category><![CDATA[pcntl_wait]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[process]]></category>
		<category><![CDATA[thread]]></category>
		<category><![CDATA[unix]]></category>

		<guid isPermaLink="false">http://blog.perplexedlabs.com/?p=376</guid>
		<description><![CDATA[I find it interesting and challenging to bend PHP in ways it probably shouldn't be bent. Almost always I walk away pleasantly surprised at it's ability to solve a variety of problems. Consider this example. Let's say you want to take advantage of more than one core for a given process. Perhaps it performs many [...]


Related posts:<ol><li><a href='http://blog.perplexedlabs.com/2009/05/04/php-libmemcached-via-memcached-and-igbinary/' rel='bookmark' title='Permanent Link: PHP libmemcached via memcached and igbinary'>PHP libmemcached via memcached and igbinary</a></li>
<li><a href='http://blog.perplexedlabs.com/2009/05/04/php-jquery-ajax-javascript-long-polling/' rel='bookmark' title='Permanent Link: PHP jQuery AJAX Javascript Long Polling'>PHP jQuery AJAX Javascript Long Polling</a></li>
<li><a href='http://blog.perplexedlabs.com/2008/04/09/php-daisy-chain-class-method-calls/' rel='bookmark' title='Permanent Link: PHP Daisy Chain Class Method Calls'>PHP Daisy Chain Class Method Calls</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I find it interesting and challenging to bend PHP in ways it probably shouldn't be bent.  Almost always I walk away pleasantly surprised at it's ability to solve a variety of problems.</p>
<p>Consider this example.  Let's say you want to take advantage of more than one core for a given process.  Perhaps it performs many intensive computations and on a single core would take an hour to run.  Since a PHP process is single threaded you won't optimally take advantage of the available multi-core resources you may have.</p>
<p>Fortunately, via the Process Control (<a href="http://php.net/manual/en/book.pcntl.php">PCNTL</a>) extension, PHP provides a way to fork new child processes.  Forking is the concept of duplicating a thread of execution from the parent to a new child.  <a href="http://www.php.net/manual/en/function.pcntl-fork.php">pcntl_fork()</a> is the function that does this.</p>
<p>The framework for using this extension is as follows:</p>
<pre class="brush: php;">
$maxChildren = 4;
$numChildren = 0;
foreach($unitsOfWork as $unit) {
	$pids[$numChildren] = pcntl_fork();
	if(!$pids[$numChildren]) {
		// do work
		doWork($unit);
		posix_kill(getmypid(), 9);
	} else {
		$numChildren++;
		if($numChildren == $maxChildren) {
			pcntl_wait($status);
			$numChildren--;
		}
	}
}
</pre>
<p>When a new child is forked via pcntl_fork() the pid is returned.  The if statement following the fork allows the child and parent to split their flow of execution based on who they are (i.e. the child does the work and kills itself - the parent tests for hitting the max number of children and waits, otherwise it creates another child).  The pcntl_wait() function is called when we hit $maxChildren, it blocks until a child exits.</p>
<p>Remember, if you want use database connections in your children, they each need to initialize their own connection.  Resources such as database connections are not thread safe.</p>


<p>Related posts:<ol><li><a href='http://blog.perplexedlabs.com/2009/05/04/php-libmemcached-via-memcached-and-igbinary/' rel='bookmark' title='Permanent Link: PHP libmemcached via memcached and igbinary'>PHP libmemcached via memcached and igbinary</a></li>
<li><a href='http://blog.perplexedlabs.com/2009/05/04/php-jquery-ajax-javascript-long-polling/' rel='bookmark' title='Permanent Link: PHP jQuery AJAX Javascript Long Polling'>PHP jQuery AJAX Javascript Long Polling</a></li>
<li><a href='http://blog.perplexedlabs.com/2008/04/09/php-daisy-chain-class-method-calls/' rel='bookmark' title='Permanent Link: PHP Daisy Chain Class Method Calls'>PHP Daisy Chain Class Method Calls</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://blog.perplexedlabs.com/2010/03/02/php-forking-to-concurrency/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>PHP Parallel Web Scraper</title>
		<link>http://blog.perplexedlabs.com/2008/12/17/php-parallel-web-scraper/</link>
		<comments>http://blog.perplexedlabs.com/2008/12/17/php-parallel-web-scraper/#comments</comments>
		<pubDate>Wed, 17 Dec 2008 20:56:00 +0000</pubDate>
		<dc:creator>Matt</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[curl]]></category>
		<category><![CDATA[parallel]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[scrape]]></category>

		<guid isPermaLink="false">http://www.perplexedlabs.com/?p=90</guid>
		<description><![CDATA[Data is the most fundamental component of today's web applications. Scraping and combining data from multiple sources to enhance, re-calculate, and re-display is an everyday occurance. Scraping a list of URLs asynchronously is just about the slowest possible way to do it. Fortunately, PHP 5.2+, via its cURL multi_* functions, gives us a way of [...]


Related posts:<ol><li><a href='http://blog.perplexedlabs.com/2009/04/22/php-named-parameters/' rel='bookmark' title='Permanent Link: PHP Named Parameters'>PHP Named Parameters</a></li>
<li><a href='http://blog.perplexedlabs.com/2008/02/04/php-array-to-string/' rel='bookmark' title='Permanent Link: PHP Array to String'>PHP Array to String</a></li>
<li><a href='http://blog.perplexedlabs.com/2008/02/12/php-fast-large-megabyte-data-transfer-between-sessions/' rel='bookmark' title='Permanent Link: PHP fast, large (megabyte), data transfer between sessions'>PHP fast, large (megabyte), data transfer between sessions</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Data is the most fundamental component of today's web applications.  Scraping and combining data from multiple sources to enhance, re-calculate, and re-display is an everyday occurance.</p>
<p>Scraping a list of URLs asynchronously is just about the slowest possible way to do it.  Fortunately, PHP 5.2+, via its cURL multi_* functions, gives us a way of downloading the data in parallel.  These functions are poorly documented compared to much of the PHP standard library, however, the process is still fairly straightforward.</p>
<p>What I attempted to do was abstract away the fundamental aspects of grouping and retrieving a large list of URLs to scrape.  The usage is simple:</p>
<pre class="brush: php;">
$list = array('MSFT', 'GOOG', 'YHOO', 'INTC', 'AAPL',
    'CSCO', 'C', 'CEDC', 'IBM', 'ORCL', 'SAP', 'CA');

function urlFunc($data) {
   return 'http://finance.yahoo.com/q/ks?s='.$data;
}

function processFunc($k, $data) {
   echo 'processing html for '.$k.&quot;\n&quot;;
}

function dbFunc($k, $data) {
   echo 'storing scraped data to db for '.$k.&quot;\n&quot;;
}

Scraper::scrape($list, 4, 'urlFunc', 'processFunc', 'dbFunc');
</pre>
<p>Note: the argument list, after the 3rd parameter, is dynamic... ie. you can add any number of functions to otherwise process, manipulate, or store the data.  They will be called sequentially and passed the return value of the previous call.</p>
<p>Here is the class listing:</p>
<pre class="brush: php;">
class Scraper
{
	private $curlOptions = null;

	/**
	 * Wrapper to scrape a generic list of items, in groups, in parallel
	 *
	 * Argument list is dynamic, functions are called sequentially...
	 * ie. $urlList is divided into groups of $groupSize urls, each url is passed to
	 * the first function specificed in the dynamic arguments.  The data returned is then
	 * passed to the next function specified, and so on...
	 *
	 * @access public
	 * @param array $list array of data
	 * @param int $groupSize size of chunk to process in parallel
	 * @return bool whether the operation was successful
	 */
	static public function scrape($list, $groupSize = 10, $urlFunc = null)
	{
		$args = func_get_args();
		$funcs = array_slice($args, 3);

		if(!is_array($list) || !count($list) || empty($groupSize)) {
			return false;
		}

		$group = array();
		$c = 0;
		$i = 0;
		$total = count($list);
		foreach($list as $k =&gt; $v) {
			if(!empty($urlFunc) &amp;amp;amp;&amp;amp;amp; is_callable($urlFunc)) {
				$v = call_user_func_array($urlFunc, array($v));
			}
			$group[$k] = $v;
			$c++;
			$i++;
			if(($c == $groupSize) || ($i == $total)) {
				self::getMulti($group, $funcs);
				$c = 0;
				$group = array();
			}
		}

		return true;
	}

	/**
	 * Performs the parallel retrieval of an arbitrary list of urls
	 *
	 * Passed funcs are called sequentially, as requests complete and
	 * data is available, with the return value of the previous
	 * function call...
	 *
	 * @access private
	 * @param array $urls array of URLs
	 * @param array $funcs array of functions to call as data returns
	 */
	static private function getMulti($urls, $funcs = array())
	{
		$curl = array();
		$multi = curl_multi_init();
		foreach($urls as $k =&gt; $v) {
			$curl[$k] = curl_init();

			curl_setopt($curl[$k], CURLOPT_URL, $v);
			curl_setopt($curl[$k], CURLOPT_RETURNTRANSFER, true);

			if(!empty(self::$curlOptions)) {
				curl_setopt_array($curl[$k], self::$curlOptions);
			}

			curl_multi_add_handle($multi, $curl[$k]);
		}

		$running = null;
		do {
			curl_multi_exec($multi, $running);
			while(($info = curl_multi_info_read($multi)) !== false) {
				$key = array_search($info['handle'], $curl, true);
				$return = curl_multi_getcontent($info['handle']);
				curl_multi_remove_handle($multi, $info['handle']);
				foreach($funcs as $func) {
					$return = call_user_func_array($func, array($key, $return));
				}
			}
		} while($running &gt; 0);

		curl_multi_close($multi);
	}
}
</pre>


<p>Related posts:<ol><li><a href='http://blog.perplexedlabs.com/2009/04/22/php-named-parameters/' rel='bookmark' title='Permanent Link: PHP Named Parameters'>PHP Named Parameters</a></li>
<li><a href='http://blog.perplexedlabs.com/2008/02/04/php-array-to-string/' rel='bookmark' title='Permanent Link: PHP Array to String'>PHP Array to String</a></li>
<li><a href='http://blog.perplexedlabs.com/2008/02/12/php-fast-large-megabyte-data-transfer-between-sessions/' rel='bookmark' title='Permanent Link: PHP fast, large (megabyte), data transfer between sessions'>PHP fast, large (megabyte), data transfer between sessions</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://blog.perplexedlabs.com/2008/12/17/php-parallel-web-scraper/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
