Powered by emcons.net

How to create a crawler

From How2s

You can create a crawler which does something to your database, e.g. update a database entry one id after another. You have to create a php script, e.g. update.php and then add the ID for every request. The ID then has to iterate through the number always indcrementing to the next ID you wish to process.

You need to do two things:

1.) Create the script which does the update 2.) Have a server which will allow you to run a browser in a cli window which will "crawl" from one ID to another.

Moving from one ID to the other can be achieved with this simple HTML page:

// something you want to do with the data set of $this_id.
$next_id = "SOME DB QUERY";

$refresh = ($this_id == $next_id ? 10000 : 10);

		echo '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
			"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

		<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
		<head>
			<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

			<title>geotag</title>
			<meta http-equiv="refresh" content=' . $refresh . '";url=http://the-url-to-your-script.com/update.php?id=' . $next_id .'">
		</head>

		<body>
		Next ID: ' . $next_id . '
		</body>
		</html>';

Then you need to install a program called "screen" on your server which allows you to run a CLI screen even if you're not connected. Install on debian:

apt-get install screen

start screen:

screen

start the browser *links* with the parameter *-html-auto-refresh* (this will force auto refresh. Normally auto-refresh is off, because the CLI browser would require input from the user to execute the auto refresh. This is mainly used for blind people.)

For example

links "http://the-url-to-your-script.com/update.php?id=13665" -html-auto-refresh 1

You can then detach from screen by pressing at the same time

CTRL+A

then

D

for "detach".

You can resume a screen by typing:

screen -r