How to create a crawler
From How2s
You can create a crawler which does something to your database, e.g. update a database entry one id after another. You have to create a php script, e.g. update.php and then add the ID for every request. The ID then has to iterate through the number always indcrementing to the next ID you wish to process.
You need to do two things:
1.) Create the script which does the update 2.) Have a server which will allow you to run a browser in a cli window which will "crawl" from one ID to another.
Moving from one ID to the other can be achieved with this simple HTML page:
// something you want to do with the data set of $this_id. $next_id = "SOME DB QUERY"; $refresh = ($this_id == $next_id ? 10000 : 10); echo '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <title>geotag</title> <meta http-equiv="refresh" content=' . $refresh . '";url=http://the-url-to-your-script.com/update.php?id=' . $next_id .'"> </head> <body> Next ID: ' . $next_id . ' </body> </html>';
Then you need to install a program called "screen" on your server which allows you to run a CLI screen even if you're not connected. Install on debian:
apt-get install screen
start screen:
screen
start the browser *links* with the parameter *-html-auto-refresh* (this will force auto refresh. Normally auto-refresh is off, because the CLI browser would require input from the user to execute the auto refresh. This is mainly used for blind people.)
For example
links "http://the-url-to-your-script.com/update.php?id=13665" -html-auto-refresh 1
You can then detach from screen by pressing at the same time
CTRL+A
then
D
for "detach".
You can resume a screen by typing:
screen -r

