Book Review: Webbots, Spiders, and Screen Scrapers
The Internet was first a mechanism for long-distance sharing of information. First we had email and collaborative newsgroups. Gopher later was a series of links between libraries and then came the sharing of information on different platforms using common code: html. This development at CERN by Tim Berners-Lee was fairly useful, but was initially text-based.
The work of the National Center for Supercomputing Applications (NCSA) began what would become Netscape: the graphic browser made famous by Marc Andressen and Jim Clark. Internet use has expanded from a few thousand users to over a billion, with an estimated 125 million websites or more as of March 2007.
Most of us are familiar with the search engines like Google or Yahoo! which seek out the information available online so that when we enter search parameters up comes a list of possible links like magic.
When I have looked at statistics for my own eXtensions website I see that several of the hits each day are not from users but from automated processes that track down information -- words, ideas, phrases -- on websites then file them away. A search engine like Byte Shark will also use logos, photographs, seals, pictures, diagrams, graphics, charts and more.
These processes are known as webbots and spiders. Michael Schrenk has produced a complex work on the subject that has the advantage that it is quite easy to read and work with: not always the case with books that need us to work with code or scripting. The screen scraper of the title is the part of the process that extracts data from the display output of another program: integral to the practical working of the webbot or spider.
The code here is, in the main, PHP (from Personal Home Page Tools) and html along with CURL, a command line tool for transferring files with URL syntax (this is installed with OS X and accessible using Terminal). Schrenk is therefore dealing with scripting that is easy to comprehend, but his writing style makes the whole exercise far less of a chore than some authors' efforts. I found myself racing through several chapters, taking in the concepts presented with none of the eye-drooping that such technical books can sometimes promote.
Perhaps the best part of the style is the way that Schrenk shares his experiences with us, letting us know that the author has made mistakes and that this is a learning process. He is also quite firm on the ethics and legality of some of the scripts that we might produce. It is up to the writers to ensure that they are not crossing the bounds of acceptable practice.
While many of us may only think of the webbot or spider as a part of the search engine's armoury, this book details several ways in which we can make these tools work for our own sites or can use them in specific ways to seek out data that we might find useful. This could be for academic purposes but, perhaps more valuable, it can also be for commercial reasons. Business depends on up to the minute information: knowing who is visiting your site or information about competitors' site traffic will allow better analysis of trends.
The book is in four main parts: Fundamental Concepts and Techniques; Projects, with several chapters on different types of webbots; Advanced technical Considerations; and Larger Considerations, including Appendices.
There are two Contents sections: the first has the basic outlines of sections and chapters, while the second, of some ten pages, breaks down the chapters and the three appendices into their component parts. There is also a detailed Index, of some fourteen small-print pages.
As is increasingly common these days, the author also uses a website to complement the printed text. Several scripts that he deals with in the book are available for download along with a considerable quantity of other useful and up to date information.
While most of the screen shots are of Windows applications, Schrenk makes it clear that the processes are multi-platform and points out where any major differences occur. For example, in the chapter concerned with Scheduling, he discusses the Windows task scheduler but opens the chapter with, "In Unix, Linux and Mac OS X environments, you can always use the cron command. . . ."
He also suggests several possibilities for delivery of the data over and above the basic browser page, for example email or SMS messages, demonstrating the wide range of technologies that the basic search can now take advantage of.
This brief review fails to do justice to the amount of usefulness that web developers and analysts will glean from this rich work, or the value to students who are interested in this area of the web.
For further information, e-mail to
To eXtensions: 2004-05
To eXtensions: Year Two
To eXtensions: Year One
To eXtensions: Book Reviews
Back to homepage