Abstract
DESIGN AND IMPLEMENTATION OF A HIGH PERFORMANCE WEB CRAWLER FOR INFORMATION EXTRACTION
*ILO Somtoochukwu F., Victor Onuchi, Akuma Uche and Okah, Paul-Kingsley
ABSTRACT
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and operating system limits must be taken into account in order to achieve high performance at a reasonable cost. In this study, we describe the design and implementation of a high performance web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes andother events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. An algorithm was developed for the web crawler to download web pages that are to be indexed by the search engine and based on the algorithm developed above, the source code was written in the PHP scripting language that is suited for web development and can be embedded into HTML. The source code was then integrated in Apache Server Environment for automatic web search engine fetch sequencing. The program workability was test run in OSI model layer 7 using Hypertext Transfer Protocol (HTTP).
[Full Text Article] [Download Certificate]