World Journal of Engineering Research and Technology (WJERT) has indexed with various reputed international bodies like : Google Scholar , Index Copernicus , Indian Science Publications , SOCOLAR, China , International Institute of Organized Research (I2OR) , Cosmos Impact Factor , Research Bible, Fuchu, Tokyo. JAPAN , Scientific Indexing Services (SIS) , Jour Informatics (Under Process) , UDLedge Science Citation Index , International Impact Factor Services , International Scientific Indexing, UAE , International Society for Research Activity (ISRA) Journal Impact Factor (JIF) , International Innovative Journal Impact Factor (IIJIF) , Science Library Index, Dubai, United Arab Emirates , Scientific Journal Impact Factor (SJIF) , Science Library Index, Dubai, United Arab Emirates , Eurasian Scientific Journal Index (ESJI) , Global Impact Factor (0.342) , 

World Journal of Engineering
Research and Technology

An International Peer Reviewed Journal for Engineering Research and Technology

ISSN 2454-695X

Impact Factor : 4.326

ICV : 79.45

News & Updation

  • Article Invited for Publication

    Article are invited for publication in WJERT Coming Issue

  • WJERT New Impact Factor

    Its our Pleasure to Inform you that WJERT Impact Factor has been increased from 3.419 to 4.236 due to high quality Publication at International Level

  • WJERT NOVEMBER ISSUE PUBLISHED

    NOVEMBER 2017 Issue has been successfully launched on 1 November 2017

  • New Issue Published

    Its Our pleasure to inform you that, WJERT 1 November 2017 Issue has been Published, Kindly check it on http://wjert.org/home/current_issues

Indexing

Abstract

DESIGN AND IMPLEMENTATION OF A HIGH PERFORMANCE WEB CRAWLER FOR INFORMATION EXTRACTION

*ILO Somtoochukwu F., Victor Onuchi, Akuma Uche and Okah, Paul-Kingsley

ABSTRACT

Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and operating system limits must be taken into account in order to achieve high performance at a reasonable cost. In this study, we describe the design and implementation of a high performance web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes andother events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. An algorithm was developed for the web crawler to download web pages that are to be indexed by the search engine and based on the algorithm developed above, the source code was written in the PHP scripting language that is suited for web development and can be embedded into HTML. The source code was then integrated in Apache Server Environment for automatic web search engine fetch sequencing. The program workability was test run in OSI model layer 7 using Hypertext Transfer Protocol (HTTP).

[Full Text Article]