Web Crawler On Client Machine

The World Wide Web is a rapidly growing and
changing information source. Due to the dynamic nature of
Such search engines rely on massive collections of web pages
the Web, it becomes harder to find relevant and recent
that are acquired with the help of web crawlers, which
information.. We present a new model and architecture of the
traverse the web by following hyperlinks and storing
Web Crawler using multiple HTTP connections to WWW.
downloaded pages in a large database that is later indexed for
The multiple HTTP connection is implemented using
efficient execution of user queries. Despite the numerous
multiple threads and asynchronous downloader module so
applications for Web crawlers, at the core they are all
that the overall downloading process is optimized.
fundamentally the same. Following is the process by which
The user specifies the start URL from the GUI provided. It
Web crawlers work starts with a URL to visit.

As the crawler visits the URL, it identifies all the hyperlinks
in the web page and adds them to

1. Download the Web page.
the list of URLs to visit, called the crawl frontier. URLs from
2. Parse through the downloaded page and retrieve all the
the frontier are recursively visited and it stops when it reaches
links.
more than five level from every home pages of the websites
3. For each link retrieved, repeat the process.
visited and it is concluded that it is not necessary to go deeper
than five levels from the home page to capture most of the
The Web crawler can be used for crawling through a
pages actually visited by the people while trying to retrieve
whole site on the Inter/Intranet. You specify a start-URL and
information from the internet.

the Crawler follows all links found in that HTML page. This
The web crawler system is designed to be deployed on a
usually leads to more links, which will be followed again, and
client computer, rather than on mainframe servers which
so on. A site can be seen as a tree-structure, the root is the
require a complex management of resources, still providing
start-URL; all links in that root-HTML-page are direct sons
the same information data to a search engine as other
of the root. Subsequent links are then sons of the previous
crawlers do.
sons. A single URL Server serves lists of URLs to a number
of crawlers. Web crawler starts by parsing a specified web
page, noting any hypertext links on that page that point to
They then parse those pages for new links,
connections, multi-threading, asynchronous downloader.
and so on, recursively. Web-crawler software doesn’t actually
move around to different computers on the Internet, as
viruses or intelligent agents do. Each crawler keeps roughly
300 connections open at once. This is necessary to retrieve
web pages at a fast enough pace. A crawler resides on a single
machine. The crawler simply sends HTTP requests for
A web crawler is a program or an automated script which
documents to other machines on the Internet, just as a web
browses the World Wide Web in a methodical automated
browser does when the user clicks on links. All the crawler
manner. A Web crawler also known as a web spiders, web
really does is to automate the process of following links. Web
robots, worms, walkers and wanderers are almost as old as
crawling can be regarded as processing items in a queue.
the web itself .

The first crawler, Matthew Gray’s
When the crawler visits a web page, it extracts links to other
wanderer, was written in spring of 1993, roughly coinciding
web pages. So the crawler puts these URLs at the end of a
with the first release of NCSA Mosaic . Due to the
queue, and continues crawling to a URL that it removes from
explosion of the web, web crawlers are an essential
the front of the queue component of all search engines and are increasingly
becoming important in data mining and other indexing
applications.

Many legitimate sites, in particular search engines,
use crawling as a means of providing up-to-date
Crawlers consume resources: network bandwidth to
data. Web crawlers are mainly used to index the links of all
download pages, memory to maintain private data structures
the visited pages for later processing by a search engine.
in support of their algorithms, CPU to evaluate and select
URLs, and disk storage to store the text and links of fetched
pages as well as other persistent data.