Jump to navigation Jump to search This article is about the internet bot. Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites’ web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently. Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and “politeness” come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000.
Today relevant results are given almost instantly. Crawlers can validate hyperlinks and HTML code. A web crawler is also known as spider. A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. The archive is known as the repository and is designed to store and manage the collection of web pages.