Web crawler also referred to as spider bots, or spider, or simple crawler is an internet bot that performs web indexing by browsing the internet. Web content is updated through spidering software or web crawling by Web Search Engines. To promote a user’s search efficiency, pages are copied by web crawlers which are later indexed by search engines.
Crawlers can visit systems without their approval and use the available resources. When greater numbers of web pages are accessed, issues of loading, schedule and “politeness” factor in. Parameters are set on public sites which prevent a crawler form crawling the site. For instance, bots can be requested to index certain portion of a site, or not at all by a robot.txt file.
Complete indexing is not possible at certain instances even by the largest crawlers because infinite number of web pages exist on the internet. During the starting period of World Wide Web, before 2000, search engines faced difficulty in showing relevant searches. Now search engines display relevant search results instantly.
Crawler can perform webscrapping as well as validate HTML codes and hyperlinks. Web crawlers use seeds, a list of URLs to start. When a URL is visited by a crawler, all the hyperlinks found on the webpage are identified and added to the crawl frontier, a list of URLs. URLs on the frontier are visited from time to time by the crawler, according to a set of guidelines. Information is saved and copied as the web crawler continues to archive the websites. The archive used for managing and starting the web page collection is called repository. HTML pages in the form of distinct files are stored in the repository. A repository functions similar to a database, as it also stores data. However, a repository lacks additional functions performed by a modern database. The most updated version of the webpage, obtained by the crawler is stored in the repository.
Due to the sheer volume of webpages found on the internet, the crawler can download only a few pages in a given time frame; hence a crawler prioritizes web pages before performing a download. The large number of URLs generated by server-side software can sometime cause the web crawler to retrieve duplicate content. Unique content is required by only a few HTTP GET combinations. For instance, user might obtain only 3 options for a photo gallery, which HTTP GET specifies in the URL. If four options are provided for sorting images, three for thumbnail choices, 2 file format and a choice for disabling user provided content, the 48 different URLs can be used to access the same content. The numbers of combinations are endless and crawler must retrieve the desired content by sorting through these endless combinations.
Various combinations of policies dictate the actions of a crawler.
- Parallelization policy determines how distributed crawler should coordinate
- Politeness policy which prevents a website from being overloaded
- Re-visit policy, which dictates search for changes made in a webpage.
- Selection policy to determine which pages are to be downloaded.
Due to its gigantic size, only a small portion of the web is accessible by the search engine. As of 2009, only 40-70% of indexing was possible by large search engines on the indexable web. A crawler can download only a few webpages, hence it is a goof strategy not to download random pages on the internet but only pages relevant to the search. For this reason, web pages need to be prioritized. Priority of a webpage depends on the intrinsic quality of the page, which is determined by the web pages’ visit, links and URL popularity. Determining the best selection policy is made difficult by the fact that web crawler has limited information because not all web pages are known to the crawler.
A crawling simulation was performed on 180,000 pages by Junghoo Cho et al, for determining crawling policies. Partial PageRank calculations, backlink count and breadth-first were the ordinary matrices being tested. One result showed that partial strategy is suitable if high PageRank web pages are to be downloaded early by the web crawler during the crawling followed by backlink count and breadth-first. An actual crawl, using breadth-first ordering was performed by Wiener and Najork on 328 million pages. The results showed that pages with high PageRank are captured early by the crawler using the breadth-first process during crawling. This is because an important page has links on multiple hosts, which can be accessed early by the crawler using the breadth-first process during crawling. This is because an important page has links on multiple hosts, which can be accessed early during the crawling process, regardless of the page crawling is done.
A crawling strategy was designed by Abitebout based on On-line Page Importance Computation or OPIC algorithm. In OPIC, a page receives “cash”, which is later on shared equally between different pages it points to. It is a single step process and mush faster that pagerank computation. Pages with a huge “cash” sum are downloaded from the crawler frontier by an OPIC-driven crawler. A 100,000 page synthetic graph was used in the experiment. However, it could not be compared with other experiment or strategies on the web. Breadth-first method was tested against random ordering, omniscient strategy and depth-first by Bold, et al. using a simulation on 100 million pages from WebBase crawl and 40 million pages from the .it domain. The comparison was based on the actual pagerank value and the one approximated by a partial crawl. Sometimes very bad progressive approximations are shown by webpages which have obtained a high Pagerank.
Multiple crawling strategies were tested by Baeza Yates et al. on 3 million pages belonging to two subsets of .el and .go domain. Results showed that strategies which use per-site queues and OPIC are more effective than breadth-first crawling, and its better to guide the current crawl by a previous crawl.
An algorithm for finding good seeds was designed by Daneshpajouh et al. The crawler begins by crawling highrank pages rather than crawling from random seeds. Through this method, good seeds can be discovered from Web graphs which had been crawled previously. These new seeds make a crawl very effective
RESTRICTING FOLLOWED LIKES:-
Sometime only an HTML page is desired by the crawler and other MIME types are avoided. This is done by making a HTTP Head request to determine MIME type of a Web resource before sending a GET request. URL are examined by a crawler to avoid sending multiple HEAD request. Multiple crawling strategies were tested by Baeza Yare et al. on3 million pages belonging to two subsets of .cl and .go domain. Results showed that strategies which use per-sites queues and OPIC are more effective than breadth-first crawling, and it’s better to guide the current crawling by a previous crawl. A resource is requested only if it URL ends with a specific character such as htm., html., php., asp., aspx., jsp., jspx., or a slash. Numerous HTML Web resources are skipped using this strategy.
In order to avoid “spider trap”, resources with a “?” are not requested by a crawler. This is done to avoid downloading infinite URLs from a site. This strategy is not effective if URLs are simplified using URL rewriting.
Crawling the same Web resource can be a problem. For this reason, URL normalization is performed. Also referred to as URL canonicalization, A URL is standardized through modification in this process. The modification can be done through numerous methods including removal of “..” segment, adding trailing slashes, and converting URLs to lowercases.
Path Ascending Crawling:-
Path-ascending crawler tries to upload/download as many possible resource of a site by ascending all paths of the URL it crawls. According to Cothey, searching isolated resources or resources without inbound links is effective through ascending crawler.
Topical or focused crawlers tend to download similar pages. Soumen Chakrabarti et al. and Flippo Menczer were the first to introduce the concepts of focused and topical crawling.
One issue in focused crawling is that texts similar to the query have to be predicted before the page is downloaded. Anchor texts of links are a predictor; this approach was adopted by Pinderton in the early web days on the first web crawler. An approach proposed by Diligenti et al. focused on finding the resemblance between pages not visited and driving query. Focused crawling focuses on starting points are provided by general Web search engines, and depends on the different links provided in the searched topic.
Academic crawlers, which crawl free academic documents are example of focused crawlers. Crawler of CiteSeer search engine, citeseerxbot is an academic crawler. Microsoft Academic Search and Google Scholar are a few examples of other academic search engines. Academic crawlers prefer crawling PostScript files, Microsoft Word, PDF along with their zipped formats because these academic papers are found in PDF format. Therefore, crawlers, like Heritrix, are customized for filtering out different MIME types, or the documents are extracted and imported to the crawl repository and database using middleware. Regular expression algorithms and machine learning are performed after a crawl to determine whether the documents are academic or not, which is a challenging task. The academic documents are found on the publication page of research institutes or on the home pages of students and faculties. Good seeds are required for increasing academic web crawlers efficiency through boosting, because only a minor portion of the available web pages are occupied by academic documents. HTML files and plain text containing academic papers metadata, like abstracts, titles and papers are downloaded by the academic web crawler. The overall volume of papers is increased, but free downloads on not provided in many cases.
Semantic focused crawler:
Semantic focused crawler, another focused crawler uses domain ontologies for representing topical maps and linking web pages with ontological concepts for categorization and selection purposes. The crawling process automatically updates the ontologies. Dong et al. introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages. An ontology-learning-based-crawler was introduced by Dong et al. which updated content of ontological concepts using vector machine during a crawl process on a Web Page.
Crawling is a time consuming process and crawler may take months to crawl even a small portion of the web. When the crawl has finished, multiple events like updates, deletions and creations could have happened. It can be problematic for the search engine if these events are not detected as it will contain an outdated resource copy. Freshness and age are the most important functions.
Freshness is a function which indicates the accuracy of the local copy.
Age of a local copy indicates how much outdated is the local copy. A crawler must keep a pages’ age as low, or to keep the freshness of pages high.
Garcia-Molina and Cho studied two re-visiting policies:
- Proportional policy: It is concerned with a pages re-visiting frequency which depends on how often changes are made in the page.
- Uniform policy: It is concerned with re-visiting all pages with same frequency, regardless of the amount of changes made in the pages.
In both situations, the repeated crawling is either in fixed or random order.
Results showed that in both a simulated and an actual Web crawl the proportional policy is always outperformed by the uniform policy in terms of average freshness. This is because a web crawler can crawl only a certain number of pages in a limited time frame. New crawls at the cost of less frequently updating pages are allocated to rapidly changing pages. Pages which change less frequently retain their freshness for a longer duration. Therefore, more crawling resources are allocated to frequently updating pages by a proportional policy, but less freshness time is experienced in them.
Crawler has to penalize elements which change frequently to improve freshness. Both the re-visiting policies (proportional and uniform) are not optimal. To maintain a high average freshness the frequently changing pages must be ignored, while the average age can be kept low by accessing frequencies which increase with increasing rate of page changes. Explicit formulas can be obtained numerically for a re-visit policy. Both Garcia-Molina and Cho demonstrated that page changes can be described in a good manner by the exponential distribution. Quality of all pages is considered homogenous by the re-visit policy, a non-realistic situation, hence for designing more effective crawling policy additional information about a Web page quality must be used.
Crawlers can seriously affect a sites performance because they retrieve date much faster than human searchers. It is hard for a server to keep up with multiple crawlers if they are downloading large files and making multiple requests per second.
A web crawler can perform numerous tasks but this has a few costs:
- Server overloading, if servers are accessed too frequently.
- Personal crawlers, if used by multiple users, disrupt Web servers and networks.
- Network resources, because crawlers are operated with parallelism and they require large bandwidth
- Poorly written crawlers, crash routers or servers, or download out of control pages.
The above mentioned problems are solved by the robots exclusion protocol, also called the robot.txt protocol that instruct a crawler not to access certain areas of the Web servers. This standard excludes visit intervals to the same server, even though this can prevent server overload. Google, MSN, Yahoo! Search and Ask Jeeves use “Crawl-delay:” parameter in robots.txt file for indicating the delay intervals between requests.
A 60 second time interval was initially proposed between successive pageloads. But, it would take months to download a complete website of over 100,000 pages with a connection of infinite bandwidth and zero latency. Also, only a small portion of the servers’ resources will be utilized. This is not an optimal strategy,
15 seconds are used as default by a WIRE crawler for access, while Cho used 10 second interval. An adaptive politeness policy is followed by Mercator Web crawler: if a document is downloaded from a server in t seconds, before downloading another page a crawler waits for 10t seconds. Access log evidences show that the crawlers have an access interval between 20s to 3-4 min. Web administrators have filed complaints even when the crawler was very polite, and followed the necessary protocol to prevent a Web servers from being overloaded.
Programmatic vs Visual Crawler:
Various “visual crawler/web scrapper” can be found on the web. These can structure data into rows and columns according to the users’ requirement. The visual and classic crawlers differ mainly in the level of programming expertise they require to operate. The latest versions of “visual scrapers”, like as outwithub, import.io and Diffbot. can scrape the web through a crawl without requiring advanced programming skills.
In visual crawling/scraping method a crawler is taught by users to follow patterns in data sources. Commonly used method for teaching a crawler is by highlighting data in a training and browser rows and columns.