Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. The idea is to spread out the required resources of computation and bandwidth to many computers and networks
.
Contents[hide] |
As of 2003 most modern commercial search engines use this technique. Google uses thousands of individual computers in multiple locations to crawl the Web.
Newer projects are attempting to use a less structured, more ad-hoc form of collaboration by enlisting volunteers to join the effort using, in many cases, their home or personal computers. LookSmart is the largest search engine to use this technique, which powers its Grub distributed web-crawling project.
This solution uses computers that are connected to the Internet to crawl Internet addresses in the background. Upon downloading crawled web pages, they are compressed and sent back together with a status flag (e.g. changed, new, down, redirected) to the powerful central servers. The servers, which manage a large database, send out new URLs to clients for testing.
It appears that many people (including founding members) behind Grub left the project. The side effect of that is that bugs aren't being fixed and even after 4 years the project doesn't give the option for searching among crawled results.
According to the Nutch, an open-source search engine FAQ, the savings in bandwidth by distributed web crawling are not significant, since "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages...".
Web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000).
This process is called Web crawling or spidering. Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
Contents[hide] |
ادامه مطلب ...
The deep web (or invisible web or hidden web) is the name given to pages on the World Wide Web that are not part of the surface web that is indexed by common search engines. It consists of pages which are not linked to by other pages (e.g., dynamic pages which are returned in response to a submitted query). The deep web also includes sites that require registration or otherwise limit access to their pages (e.g., using the Robots Exclusion Standard), prohibiting search engines from browsing them and creating cached copies. Pages that are only accessible through links produced by JavaScript and Flash also often reside in the deep web since most search engines are unable to properly follow these links.
.
ادامه مطلب ...A digital library is a library in which a significant proportion of the resources are available in machine-readable format (as opposed to print or microform), accessible by means of computers. The digital content may be locally held or accessed remotely via computer networks. In libraries, the process of digitization began with the catalog, moved to periodical indexes and abstracting services, then to periodicals and large reference works, and finally to book publishing. Some of the largest digital libraries are purely digital having few if any physical holdings
The term Digital Library is diffuse enough to be applicable to a wide range of digital entities. Divisions can be made between libraries that have some physical presence where patrons are able to access physical holdings as well as digital holdings and libraries where collections are almost completely digital. Project Gutenberg, ibiblio, International Children's Digitial Library and the Internet Archive can serve as examples of this later case.
اینترنت موجب توسعه ی ارتباطات پیشرفته در همه جا، دسترسی در هر لحظه و ابزاری ساده از طریق مرورگرهای WEB گردید. سازمان ها سنتی با به کارگیری روش های نو و ابتکاری پا به عرصه ی تجارت الکترونیکی گذاشتند تا از تمام مزایا و قابلیت های اینترنت استفاده کنند. شرکت ها از طریق اینترنت فورا با مشتری ها، فروشنده ها و شرکای خود تماس برقرار می کنند. اینترنت موجب تغییر حرکت اطلاعات در سازمان ها، تغییر نحوه ی تبادل اطلاعات تجاری و ارتباطات گردیده است. این شرایط جدید موجب ایجاد ارزش های جدید در صحنه های اقتصادی و اجتماعی گردید. BPR برای تجارت الکترونیکی نقش بیشتری نسبت به توانمندی های وب دارد.BPR شامل طراحی مجدد فرآیندها در سرتاسر حلقه های ارتباطات درون سازمانی و بین سازمانی است.
ادامه مطلب ...