We should note that spidering is not without its hazards. We can say that surface web content is organized by an association of links, or in HTML jargon, an association of tags. ![]() Successful web crawling relies on the fact that site owners want their content to be found and that most of a site's content can be accessed directly, or by following links from the home page. They then follow these new links and all subsequent links recursively. They load each web site’s home page and note its links to other web pages. Surface web spiders work from a large list, or catalog, of known and discovered web sites. Different strategies and a new kind of “deep web explorer” are needed to mine the deep web. Merely throwing lots of resources at the deep web, the vast set of content that lives inside of databases and is typically accessed by filling out and submitting search forms, doesn’t work well. Spidering the surface web, consisting mostly of static content that doesn’t change frequently, is mostly a matter of throwing lots of network bandwidth, compute power, storage and time at a huge number of web sites. Not to minimize the tremendous value that Google and other search engines provide, but the technology that gathers up or “spiders” web pages is pretty straightforward. ![]() Web spiders these days, it seems, are a dime a dozen.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |