1. The Ultimate Guide to the Invisible Web
Published on Monday 18th of December, 2006 from OEDb.org
When you use a search engine on the Internet and can't find what you're looking for, what do you do?
Maybe you're seeking to learn something, which means you're probably going to keep trying until you
find it. Or give up in frustration. Don't give up that easily. There's information out there that is actually
not indexed in the big search engines. Such Web pages are part of what's called the Dark, Deep,
Hidden or Invisible Web. Those pages that are actually indexed are known by some as the surface Web.
Fortunately, the invisible Web is getting easier to search, with tools beyond the standard big three
search engines such as Google, Yahoo, and MSN.
In the early days of the Web, computing power and storage space was at such a premium that the few
search engines that were around often indexed only a tiny fraction of Web pages and not even full
pages at that. But eventually space became relatively cheap and engines started indexing pages in full
(full text), as well as more pages. Still, engines miss a lot of pages. Here's a guide to those "invisible"
pages.
Background of the Invisible Web
1. The term. "Invisible" is purely search engine-centric, indicating any Web page that can be
accessed by at least one person but which is not indexed in a search engine. Many people
prefer the term "Deep Web" instead.
2. Its size. No one knows for sure. Danny Sullivan, a search engine expert and formerly of Search
Engine Watch, wrote in 2000 that the invisible Web was about 500 times Google's index of one
billion pages. New estimates [NY Times' free registration may be required] of Google's index
sets it at over 8 billion at the time of this writing. (Claims by its archival archrival Yahoo! of 19+
billion pages were considered questionable.) Search engines are said to only crawl 16-20% of
the Internet
3. Its real size. The most likely entity to be able to make any sort of "accurate" estimate is Google,
though if they've made a recent estimate of either the current size or growth rate of the
invisible Web, that information itself appears to be invisible. (They would have a list
somewhere of never-crawled URLs, which would be a mere starting point, as there would also
be all those countless URLs even they cannot get to. Without this, how could an estimate be
calculated?)
4. A guesstimate. Any astute mathematician with an understanding of Web content management
systems, content databases, and dynamically-served Web pages would probably say between 1
and 4 trillion pages, then conclude the near impossibility of an accurate estimate, especially
because of the rapidly increasing number of invisible sites. It's easier to compare search engine
index size.
5. Example of futility of estimating. A library or museum gets gifted with a collection of one
million digital images and decides to create a Web-accessible database. Each image will have its
2. own dynamically-served page, accessible via a query form. Just like that, one million new pages
have been added to the invisible Web.
6. How many invisible sites. In the same article by Danny Sullivan (above), he indicates
BrightPlanet's estimate of 100,000 as being the number of "significant invisible websites" out of
about 200,000. That was in 2000, so it's a hopelessly outdated estimate. Since then, weblogs
have been added to the mix, and many of them go uncrawled, increasing the number of
invisible sites.
7. Rate of growth of invisible sites. Technorati's David Sifry said that10,0000 new blogs are
created daily as of October 2006, but he also said 175,000 daily as of July 2006. Even at the
lower figure, if at least one page on each new blog is never indexed, the size of the invisible
Web is growing at around 36.5 million new pages per year. That doesn't even include other
types of invisible content (described elsewhere in this article).
8. Will this change? Google recently filed a patent application related to searching content
through Web-based forms. SEO by the SEA speculates that they are planning to index more of
the invisible Web and goes on to explain a possible methodology. Google's Eric Schmidt (or
possibly founders Larry Page and Sergey Brin) has said Google is dedicated to indexing the
world's content, however long it takes. Also, more previously invisible pages are getting
indexed because of manually-added links to them from visible pages.
9 Reasons a Web Page is Invisible
By "invisible", this does not mean a Web page is necessarily inaccessible. It simply means it's not
indexed by a search engine and is thus "invisible" to a searcher who does not know of its existence.
There are several reasons why a page may be invisible. Keep in mind that some pages are only
temporarily invisble, possibly being indexed at a later date. The general rule of thumb is that just
because a search engine finds no results does not mean it's not there. The list below also includes
examples of content types gleaned from Internet Tutorials.
1. Dynamic URLs. Engines have traditionally ignored any Web pages whose URLs have a long
string of parameters and equal signs and question marks, on the off chance that they'll
duplicate what's in their database — or worse — the spider will somehow go around in circles.
Danny Sullivan refers to such pages as part of the "shallow web".
2. Form-controlled entry, non-passworded. In this case, page content only gets displayed when a
human applies a set of actions, mostly entering data into a form (specific query information,
such as job criteria for a job search engine). This typically includes databases that generate
pages on demand and hence cannot be indexed by a spider. Applicable content includes travel
industry data (flight info, hotel availability), job listings, product databases, patents, publicly-
accessible governent information, dictionary definitions, laws, stock market data, phone books
and professional directories.
3. Passworded access, subscription or non subscription. This includes VPN (virtual private
networks) and any Web site where some pages require username and password information.
Access may or may not be by paid subscription. However, BrightPlanet found in 2001 that 95%
of the invisible Web is publicly accessible without fees or subscriptions. Applicable content
3. includes academic and corporate databases, newspaper or journal content, and academic
library subscriptions.
4. Time-limited access. On some sites, such as the New York Times or Marketing Profs, content
becomes inaccessible after a certain time without a password. Search engines retain the URL,
but the page generates a sign-up form, and the content is moved to a new URL that requires a
password. Note that the content is sometimes cached by an engine. The NY Times also has
alternate URLs to some time-dated content that show the original content without a password.
You just have to know how to get to it.
5. Too new. If a site is relatively new, it's likely that most or none of its Web pages will be indexed
by any engine. This results in the site's pages being mostly invisible for a short period of time (2-
6 months).
6. Robots exclusion. The robots.txt file, which usually lives in the main directory of a Web site,
tells search robots which files and directories should not be indexed. Hence its name "robots
exclusion file." If this file is setup, it will block certain pages from being indexed, which will
hence be invisible to searchers.
7. Flash presentation. Text content in Flash presentations is not indexed, though additional meta-
information might be.
8. Geo-tagged. A site's Web server can check for the supposed geographic location, via the IP
address, of a visitor's computer. Those computers from certain regions can be blocked out. That
may include blocking some search engines. For example, several American TV broadcasters are
now showing video online, but the pages are only accessible to US citizens, sometimes only in
certain regions or certain states.
9. Hidden pages. One of the simplest and most common reasons for invisible Web pages is that
they are hidden. That is, there is simply no sequence of hyperlink clicks that could take you to
such a page. The pages are accessible, but only people who know of their existence know how
to view them.
10 Ways to Make Invisible Content Visible
We have discussed what type of content is invisible and where we might find such information. Now
imagine if there were some way to make some of that invisible content more visible. That's possible for
some Web pages.
1. Do a static dump. If you have a small database of content, you may want to simply dump it out
to one static HTML page, with relevant formatting and necessary hyperlinks, then link to this
static page from an already "visible" (indexed) page.
2. Do categorized database publishing. If you have a database of, say, products, you could publish
select information to static category and overview pages, thereby making content available
without form-based or query-generated access. Of course, this works best for information that
does not become outdated. Job listings, for example, may not suit this method.
3. Convert formats. Word processors, spreadsheets, slideshows, PDFs, audio, video all used to be
part of the invisible Web. However, Google and other text search engines started indexing their
contents a few years ago, adding to the available pages of the visible Web. The benefit to
librarians and researchers, etc., is that it's now easier to find a particular piece of text. But if you
4. have a format such as Flash, which isn't indexed, you could publish a static version of the text
content, to supplement the rich media.
4. Transcribe information. Have audio or video content such as a podcast? Transcribe the
information and publish it as supplementary text.
5. Build links. Link to your own pages from other related pages. If you write about, say, trees on
page A, then write about trees again on page B, link from page B to page A to give A more
relevance. If page A hasn't been indexed, it will be after B is indexed. Points 6-9 are alternate
ways to build links, hence helping make content visible.
6. Publish a sitemap. Not the new XML kind that the Big 3 search engines agreed to a standard on,
but an HTML page that maps out the main sections of your site. This is essentially a way to build
links (#5). Each main section will in turn link to specific pages. The result is that a spider has a
relevance map with which to decide what to index. Then again, you can also use the new type
of sitemap to achieve deep indexing. Chris Pearson offers a sitemap generator and template.
7. Build a topic pyramid. This is a specialized form of sitemap that actually spans many pages. The
apex (top-most) page has general topics and links to the next layer of pages, which have more
specific topics and links to the next layer. The bottom-most layer of the topic pyramid are your
original Web pages or blog posts, which have the most specific content. This method builds
page relevance via the serial linking, which induces spiders to want to visit and index.
8. Write about it elsewhere. This is a form of link-building. When someone writes about an
invisible page and links to it, it becomes visible by proxy, once an engine follows through and
indexes it.
9. Socially bookmark it. If you find something, say a book at The Gutenberg Project, that you like,
bookmark the URL at a social bookmarking site such as Del.icio.us and a brief description.
10. Remove access restrictions. Get rid of the need to login, or don't apply time-limits.
How to Access and Search for Invisible Content
If a site publisher does none of the above to make their content more accessible, there are still ways to
make the content available, if not the actual pages.
Imagine if there was a search engine that could help you access some of the invisible Web. It would
have an advantage over traditional engines. Well there's more than one such engine, and even
traditional engines are making a move in that direction. The larger engines already index rich media
such as PDF files, word processor documents, spreadsheets, etc.
Invisible Web engines have taken a different approach, collaborating with Web site publishers to index
the otherwise invisible content. But for invisible content that cannot and/or should not be made
visible, there are still a number of ways to get access:
Be a student, alumnus, or professor to gain access to university records and library journals.
Be an employee of a company with a VPN over the Internet.
Request access. This might be as simple as signing up for free.
Pay for a subscription.
5. Request a "dump" page of a database. Sometimes a request to the right person will gain you
this data.
Use a Deep Web engine, portal, or directory.
To actually search for effectively invisible content:
1. Use a site's search engine. These tend not to be as robust for complex query terms, and usually
are quite literal about the search string, but they are more likely to show you where invisible
content is than a regular engine.
2. Use site archive navigation. On weblogs in particular, you can use the archive links to find info,
albeit through manual searching.
3. Use the word "database". Using the word "database" in your regular search engine query will
often find you information that is otherwise nearly impossible to find. For example, if you are
looking for a database of images, you can type the search string images database into Google
or one of the other engines. Somewhere down the results list in Google, you'll find Full-Text
Database Images from the USPTO (US Patent and Trademark Office). You can then use the
Quick or Advanced search forms to find patents relating to one or more terms. If there are
images to be seen, there will be links to them.
4. Use a suitable resource. Use an "invisible Web" directory, portal or specialized search engine
such as Google Book Search, Google Scholar, Librarian's Internet Index, or BrightPlanet's
Complete Planet (70,000 searchable databases and specialty search engines).
15 Invisible Web Search Tools
BrightPlanet estimated in 2001 that in excess of 200,000 "Deep Web" sites existed. They found that 60
of the largest of these sites collectively contained 40 times the pages in the surface Web (at the time),
and that despite being invisible in the engines, receive a significant amount of traffic. Here is a small
sampling of invisible Web search tools (directories, portals, engines) to help you find some invisible
content. To see more like these, please look at our Research Beyond Google article.
1. Deep Web Search Engine — Clusty.
2. Art — Musie du Louvre.
3. Books Online — The Online Books Page.
4. Business — Explorit Now!.
5. Consumer — US Consumer Products Safety Commission Recalled Products.
6. Economic and Job Data — FreeLunc.com — A searchable directory of free economic data.
7. Finance and Investing — Bankrate.com.
8. General Research — GPO's Catalog of US Government Publications.
9. Government Data — Copyright Records (LOCIS).
10. International — International Data Base (IDB).
11. Law and Politics — THOMAS (Library of Congress).
12. Library of Congress — Library of Congress.
13. Medical and Health — PubMed.
14. Science — ScienceResearch.com.
6. 15. Transportation — FAA Flight Delay Information.
References and Resources
These are relevant references that are not linked to above, which may be of interest to writers and
researchers. There's a strong leaning to research papers here, some of which have dozens of links to
PDF documents on the technical aspects of accessing, indexing and retrieving deep Web content. A few
references below are to companies offering "internet intelligence" tools and software.
1. About WebSearch — Christmas 2006 web search guide.
2. About Websearch — The deep web — find out more about the deep web — deep web search.
3. ALA — American Library Association.
4. BrightPlanet — FAQ.
5. Deep Web Research — A gigantic list of resources.
6. Deep Web Technologies.
7. Ellipsis — Metadata, Google, and the Invisible Web.
8. Envisional.
9. Google Librarian Center.
10. Google Library Project.
11. Lifehacker — How to search the invisible web.
12. MediaBistro — Some resources for freelancers.
13. MetaQuerier — Exploring and integrating the deep web.
14. QProber — Classifying and searching hidden-web text databases.
15. The Invisible Web Weblog.
16. University of California, Berkeley — Invisible or deep web.
Did you enjoy this article? Bookmark it at del.icio.us »
Browse Our Library Categories:
Beginning Online Learning
Choosing a Degree
Choosing a Program
Choosing a School
College Basics
Continuing Education for Adults
Distance vs. Local Education
Features
Financial Aid Information
Military Assistance Degrees
Online Class Assignments
Starting a Career