The internet is an iceberg. And, as you might guess, most of us only reckon with the tip. While the pages and media found via simple searches may seem unendingly huge at times, what is submerged and largely unseen – often referred to as the invisible web or deep web – is in fact far, far bigger.
The Surface Web
What we access every day through popular search engines like Google, Yahoo or Bing is referred to as the Surface Web. These familiar search engines crawl through tens of trillions of pages of available content (Google alone is said to have indexed more than 30 trillion web pages) and bring that content to us on demand. As big as this trove of information is, however, this represents only the tip of the iceberg.
Eric Schmidt, the CEO of Google, was asked to estimate the size of the World Wide Web. He estimated that of roughly 5 million terabytes of data, Google has indexed roughly 200 terabytes, or only .004% of the total internet.
The Invisible Web
Beneath the Surface Web is what is referred to as the Deep or Invisible Web. It is comprised of:
- Private websites, such as VPN (Virtual Private networks) and sites that require passwords and logins
- Limited access content sites (which limit access in a technical way, such as using Captcha, Robots Exclusion Standard or no-cache HTTP headers that prevent search engines from browsing or caching them)
- Unlinked content, without hyperlinks to other pages, which prevents web crawlers from accessing information
- Textual content, often encoded in image or video files or in specific file formats not handled by search engines
- Dynamic content created for a single purpose and not part of a larger collection of items
- Scripted content, pages only accessible using Java Script, as well as content downloaded using Flash and Ajax solutions
There are many high-value collections to be found within the invisible web. Some of the material found there that most people would recognize and, potentially, find useful include:
- Academic studies and papers
- Blog platforms
- Pages created but not yet published
- Scientific research
- Academic and corporate databases
- Government publications
- Electronic books
- Bulletin boards
- Mailing lists
- Online card catalogs
- Many subscription journals
- Archived videos
But knowing all these materials are out there, buried deep within the web doesn’t really help the average user. What tools can we turn to in order to make sense of the invisible web? There really is no easy answer. Sure, the means to search and sort through massive amounts of invisible web information are out there, but many of these tools have an intense learning curve. This can mean sophisticated software that requires no small amount of computer savvy; it can mean energy-sucking search tools that require souped up computers to handle the task of combing through millions of pages of data; or, it can require the searching party to be unusually persistent – something most of us, with our expectations of instantaneous Google search success, won’t be accustomed to.
All that being said, we can become acquainted with the invisible web by degrees. The many tools considered below will help you access a sizable slice of the invisible web’s offerings. You will find we’ve identified a number of subject-specific databases and engines; tools with an established filter, making their searches much more narrow.
Open Access Journal Databases
Open access journal databases (OAJD) are compilations of free scholarly journals maintained in a manner that facilitates access by researchers and others who are seeking specific information or knowledge. Because these databases are comprised of unlinked content, they are located in the invisible web.
The vast majority of these journals are of the highest quality, with peer reviews and extensive vetting of the content before publication. However, there has been a trend of journals that are accepting scholarship without adequate quality controls, and with arrangements designed to make money for the publishers rather than furtherance of scholarship. It is important to be careful and review the standards of the database and journals chosen. “This helpful guide” explains what to look for.
Below is a sample list of well-regarded and reputable databases.
- “AGRIS” (International Information System for Agricultural Science and Technology) is a global, public domain database maintained in multiple languages by the Food and Agriculture Organization of the United Nations. They provide free access to agricultural research and information.
- “BioMed Central” is the UK-based publisher of 258 peer-reviewed open access journals. Their published works span science, technology and medicine and include many well-regarded titles.
- “Copernicus Publications” has been an open-access scientific publisher in Germany since 2001. They are strong supporters of the researchers who create these articles, providing top-level peer review and promotion for their work.
- “DeGruyter Open” (formerly Versita Open) is one of Germany’s leading publishers of open access content. Today DeGruyter Open (DGO) publishes about 400 owned and third-party scholarly journals and books across all major disciplines.
- “Directory of Open Access Journals is focused on providing access only to those journals that employ the highest quality standards to guarantee content. They are presently a repository of 9,740 journals with more than 1.5 million articles from 133 countries.
- “EDP Sciences” (Édition Diffusion Presse Sciences) is a France-based scientific publisher with an international mission. They publish more than 50 scientific journals, with some 60,000 published pages annually.
- “Elsevier of Amsterdam is a world leader in advancing knowledge in the science, technology and health fields. They publish nearly 2,200 journals, including The Lancet and Cell, and over 25,000 book titles, including Gray’s Anatomy and Nelson’ s Pediatrics.
- “Hindawi Publishing Corporation”, based in Egypt, publishes 434 peer-reviewed, open access journals covering all areas of Science, Technology and Medicine, as well as a variety of Social Sciences.
- “Journal Seek” (Genamics) touts itself as “the largest completely categorized database of freely available journal information available on the internet,” with more than 100,000 titles currently. Categories range from Arts and Literature, through both hard- and soft-sciences, to Sports and Recreation.
- “The Multidisciplinary Digital Publishing Institute” (MDPI), based in Switzerland, is a publisher of more than 110 peer-reviewed, open access journals covering arts, sciences, technology and medicine.
- “Open Access Journals Search Engine” (OAJSE), based in India, is a search engine for open access journals from throughout the world, except for India. An extremely simple interface. Note: the site was last updated June 21, 2013.
- “Open J-Gate” is an India-based e-journal database of millions of journal articles in open access domain. With a worldwide reach, Open J-Gate is updated every day with new academic, research and industry articles.
- “Open Science Directory” contains about 13,000 scientific journals, with another 7,000 special programs titles.
- “Springer Open” offers a roster of more than 160 peer-reviewed, open access journals, as well as their more recent addition of free access books, covering all scientific disciplines.
- “Wiley Open Access”, a subsidiary of New Jersey-based global publishers John Wiley & Sons, Inc., publishes peer reviewed open access journals specific to biological, chemical and health sciences.
Invisible Web Search Engines
Your typical search engine’s primary job is to locate the surface sites and downloads that make up much of the web as we know it. These searches are able to find an array of HTML documents, video and audio files and, essentially, any content that is heavily linked to or shared online. And often, these engines, Google chief among them, will find and organize this diversity of content every time you search.
The search engines that deliver results from the invisible web are distinctly different. Narrower in scope, these deep web engines tend to access only a single type of data. This is due to the fact that each type of data has the potential to offer up an outrageous number of results. An inexact deep web search would quickly turn into a needle in a haystack. That’s why deep web searches tend to be more thoughtful in their initial query requirements.
Below is a list of popular invisible web search engines:
- “Clusty” is a meta search engine that not only combines data from a variety of different source documents, but also creates “clustered” responses, automatically sorting by category.
- “CompletePlanet” searches more than 70,000 databases and specialty search engines found only in the invisible web. A search engine as well-suited to casual searchers as it is to researchers.
- “DigitalLibrarian”: A Librarian’s Choice of the Best of the Web is maintained by a real librarian. With an eclectic mix of some 45 broad categories, Digital Librarian offers data from categories as diverse as Activism/Non Profits and Railroads and Waterways.
- “InfoMine” is another librarian-developed internet resource collection, this time from The Regents of the University of California.
- “InternetArchive” has an eclectic array of categories, starting with the ‘Wayback Machine,’ which allows the searcher to locate archived documents, and including an archive of Grateful Dead audience and soundboard recordings. They offer 6 million texts, 1.5 million videos, 1.9 million audio recordings and 126K live music concerts.
- “The Internet Public Library” (ipl and ipl2) is a non-profit, student-run website at Drexel University. Students volunteer to act as librarians and respond to questions from visitors. Categories of data include those directed to Children and Teens.
- “SurfWax” is a metasearch engine that offers “practical tools for Dynamic Search Navigation.” It offers the option of grabbing results from multiple search engines at the same time, or even designing “SearchSets,” which are individualized groups of sources that can be used over and over in searches.
- “UC Santa Barbara Library” offers access to a diverse group of research databases useful to students, researchers and the casual searcher. It should be noted that many of these resources are password protected. Those that do not display a lock icon are publicly accessible.
- “USA.gov” offers acess to a huge volume of information, including all types of forms, databases, and information sites representing most government agencies.
- “Voice of the Shuttle” (VoS) offers access to a diverse assortment of sites, including literature, literary theory, philosophy, history and cultural studies, and includes the daily update of all things “cool.”
Subject -Specific Databases
The following lists pool together some mainstream and not so mainstream databases dedicated to particular fields and areas of interest. While only a handful of these tools are able to surface deep web materials, all of the search engines and collections we have highlighted are powerful, extensive bodies of work. Many of the resources these tools surface would likely be overlooked if the same query were made on one of the mainstream engines most users fall back on, like Bing, Yahoo and even Google.
Art & Design
- “ArtNet” deals with pricing and sourcing work in the art market. They also keep track of the latest news and artists in the industry.
- “The Metropolitan Museum of Art” site hosts an impressively interactive body of information on their collections, exhibitions, events and research.
- “Musée du Louvre”, the renowned museum, maintains a site filled with navigable sections covering its collections.
- “The National Gallery of Art” premier museum of arts in our nation’s capital, also maintains a site detailing the highlights, exhibitions and education efforts the institution oversees.
- “Public Art Online” is a resource detailing sources, creators, prices, projects, legal issues, success stories, resources, education and all other aspects of the creation of public art.
- “Smithsonian Art Inventories Catalog” is a subset of the Smithsonian Institution Research Information System (SIRIS). A browsable database of over 400,000 art inventory items held in public and private collections.
- “Web Gallery of Art” is a searchable database of European art, containing nearly 34,000 reproductions. Additional database information includes artist biographies, period music and commentaries.
- “Better Business Bureau” (BBB) Information System Search allows consumers to locate the details of ratings, consumer experience, governmental action and more of both BBB accredited and non-accredited businesses.
- “BPubs.com” is the business publications search engine. They offer more than 200 free subscriptions to business and trade publications.
- “BusinessUSA” is an excellent and complete database of everything a new or experienced business owner or employer should know.
- “EDGAR: U.S. Securities and Exchange Commission” contains a database of Securities and Exchange Commission. Posts copies of corporate filings from US businesses, press releases and public statements.
- “Global Edge” delivers a comprehensive research tool for academics, students and businesspeople to seek out answers to international business questions.
- “Hoover’s”, a subsidiary of Dun & Bradstreet, is one of the best known databases of American and International business. A complete source of company and industry information, especially useful for investors.
- “The National Bureau of Economic Research is perhaps the leading private, non-partisan research organization dedicated to unbiased analysis of economic policy. This database maintains archives of research data, meetings, activities, working papers and publications.
- “U.S. Department of Commerce”, Bureau of Economic Analysis is the source of many of the economic statistics we hear in the news, including national income and product accounts (NIPAs), gross domestic product, consumer spending, balance of payments and much more.
Legal & Social Services
- “U.S. Department of Justice Resources” is a comprehensive database for the Department of Justice, including archives, initiatives, news, publications and resources.
- “Federal Bureau of Investigation (FBI) Stats & Services” organizes crime statistics, criminal history checks, a sex offender registry, resources for businesses, communities, crime victims, law enforcement, job seekers, researchers and students.
- “Homeland Security Digital Library” (HSDL) maintains databases, policy and strategy statements, special collections and research tools.
- “National Criminal Justice Reference Service” (NCJRS) is a federally funded resource offering extensive databases detailing issues of justice, substance abuse, and victim assistance information to victims of crime, among other topics.
- “Social Work Policy” Institute supports research in social work with databases, publications, archives, foundation news and events.
Science & Technology
- “Environmental Protection Agency” rganizes the agency’s laws and regulations, science and technology, and the many issues affecting the agency and its policies.
- “National Science Digital Library” (NSDL) is a source for science, technology, engineering and mathematics educational data. It is funded by the National Science Foundation.
- “Networked Computer Science Technical Reports Library (NCSTRL) was developed as a collaborative effort between NASA Langley, Virginia Tech, Old Dominion University and University of Virginia. It serves as an archive for submitted scientific abstracts and other research products.
- “Science.gov” is a compendium of more than 60 US government scientific databases and more than 200 websites. Governed by the interagency Science.gov Alliance, this site provides access to a range of government scientific research data.
- “Science Research” is a free, publicly available deep web search engine that purports to use a sophisticated technology that permits queries to more than 300 science and technology sites simultaneously, with the results collated, ranked and stripped of duplications.
- “WebCASPAR” provides access to science and engineering data from a variety of US educational institutions. It incorporates a table builder, allowing a combined result from various National Science Foundation and National Center for Education Statistics data sources.
- “WebCASPAR” World Wide Science is a global scientific gateway, comprised of US and international scientific databases. Because it is multilingual, it allows real-time search and translation of reporting from an extensive group of databases.
- “Cases Database” is a searchable database of more than 32,000 peer-reviewed medical case reports from 270 journals covering a variety of medical conditions.
- “Center for Disease Control” (CDC) WONDER’s online databases permit access to the substantial public health data resources held by the CDC.
- “HCUPnet” is an online query system for those seeking access to statistical data from the Agency for Healthcare Research and Quality.
- “Healthy People” provides rolling 10-year national objectives and programs for improving the health of Americans. They currently operate under the Healthy People 2020 decennial agenda.
- “National Center for Biotechnology Information” (NCBI) is an offshoot of the National Institutes of Health (NIH). This site provides access to some 65 databases from the various project categories currently being researched.
- “OMIM” offers access to the combined research of many decades into genetics and genetic disorders. With daily updates, it represents perhaps the most complete single database of this sort of data.
- “PubMed is a database of more than 23 million citations from the US National Library of Medicine and National Institutes of Health.
- “TOXNET” is the access portal to the US Toxicology Data Network, an offshoot of the National Library of Medicine.
- “U.S. National Library of Medicine” is a database of medical research, available grants, available resources. The site is maintained by the National Institutes of Health.
- “World Health Organization” (WHO) is a comprehensive site covering the many initiatives the WHO is engaged in around the world.