The Web is often considered the known content universe. Indeed it appears that today, everything that could be digital has become digital and can be found on the web – articles, images, music, movies, and games. Since every search query on Google returns millions of hits, all digital content in the world is available through the Google search. After all, Google indexed and cached the entire World Wide Web, didn’t they?
Not so. Google only exposes content that wants to be found. It is content on the public Web that is not protected by any security. A vast majority of content, however, resides behind a login challenge that prevents Google from indexing it. In fact, the 2003 study “How Much Information?” conducted by the by the University of California, Berkeley, estimates that less than 1% of content is exposed on the public Web (“Surface Web”) while the majority resides in the “Deep Web”. While the Internet has evolved since 2003, the percentage of content not indexed by Google and other search engines remains huge.
Searching secure content is not trivial which is why Google has only a modest presence in the enterprise. Restricted access to content applies not only to the content itself but also to the index and search results. Not only should you be prevented from accessing the document you are not authorized to see but you should also not be able to see a link with the document’s name in the search results. And so Google takes the easy way out by searching only the unprotected content. Search with security – an absolute must in the enterprise – is a much harder nut to crack.
There is another trick that makes Google’s job easier. It really searches content that really wants to be found – content that has been optimized for search. There is an entire industry called Search Engine Optimization (SEO) which spends millions of dollars on making sure the content employs all kinds of tricks to ensure that Google can find it easily and rank it as relevant. That’s of course not the case for the deep Web content, not to mention documents and other types of content in the enterprise. And Page Rank, Google’s famed relevance algorithm, is based on hyperlinks between Web pages that don’t really exist in the world of enterprise content.
The public Web searchable via Google only represents the tip of the iceberg and searching secure content on the web or content in the enterprise is much more difficult.