Infovell's 'research engine' finds deep Web pages that Google, Yahoo miss

September 8th, 2008 by Lisa Zyga infovell

With Infovell, users search with key phrases up to 25,000 words long, rather than keywords. Image credit: Infovell.

According to a study by the University of California at Berkeley, traditional search engines such as Google and Yahoo index only about 0.2% of the Internet. The remaining 99.8%, known as the "deep Web," is a vast body of public and subscription-based information that traditional search engines can't access.

To dig into this "invisible" information, scientists have developed a new search engine called Infovell geared at helping researchers find often obscure data in the deep Web. As scientists working on the Human Genome Project, Infovell´s founders designed the new searching technology based on methods in genomics research. Instead of using keywords, Infovell accepts much longer search terms, and in any language.

"There are no ´keywords´ in genetics," explains Infovell´s Web site. "New unique and powerful techniques have been developed to extract knowledge from genes. Now, through Infovell, these techniques have, for the first time, been applied to language and other symbol systems, shattering long-held barriers in search and leapfrogging the capabilities of current search providers to deliver the World´s Research Engine."

While keywords may work fine for the general public looking for popular and accessible content, they don´t often meet the needs of researchers looking for specific data. As information in the deep web continues to grow, Infovell explains that a one-size-fits-all approach to searching will make academic searching even more challenging.

One reason is the nature of deep Web sites themselves. While many popular Web sites are specifically designed to be search-engine friendly, a lot of deep Web content is unstructured, making it difficult for keyword-based search engines to index. Further, the deep Web does not receive much traffic, meaning these pages don´t have many incoming links and therefore aren´t ranked highly by systems such as Google´s PageRank. And for private sites, barriers such as registration and subscription requirements also make it difficult for search engines to access them.

Searching with keywords also presents a trade-off between being too general and getting millions of irrelevant results, or being too specific and not getting any results at all. After getting results, users then have to sift through many pages looking for what they need.

But with Infovell, users search with "KeyPhrases," from paragraphs to whole documents or even sets of documents up to 25,000 words. Because it´s born out of the world of genomics, Infovell is also language-independent. Users can search in English, Chinese, Arabic, or even mathematical symbols, chemical formulas, or musical notes. "The key requirement is that the information is in digital format, and it can be stored in a linear, sequential and segregated manner," according to Infovell´s site.

Infovell´s technology allows users to locate the most current and comprehensive documents and published articles from billions of pages, with topics including life sciences, medicine, patents, industry news, and other reference content.

Currently, some researchers use advanced search options provided by individual sites to try to get around keyword search engines. However, these search engines require users to learn special syntax, and only work for the site they´re at. The advantage of Infovell is that it doesn´t require special training (and it doesn´t use Boolean operators, taxonomies or clustering); rather, it is easy to use and can search everything at once.

Although Infovell is not the first attempt at a search engine for crawling the deep Web, its developers hope that researchers will benefit from Infovell´s advantages more in the future, especially as the deep Web continues to grow.

Infovell is being demonstrated at DEMOfall08, a conference for emerging technologies taking place in San Diego on September 7-9. Users can sign up for a 30-day risk-free trial at Infovell´s Web site, and Infovell is initially available on a subscription basis. Later this year, Infovell will release a free beta version on a limited basis without some of the advanced features in the premium version.

More information: www.infovell.com

Via: www.networkworld.com


print this article email this article download pdf blog this article bookmark this article     Digg this Stumble it share on Facebook share on Reddit add to delicious save to Yahoo! bookmarks
4.4/5 after 58 votes

Rank Filter

Move the slider to adjust rank threshold, so that you can hide some of the comments.


Display comments: newest first

  • HarshMistress - Sep 08, 2008
    • Rank: 4 / 5 (3)
    My employer bought a Google Search Appliance box and gave it to me to play with. The first thing I've found out is that GSA cannot index deep Web or hidden Web which is, in our case, URLs (absolute addresses) saved into drop down lists. Thousands upon thousands of web pages wouldn't be indexed and delivered in the result set on visitor's request just because stupid Google search engine doesn't crawl web forms! On top of it, the guy from GSA/Google Mini support group didn't know anything about the problem, so I had to invent the wheel on my own.

    It's high time for a better search engine. If true to their word, Infovell people just got themselves a big $$$ generator.
  • Arikin - Sep 08, 2008
    • Rank: 4 / 5 (2)
    How does it access the subscription or password protected pages??? Did they sign up for everything? :-)
  • earls - Sep 09, 2008
    • Rank: not rated yet
    What we really need is a way to bridge the gap between every sites individual database(s) and make them searchable.

    I'm not positive, but I believe the only way to keep the information published and available is to generate a "hard" HTML copy of the page.
  • paulthebassguy - Sep 09, 2008
    • Rank: 4 / 5 (1)
    To access subscription and password protected pages is just a matter caching! Just like google caches pages at the moment. So what would happen is that the engine would cache subscription pages for people that have actually subscribed, which would then appear in the search results of any public user. Then, if the public user wanted to actually access the page, he/she would have to subscribe first.
  • Fred12345 - Sep 12, 2008
    • Rank: not rated yet
    I believe, from the video I just saw, that the limit is 25,000 characters, not words.
  • DoctorKnowledge - Sep 12, 2008
    • Rank: not rated yet
    There are a lot of things Google doesn't search. It's tuned so that the masses searching on "Spears", "Palin", "sex" or "download" get exciting results. This article in a sense is being polite about Google's weaknesses. (We can't call them failings, any more than the Romans called bread and circuses failings.)

September 8th, 2008 all stories
Technology / Internet

Comments: 6
Rank: 4.4/5 after 58 votes

  • Stumble this up

  • Digg this

  • Share it:
  • share on Facebook
  • share on MySpace
  • share on Slashdot
  • rss-newsfeed
  • share on Google
  • share on Reddit
  • add to delicious
  • save to Yahoo! bookmarks
  • share on Windows Live
  • Add to Mixx!
Rating: 4.4/5 after 58 votes



  • Physicists Demonstrate Quantum Memory with Matter Qubits
    Physicists Demonstrate Quantum Memory with Matter Qubits
    Physics / General Physics
    created 18 hours ago | popularity 4.5 / 5 (11) | comments 1
  • 'Holey' Nanosheets for Wastewater Dye Removal
    Nanotechnology / Nanomaterials
    created Jul 01, 2009 | popularity 5 / 5 (5) | comments 1
  • Jellyfish Robot Swims Like its Biological Counterpart
    Jellyfish Robot Swims Like its Biological Counterpart
    Electronics / Robotics
    created Jun 26, 2009 | popularity 4.4 / 5 (7) | comments 1
  • Could Maxwell's Demon Exist in Nanoscale Systems?
    Could Maxwell's Demon Exist in Nanoscale Systems?
    Physics / General Physics
    created Jun 24, 2009 | popularity 4.4 / 5 (18) | comments 29
  • Living Safely with Robots, Beyond Asimov's Laws
    Living Safely with Robots, Beyond Asimov's Laws
    Electronics / Robotics
    created Jun 22, 2009 | popularity 4.6 / 5 (50) | comments 39
  • Other News

    Homeland Security Secretary Janet Napolitano

    US government Internet traffic to be screened: report (Update)

    Technology / Internet

    created 17 hours ago | popularity 5 / 5 (1) | comments 2

    The Obama administration is planning to use the National Security Agency to screen Internet traffic between government agencies and the private sector, the Washington Post reported Friday.


    Volkswagen hopes to turn out its first all-electric car in 2013

    Volkswagen plans electric car in 2013: head

    Technology / Energy

    created 12 hours ago | popularity 1 / 5 (1) | comments 0

    German auto maker Volkswagen hopes to turn out its first all-electric car in 2013, VW head Martin Winterkorn said Friday.


    Japanese veterans in Imperial Army uniforms march in Tokyo

    Japanese imperial army maps to go online

    Technology / Internet

    created 9 hours ago | popularity 1 / 5 (1) | comments 0

    Old Asia-Pacific maps from Japanese Imperial Army archives are going online for modern use, such as studying changes in forest cover or the growth of cities, a Japanese researcher said Friday.


    US wants privacy in new cyber security system (AP)

    US wants privacy in new cyber security system

    Technology / Internet

    created 21 hours ago | popularity 4 / 5 (1) | comments 0

    (AP) -- The Obama administration is moving cautiously on a new pilot program that would both detect and stop cyber attacks against government computers, while trying to ensure citizen privacy protections.


    Racing car powered by chocolate and steered by carrots takes to the track at Goodwood

    Technology / Engineering

    created 16 hours ago | popularity 1 / 5 (2) | comments 0

    A racing car created from potatoes and carrots and powered by chocolate will be put through its paces this weekend at the world’s largest celebration of motorsport.