|analog||new technology in text processing|
|Introduction | Demo ||
In our every day life we often have to deal with large textual data sources - for example when we browse the Internet or a database in search of a particular piece of information. In such a case a typing error (on either side) may cause a massive problem. The ANAL0G associative text searching system enables to perform complex searching and data collecting tasks even when the piece of information is not known or not stored correctly.
Let us take for example a multilingual document registering system and suppose that the operation to be performed is to search all the documents related to orthopaedy. If the search is carried out in the proper document-sphere, then while looking for the word 'orthopaedy' we will surely find some other ones as well. In general today's registering and accessing systems make search according to a substring possible. So when searching with the word-stem 'orthopaed' we will also find documents containing the words 'orthopaedy', 'orthopaedist' and 'orthopaedik'. Of course it is possible to perform more and more searches to find the Hungarian ('ortopéd'), Latin ('orthopaed') or German ('orthopäd') equivalents. In case we are interested in French, Dutch or Spanish documents we can find those naturally with the help of the proper dictionaries and our new searching method. But what about errors in the searched text itself?! Anyone that typed the words containing the word-stem 'orthoped' may not have looked them up in the Orthographical Dictionary, or the person typing in 'arthoped', 'otoped', 'ortoned' and 'atoped' must simply have been tired. Given that the document is very interesting for us we can try searching with other keywords (again and again - very difficult) or by reading all the documents (totally hopeless in case of a large database or the Internet for example). With very big systems it is said that there are no such things as faultily stored data. Of course this saying is very difficult to accept. We haven't accepted it either.
Characteristic of the complicity of this task is that though this type of text search is a fundamental demand, even the largest database management systems and Internet search engines do not provide any tool of this kind. The main reason is that the general solution of the task is of very bad efficiency (so called 'exponential' problem combinatorically). Based on new theoretical grounds, our system, the ANAL0G, provides an efficient solution.
Our method based on the comparison of character sequences yields solutions to not only the above mentioned problems, but also to phoneme based speech recognition, associative dictionary and the searching of DNA structures for mutant gene sequences for example.
The ANAL0G web search engine has the following great advantage compared to other search engines (like Google, Yahoo!, Live Search, etc.): the comparison of the keywords and the words in a document is based on an abstract similarity definition (that is a function of the number of matching characters and the accordance of their order). A word is accepted if its similarity to the keywords given by the user exceeds a similarity limit that is also given by the user. For the acceptance of a word there is no need for the matching of a substring or any kind of error models.