News

 

Better methods for similarity searches

 

He will head a comprehensive new research project backed with a €1.9 million ‘Consolidator Grant’ from the European Research Council.

“It’s a bit of a high-risk gamble, because we’re trying to do something that a number of pretty clever people have attempted before. But we believe we may have the additional clues that can lead us towards some kind of a breakthrough,” he says.

The project aims to find new and improved methods for so-called ’similarity searches’. These are searches where the user has one piece of information and is looking for something that is similar to it, but without knowing exactly what the searched-for object looks like.

He will also examine mathematically whether the methods that many of the search engines use in practice actually work. The methods are based on algorithms and ad-hoc solutions but have no theoretical basis.

 

No guarantees

 

As an example of a similarity search, Pagh uses a photo of a lion. You want to find other pictures that are similar to the one you have, but you don’t know whether the photo collection contains other lions, other cats or perhaps a painting of a lion.

It is far from certain that the search finds all results that are similar to the lion on your picture, since the methods used today are based on samples in which the quality of the search results is unknown. So even though you get some results, you cannot be certain that you have found all the relevant data.

”Imagine a doctor searching for cases of a disease similar to the one he wants to know about and he does not find a treatment that saved a patient somewhere else. Here you would want to know for certain that the software finds it if it’s there to be found,” he says.

 

You are likely to be unlucky

 

It’s a bit of a high-risk gamble, because we’re trying to do something that a number of pretty clever people have attempted before. But we believe we may have the additional clues that can lead us towards some kind of a breakthrough.

Rasmus Pagh

With the existing methods, the computer first converts our lion picture into electronic 1s and 0s, or yeses and nos. The software takes a sample of a variety of parameters: is it a hat? No. Is it an animal? Yes, etc.

The computer then finds other pictures where it is possible to answer yes or no to the same questions. However, it selects the parameters to be examined at random, which explains why the computer randomly goes looking for either other pictures of animals or other pictures without hats.

“If you’re lucky, the right parameters are selected, but there is a fairly high likelihood that you’re unlucky. These are the theoretically best algorithms available today – and they are only slightly faster than looking through all the data.”

 

Getting harder to use good methods

 

However, since this method – known as locality sensitive hashing – also places great demands on computing resources, many use methods where they do not know the mathematical quality of the search result.

In the future, it will become increasingly difficult to use the good methods.

 

Facts

 

The amount of data is growing at an exploding rate – every two years it doubles, according to market intelligence firm IDC. Although computing power is also growing, the ability to carry out complex computations is not growing anywhere near as fast as the amount of data is.

Searches that today can still go through all the data will be limited to the search methods that – for now – are statistically uncertain.

”Part of the motivation for the research project is to ensure that we will continue to be able to use similarity searches in the future. Another part is to describe the existing systems, which have some algorithms that work most of the time. But sometimes they do not work, and it is not always entirely clear when they do not work,” says Pagh.