Breaking into the magic of search engines

We have been doing some really neat R&D work for clients recently, focusing on the WordPress search functionality. This is the first in a series of articles I am intending to write on the subject of search engines.

Development

In this article I will describe why search engines are so special and how they work. You don’t need to be a developer or an engineer to understand this article! At the end of this post I will finish with a real life example of a search facility we built and implemented for a client site.

Introduction

We are familiar with search engines, we use them every day and we are familiar with them. They’re something I think most of us cannot live without any more.

Google is the most well known search engine today and did you know it doesn’t just do a simple search for exactly what you type in?

If we have a typo in our search terms, it suggests to us the correct word or term. We could also get results that contain synonyms of our terms and it always returns some results, even using words similar to those we used.

How do we get there? How can a machine understand what we type and return results? Lots of computer science is involved, and the results of decades of research in algorithms, brings what seems normal today for everyday use.

In this article we explore some common search engine features and we discuss how and why WordPress cannot provide them.

How WordPress performs searches

The search function in WordPress is pretty naive; when you type the terms the system looks for posts that have exactly those words in the text.

Let’s assume that we have some posts that contains the following:

The quick brown fox jumps over the lazy dog.

If you search fox, WordPress picks posts that contains the word fox, foxes, but not vixen. It could find results for foxtrot, basically everything that contains the letters fox.

What it does is a database scan; it reads all the posts looking for the keywords. This approach is really expensive in terms of computational power and it can lead to a large website going offline, due to the heavy load of multiple simultaneous searches.

A database scan is rather like reading through a book, word for word, searching for content.

If you are a magazine or a newspaper with an extensive archive of content, perhaps hundreds of thousands of posts, then good search is crucial to giving your readers ready access to that information.

Dedicated search engines

When you get big enough, the standard WordPress search reaches its limits. You could have a Google powered search in your website, as well as some other options and tools that are available, but if you want to customise the experience, you have to look for something else. Specific customer needs will make you realise pretty soon that the Google-box is not enough.

The above mentioned systems work by analysing what is public and they are not tailored to your business. You cannot have an advanced search form based on categories or a complex taxonomy, for example, or look behind a private paywall.

When you need full control on searches, you need a dedicated search server.

A dedicated search engine will store the posts (and those behind the paywall too) in a particularly search-friendly way, in order to retrieve the text and the post ID, without scanning the entire database.

For example, in a good search engine, when we search for vixen we actually retrieve results for vixens and foxes too. The system is aware of synonymous language use; if we simply search the word fox, we receive results for every alternative term for fox known to the English dictionary.

You can build your synonymous list, and acronyms. If your website is specialised in a particular sector we can retrieve the same results for terms like MS and Microsoft, UK and United Kingdom, Brexit and What were they thinking? etc etc.

The benefits of a dedicated search engine don’t end here either.

Dedicated search engines are useful for e-commerce, document management software, CRM and everything that has a more complex structure.

The last to be mentioned, but definitely an important feature, is performance. A search engine works very differently from a regular database system; it uses special algorithms and mathematical formulas to store and retrieve information using keywords.

Database and search engines

The purpose of the database is to store our data exactly how we inserted it. We want our database to take care of our posts and retrieve them quickly using the SQL language. We don’t want our data to be altered in any way, we need data integrity and reliability.

The purpose of a search engine is to process the data in order to retrieve items quickly, by altering the text in some special way. It also knows when we misspell some words, or when we need more results, like a given post.

Complex search servers cannot replace databases, nor vice versa.

A practical example

We know that fox and vixen are the same animal, and their plural are foxes and vixens. Humans easily gather both these terms mean the same concept, but a machine needs a bit of help.

Before storing the text in our search engine, we could convert all synonymous and plural words to a base root, for example, vixen is converted to fox, and foxes/vixen to fox.

But we can go further, every long text is filled with words that actually don’t matter too much for our desired search result.

Those words are too generic and can pollute our search results, they are called stop words and contains terms like also, go, and, etc etc.

So, our text:

The quick brown fox jumps over the lazy dog.

becomes:

quick brown fox jump lazy dog.

Now we need to store the text and associate it to a post ID in order to retrieve it quickly, by just looking at the words.

We actually already use a system that solves this problem…in books!

Like an index stores a list of words and associates them to a page number, in the same way we associate the word to a post ID. A search server uses what is called an inverted index.

Misspelling

If some or all of our search terms are not found, the search engine assumes that those words are misspelled.

If we read a text and we see a typo, we probably acknowledge it because we know how the correct word looks; we pick the right sequence of letters that form the word, from the dictionary stored in our brain!

In our search server we have a dictionary from the text indexed and we need to pick the word that looks closest to the right one.

Have you ever misspelled a word in a word processor? The program knows the error because it searches the misspelled word in a dictionary, and then picks the words spelt most like it, using a particular algorithm.

This algorithm is part of the family for string metrics; we actually measure the distance between the misspelled word with those in the dictionary, and we pick the most like it.

So, if the search engine cannot find the terms in the index, it starts to measure the distance between the terms and the words in the dictionary and picks the closest one. That’s how the did you mean functionality is implemented.

Most similar content

Similar search results are not considered trivially when it comes to search engines. Considering a text similar to another involves so many factors, that resolving this kind of problem is never simple.

For the purposes of this post, we will stick with the simple approach to this problem.

When we read the same news in two different newspapers, how do we tell that we are reading the same news? We probably see the same words in both the news titles; if it’s sport news, we would read the same athletes name. Alternatively we could work on the actual meaning of the text, but this is a complex matter and we are going to keep things simple here, so we concentrate on the number of common words across the two articles.

In our search server, to recognise similar content we start to count the occurrence of words in the post (now the filtered text comes in handy and it prevents false positives) and start to build tables for this data.

We then compare those tables with others stored in our index. The comparing is actually measuring the distance from other tables, practically we are using vector and mathematical space. Lots of algebra is involved for getting the most similar posts!

Nice, where can I see an example?

We built and implemented a search solution for The Lawyer. We indexed over 110,000 articles and the system returns results in a matter of milliseconds. We even indexed the premium content that is normally hidden from public search engines. The system corrects misspellings and if the term is wrong, results are returned.

Summary

In this article I showed how the most popular search engine features work. In the real world things are more complex than that, but I tried to keep this article simple and I hope you enjoyed reading this post.

If you’re keen to use our search facility on large WordPress sites, please feel free to drop us a line at [email protected] and we’d be happy to talk to you.

Further reading.

Leave a Reply

Your email address will not be published. Required fields are marked *