Search engines are so common nowadays. Google is the most popular web search engine. Facebook has a search engine for you to search for people, places, posts and other content you are interested in. Almost any website has a way to help you quickly find the specific information you are looking for on it, which is usually powered by an underlying search engine.
The general idea of a search engine looks very simple. You have a repository of documents, like web pages or Facebook posts. You build a software system based on this repository which answers keyword queries, like “X Y Z”, by returning a set of sorted documents that have some or all the keywords in the query.
The algorithm underneath a search engine looks very straightforward as well. Your search engine can go through all documents to find those documents which contain the keywords being searched.
The above ideas work for a small repository. However, this approach won’t work at all for a huge repository, say a billion document repository. This is because it may take forever for your software to go through every document before returning the results of your query.
This is a common phenomenon in computer engineering area. A problem works one way at a small scale, say a search on a few thousand documents. But when you try to extend to a much larger scale, say billions of documents, suddenly the problem looks very different.
An example is the Facebook website, which has almost 1.5 billion monthly active users and is completely different today from when it had only tens of thousands of users. Software engineers love to solve problems at scale, which is one of the biggest challenges in this area. There is an annual industrial conference called @scale to discuss solving problems at scale in computer engineering.
Back to building a search engine at scale. The smart idea is to first build an index, or an inverted index as it is called in technology terms. This lets the search engine quickly get all documents which contain the keywords in a query instead of going through every document on demand. An inverted index is very similar to the index of a thick book. It tells you in which documents you can find a specific keyword. The only difference is that an inverted index for a search engine is usually so huge that it can only be built by computers. It may even take thousands of computers to build.
So you can build a search engine for a huge repository using the technology of an inverted index, which needs a large amount of computation power to build. It can give you all documents which contain the keywords of a query.
However, that alone is far from a useful search engine. A huge part you are missing is document ranking. It’s very likely the inverted index gives you tens of thousands or even millions of documents for a query, because queries are usually short and documents are relatively much longer. It’s useless for a human to get such a big number of documents. Finding what you want in them is like searching for a needle in a haystack. Imagine if you had to look at the first 100 pages of Google’s search results before you found the web page you were looking for. Instead Google sorts all results according to some complicated algorithms so that people usually find what they want on the first Google search result page. That’s why Google is so popular and useful.
Document ranking is another big and challenging area about search engines. People have invented tons of smart algorithms to rank documents by lots of collected signals that are extracted from documents,queries and even the people who submit those queries.
Overall, 2 general ideas are behind a search engine: inverted index and document ranking. Both involve lots of techniques and have a lot of sub areas attracting enthusiastic engineers and researchers to put endless effort into improving them, little by little.