Creating a Search Engine For America’s Most Expansive Food Database

After creating one of America’s largest food databases, Seek Solutions was tasked with creating a fast, intuitive, and user-friendly way to search through it.

Building a search engine can be overwhelming. While some of the necessary grunt work is done by 3rd parties such as stemming, lemming, tokenization, ranking algorithms, etc, there are still hundreds of factors to consider when building a search engine. In this article, we describe in detail the key considerations needed to design and implement an intuitive and accurate search engine.

Start with the user in mind.

Before deciding on any technical detail, think — How is a user going to “think up” their query? Is there any other info that we can use to give context to our text search engine? The answer to these questions will give us a solid foundation and hone in on a group of viable approaches to the problem.

How a user “thinks up” a query

The process of transforming intent in a user’s mind into words in a search box is crucial for us to understand. In the context of Track Change Thrive:

Users want to log their food using precise keywords. They will usually have a branded food item with its label in front of them while searching.

Users will use only a slight variation in word order. When a user logs a branded food, sometimes they may search for the brand along with the food, but it could be before or after the food. Other than this, the word order of the item itself is fairly straightforward. Exact matches are also relevant.

Autocomplete is important. When the final word(s) of a user’s query is incomplete, we still need to be responsive and show the user relevant results.

Typo tolerance is moderately important. Because search terms are often in the user’s site on labels, typo tolerance is a factor but not one to optimize on. Also, autocomplete functionality can reduce the need for some typo tolerance because short queries impose a lower likelihood of misspelling.

With these three things in mind, we can hone in on an implementation strategy for text search. This use case lends itself to more traditional relevance-ranking strategies as opposed to (a more trendy) Vector Search strategy. If the population of users tended to use longer queries using a more varied vocabulary and sentence structure, a vector approach may be useful. However, we know that users are going to be relatively short queries using a similar set of keywords.

Further, we can understand the tokenization strategies of words. We can eliminate large n-gram tokenizers because word order is mostly similar and queries are short. Many of the tokens generated by n-gram won’t yield useful information. A blended approach between word-order-specific and word-order-agnostic tokenization would work best here.

Personalizing results

One of the best ways to make search engines more accurate is to use user data to boost relevance. Research shows that people eat habitually. We tend to eat the same things every day, around the same time, and in the same dose. Recency, frequency of consumption, and time of day are three of the biggest predictive factors. Naturally, we decided to boost foods that score high on these metrics to elevate them above results that may have a higher raw search-relevance score. For example, searching “apple” may boost “Apple Jacks Cereal” for a user who enjoys their cereal every morning, “Honeycrisp apple” for the fruit lover, and “apple pie” for someone with a sweet tooth (like me). In our case, personalized logging can eliminate the need to search altogether in many cases.

Analyze the text data

One of the challenges to overcome in a text search index is to build an index that partitions and tokenizes the text in a way that creates informative pattern matches without any “bloat” — without generating any tokens that provide no information and needlessly increase the size of the index. For example, if the database record “apple pie” was only indexed as the entire search query “apple”, “apple p”, and “apple pi” wouldn’t have a chance to match. On the other hand, a bigram tokenizer would generate a lot of useless tokens such as “pl”, “e “, “le”, etc.

Text Length

Many tokenizers depend on a min and max length — edge n-gram, n-gram, trailing edge n-gram, and so on. A good heuristic to find the max lengths for any of these tokenizers is to look at the distribution of text length in your search domain. The average is a reasonable starting point, but should be tuned based on where the information resides in the string.

If using a word tokenizer such as Lucene Standard or whitespace, how long are those tokens on average? This can help decide a min and max gram length for a token filter that will generate smaller tokens that start at each word.

For example, breaking apart the string “The Shawshank Redemption” while using a whitespace tokenized and an edgeGram token filter with minimum grams of 2 and maximum grams of 4, the process will look like this:

  1. The whitespace tokenizer will produce: "The", "Shawshank", and "Redemption".
  2. The edgeGram token filter will then process each token:
    • "The" -> "Th"
    • "Shawshank" -> "Sh", "Sha", "Shaw"
    • "Redemption" -> "Re", "Red", "Rede"

In an autocomplete setting, it might help to find partial matches up to the first 4 characters. But if you have to autocomplete search a lexicon with words like pneumonoultramicroscopicsilicovolcanoconiosis (yes it’s real), a larger max-gram value might be necessary.

Information location within the text

Sometimes information is located in the beginning of a string (if you’re looking for area codes in a database of phone numbers). Sometimes it’s at the end (as is the case when looking for the last 4 digits of a credit card). Sometimes it’s in between.

Grammatical structure

Does your text contain a lot of verbs, adjectives, stop words, nouns, proper nouns, full sentences, titles, and phrases? Do those words provide any information relevant to the search? The answers help clarify what kind of words can be dropped from an index. It will also elucidate what sorts of stemming and lemming strategies to use.

By the way, if you’re curious to analyze the grammatical structure of your data, you can use a library like Spacy to get on your way:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
   print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
           token.shape_, token.is_alpha, token.is_stop)

Screenshot 2024-05-27 at 8.52.14 PM.png

Language support

What languages does the search support? This can influence almost every part of the engine from stemming and lemming, to tokenization, folding diacritics and other characters, and more.

In Summary

Creating a powerful search engine for America's most eclectic food database involves a comprehensive understanding of user intent and behavior, thoughtful design of text data analysis, and personalization strategies to enhance relevance. By focusing on how users formulate their queries, emphasizing the importance of exact matches and autocomplete, and balancing typo tolerance, Seek Solutions has developed a search engine that effectively meets user needs. Personalized results based on eating habits and patterns further refine the search experience. Additionally, the process of analyzing text data, considering text length, information location, grammatical structure, and language support ensures that the search engine is both efficient and accurate. This approach combines traditional relevance-ranking strategies with personalized data and domain-specific knowledge to provide a seamless and intuitive search experience for users logging their meals.