Finding a record in a database with Python

A practical tutorial on searching with FuzzyWuzzy

Photo by K8 on Unsplash

Finding a record within a database can be automated. You can use such a script to automate many different routine tasks. It can for example be used to get information out of the database with a voice-assistant or do multiple searches in parallel for completing financial administrative tasks.

In this tutorial we are gonna keep it simple and make a basic script that takes as an input a name and birthdate and as an output gives you a record. We are gonna start with choosing from a list through searching on the basis of the birthdate. After confirming the name is correctly identified we are going to choose the person from the list. Here is the workflow:

The reason we search for the birthdate is that the birthdate is often a more reliable way of communicating information about a person. The spoken name can often be spelt in many different ways, while a birthdate can not. Of course you have the issue of twins, but that’s not something we are going to look into right now. The first step we are going to focus on here is applying a for loop to finding a birthdate in a list of birthdates. We first load the dummy database csv file with pandas:

We make a basic input prompt to ask for the input name and birthdate and we search the second column of the dummy database for a matching birthdate.

When we get to a matching birthdate we check the name. We can do this with the Levenshtein distance. In short, the Levenshtein distance is calculated by finding the easiest way to transform one string into another. The Levenshtein distance can be calculated with different protocols. For our purpose, we need the partial ratio. To illustrate why, let’s take names from three different cultures (Dutch/German, Spanish/Portuguese and Japanese-English, see table). Often you will find yourself only able to write down only part of the names accurately in a multicultural community. At other times, names consist of two combined last names of which only one is used for verbally communicating information about a person. If we calculate the outcome of the different protocols, complete string ratio, partial string ratio, sorted token ratio and token set ratio, we get the following outcomes:

Focusing on only a part of the string with the partial ratio protocol seems to eliminate the parts of names that can be confusing, such as second names or other additional words. So the partial ratio seems to be the most helpful. We import fuzzy-wuzzy as fuzz and incorporate the partial ratio of the fuzzy-wuzzy package to do name matching:

Finally, we use a cutoff score to minimise the chance of getting the wrong person. Depending on the number of people in your registry you should adjust the cut-off for your application. We incorporate the cut-off in an if statement and return the personal information:

Testing provides the following conversation:

You can find the full code in a notebook on my Github page. Fun things you can try to do are trying to give a more elaborate summary of person his/her characteristics (calculated age or country) to the user or provide an option to the user to choose from matches that have a partial_ratio > 75%.

MD and PhD. Let’s bring health and tech together.