Commonly-used distance calculation in Data Science : Levenshtein Distance, Haversine Distance
There are multiple distance calculations commonly used in data science projects depending on the use cases. Usually, these distance calculations will act as input features for Machine Learning or Deep Learning models. Sometimes, people will relate the distance calculations to similarity with the assumption that 2 similar objects will “sit” close to each other. We will see this assumption in the following examples and the reason for using these distance calculation.
Levenshtein Distance
Reason for Using : When Each character in a word carries important information
Use cases: Location/Hotel/Building/Brand Name matching.
use case 1 : When we have 2 hotels that are located at the same location but the top floor with the hotel name “ABC A1” and the bottom floor with the hotel name “ABC A2”.
Use case 2 : When we have 2 brand that are similar like “Nike” and “Niki”.
Implementation : When we have the scenarios as described in the use cases, Levenshtein distance will be a good measurement to determine the similarity of the 2 words. For simplicity, we could think that Levenshtein distance calculations is the calculation of minimum steps to transform 1 word to another words. Let’s go through this algo with the following example.
word 1 = SUNNY
word 2 = SAND
To transform SUNNY to SAND, we will need to take 3 steps.
- step 1 : substitute character 2 which is U to A
- step 2 : substitute character 4 which is N to D
- step 3 : delete character Y
As we can see, the Levenshtein of SUNNY and SAND is 3.
However, there is a case where we can transform 1 word to another word faster by deleting characters instead of substituting characters. Let’s go for another Example
word 1 = SUUNNY
word 2 = SUNNY
Option 1 : we will take 1 step by deleting third character U to transform SUUNNY to SUNNY. the Levenshtein distance of SUUNNY and SUNNY is 1.
Option 2 : we will take 3 steps to transform SUUNNY to SUNNY.
- step 1 : substitute character 3 which is U to N
- step 2 : substitute character 5 which is N to Y
- step 3 : delete character 6 which is Y.
As we can see, we have 2 options to transform SUUNNY to SUNNY. In the Levenshtein algorithm, we will chose the shortest distance.
Now, we know that logic behind this algorithm. For the next step, we need to understand the implementation of this logic. The implementation steps are displayed in the following chart.
For the first step, we will create a metric and initialize the first row and first column with the character position.
Next step, we need to fill in the blue box one-by-one with the following rule.
- if the character of word 1 is same as the character of word 2, we will assign 0 + surrounding minimum scores (red color scores) to the blue box. This is shown in the second matrix on the top row.
- else if the character of word 1 is not the same as the character of word 2, we will assign 1 + surrounding minimum scores (red color scores) to the blue box. this is shown in the third matrix on the top row.
We could implement this algorithm using the following python code.
def levenshtein_distance(s1, s2):
# Create a matrix to store the distances
rows = len(s1) + 1
cols = len(s2) + 1
matrix = [[0] * cols for _ in range(rows)]
# Initialize the first row and column
for i in range(rows):
matrix[i][0] = i
for j in range(cols):
matrix[0][j] = j
# Compute the distances for the rest of the matrix
for i in range(1, rows):
for j in range(1, cols):
cost = 1
if s1[i-1] == s2[j-1]:
cost = 0
matrix[i][j] = min(
matrix[i - 1][j] + 1, # Deletion
matrix[i][j - 1] + 1, # Insertion
matrix[i - 1][j - 1] + cost # Substitution
)
displayDistances(matrix, len(s1), len(s2))
return matrix[rows - 1][cols - 1]
The last matrix (bottom left) will give us the Levenshtein score shown in the Green color. For more details on the algorithm, we can refer to this wikipedia page.
Haversine Distance
Reason for using : Get the distance of 2 locations in earth surface.
Use cases : Match same shops/hotels/houses from different platform.
Implementation : The implementation of this distance calculation is rather simple. We have the formula documented in this wikipedia page. First, we need to convert the latitude and longitude of 2 locations to radian. Next, we apply the following formula to calculate the Haversine Distance.
def haversine_distance(lat1, lon1, lat2, lon2):
# Convert latitude and longitude from degrees to radians
lat1_rad = math.radians(lat1)
lon1_rad = math.radians(lon1)
lat2_rad = math.radians(lat2)
lon2_rad = math.radians(lon2)
# Haversine formula
dlat = lat2_rad - lat1_rad
dlon = lon2_rad - lon1_rad
a = math.sin(dlat/2) ** 2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(dlon/2) ** 2
# Radius of Earth in kilometers
radius_earth_km = 6371.0
# Calculate the distance
distance = 2 * radius_earth_km * ( math.asin( math.sqrt(a) ) )
print(distance)
return distance
latitude1, longitude1 = 37.7749, -122.4194 # San Francisco, CA
latitude2, longitude2 = 34.0522, -118.2437 # Los Angeles, CA
distance = haversine_distance(latitude1, longitude1, latitude2, longitude2)
print(f"The Haversine distance between the two locations is: {distance:.2f} kilometers")
We have introduced 2 distance calculations for different use cases. There is another famous distance calculation which is called cosine distance. However, this distance has been introduce earlier. Feel free to check out the explanation in the link below. Hopefully, this blog can enhance your understandings on different distance calculation and open up the possibilities to improve your data science projects.
Cosine Similarity(Distance)
Reason for using : compare multiple features at once.
Use cases : Comparing Apple to Orange, comparing father to mother.
Implementation : https://alex-yeo.medium.com/cosine-similarity-introduction-and-applications-in-nlp-92519407df1