Clustering similar images with pHash
Image hashing is a technique for generating distinct "fingerprints" of images which can be used to identify and group together similar images. "phash" is one of the most popular and effective hashing algorithms. We tried it on 10k images from our archive and had promising results.
This blog is a walkthrough of how we constructed the phashes with the Imagehash library, created easily navigable clusters (groups) of images whose fingerprints (hashes) are identical, and found images that are similar to a query image. An elegant feature of phashes is that similar images will have similar hashes. To know how the hashing algorithm works, check out this other blog
# Import necessary libraries
import imagehash
import os
import pandas as pd
import numpy as np
from PIL import Image
from datetime import date, datetime, timedelta
from time import perf_counter
import wget
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
def get_hashes(path):
hashfunc = imagehash.phash
failed = []
total = 0
success = 0
image_filenames = []
image_filenames += [file for file in os.listdir(path)]
images = {}
for img in image_filenames:
if img.split(".")[-1] in ["jpg", "jpeg", "png"]:
try:
phash = hashfunc(Image.open(os.path.join(path, img)).convert("RGBA"))
images[img] = str(phash)
success +=1
except Exception:
failed.append(img)
continue
total +=1
print("phash generated for {} of {} images".format(success, total))
return images
Calling the function and timing its execution in seconds:
start = perf_counter()
hashes = get_hashes(os.getcwd()+"/images")
delta = perf_counter() - start
print(delta)
2. Cluster creation¶
Next, we load the hashes into a Pandas dataframe for easier viewing and data manipulation.
df = pd.DataFrame.from_dict(hashes, orient="index", columns = ["phash"])
df.reset_index(inplace=True)
df.rename(columns={"index":"filename"}, inplace=True)
df.head(3)
The next two cells convert the image filenames into their public URLs which can be opened for quick image viewing.
def add_url(filename):
url = "https://s3.ap-south-1.amazonaws.com/sharechat-scraper.tattle.co.in/" + filename
return url
df["img_urls"] = df["filename"].map(add_url)
Then we apply some grouping, aggregation and sorting operations to the data and see the image clusters emerge -
grouped = df.groupby(by="phash").agg({"filename":"size", "img_urls":list})
grouped.rename(columns={'filename':'count'}, inplace=True)
sorted = grouped.sort_values("count", ascending=False)
sorted.reset_index(inplace=True)
sorted.head(5)
The output of the cell above shows the first five rows of the transformed data. The "phash" column contains distinct image hashes. The "count" column shows us how many images are clustered together, i.e. they share the same phash and therefore will have similar visual content. The "img_urls" column will let us look at the images inside each cluster by clicking on their URLs.
len(sorted)
We've created 9665 image clusters out of the initial 10,800 images. This looks like an unhelpful result, and it indeed is, because we haven't specified how many images a cluster should contain for it to be considered a cluster. Let's fix that now. Since we're just exploring, we'll be lenient and keep the minimum number of images per cluster low. We'll compare results with n=2 and n=5.
print("Number of clusters when minimum number of images is 2:", len(sorted[sorted["count"] >= 2]))
print("Number of clusters when minimum number of images is 5:", len(sorted[sorted["count"] >= 5]))
Insightful as that is, we'll stick to 2 for the rest of the analysis. That is, we'll exclude all phashes that aren't shared by at least 2 images.
identical_clusters = sorted[sorted["count"] >= 2]
identical_clusters.reset_index(inplace=True,drop = True)
len(identical_clusters)
3. Querying¶
We now have a dataframe of 655 clustered images where each cluster contains at least two images that share the same phash.
What if we get a new image, and want to check if there are any images similar to it in our data? We can use the phashes to compare the images. When we defined the get_hashes() function above, we converted the hashes into hex strings to make them more compact and readable. To compare the hashes, we first convert them back into their original form (binary arrays). Then we calculate the Hamming distance between the binary arrays. A Hamming distance of up to 10 is a decent indicator for similarity, so we can use that as a threshold for returning similar images.
Let's use the phash of the biggest cluster (the one with 34 images) as a query and see what we get:
# Find similar images
query = "81d07fffe0611a85"
matches = {}
for i, row in sorted.iterrows():
# Convert hash hex string to original hash (binary array)
diff = imagehash.hex_to_hash(query) - imagehash.hex_to_hash(row["phash"])
# phashes with Hamming distance less than 10 are likely to be similar images
if diff <= 10:
print("Found match: ", row["phash"])
print("Hamming distance from query:", diff)
matches[row["phash"]] = row["img_urls"]
print("")
The query returns three phashed clusters whose Hamming distance from the query is less than or equal to 10. The first cluster, as we can see from its phash, is our query itself.
The second and third clusters contain images that are near-duplicates of the images in the query cluster. Let's open an image from the query cluster, plus one image each from the second and third clusters, to see what makes them similar and yet different:
# Add the first image from each matched cluster to a list of images along with its local path
images = []
for i in matches.values():
img_path = "images/" + i[0].split("/")[-1]
images.append(img_path)
# Load and display the images
img_arrays = []
for img in images:
img_arrays.append(mpimg.imread(img))
plt.figure(figsize=(15,15))
columns = 3
for i, image in enumerate(img_arrays):
plt.subplot(len(images) / columns + 1, columns, i + 1)
plt.xticks([])
plt.yticks([])
plt.imshow(image)
4. Conclusion¶
The images from the matched clusters are very similar to the query image - only the borders are different.
This demonstrates that phashes are useful not only for finding identical images (first cluster returned by the query) but also near-duplicates / similar ones with minor variations (second and third clusters returned by the query)
More about Hamming distance:¶
Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different. The image hashes we have generated have 64 bits each.
Hashes generated with the Imagehash library have a method that allows us to get the Hamming distance simply by subtracting one hash from another as in the Querying section above. If we're feeling curious and want to do this calculation ourselves, we can modify the library's hex_to_hash() function so that it returns ordinary binary arrays that don't have this special method. Then we follow the following steps to get the Hamming distance -
- XOR the two hashes we want to compare
- Count the number of 1s (or Trues) in the result
# Modify Imagehash library func to return array instead of ImageHash obj
def hex_to_hash(hexstr):
"""
Convert a stored hash (hex, as retrieved from str(Imagehash))
back to a Imagehash object.
Notes:
1. This algorithm assumes all hashes are either
bidimensional arrays with dimensions hash_size * hash_size,
or onedimensional arrays with dimensions binbits * 14.
2. This algorithm does not work for hash_size < 2.
"""
hash_size = int(np.sqrt(len(hexstr)*4))
#assert hash_size == numpy.sqrt(len(hexstr)*4)
binary_array = '{:0>{width}b}'.format(int(hexstr, 16), width = hash_size * hash_size)
bit_rows = [binary_array[i:i+hash_size] for i in range(0, len(binary_array), hash_size)]
hash_array = np.array([[bool(int(d)) for d in row] for row in bit_rows])
return hash_array
xor = hex_to_hash("87d03e7ee0259e85") ^ hex_to_hash("81d07fffe0611a85")
xor
# Count the number of 1s
np.count_nonzero(xor != 0)
Verifying that we get the same distance using the library method:
imagehash.hex_to_hash("87d03e7ee0259e85") - imagehash.hex_to_hash("81d07fffe0611a85")