Tattle's Blog

Updates from our Team

Clustering similar images with pHash

phash-blog

Image hashing is a technique for generating distinct "fingerprints" of images which can be used to identify and group together similar images. "phash" is one of the most popular and effective hashing algorithms. We tried it on 10k images from our archive and had promising results.

This blog is a walkthrough of how we constructed the phashes with the Imagehash library, created easily navigable clusters (groups) of images whose fingerprints (hashes) are identical, and found images that are similar to a query image. An elegant feature of phashes is that similar images will have similar hashes. To know how the hashing algorithm works, check out this other blog

In [1]:
# Import necessary libraries
import imagehash
import os
import pandas as pd
import numpy as np
from PIL import Image
from datetime import date, datetime, timedelta
from time import perf_counter
import wget
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

Steps

1. Loading and hashing images

We define a function to load images from a local folder called "images" in a loop, generate their phashes and store them in a dictionary:

In [2]:
def get_hashes(path):
    hashfunc = imagehash.phash
    failed = []
    total = 0
    success = 0
    image_filenames = []
    image_filenames += [file for file in os.listdir(path)]
    images = {}
    for img in image_filenames:
        if img.split(".")[-1] in ["jpg", "jpeg", "png"]:
            try:
                phash = hashfunc(Image.open(os.path.join(path, img)).convert("RGBA"))
                images[img] = str(phash)
                success +=1
            except Exception:
                failed.append(img)
                continue
            total +=1
    print("phash generated for {} of {} images".format(success, total))
    return images

Calling the function and timing its execution in seconds:

In [3]:
start = perf_counter()
hashes = get_hashes(os.getcwd()+"/images")
delta = perf_counter() - start
print(delta)
phash generated for 10800 of 10800 images
147.40672431000002

2. Cluster creation

Next, we load the hashes into a Pandas dataframe for easier viewing and data manipulation.

In [4]:
df = pd.DataFrame.from_dict(hashes, orient="index", columns = ["phash"])
df.reset_index(inplace=True)
df.rename(columns={"index":"filename"}, inplace=True)
df.head(3)
Out[4]:
filename phash
0 1ee8da5f-326d-427d-85db-d54bea136ab0.jpg 989d1cded6d69610
1 75d01f9b-31c9-4c06-9ab0-c7968f3617da.jpg f0a59e5e254b42bc
2 89c5d57c-c3c5-4359-bf68-ae830fceedcf.jpg 94971ef8694161b7

The next two cells convert the image filenames into their public URLs which can be opened for quick image viewing.

In [5]:
def add_url(filename):
    url = "https://s3.ap-south-1.amazonaws.com/sharechat-scraper.tattle.co.in/" + filename
    return url
In [6]:
df["img_urls"] = df["filename"].map(add_url)

Then we apply some grouping, aggregation and sorting operations to the data and see the image clusters emerge -

In [7]:
grouped = df.groupby(by="phash").agg({"filename":"size", "img_urls":list})
grouped.rename(columns={'filename':'count'}, inplace=True)
sorted = grouped.sort_values("count", ascending=False)
sorted.reset_index(inplace=True)
sorted.head(5)
Out[7]:
phash count img_urls
0 81d07fffe0611a85 34 [https://s3.ap-south-1.amazonaws.com/sharechat...
1 e3b24096fc0d3cad 20 [https://s3.ap-south-1.amazonaws.com/sharechat...
2 f8076bbc53817cc1 13 [https://s3.ap-south-1.amazonaws.com/sharechat...
3 85d3ba7ca5359a12 13 [https://s3.ap-south-1.amazonaws.com/sharechat...
4 aad2d00ed3392cb7 12 [https://s3.ap-south-1.amazonaws.com/sharechat...

The output of the cell above shows the first five rows of the transformed data. The "phash" column contains distinct image hashes. The "count" column shows us how many images are clustered together, i.e. they share the same phash and therefore will have similar visual content. The "img_urls" column will let us look at the images inside each cluster by clicking on their URLs.

In [8]:
len(sorted)
Out[8]:
9665

We've created 9665 image clusters out of the initial 10,800 images. This looks like an unhelpful result, and it indeed is, because we haven't specified how many images a cluster should contain for it to be considered a cluster. Let's fix that now. Since we're just exploring, we'll be lenient and keep the minimum number of images per cluster low. We'll compare results with n=2 and n=5.

In [9]:
print("Number of clusters when minimum number of images is 2:", len(sorted[sorted["count"] >= 2]))
print("Number of clusters when minimum number of images is 5:", len(sorted[sorted["count"] >= 5]))
Number of clusters when minimum number of images is 2: 655
Number of clusters when minimum number of images is 5: 67

Insightful as that is, we'll stick to 2 for the rest of the analysis. That is, we'll exclude all phashes that aren't shared by at least 2 images.

In [10]:
identical_clusters = sorted[sorted["count"] >= 2]
identical_clusters.reset_index(inplace=True,drop = True)
len(identical_clusters)
Out[10]:
655

3. Querying

We now have a dataframe of 655 clustered images where each cluster contains at least two images that share the same phash.

What if we get a new image, and want to check if there are any images similar to it in our data? We can use the phashes to compare the images. When we defined the get_hashes() function above, we converted the hashes into hex strings to make them more compact and readable. To compare the hashes, we first convert them back into their original form (binary arrays). Then we calculate the Hamming distance between the binary arrays. A Hamming distance of up to 10 is a decent indicator for similarity, so we can use that as a threshold for returning similar images.

Let's use the phash of the biggest cluster (the one with 34 images) as a query and see what we get:

In [11]:
# Find similar images
query = "81d07fffe0611a85"
matches = {}
for i, row in sorted.iterrows():
    # Convert hash hex string to original hash (binary array) 
    diff = imagehash.hex_to_hash(query) - imagehash.hex_to_hash(row["phash"])
    # phashes with Hamming distance less than 10 are likely to be similar images
    
    if diff <= 10:
        print("Found match: ", row["phash"])
        print("Hamming distance from query:", diff)
        matches[row["phash"]] = row["img_urls"]
        print("")
Found match:  81d07fffe0611a85
Hamming distance from query: 0

Found match:  87d03e7ee0259e85
Hamming distance from query: 10

Found match:  c1967e7ee0211e8d
Hamming distance from query: 10

The query returns three phashed clusters whose Hamming distance from the query is less than or equal to 10. The first cluster, as we can see from its phash, is our query itself.

The second and third clusters contain images that are near-duplicates of the images in the query cluster. Let's open an image from the query cluster, plus one image each from the second and third clusters, to see what makes them similar and yet different:

In [12]:
# Add the first image from each matched cluster to a list of images along with its local path
images = []
for i in matches.values():
    img_path = "images/" + i[0].split("/")[-1]
    images.append(img_path)

# Load and display the images
img_arrays = []
for img in images:
    img_arrays.append(mpimg.imread(img))

plt.figure(figsize=(15,15))
columns = 3
for i, image in enumerate(img_arrays):
    plt.subplot(len(images) / columns + 1, columns, i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(image)

4. Conclusion

The images from the matched clusters are very similar to the query image - only the borders are different.

This demonstrates that phashes are useful not only for finding identical images (first cluster returned by the query) but also near-duplicates / similar ones with minor variations (second and third clusters returned by the query)

More about Hamming distance:

Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different. The image hashes we have generated have 64 bits each.

Hashes generated with the Imagehash library have a method that allows us to get the Hamming distance simply by subtracting one hash from another as in the Querying section above. If we're feeling curious and want to do this calculation ourselves, we can modify the library's hex_to_hash() function so that it returns ordinary binary arrays that don't have this special method. Then we follow the following steps to get the Hamming distance -

  1. XOR the two hashes we want to compare
  2. Count the number of 1s (or Trues) in the result
In [13]:
# Modify Imagehash library func to return array instead of ImageHash obj

def hex_to_hash(hexstr):
	"""
	Convert a stored hash (hex, as retrieved from str(Imagehash))
	back to a Imagehash object.
	Notes:
	1. This algorithm assumes all hashes are either
	   bidimensional arrays with dimensions hash_size * hash_size,
	   or onedimensional arrays with dimensions binbits * 14.
	2. This algorithm does not work for hash_size < 2.
	"""
	hash_size = int(np.sqrt(len(hexstr)*4))
	#assert hash_size == numpy.sqrt(len(hexstr)*4)
	binary_array = '{:0>{width}b}'.format(int(hexstr, 16), width = hash_size * hash_size)
	bit_rows = [binary_array[i:i+hash_size] for i in range(0, len(binary_array), hash_size)]
	hash_array = np.array([[bool(int(d)) for d in row] for row in bit_rows])
	return hash_array
In [14]:
xor = hex_to_hash("87d03e7ee0259e85") ^ hex_to_hash("81d07fffe0611a85")
xor 
Out[14]:
array([[False, False, False, False, False,  True,  True, False],
       [False, False, False, False, False, False, False, False],
       [False,  True, False, False, False, False, False,  True],
       [ True, False, False, False, False, False, False,  True],
       [False, False, False, False, False, False, False, False],
       [False,  True, False, False, False,  True, False, False],
       [ True, False, False, False, False,  True, False, False],
       [False, False, False, False, False, False, False, False]])
In [15]:
# Count the number of 1s 
np.count_nonzero(xor != 0)
Out[15]:
10

Verifying that we get the same distance using the library method:

In [16]:
imagehash.hex_to_hash("87d03e7ee0259e85") - imagehash.hex_to_hash("81d07fffe0611a85")
Out[16]:
10