Near-duplicate image detection using Locality Sensitive Hashing
This proof-of-concept uses Locality Sensitive Hashing for near-duplicate image detection
and was inspired by Adrian Rosebrock’s article
Fingerprinting Images for Near-Duplicate Detection.
At the end of the article,
the author proposes to use K-d trees or VP trees to achieve real near-duplicate detection in a
scalable way. This page explains how to achieve that with Locality Sensitive Hashing instead.
Note: Instructions on how to run the proof-of-concept code can be found at the bottom of this page.
In this pre-processing step the idea is to compress images into smaller signatures while preserving the
similarity between the two. If two images are similar, so should their signatures be. It makes sense to
use the same image hashing function from the source article to do this: dhash
(difference hash) from
the Python image hashing library imagehash
.
This dhash
function does the following for a given k (k = 8 for example):
This last greyscale picture can then be used to build the 256-bit (16^2) signature based on the difference in light
intensities between the pixels, yielding the bitstring below. With a bit of effort, you can see the outline of the
person captured by the 1’s.
1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0
1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0
1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0
1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0
1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0
1 1 1 1 1 0 1 1 0 0 1 1 0 0 0 0
1 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0
1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0
1 0 1 1 1 0 1 1 0 0 0 1 1 0 0 0
1 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0
1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 1
0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1
1 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1
1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1
0 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0
1 1 1 0 0 1 0 0 1 1 0 0 0 1 0 1
Now instead of comparing the signatures to eachother, we will break up each signature in b bands of r
rows (bits). This of course means that b · r = k^2 with k^2 our signature length. We will then apply b
hash functions - 1 for every band - to all signatures substrings (pieces of r bits) in band b. If any of
those signature substrings end up in the same bucket for any of the hash functions, we’ll consider those
signatures - and their related files - candidate (near-duplicate) pairs.
We can influence the sensitivity of candidate pair detection by tuning the b and r parameters, if we
take fewer but wider bands, it becomes less ”easy” for signature substrings (bands) to hash to the same
bucket and vice versa.
Since it’s likely that this process creates some false positives, a post-processing step is necessary where
we actually compare the signatures of our candidate pairs to eachother. This allows us to check whether
they’re actually near-duplicates and to what extent. We do this by calculating the Hamming distance
between two signatures, ie. the number of bits where both signatures are different. If that number divided
by the signature length ( Hamming Distance / k^2 ) is lower than a self-defined threshold t, we can consider them
near-duplicates. We can choose t for example to be 0.8 in order to consider images that are 80% similar
as near-duplicates.
Below the result of a near-duplicate scan using the Locality Sensitive Hashing method described above
with k = 16 (hash size), b = 32 (bands) and r = 8 (rows). The original picture is on the left, the one with
Waldo in the middle was matched with 97.27% similarity and the cropped one on the right with 77.73%
similarity. The Hamming distance was used to calculate the similarity after these pictures hashed for 1
or more bands to the same bucket.
The 256-bit signatures of these 3 files are listed below, separated into 32 bands that can be read from
left to right, top to bottom to get the full signature. Bands that are different compared to the original
have been prefixed with >
:
Original:
11101110 11000000 11101100 11100000 11111000 01100000 11111001 01100000
11111101 00100000 11111011 00110000 11011011 00110000 11011001 00110000
10111011 00011000 10110111 10001100 10100111 10011101 01110111 00011011
10100111 00010101 10110011 10000011 00110110 11010010 11100100 11000101
Waldo interference:
11101110 11000000>01101100>11100000>10011000 01100000>10011001 01100000
>11011101 00100000>11011011 00110000 11011011 00110000 11011001 00110000
10111011 00011000 10110111 10001100 10100111 10011101 01110111 00011011
10100111 00010101 10110011 10000011 00110110 11010010 11100100 11000101
Bottom cropped:
11101110 11000000 11101100>11000000>11101100 01100000 11111001 01100000
11111101 00100000>11111101 00110000>11111011 00110000>11011011 00110000
>11011001>00110000>10011011>00111000>10110111>00001000>10100101>10001100
>00100111>00011011>01110111>00011011>10100011>10010111>10110010>10000011
Clone this repository and from within it run:
$ pip install -r requirements.txt # installs imagehash library and dependencies
$ python3 detect.py -i input
Found 3 near-duplicate images in input/ (threshold 90.00%)
99.61% similarity: file 1: input/girl_lights.jpg - file 2: input/girl_lights_shrunk_to_1334x889.jpg
96.88% similarity: file 1: input/girl_lights_shrunk_to_1334x889.jpg - file 2: input/girl_lights_waldo.jpg
96.48% similarity: file 1: input/girl_lights.jpg - file 2: input/girl_lights_waldo.jpg
For other parameters like threshold, hash size, … run python3 detect.py --help