Kind of Like That
Monday, 21 January 2013
Over at FotoForensics, we're about to enable another porn filter. Basically, there are a small number of prohibited images that have been repeatedly uploaded. The current filters automatically handle the case where the picture has already been seen. When this happens, users are shown a prompt that basically says "This is a family friendly site. The picture you have uploaded will result in a three-month ban. Are you sure you want to continue? Yes/No" Selecting "No" puts them back on the upload page. About 20% of the users select "Yes" and are immediately banned. (I can't make this up...)
The problem is that images may vary a little. Since the hash used for indexing images on FotoForensics differs, the auto-ban filter never triggers. So I need a way to tell if the incoming image looks like a known-prohibited image.
My current solution is to generate a hash of each known prohibited content and then test every new picture against the hashes. If it matches, then the user will be prompted to confirm that the new picture isn't prohibited content before continuing.
While leaving the choice to the user may seem too trustworthy, the same type of prompting has significantly reduced the amount of porn from 4chan. Uploads from 4chan used to be 75% porn. With prompting, it has dropped to 15%. Prompting users really does work.
The hash values represent the relative change in brightness intensity. To compare two hashes, just count the number of bits that are different. (This is the Hamming distance.) A value of 0 indicates the same hash and likely a similar picture. A value greater than 10 is likely a different image, and a value between 1 and 10 is potentially a variation.
I've used aHash, pHash, and dHash to search for the various needles in the haystack. For comparisons, I did not pre-cache any of the repository hash values. I also consider a cutoff value of 10 to denote a match or a miss. (If the haystack image differs from the needle image by more than 10 bits, then it is assumed to not match.) Here's the results so far:
Storing values by row or column really doesn't make a difference. Computing both row and column hashes significantly reduces the number of false-positives and is comparable to pHash (almost as accurate). So it maintains speed and gains accuracy as the cost of requiring 128 bits for the hash.
I've also combined pHash with dHash. Basically, I use the really fast dHash as a fast filter. If dHash matches, then I compute the more expensive pHash value. This gives me all the speed of dHash with all the accuracy of pHash.
Finally, I realized that using dHash as a fast filter is good, but I don't need 64-bits for this computation. My 16-bit dHash variant uses a 6x4 reduced image. This gives me 20 difference values. Ignoring the four corners yields a 16-bit hash -- and has the benefit of ignoring the impact from Instagram-like vignetting (corner darkening). If I have a million different images, then I should expect about 15 images per 16-bit dHash. pHash can compare 15 images really quickly. At a billion images, I'm looking at about 15,258 image collisions and that still is a relatively small number.
I can even permit my 16-bit dHash to be sloppy; permitting any 1-bit change to match. Any computed 16-bit dHash would yield 17 possible dHash values to match. A million images should yield about 260 collisions, and a billion becomes about 260,000 collisions. At a billion images, it would be worth storing the 16-bit dHash, 64-bit dHash, and 64-bit pHash values. But a million images? I'm still ahead just by pre-computing the 16-bit dHash and 64-bit pHash values.
Given a solid method for searching for images, we can start exploring other research options. For example, we can begin doing searches for the most commonly uploaded image variants (I'm expecting memes and viral images to top the list) and better segmenting data for test cases. We might even be able to enhance some of our other research projects by targeting specific types of images.
In the meantime, I'm really looking forward to getting the new porn filter online.
The problem is that images may vary a little. Since the hash used for indexing images on FotoForensics differs, the auto-ban filter never triggers. So I need a way to tell if the incoming image looks like a known-prohibited image.
My current solution is to generate a hash of each known prohibited content and then test every new picture against the hashes. If it matches, then the user will be prompted to confirm that the new picture isn't prohibited content before continuing.
While leaving the choice to the user may seem too trustworthy, the same type of prompting has significantly reduced the amount of porn from 4chan. Uploads from 4chan used to be 75% porn. With prompting, it has dropped to 15%. Prompting users really does work.
A Different Approach
About 8 months ago I wrote a blog entry on algorithms for comparing pictures. Basically, if you have a large database of pictures and want to find similar images, then you need an algorithm that generates a weighted comparison. In that blog entry, I described how two of the algorithms work:- aHash (also called Average Hash or Mean Hash). This approach crushes the image into a grayscale 8x8 image and sets the 64 bits in the hash based on whether the pixel's value is greater than the average color for the image.
- pHash (also called "Perceptive Hash"). This algorithm is similar to aHash but use a discrete cosine transform (DCT) and compares based on frequencies rather than color values.
dHash
Like aHash and pHash, dHash is pretty simple to implement and is far more accurate than it has any right to be. As an implementation, dHash is nearly identical to aHash but it performs much better. While aHash focuses on average values and pHash evaluates frequency patterns, dHash tracks gradients. Here's how the algorithm works, using the same Alyson Hannigan image as last time:
- Reduce size. The fastest way to remove high frequencies and detail is to shrink the image. In this case, shrink it to 9x8 so that there are 72 total pixels. (I'll get to the "why" for the odd 9x8 size in a moment.) By ignoring the size and aspect ratio, this hash will match any similar picture regardless of how it is stretched.
- Reduce color. Convert the image to a grayscale picture. This changes the hash from 72 pixels to a total of 72 colors. (For optimal performance, either reduce color before scaling or perform the scaling and color reduction at the same time.)
- Compute the difference. The dHash algorithm works on the difference between adjacent pixels. This identifies the relative gradient direction. In this case, the 9 pixels per row yields 8 differences between adjacent pixels. Eight rows of eight differences becomes 64 bits.
- Assign bits. Each bit is simply set based on whether the left pixel is brighter than the right pixel. The order does not matter, just as long as you are consistent. (I use a "1" to indicate that P[x] < P[x+1] and set the bits from left to right, top to bottom using big-endian.)
=
= 3a6c6565498da525
The hash values represent the relative change in brightness intensity. To compare two hashes, just count the number of bits that are different. (This is the Hamming distance.) A value of 0 indicates the same hash and likely a similar picture. A value greater than 10 is likely a different image, and a value between 1 and 10 is potentially a variation.
Speed and Accuracy
From FotoForensics, we now have a testbed of over 150,000 images. I have a couple of test images that occur a known number of times. For example, one picture (needle) appears exactly once in the 150,000 image repository (haystack). Another picture occurs twice. A third test picture currently occurs 32 times.I've used aHash, pHash, and dHash to search for the various needles in the haystack. For comparisons, I did not pre-cache any of the repository hash values. I also consider a cutoff value of 10 to denote a match or a miss. (If the haystack image differs from the needle image by more than 10 bits, then it is assumed to not match.) Here's the results so far:
- No hash. This is a baseline for comparison. It loads each image into memory, and then unloads the image. This tells me how much time is spent just on the file access and loading. (And all images are located on an NFS-mounted directory -- so this includes network transfer times.) The total time is 16 minutes. Without any image comparisons, there is a minimum of 16 minutes needed just to load each image.
- No hash, Scale. Every one of these hash algorithms begins by scaling the image smaller. Small pictures scale very quickly, but large pictures can take 10 seconds or more. Just loading and scaling the 150,000 images takes 3.75 hours. (I really need to look into possible methods for optimizing my bilinear scaling algorithm.)
- aHash. This algorithm takes about 3.75 hours to run. In other words, it takes more time to load and scale the image than to run the algorithm. Unfortunately, aHash generates a huge number of false positives. It matched all of the expected images, but also matched about 10x more false-positives. For example, the test picture that should have matched 32 times actually matched over 400 images. Worse: some of the misses had a difference of less than 2 bits. In general, aHash is fast but not very accurate.
- pHash. This algorithm definitely performed the best with regards to accuracy. No false positives, no false negatives, and every match has a score of 2 or less. I'm sure that a bigger data set (or alternate needle image) will generate false matches, but the number of misses will likely be substantially less than aHash.
The problem with pHash is the performance. It took over 7 hours to complete. This is because the DCT computation uses lots of operations including cosine and sine. If I pre-compute the DCT constants, then this will drop 1-2 hours from the overall runtime. But applying the DCT coefficients still takes time. pHash is accurate, but not very fast.
- dHash. Absolutely amazing... Very few false positives. For example, the image with two known matches ended up matching 6 pictures total (4 false positives). The scores were: 10, 0, 8, 10, 0, and 10. The two zeros were the correct matches; all of the false-positive matches had higher scores. As speed goes, dHash is as fast as aHash. Well, technically it is faster since it doesn't need to compute the mean color value. The dHash algorithm has all the speed of aHash with very few false-positives.
Algorithm Variations
I have tried a few variations of the dHash algorithm. For example, David's initial proposal used an 8x8 image and wrapped the last comparison (computing the pixel difference between P[0] and P[7] for the 8th comparison). This actually performs a little worse than my 9x8 variation (more false-positives), but only by a little.Storing values by row or column really doesn't make a difference. Computing both row and column hashes significantly reduces the number of false-positives and is comparable to pHash (almost as accurate). So it maintains speed and gains accuracy as the cost of requiring 128 bits for the hash.
I've also combined pHash with dHash. Basically, I use the really fast dHash as a fast filter. If dHash matches, then I compute the more expensive pHash value. This gives me all the speed of dHash with all the accuracy of pHash.
Finally, I realized that using dHash as a fast filter is good, but I don't need 64-bits for this computation. My 16-bit dHash variant uses a 6x4 reduced image. This gives me 20 difference values. Ignoring the four corners yields a 16-bit hash -- and has the benefit of ignoring the impact from Instagram-like vignetting (corner darkening). If I have a million different images, then I should expect about 15 images per 16-bit dHash. pHash can compare 15 images really quickly. At a billion images, I'm looking at about 15,258 image collisions and that still is a relatively small number.
I can even permit my 16-bit dHash to be sloppy; permitting any 1-bit change to match. Any computed 16-bit dHash would yield 17 possible dHash values to match. A million images should yield about 260 collisions, and a billion becomes about 260,000 collisions. At a billion images, it would be worth storing the 16-bit dHash, 64-bit dHash, and 64-bit pHash values. But a million images? I'm still ahead just by pre-computing the 16-bit dHash and 64-bit pHash values.
Applied Comparisons
There are two things we want out of a perceptual hash algorithm: speed and accuracy. By combining dHash with pHash, we get both. But even without pHash, dHash is still a significant improvement over aHash and without any notable speed difference.Given a solid method for searching for images, we can start exploring other research options. For example, we can begin doing searches for the most commonly uploaded image variants (I'm expecting memes and viral images to top the list) and better segmenting data for test cases. We might even be able to enhance some of our other research projects by targeting specific types of images.
In the meantime, I'm really looking forward to getting the new porn filter online.


1) Reduce size. The fastest way to remove high frequencies and detail is to shrink the image. In this case, shrink it to 9x8 so that there are 72 total pixels...
2) Reduce color. Convert the image to a grayscale picture.
? Because it seems to me it'd be quicker to reduce the size of a grayscale, so if dropping to grayscale in the large photo isn't all that expensive timewise and if the accuracy is similar, you might reverse the order of these two.
I'm not an image analysis expert, of course.
In fact, you can do the scaling and grayscale conversion in the same step if you want. (Great for systems that are memory restrictive.)
As far as speed goes... Interesting question. Gotta go benchmark it!
Update: Just checked. My existing code does grayscale first and scaling second. Swapping the order does run slower because there is 3x scaling for RGB images. However, the speed difference isn't noticed until you hit really large images. I'll update the blog entry to reflect the optimal order of events.
I wonder if we could see the code anywhere?
I have a question, if the perceptual hashing could also be used for text/plagiarism detection. If yes, I will try it. If no, what could be the algorithm for that?
Looking forward to receive an answer to my question.
Student
pHash's DCT is really just an (8x32) (32x32) (32x8) matrix multiplication chain, a median step, and 64 comparisons, which is very fast to compute (it could probably be reduced to all 8x8 matrices too without significantly affecting the output).
On your 150k test image set, it should take about 27 seconds, a negligible amount of time compared to the 3.75 hours it took to scale the images. I suspect something is terribly sub-optimal about your implementation if it's 430 times slower than my Python one!
I use pHash to cross-match about 2000 frames of "dirty" video (possibly subtitled, downscaled, with compression artifacts) against 300k-1m frames of clean HD video. I pHash every frame of both halves (which takes about an hour or two in total - mostly the H) and store the hashes - this is a very small and simple bit of code.
Then I compare every one of the 2000 frames against every one of the 300k-1m frames (over a billion comparisons) and look for runs of frames with similar hashes (by counting the number of bits that differ and comparing it with two thresholds). This takes a few minutes with some optimized Python + Cython code.
If you're interested, the video hashing script is here:
http://www.marcansoft.com/paste/czgaGc7a.txt
I had to triple-check my code. I'm using a very efficient DCT implementation. (All constants precomputed once. Just a set of double loops for multiplication and summation. And I only compute the constants once regardless of the number of pictures being hashed.)
I think the Python code you're using might be taking advantage of MMX, SSE, SSE2, FPGA, or other hardware optimizations. My code doesn't use this (yet). With SIMD operations, the double loops could be reduced to a single loop. And with SSE2 or FPGA, it might even be no loops if the matrix is small enough to completely fit into the SIMD registers. This would easily become a 200%-400% speed improvement for the forward-DCT operation.
Am I misunderstanding the way you ran your benchmark? I'm considering the time it takes to hash 150k images. Obviously if you are re-hashing the entire haystack for every needle and reporting the total time for all needles then that's a factor that I'm not taking into account (you didn't specify how many needles you were searching for).
I just changed the Python code to implement matrix multiplication naively in python, and also perform the median by sorting and taking the middle element, removing numpy (due to the way Python floats work, this also means it is now using slower double-precision floats instead of single-precision floats; no way around that unless I switch to ints, which I could...). It is, of course, much slower, but still benchmarks at 677.69 hashes per second, or a total time of 221 seconds (3.6 minutes) for the 150k-image test set. Keep in mind that this is pure Python code now running inside the Python VM, with tons of overhead. A pure C implementation without using SIMD would be much faster.
The FastDCT computes all of the expensive math upfront and than uses matrix multiplication to compute the DCT, making it extremely quick to run.
My c# FastDCT implementation was able to compute 3000 hashes in 800ms, (so roughly 40 seconds on your image set), where as the standard method was so slow I couldn't even finish it once!
Anyway, thanks again for the great algorithm! It's helped me immensely!
I'm also experiencing worse results with dHash implemented in C# and I don't really know why, yet...
I posted my Python implementation in my comment above. That's the fast row-column method. You should be able to reimplement it in whatever language you choose.
http://home.no/rounin/programming.html
php version:
http://jax-work-archive.blogspot.com/2013/05/php-ahash-phash-dhash.html
Scaled: yes.
Rotated: no.
For one project I need to prevent adding duplicate images by users.
This is an implementation of dhash, phash and ahash in PHP:
https://github.com/valbok/leie/blob/master/classes/image/leieImage.php
also in python (uses openCV for DCT and SURF)
https://github.com/valbok/img.chk/blob/master/core/image.py
Unfortunately these hashes do not work with cropped images.
But it can be used to determine duplicates like
https://github.com/valbok/leie/blob/master/tests/image/img/madonna.jpg
and
https://github.com/valbok/leie/blob/master/tests/image/img/madonna-a.jpg
And the main feature of these hashes is that hamming distance can be calculated directly in mysql by using BIT_COUNT.
For example:
SELECT image_id FROM image_hash WHERE BIT_COUNT(YOUR_HASH ^ hash) < 10
Today looking at how to index SIFT and SURF or other descriptors to find cropped images. And it is still a question for me how to make searching in a list of keypoints or features.
Too, this turns out to be pretty good way to handle images that are big patterns - e.g. An image of the ocean and sky. The normal dhash for such images tends to be values like 0xffffFFFFffffFFFF or 0x0000ffffFFFFffff.
There is another way to handle near-zero and near-0xfff... dhash values: If an image hashes to near zero or F's, first rotate the image, then if it still hashes bad, use just the red band, green band, blue band. In my code, I've also done this sort of thing if the (hash ^ (hash >> 32)) & 0xffffFFFF is near a zero hash. But, in around 12,000 images, these sorts of things aren't done but once or twice if the dhash is done in a space-filling curve order.
So far as orientation is concerned, rotating the image so that it's wider than high can be done to fix a lot of images.
Also, when looking images up by hash, a Python app I wrote looks up all orientations of the image and picks the best one - the one closest to the haystack image in question.
However my findings are not great. I'm actually seeing worse tolerance to brightness/contrast/color and size changes in DHash than in AHash. I'm using Photoshop to make small changes.
Similarity results (100% == perfect match):
Brightness increase 5%: AHash (98%), DHash (90%)
Contrast decrease 10%: AHash(98%), DHash(90%)
Resized 25% smaller: AHash(100%), DHash(96%)
Resized 50% larger: AHash(98%), DHash(95%)
Some other important things to note:
The similarity is expressed 0-100% via the following: return ((64 - BitCount(hash1 ^ hash2)) * 100) / 64.0;
Grayscaling is done via:
uint gray = (pixel & 0x00ff0000) >> 16;
gray += (pixel & 0x0000ff00) >> 8;
gray += (pixel & 0x000000ff);
gray /= 12;
return (byte)gray;
The DHash implementation is:
ulong hash = 0;
for (int i = 0, j = 1; i < 72; i++, j++)
{
if (j > 1 && j % 9 == 0) // don't calculate the current end of row to the beginning of the next row
continue;
if (grayscale[i] > grayscale[i + 1]) // greater than the next pixel? then 1 else 0
hash |= (1UL << (63 - i));
}
I believe you guys are far wiser, so I believe your findings, however I cannot find the problem in my implementation. Does anything jump out at you? Thanks in advance!
A subsample scale algorithm could cause this. But something like bilinear should not react this way.
// Squeeze the image into an 9x8 canvas
Bitmap squeezed = new Bitmap(9, 8, PixelFormat.Format32bppRgb);
Graphics canvas = Graphics.FromImage(squeezed);
canvas.CompositingQuality = CompositingQuality.HighQuality;
canvas.InterpolationMode = InterpolationMode.HighQualityBilinear;
canvas.SmoothingMode = SmoothingMode.HighQuality;
canvas.DrawImage(image, 0, 0, 9, 8);
In this case, the "image" variable in the last line is the high-res version, and it's drawn onto the canvas using the properties/qualities as outlined above. I would think the InterpolationMode being HighQualityBilinear would suffice, perhaps not?
For my application, I need the images to perform well with slight brightness/contrast differences, I won't run into resizing but ran tests for that anyway. The winner is .Net HighQualityBicubic resampling. The similarity isn't as high as I would like (was hoping for a 99-100% match) but maybe I can live with this.
.Net Interpolation Mode: HighQualityBilinear
Brightness increase 5%: AHash (98%), DHash (90%)
Contrast decrease 10%: AHash(98%), DHash(90%)
Resized 25% smaller: AHash(100%), DHash(96%)
Resized 50% larger: AHash(98%), DHash(95%)
.Net Interpolation Mode: HighQualityBicubic
Brightness increase 5%: DHash (96%)
Contrast decrease 10%: DHash(92%)
Resized 25% smaller: DHash(98%)
Resized 50% larger: DHash(95%)
AForge Lib Bilinear Resizing:
Brightness increase 5%: DHash (89%)
Contrast decrease 10%: DHash(85%)
Resized 25% smaller: DHash(93%)
Resized 50% larger: DHash(96%)
AForge Lib Bicubic Resizing:
Brightness increase 5%: DHash (92%)
Contrast decrease 10%: DHash(92%)
Resized 25% smaller: DHash(92%)
Resized 50% larger: DHash(89%)
LinearInterpolationScale C# Resizing:
Brightness increase 5%: DHash (90%)
Contrast decrease 10%: DHash(87%)
Resized 25% smaller: DHash(100%)
Resized 50% larger: DHash(100%)
(R + G + B ) / 3
In your post you state you are dividing by 12, so all your colors have been reduced significantly.
http://www.johndcook.com/blog/2009/08/24/algorithms-convert-color-grayscale/
You might see what happens if you let .NET do the work for you by using:
Bitmap squeezed = new Bitmap(9, 8, PixelFormat.Format16bppGrayScale);
Kevin
I tried the approach you mentioned with the 16bpp Bitmap, but unfortunately GDI+ doesn't support Graphics.FromImage() for that pixel format (it's documented). I tried an alternate RGB to Grayscale conversion but again, AHash is giving me better numbers for some reason. That algorithm is here: http://stackoverflow.com/questions/2265910/convert-an-image-to-grayscale
Scratching my head over this, but happy to stick with AHash if it gives me the results I need. Strange though...
If we apply this to a large set of pictures is 8 pixels not too small so there will be a lot of collisions and therefore false positives??
Is not the fingerprint more accurate if the hash_size value is larger??
Thank you
It would be worth-while investigating feature/edge detection (e.g. SIFT), as scale invariance - and other factors, like stretching, flipping, rotating - is arguably one of the most challenging problems when it comes to this sort of thing.
Defeating detection is easy. Just alter about 20% of the image or change the dimensions by about 10% via cropping or adding.
However, the result is no longer a visually similar picture.
I have a coworker who calls these perceptual hashes "the 20 foot test" (or "6 meter test" for metric measurements). That is, if you print out both pictures and hang them on a wall 20 feet away, would you still think it's the same picture?
Different hash systems serve different purposes. These hashes look for visually similar images. As long as you don't want your picture to be visually similar, you can alter an image so that it won't match the hash.
Regarding SIFT: SIFT is a different type of algorithm and solves a different set of problems. (Unfortunately, SIFT is also patent restricted, and that dramatically limits the usefulness.)
http://01101001.net/averagehash.php
http://01101001.net/AverageHash.py
http://01101001.net/Imghash.zip
http://01101001.net/differencehash.php
http://01101001.net/DifferenceHash.py
https://docs.opencv.org/3.3.1/d4/d93/group__img__hash.html
If any of you need PHP code for dHash- here's my implementation in 50 lines (including comments):
https://github.com/Tom64b/dHash/blob/master/dhash.php
https://github.com/KilianB/JImageHash
To make the fingerprinting faster in 2017 I had a look at existing dhash implementations in Ruby and made my own one powered by vips instead of imagemagick.
To increase the quality of the algorithm I've improved it in several aspects and called it the "IDHash".
To make the fingerprints comparison faster in 2019 I've implemented the function as a native C extension.
I always had benchmarks so currently I assume my gem to be the best Ruby implementation if anyone is interested: https://github.com/Nakilon/dhash-vips
I'm considering adding a secondary hash
- like phash - to cut down on repeats. I'm wondering if anyone else has run into similar issues, and if so how did you get around it?
Great article by the way!