PhotoDNA and Limitations
Friday, 27 August 2021
I've been holding off writing this blog entry for 8 years. It isn't that I'm a fan of PhotoDNA; in fact, NCMEC never granted me access to it. Rather, even with it being treated by NCMEC and Microsoft as a big secret, I had been told that it was the only real automated solution that NCMEC had in their toolbox. Since it was helping a few organizations identify child pornography (CP) and child sexual abuse material (CSAM), I didn't want to hinder their efforts.
But like I said, that was years ago (2012-2015). The recent news coverage of Apple's CSAM announcement and updates has turned attention back to PhotoDNA. (Apple's new solution isn't using PhotoDNA, but that hasn't stopped the revised interest.) In the last two weeks, three different groups have contacted me about helping them reverse PhotoDNA. Each group made it very clear that they want to identify and publicize the algorithm's limitations. Why ask me? Well, as far as I know, I'm the only person who figured out how it works without being under an NDA. Back in 2014, I wrote a confidential whitepaper describing how it works and identifying specific limitations.
Last week, one of the groups learned where they could get "photodna.dll" -- the compiled library. As I understand it, this is supposed to be protected by an NDA or license agreement. However, a forensic vendor leaked it by inadvertently permitting access to their commercial software without a password. (I reported this to the vendor and then the github project found an alternate source.) On Tuesday (Aug 24), another research group went hunting for the same DLL. While I believe it is legal to download the available software and extract the DLL for reverse-engineering, the ethics seem questionable to me. In any case, I expect them to figure out how PhotoDNA works and identify the problems within weeks or months, not years. (One way or another, all of the secrets around PhotoDNA are about to come out.)
I suggested to some of the research groups that they contact NCMEC. (This would clear up any ethical issues.) However, each group made it clear that they were not interested in this approach. So I contacted NCMEC. I was informed that, in the last few years, NCMEC has added additional solutions beyond PhotoDNA. This includes Google's CSAI and Facebook's open-source video/image matching tools. What this means: my excuse of not wanting to take away NCMEC's only tool is no longer an issue. NCMEC now has other options. Moreover, NCMEC did not voice any concerns about this information being made public. (Poof! Ethical dilemma is gone!)
Rather than giving personal assistance to a few research groups, I'm making it all public now. In this blog entry, I'm going to cover how PhotoDNA works and the algorithm's limitations. (To the vendors who are still using PhotoDNA: I strongly suggest moving to something else.)
After re-creating the algorithm with values that were compatible (not identical, but close enough), I identified the algorithm's limitations. I wrote my confidential whitepaper in 2014 and quietly distributed it at non-public conferences and to people who had a need to know (NCMEC, Microsoft, some ICACs, and specific vendors who used the algorithm). Through these contacts, I began to receive hints on how to improve my implementation's results. Today I have an algorithm that generates the same values as the real PhotoDNA software, but it includes so many hints that I don't feel comfortable calling it "my code". For this reason, I will not be distributing code or detailing the whispered hints. None of these hints changed the limitations identified in the 2014 whitepaper; they only improved the compatibility of the generated hashes.
This blog entry describes how the algorithm works and the limitations, but not my thought process for reverse-engineering it. Much of this blog includes text and figures as-is from my 2014 whitepaper.
The Microsoft WS4_PhotoDNA.pdf presentation includes a high-level how-to for creating the hash. First, they convert it to a grayscale image. Next, they scale it to a normalized size. Finally, they compute the hash by segmenting the small image into 6x6 grids and compute the sum of Sobel gradients.
The problem with the 36x36 pixel solution is that it really isn't resistant to minor cropping. Removing as little as 1% off any edge is equivalent to half a pixel when normalized to 36x36. That causes the gradients to shift and the hash values to change.
To solve this problem, PhotoDNA uses overlapping grids. Each 6x6 pixel grid overlaps the next 6x6 pixel grid by 2 pixels. The normalizing step initially reduces the image to 26x26 pixels that are divided into 6x6 grids that overlap by 2 pixels. Now you can crop about 2% off any side and still have nearly the same hash. (A half-pixel shift is 1÷(26×2) = 1.9%; due to the overlap, this is not enough to shift any gradients by one pixel.)
The pixels within each grid are used to compute a gradient. A gradient is simply the difference between pixel values. For example, if two adjacent pixels have the values 3 and 5 then the gradient is 2 since the value differs by two. The PhotoDNA hash consists of four summations: the total number of horizontally increasing values, total sum of horizontally decreasing values, the sum of the vertically increasing values, and the sum of the vertically decreasing values. A single grid with the hash value “1,12,8,2” (left, right, up, down) means that the total sum of horizontally decreasing values (left) is 1, the sum of horizontally increasing values (right) is 12, the sum of the up values is 8, and the down values is 2.
Using this approach (26x26 normalized image, 6x6 overlapping grids, sum of Sobel gradients per grid) won't generate the exact same values. However, your values will be in the ballpark and have the correct relative magnitudes. At this point, the only difference is the scaling algorithm, blurring, and an equalization to convert extreme values into something in the range of 0 to 255.
I don't know the comparison algorithm or threshold that NCMEC uses. However, there are only a few ways to compare a vector of hashes:
Visually, the first two pictures may appear the same. However, the second one is cropped. (Original is 622x414, cropped is 615x406.) The PhotoDNA hashes are only a little different. The MSE is 5.3 and the linear difference is 27.7. These low numbers denote similar pictures.
The source windmill and child's picture are visually different. The hashes have a MSE of 9189.9 and a linear distance of 1086 -- these are much larger numbers. This difference allows tuning a threshold for an acceptable match.
Now that we know how the algorithm works, we can start addressing the limitations.
A grid with the values "1,1,8,1" indicates a smooth upward gradient with a single impulse. The smooth gradient may be a horizontal line that spans the full six-pixel width with a gradient of 7 or multiple lines full-width lines that have a cumulative sum of 7. The single impulse is not located on or adjacent to any of the horizontal gradient lines, nor along the left, right, or bottom grid edges. There are a few dozen combinations that define the "1,1,8,1" pattern (not millions of combinations). Moreover, all of the combinations are visually similar - the strong upward gradient dominates the grid coloring.
In the windmill example, there is a grid with the values "5,0,0,12". This indicates no right or upward gradients, and only a slight left and downward gradient. This suggests a near-uniformly colored grid. This grid comes from the top-right corner of the photo (position [1,0] within the 6x6 overlapping grids) - where the sky is a near-uniformly color.
The windmill example also has a grid with hash values "20,0,207,13". These values define a rough slope that brightens upward sharply. The grid contains dirt and weeds from the bottom of the picture (position [1,5] within the 6x6 overlapping grid).
In general, lower values are the easiest to reverse into a pattern. Higher values have more potential combinations, but usually define a bumpier surface.
The capped 255 value leads to an ambiguity: completely different looking patterns can generate grids with multiple 255 values. This also means that non-linear color adjustments, such as a gamma correction, histogram equalization, or white balancing are likely to generate more 255 values since these color operations increase the gradient slopes.
However, there is also the final equalization that tries to scale the values into the range 0 to 255. This results in significant changes to all other hash values as the algorithm tries to reduce the number of 255 values.
This ambiguity also increases the amount of error when comparing hashes. The false-positive rate for incorrectly matching a picture to a known PhotoDNA hash increases with the number of 255 values in the known hash. Unfortunately, 255 values will usually appear wherever there is an extreme bumpy texture or color-equalized image. A grid that contains a bumpy rock will look like a grid that is filled with really curly hair (high-contrast texture). But at 6x6 pixels, it all looks similarly "bumpy".
Unfortunately, PhotoDNA's 6x6 grids do not use a 36x36 pixel image. Instead, the algorithm uses a 26x26 pixel square and each grid overlaps with its neighboring grids by two pixels.
The two-pixel overlap between adjacent grids enforces a strong dependency. If one grid's gradients are mostly flat (e.g., [1,1,1,1]) and its adjacent neighbor is bumpy (e.g., [100,100,100,100]), then the bumpiness must be in the non-overlapping pixels and away from the flat neighbor.
Without the overlap, reversing a single grid's pattern from the four PhotoDNA hash values may yield a few thousand possible combinations. With the overlap enforcing constraints, the number of possible combinations is greatly reduced.
The middle four grids in the 6x6 grid segmentation are generally more important with pictures because photographers typically try to center subjects in the photos. The four middle grids within the 6x6 grid segmentation have overlapping regions with all of their neighbors. Even if the middle grid and its neighbors are bumpy, the constraints ensure that there are likely only a few dozen combinations that will generate the same gradients - and all of the combinations will appear visually similar.
For example, the 1x3 Sobel gradient [-1 0 +1] ignores the value of the middle pixel. Instead, it subtracts the pixel on the left from the pixel on the right. A single impulse, such as [5 12 5] has a gradient of zero.
The sample workflow provided by Microsoft (WS4_PhotoDNA.pdf, slide 5) includes a Sobel gradient. With this gradient, the outer edge of each grid cannot be included in the gradient computation because one of the 3 positions in the 1x3 Sobel gradient will be outside of the grid. For example, each grid is 6 pixels wide, but the sum of horizontal gradients only uses the middle 4 pixel gradients per column.
The use of the Sobel gradient adds a constraint between the adjacent grids. For the middle four grids, the center 2x2 pixels are the only portion not in any direct overlap with neighboring grids. However, every gradient that uses these non-overlapping 2x2 pixels must include at least one pixel value from a constrained region. The 16 hash values from the middle four grids are all directly dependent on the texture and pixel values seen in the surrounding 12 grids.
This additional constraint further limits the possible values when reversing a PhotoDNA hash back into an image. This permits a significantly sharpened image from the hash projection.
However, the overlap regions enforce a weighted importance. Pixel values in the overlapped areas are more important for hash generation than pixels in the non-overlapped areas. They are more important because they influence hash values in multiple grids.
Modifying just one of these 2x2 maximum-overlap regions will alter less than 1% (0.6%) of the total image but can result in a significant change to 16 of the 144 PhotoDNA hash values, or 11% of the total hash. A minor change to the picture in one of the 25 specially selected areas can result in a significant change to the PhotoDNA hash value.
Modifying two of the 2x2 maximum-overlap regions, where the two regions share no common grids (such as the top-left and bottom-right maximum overlap regions) requires changing less than 1.2% of the picture but will result in a change to 8 of the 36 grids, or 22% of the PhotoDNA hash values. This is enough to generate a PhotoDNA hash that appears significantly different to the unaltered photo's hash, without significantly altering the visual content.
A hostile attacker who wishes to defeat PhotoDNA detection only needs to modify two of these 2x2 maximum-overlap regions. The attacker can either force a visually similar image to have a significantly different PhotoDNA hash value (forcing a false-negative result) or can make a visually dissimilar image appear to match a known PhotoDNA hash value. For example, the following two pictures have inverted two sets of 2x2 from the 26x26 positions. In both pictures, the edits account for 1.18% of the total image.
Both of these altered images contain the same amount of edits; the only difference are the locations. These images can be compared against the source windmill picture.
PhotoDNA was designed to match similar pictures with minor cropping. It can correctly match visually similar pictures that have edits on the non-overlapping regions. However, it can have significant problems when there are even minor edits in the maximum overlap regions.
PhotoDNA does not detect flips, mirroring, 90-degree rotations, or inverting. However, it is supposed to detect visually similar pictures. Digitally alter less than 2% of the picture in very specific locations can effectively avoid detection. Moreover, these edits can be applied to non-salient regions of the picture.
Someone who wants to generate false-positive results only needs to modify a few selective portions of the picture. Forcing false-positive results can be used to justify plausible deniability in a court of law. (If you're involved in a CP/CSAM case, make sure your attorney receives the picture and not just a claim that the hash matched. If the evidence doesn't have the same SHA1, SHA256, or other strong cryptographic checksum, or isn't visually similar as identified by a human, then have the evidence excluded. It's not that I'm pro-CP, it's that I've heard of cases where people were accused based only on hash matches.)
I have made no attempt to automate this reverse-PhotoDNA solution. It is my belief that, upon creating an automated solution, everyone who distributes NCMEC's PhotoDNA hashes will be distributing child pornography, and everyone in possession of NCMEC's PhotoDNA hashes will be in possession of child pornography.
Reversing a PhotoDNA hash should result in a 26x26 version of the picture that is grayscaled and slightly blurry. (A 26x26 pixel image is about the same resolution as a small desktop icon.) The final result is unlikely to be identical to the source image, but should be visually similar and permit identifying specific people (when the face is the main picture's focus), the number of people in the photo, relative positioning, and other large elements (doors, beds, etc.).
I hand-delivered my whitepaper to NCMEC and Microsoft representatives in 2014. Additional copies were delivered to them between 2014 and 2016. There has been no feedback from them. Meanwhile, in the statements repeatedly made by Microsoft and NCMEC, they have been keen to point out that "the PhotoDNA hash is not reversible, and therefore cannot be used to recreate an image." I believe they are wrong.
Update: It only took a few months for Anish Athalye to implement an AI system that reverses PhotoDNA hashes back into pictures. He has a detailed writeup (with working code) and I have my own followup blog entry.
But like I said, that was years ago (2012-2015). The recent news coverage of Apple's CSAM announcement and updates has turned attention back to PhotoDNA. (Apple's new solution isn't using PhotoDNA, but that hasn't stopped the revised interest.) In the last two weeks, three different groups have contacted me about helping them reverse PhotoDNA. Each group made it very clear that they want to identify and publicize the algorithm's limitations. Why ask me? Well, as far as I know, I'm the only person who figured out how it works without being under an NDA. Back in 2014, I wrote a confidential whitepaper describing how it works and identifying specific limitations.
Last week, one of the groups learned where they could get "photodna.dll" -- the compiled library. As I understand it, this is supposed to be protected by an NDA or license agreement. However, a forensic vendor leaked it by inadvertently permitting access to their commercial software without a password. (I reported this to the vendor and then the github project found an alternate source.) On Tuesday (Aug 24), another research group went hunting for the same DLL. While I believe it is legal to download the available software and extract the DLL for reverse-engineering, the ethics seem questionable to me. In any case, I expect them to figure out how PhotoDNA works and identify the problems within weeks or months, not years. (One way or another, all of the secrets around PhotoDNA are about to come out.)
I suggested to some of the research groups that they contact NCMEC. (This would clear up any ethical issues.) However, each group made it clear that they were not interested in this approach. So I contacted NCMEC. I was informed that, in the last few years, NCMEC has added additional solutions beyond PhotoDNA. This includes Google's CSAI and Facebook's open-source video/image matching tools. What this means: my excuse of not wanting to take away NCMEC's only tool is no longer an issue. NCMEC now has other options. Moreover, NCMEC did not voice any concerns about this information being made public. (Poof! Ethical dilemma is gone!)
Rather than giving personal assistance to a few research groups, I'm making it all public now. In this blog entry, I'm going to cover how PhotoDNA works and the algorithm's limitations. (To the vendors who are still using PhotoDNA: I strongly suggest moving to something else.)
References and Disclaimer
A few years ago, getting the PhotoDNA code from NCMEC requires signing an NDA and then going through multiple layers of approvals. While I signed the NDA twice, NCMEC never countersigned it, never provided me with code, and stopped responding to my requests for status updates. Since they were treating it as a secret, I decided to figure it out for myself. (Update 2021-08-30: Today NCMEC informed me that Microsoft ended NCMEC's ability to sublicense the code a few years ago. Now it must come directly from Microsoft.) Fortunately, Microsoft had released a few documents that contained enough details for me to piece it together. (It's been 9 years; most of the docs are no longer online and the Internet Archive doesn't have copies of everything.) Here are some of the references I used (hyperlinks go to documents that still exist online):- http://www.microsoft.com/en-us/news/download/presskits/photodna/docs/photodnafs.pdf (hyperlink goes to the Internet Archive)
- http://www.itu.int/en/cop/case-studies/Documents/ICMEC_PhotoDNA.PDF
- flowchart_photodna_Web.jpg (I originally found the picture at Microsoft, but it's no longer there. SmartPrix still has a copy of it.)
- http://www.coe.int/t/dghl/cooperation/economiccrime/cybercrime/Documents/Reports-Presentations/Octopus2011/WS4_PhotoDNA.pdf (link goes to the Internet Archive)
- http://conferencescco.files.wordpress.com/2012/11/1-ms-valerie-tan.pptx
After re-creating the algorithm with values that were compatible (not identical, but close enough), I identified the algorithm's limitations. I wrote my confidential whitepaper in 2014 and quietly distributed it at non-public conferences and to people who had a need to know (NCMEC, Microsoft, some ICACs, and specific vendors who used the algorithm). Through these contacts, I began to receive hints on how to improve my implementation's results. Today I have an algorithm that generates the same values as the real PhotoDNA software, but it includes so many hints that I don't feel comfortable calling it "my code". For this reason, I will not be distributing code or detailing the whispered hints. None of these hints changed the limitations identified in the 2014 whitepaper; they only improved the compatibility of the generated hashes.
This blog entry describes how the algorithm works and the limitations, but not my thought process for reverse-engineering it. Much of this blog includes text and figures as-is from my 2014 whitepaper.
The Basic Algorithm
Every perceptual hash (PhotoDNA, pHash, aHash, etc.) has a few common steps. First, the image is normalized. This usually means reducing the size and converting to grayscale. Then the normalized image is converted to a hash. The specific conversion is dependent on the hash algorithm.The Microsoft WS4_PhotoDNA.pdf presentation includes a high-level how-to for creating the hash. First, they convert it to a grayscale image. Next, they scale it to a normalized size. Finally, they compute the hash by segmenting the small image into 6x6 grids and compute the sum of Sobel gradients.
| This image shows the scaled down 36x36 "pixels" divided into 6x6 grids. Each grid has a color. The diagonal 6 grids are highlighted with a thick border. |
The problem with the 36x36 pixel solution is that it really isn't resistant to minor cropping. Removing as little as 1% off any edge is equivalent to half a pixel when normalized to 36x36. That causes the gradients to shift and the hash values to change.
To solve this problem, PhotoDNA uses overlapping grids. Each 6x6 pixel grid overlaps the next 6x6 pixel grid by 2 pixels. The normalizing step initially reduces the image to 26x26 pixels that are divided into 6x6 grids that overlap by 2 pixels. Now you can crop about 2% off any side and still have nearly the same hash. (A half-pixel shift is 1÷(26×2) = 1.9%; due to the overlap, this is not enough to shift any gradients by one pixel.)
| This image shows 26x26 "pixels" divided into 6x6 overlapping grids. The overlap regions are shaded. The diagonal 6 grids are still highlighted with a thick border. |
The pixels within each grid are used to compute a gradient. A gradient is simply the difference between pixel values. For example, if two adjacent pixels have the values 3 and 5 then the gradient is 2 since the value differs by two. The PhotoDNA hash consists of four summations: the total number of horizontally increasing values, total sum of horizontally decreasing values, the sum of the vertically increasing values, and the sum of the vertically decreasing values. A single grid with the hash value “1,12,8,2” (left, right, up, down) means that the total sum of horizontally decreasing values (left) is 1, the sum of horizontally increasing values (right) is 12, the sum of the up values is 8, and the down values is 2.
Using this approach (26x26 normalized image, 6x6 overlapping grids, sum of Sobel gradients per grid) won't generate the exact same values. However, your values will be in the ballpark and have the correct relative magnitudes. At this point, the only difference is the scaling algorithm, blurring, and an equalization to convert extreme values into something in the range of 0 to 255.
Comparing Hashes
Each PhotoDNA hash contains 144 values. The values are in the range 0 to 255 and represent the four gradients from each of the 6x6 grids.I don't know the comparison algorithm or threshold that NCMEC uses. However, there are only a few ways to compare a vector of hashes:
- MSE. The mean square error is the sum of the differences between values. In pseudo code:
function MSE(hash1[144],hash2[144]) {
sum=0;
for(n=0; n < 144; n++) {
sum += (hash1[n]-hash2[n])*(hash1[n]-hash2[n]);
}
return(sum/144);
} - Linear Distance. This is almost the same as the MSE. Rather than returning
sum/144, you returnsqrt(sum). - Raw squared difference. If you care about speed, then drop off the division or square root and just return the raw squared difference.
| Image | PhotoDNA |
|---|---|
| Microsoft's windmill | 2,9,2,13, 4,10,6,12, 13,14,60,10, 16,6,29,11, 6,0,3,11, 5,0,0,12, 7,5,133,16, 42,3,172,14, 182,63,255,99, 35,180,255,27, 14,3,209,20, 20,0,207,13, 14,20,255,1, 26,9,255,6, 131,29,255,36, 68,155,255,11, 31,5,255,4, 37,2,255,5, 0,43,45,47, 7,32,54,57, 115,23,49,36, 58,122,59,30, 18,4,57,34, 24,1,48,30, 3,16,133,14, 9,14,164,8, 100,16,148,10, 47,105,127,19, 20,14,137,26, 31,5,132,45, 6,13,36,9, 6,9,32,10, 28,8,25,20, 17,29,34,24, 14,7,63,10, 20,4,71,10 You might notice that my PhotoDNA hash is a little different from Microsoft's published hash. This is because I'm working off of the scaled and recolored image that I extracted from their PDF; this is not the original source that they used. For clarity: I am generating the correct PhotoDNA hash for this specific version of this file. |
| Slight cropping: | A slight crop changes the hash values a little. The numbers in bold are different from the uncropped source and red denotes values that changed by more than 5. The differences appear as mostly small numerical changes. 2,9,2,12, 4,10,6,11, 13,14,63,9, 16,7,32,10, 6,0,3,11, 5,0,0,12, 7,5,131,16, 41,3,169,14, 183,60,255,106, 39,186,255,30, 13,4,206,20, 20,0,204,12, 14,20,255,1, 25,8,255,5, 128,27,255,36, 73,157,255,12, 30,6,255,4, 39,1,255,5, 0,43,42,46, 6,31,48,56, 111,21,44,37, 63,126,54,29, 17,5,52,34, 22,1,41,29, 3,17,134,16, 9,14,165,10, 99,15,150,9, 54,112,125,16, 20,13,133,23, 29,6,128,40, 7,12,32,9, 6,10,32,10, 29,7,26,23, 17,33,34,27, 14,9,66,12, 20,4,74,13 One of the hints that I received applied an equalization across the sum-of-gradients in order to scale the values to the range 0 to 255. This has the unfortunate impact of changing lots of values when only a few sum-of-gradients changed. This equalization dramatically increases the number of false-negative matches. (This is the only additional limitation that I've found since writing the original 2014 paper.) |
| Image from 'thispersondoesnotexist.com': | 113,92,33,67, 158,14,72,33, 14,35,1,126, 2,40,0,145, 61,6,1,103, 111,1,11,60, 163,78,10,25, 131,31,28,21, 0,53,21,14, 3,39,20,15, 80,1,10,30, 143,0,19,32, 132,48,115,9, 174,18,79,43, 16,97,86,47, 48,55,100,51, 42,65,61,44, 141,2,4,66, 113,55,246,0, 47,79,126,38, 19,88,60,55, 41,55,49,63, 19,54,29,42, 146,13,35,44, 130,18,69,20, 15,162,138,17, 17,113,78,31, 28,70,56,46, 62,48,135,13, 208,36,222,14, 93,12,27,35, 50,99,86,59, 20,161,210,21, 87,45,255,21, 142,8,255,7, 46,139,92,78
Nearly every value is significantly different. This (artifically generated) child is not a windmill. |
Visually, the first two pictures may appear the same. However, the second one is cropped. (Original is 622x414, cropped is 615x406.) The PhotoDNA hashes are only a little different. The MSE is 5.3 and the linear difference is 27.7. These low numbers denote similar pictures.
The source windmill and child's picture are visually different. The hashes have a MSE of 9189.9 and a linear distance of 1086 -- these are much larger numbers. This difference allows tuning a threshold for an acceptable match.
Now that we know how the algorithm works, we can start addressing the limitations.
Limitation: Reversible Gradients
Because PhotoDNA stores the sum of opposite directions, the general texture of each grid can be identified. For example, a grid with the values "1,1,1,1" indicates a smooth grid (virtually no texture). The gradient to the left of the grid changes just as much as the gradient to the right. This can only happen if there is a single 1-value impulse near the middle of the 6x6 pixels grid. If the pixel change were along the edge of the 6x6 pixel grid, then one of the directions would have a value of zero (0).A grid with the values "1,1,8,1" indicates a smooth upward gradient with a single impulse. The smooth gradient may be a horizontal line that spans the full six-pixel width with a gradient of 7 or multiple lines full-width lines that have a cumulative sum of 7. The single impulse is not located on or adjacent to any of the horizontal gradient lines, nor along the left, right, or bottom grid edges. There are a few dozen combinations that define the "1,1,8,1" pattern (not millions of combinations). Moreover, all of the combinations are visually similar - the strong upward gradient dominates the grid coloring.
In the windmill example, there is a grid with the values "5,0,0,12". This indicates no right or upward gradients, and only a slight left and downward gradient. This suggests a near-uniformly colored grid. This grid comes from the top-right corner of the photo (position [1,0] within the 6x6 overlapping grids) - where the sky is a near-uniformly color.
The windmill example also has a grid with hash values "20,0,207,13". These values define a rough slope that brightens upward sharply. The grid contains dirt and weeds from the bottom of the picture (position [1,5] within the 6x6 overlapping grid).
In general, lower values are the easiest to reverse into a pattern. Higher values have more potential combinations, but usually define a bumpier surface.
Limitation: Capped Values and Ambiguous Patterns
The only complex situation when reversing a PhotoDNA hash comes from the value 255. If the sum of the gradients in a direction is greater than 255 (e.g., 300), then the values are capped at 255. In effect, the value of 255 means "255 or more".The capped 255 value leads to an ambiguity: completely different looking patterns can generate grids with multiple 255 values. This also means that non-linear color adjustments, such as a gamma correction, histogram equalization, or white balancing are likely to generate more 255 values since these color operations increase the gradient slopes.
However, there is also the final equalization that tries to scale the values into the range 0 to 255. This results in significant changes to all other hash values as the algorithm tries to reduce the number of 255 values.
This ambiguity also increases the amount of error when comparing hashes. The false-positive rate for incorrectly matching a picture to a known PhotoDNA hash increases with the number of 255 values in the known hash. Unfortunately, 255 values will usually appear wherever there is an extreme bumpy texture or color-equalized image. A grid that contains a bumpy rock will look like a grid that is filled with really curly hair (high-contrast texture). But at 6x6 pixels, it all looks similarly "bumpy".
Limitation: Constrained Values by Pixel Alignment
By strictly evaluating the four gradients associated with each grid, the texture and relative layout of the grid can be determined. Reversing the image with just the per-grid values will result in a blocky/blurry 36x36 image.Unfortunately, PhotoDNA's 6x6 grids do not use a 36x36 pixel image. Instead, the algorithm uses a 26x26 pixel square and each grid overlaps with its neighboring grids by two pixels.
The two-pixel overlap between adjacent grids enforces a strong dependency. If one grid's gradients are mostly flat (e.g., [1,1,1,1]) and its adjacent neighbor is bumpy (e.g., [100,100,100,100]), then the bumpiness must be in the non-overlapping pixels and away from the flat neighbor.
Without the overlap, reversing a single grid's pattern from the four PhotoDNA hash values may yield a few thousand possible combinations. With the overlap enforcing constraints, the number of possible combinations is greatly reduced.
The middle four grids in the 6x6 grid segmentation are generally more important with pictures because photographers typically try to center subjects in the photos. The four middle grids within the 6x6 grid segmentation have overlapping regions with all of their neighbors. Even if the middle grid and its neighbors are bumpy, the constraints ensure that there are likely only a few dozen combinations that will generate the same gradients - and all of the combinations will appear visually similar.
Limitation: Constrained Values by Gradients
There are a few different approaches for computing a gradient. The direct approach is to simply subtract adjacent pixel values; a value of 2 next to a value of 5 will have a gradient of 3. However, another method uses a kernel, a small vector or matrix for computing a smoother gradient.For example, the 1x3 Sobel gradient [-1 0 +1] ignores the value of the middle pixel. Instead, it subtracts the pixel on the left from the pixel on the right. A single impulse, such as [5 12 5] has a gradient of zero.
The sample workflow provided by Microsoft (WS4_PhotoDNA.pdf, slide 5) includes a Sobel gradient. With this gradient, the outer edge of each grid cannot be included in the gradient computation because one of the 3 positions in the 1x3 Sobel gradient will be outside of the grid. For example, each grid is 6 pixels wide, but the sum of horizontal gradients only uses the middle 4 pixel gradients per column.
The use of the Sobel gradient adds a constraint between the adjacent grids. For the middle four grids, the center 2x2 pixels are the only portion not in any direct overlap with neighboring grids. However, every gradient that uses these non-overlapping 2x2 pixels must include at least one pixel value from a constrained region. The 16 hash values from the middle four grids are all directly dependent on the texture and pixel values seen in the surrounding 12 grids.
This additional constraint further limits the possible values when reversing a PhotoDNA hash back into an image. This permits a significantly sharpened image from the hash projection.
Limitation: Weighted Regions
The two-pixel overlap permits PhotoDNA to identify similar pictures that differ by a minor cropping or image shift. A picture will need to have more than 4% of the image cropped from one side; cropping both sides of an image can usually match hashes with a nearly 8% size reduction across one dimension (horizontal or vertical).However, the overlap regions enforce a weighted importance. Pixel values in the overlapped areas are more important for hash generation than pixels in the non-overlapped areas. They are more important because they influence hash values in multiple grids.
| This image shows the 26x26 normalized image with colorized 6x6 grids; overlapped regions are in shades. The thick borders enclose the regions of the image that are not involved in any overlap. These are regions that carry the least amount of influence on the overall PhotoDNA hash since they only impact one set of grid gradients. | |
| Within the 6x6 grid are a total of twenty-five 2x2 pixel regions that provide the most influence on the PhotoDNA hash values. Each of the 2x2 regions represent 0.6% of the total image (4 pixels out of 26x26 pixels), yet each 2x2 grid influences 11% of the picture's hash values. The four overlapping grid regions that share one 2x2 max-overlap area account for 10x10 of the 26x26 normalized pixels (15% of the image). |
Modifying just one of these 2x2 maximum-overlap regions will alter less than 1% (0.6%) of the total image but can result in a significant change to 16 of the 144 PhotoDNA hash values, or 11% of the total hash. A minor change to the picture in one of the 25 specially selected areas can result in a significant change to the PhotoDNA hash value.
Modifying two of the 2x2 maximum-overlap regions, where the two regions share no common grids (such as the top-left and bottom-right maximum overlap regions) requires changing less than 1.2% of the picture but will result in a change to 8 of the 36 grids, or 22% of the PhotoDNA hash values. This is enough to generate a PhotoDNA hash that appears significantly different to the unaltered photo's hash, without significantly altering the visual content.
A hostile attacker who wishes to defeat PhotoDNA detection only needs to modify two of these 2x2 maximum-overlap regions. The attacker can either force a visually similar image to have a significantly different PhotoDNA hash value (forcing a false-negative result) or can make a visually dissimilar image appear to match a known PhotoDNA hash value. For example, the following two pictures have inverted two sets of 2x2 from the 26x26 positions. In both pictures, the edits account for 1.18% of the total image.
| Edits to two minimum-weighted regions | 1,14,2,18, 4,10,6,12, 13,14,60,10, 16,6,29,11, 6,0,3,11, 5,0,0,12, 7,5,133,16, 42,3,172,14, 182,63,255,99, 35,180,255,27, 14,3,208,20, 20,0,207,13, 14,20,255,1, 26,9,255,6, 131,29,255,36, 68,155,255,11, 31,5,255,4, 37,2,255,5, 0,43,45,47, 7,32,54,57, 115,23,49,36, 58,121,59,30, 18,4,57,34, 24,1,48,30, 3,16,133,14, 9,14,164,8, 100,16,148,10, 47,105,127,19, 20,14,137,26, 31,5,132,45, 6,13,36,9, 6,9,32,10, 28,8,25,20, 17,29,34,24, 14,7,63,10, 19,7,69,12
The top-left and bottom-right corners only impact the first and last gradient sets (in gray). (The other two spurious changes are due to the equalization. Both values differ by 1.) |
| Edits to two maximum-weighted regions | 197,84,201,95, 20,155,112,55, 10,11,48,8, 13,5,23,8, 5,0,2,9, 4,0,0,9, 104,42,123,156, 42,76,146,88, 145,50,255,79, 28,143,220,21, 11,3,166,16, 16,0,165,10, 11,16,255,1, 20,7,255,4, 104,23,255,29, 54,123,255,8, 25,4,255,3, 30,1,255,4, 0,34,36,37, 5,26,43,45, 91,18,39,29, 46,97,47,24, 14,3,46,27, 19,1,38,24, 2,12,105,11, 7,11,130,6, 79,13,118,8, 37,84,101,15, 18,27,108,36, 46,12,103,65, 5,11,28,7, 5,7,26,8, 22,7,19,16, 13,23,27,19, 14,43,73,16, 67,23,101,24
Inverting two of the maximum overlap regions directly impacts four gradient sets (in gray) and changes nearly all hash values. Why didn't it only change 8 of the 36 grids or 32 of the 144 values? The algorithm applies an equalization after calculating the gradient sums. This causes big changes to propagate across all of the hash values. Most of the red values are clustered around the 8 grids that are directly impacted by the edits. |
Both of these altered images contain the same amount of edits; the only difference are the locations. These images can be compared against the source windmill picture.
- In the picture with edits to the minimum-weighted regions, the MSE is 0.49 and the linear distance is 8.42. Both of these metrics result in very low numbers, indicating very similar pictures.
- The picture with edits to the maximum-weighted regions has an MSE of 1335.1 and a linear distance of 438.5. Those are large differences and imply that these should be different pictures. I don't know what thresholds NCMEC uses, but I was told (by an organization that has access to PhotoDNA) that these differences are large enough to be a "miss".
PhotoDNA was designed to match similar pictures with minor cropping. It can correctly match visually similar pictures that have edits on the non-overlapping regions. However, it can have significant problems when there are even minor edits in the maximum overlap regions.
Summarizing PhotoDNA Technical Limitations
PhotoDNA has some significant design weaknesses:- The four sum-of-gradient values from each grid define each grid's texture. By providing a matched set of opposite directions (up and down, left and right), the surface within the grid can be reversed to a set of a few hundred possible values.
- The overlapping two-pixel region between grids reduces the set of possible values in each grid to a few dozen possibilities that are all visually similar.
- The multi-pixel Sobel gradient further reduces the possible set of values and permits sharpening any hash projection.
- The use of an equalization for scaling the sum-of-gradients increases the likelihood of a false-negative for any minor edit.
PhotoDNA does not detect flips, mirroring, 90-degree rotations, or inverting. However, it is supposed to detect visually similar pictures. Digitally alter less than 2% of the picture in very specific locations can effectively avoid detection. Moreover, these edits can be applied to non-salient regions of the picture.
Someone who wants to generate false-positive results only needs to modify a few selective portions of the picture. Forcing false-positive results can be used to justify plausible deniability in a court of law. (If you're involved in a CP/CSAM case, make sure your attorney receives the picture and not just a claim that the hash matched. If the evidence doesn't have the same SHA1, SHA256, or other strong cryptographic checksum, or isn't visually similar as identified by a human, then have the evidence excluded. It's not that I'm pro-CP, it's that I've heard of cases where people were accused based only on hash matches.)
Pseudo-Code
Back in 2014, I wrote some pseudo-code and test cases for reversing a PhotoDNA hash. (I'm sure my approach is inefficient and crypto gurus can do a better job.) My basic approach treated it like a 26x26 Sudoku puzzle. The approach brute forced every combination while obeying the constraints imposed by the gradients and overlapped grid positions. A brute-force approach should not take more than a few days to reverse one hash. I confirmed this approach by altering the prior normalized 26x26 image and observing the changes to the hash values. (That was in 2014. Today I would use a neural network since AI should be able to easily learn the mapping from hash to image.)I have made no attempt to automate this reverse-PhotoDNA solution. It is my belief that, upon creating an automated solution, everyone who distributes NCMEC's PhotoDNA hashes will be distributing child pornography, and everyone in possession of NCMEC's PhotoDNA hashes will be in possession of child pornography.
Reversing a PhotoDNA hash should result in a 26x26 version of the picture that is grayscaled and slightly blurry. (A 26x26 pixel image is about the same resolution as a small desktop icon.) The final result is unlikely to be identical to the source image, but should be visually similar and permit identifying specific people (when the face is the main picture's focus), the number of people in the photo, relative positioning, and other large elements (doors, beds, etc.).
I hand-delivered my whitepaper to NCMEC and Microsoft representatives in 2014. Additional copies were delivered to them between 2014 and 2016. There has been no feedback from them. Meanwhile, in the statements repeatedly made by Microsoft and NCMEC, they have been keen to point out that "the PhotoDNA hash is not reversible, and therefore cannot be used to recreate an image." I believe they are wrong.
Update: It only took a few months for Anish Athalye to implement an AI system that reverses PhotoDNA hashes back into pictures. He has a detailed writeup (with working code) and I have my own followup blog entry.
Read more about Forensics, FotoForensics, Image Analysis, Privacy, Programming, Security
| Comments (14)
| Direct Link

I would like to confirm my understanding with an example:
If the values in a row are for example [1 2 3 4 3] the corresponding gradients are [ 1 1 1 0 -1]. If I have understood correctly, the two horizontal hash values would then be 3 and 1? where 3 is the sum of positive values and -1 is the sum of the absolute values of negative ones?
Great question.
Each row is 6 pixels, so lets use a test set of values for one row: [1 2 5 3 4 10].
The gradient kernel is [-1 0 +1].
This means there are 4 gradients (because it would be out of bounds to be centered on the 1, also you can't be centered on the 10).
1st gradient: (-1)x1 + 0x2 + (+1)x5 = 4
2nd gradient: (-1)x2 + 0x5 + (+1)x3 = 1
3rd gradient: (-1)x5 + 0x3 + (+1)x4 = -1
4th gradient: (-1)x3 + 0x4 + (+1)x10 = 7
The positive values are the 'right' gradients: 4+1+7 = 12
The negative values are the 'left' gradients: -1
The code is literally an 'if' statement. If the gradient for any positition is positive, then add it to the right or down gradients. Otherwise, add it to the left or up gradients.
As soon as I saw the microsoft document that listed [-dx,+dx,-dy,+dy], I knew they were separating out the positive and negative gradients.
As an aside: this means that you could flip the gradient order AND swap the gradient set positions and be able to have hashes for every rotation/mirror/flip combination. However, PhotoDNA does not do this.
I was supposing the use of a simpler kernel such as [-1 +1] but of course the [-1 0 1] is more similar to the original Sobel operator.
I'm wondering (maybe I could try some tests) if by considering a bit wider kernel, let say of five elements, and the convolution on all the pixel of the image (with zero padding out of the image) we could avoid the superimposition of the 6x6 pixel squares.
I already noticed the possibility to easily compare the hash with all the possible rotations and flipping of the image...it is trivial... why they don't consider this option?
- “[…] downsized to a lower and fixed resolution of 400 × 400 pixels […]”
So, we know now the exact resolution for step1.
- “Finally, we compute the similarity of two hashes as the Euclidean distance between two feature vectors, with distances below a specified threshold qualifying as a match.”
So, they seem to use Euclidean distance instead of raw distance or mean square error.
- “In 2016, with an NCMEC-supplied database of approximately 80,000 images […]”
Now we know that the NCMEC database had about 80000 entries back in 2016. This might be interesting for calculation of the probability of a false-positive. (If all the images you have are compared to 80000 hashes, the risk of a false positive is much higher than if you compare all your images against only one hash.) I also found another source which states that the database had around 300,000 entries in 2019, but I didn't find the link to this page right now.
Also, the document states that “Misclassify an image as CP at a rate of no more than one in 50 billion” was a requirement for PhotoDNA.
What's your thought on this statement? Isn't the false-positive rate realistically much higher than one in 50 billion?
I hadn't seen that paper. I had heard that Hany was trying to come up with a variation of PhotoDNA that avoids any conflict of ownership. (As I was told, he effectively sold PhotoDNA to Microsoft.) Given how this algorithm works, I can't see how this would not be a conflicting derivative work. But I'm not a lawyer.
The PDF is an article covering pages 593-599. Page 594 of the PDF lists 4 requirements regarding speed and accuracy. I've worked with NCMEC for years and they even evaluated my own image hashing algorithm (2013-2014). In all of those years, I never heard about these four requirements. Perhaps those were specifically between NCMEC and Microsoft, and not for any algorithm in general.
I can say with absolute certainty that PhotoDNA fails all four requirements. For example, analyze in under two milliseconds? The typical JPEG image from a camera takes more than 2 milliseconds to decode BEFORE it can be analyzed. (JPEG is not built for speed, and the initial scaling reduction step takes time.) You can try to get the same throughput rate (500 images per second) by running multiple analyzers on multiple CPUs, but that's not the same thing as analyzing one image in under two milliseconds.
Page 596 mentions a corpus of 80,000 images that he received in 2016. I doubt that they contained actual CP since that becomes an issue with Federal law. (NCMEC often uses analogy images for testing.) When I worked with NCMEC on evaluating a separate algorithm (2013-2018), I was explicitly told that they did not have any images they could share. I think this paper may be confusing a one-time privilege with something generally available to developers. The same page mentions adoption by Bing (Microsoft), Facebook, Twitter, and Google but seems to misrepresent the circumstances. As I understand it, Google had not "waited until 2016 to deploy." Instead, both Google and Twitter repeatedly asked for PhotoDNA for years (because it was the only solution offered by NCMEC) and were repeatedly denied. It took years before they were finally granted permission to use the algorithm. (My information comes directly from employees at Twitter and Google who were involved in the integration effort.) In both cases, I don't know if this delay was due to NCMEC or Microsoft, but this is the same lack of access that I encountered.
The algorithm discussed in this paper looks like a variation of PhotoDNA. However, the description of PhotoDNA does not appear to be accurate. Page 596 says, "Although I will not go into too much detail on the algorithmic specifics, I will provide a broad overview". The review takes a lot of liberty in the description of the algorithm.
There are other parts of this paper that are equally concerning. For example, page 598 describes a 3-minute video at 24fps as having "4,320 still images". I sincerely hope that this paper is just poorly written, because videos typically use B-frames and P-frames to store partial images. The I-frame (still image) only appears periodically. (If you only use I-frames for a 640x480 video, then you won't get 24fps. See MJPEG for an I-frame-only video format; you'll be lucky to get 12fps.) A three minute video at 24fps does not contain "4,320 still images". It usually contains 4,320 frames -- 'frames' are not the same as 'still images'. Among other things, different video rendering engines (ffmpeg, Apple, Microsoft, etc.) will apply I/B/P frames differently, resulting in minor variations for any specific frame. Also, frames have a time code -- a single unchanging frame on the screen may be encoded once and the next frame does not appear until much later in time. (I can make a 3 minute video at 24fps that only encodes one frame, if the same frame is seen for the entire 3 minutes.) Any video analysis solution that relies on raw frame-to-image extraction at explicit time intervals -- without accounting for the frame's time code or I/B/P status -- sounds like a brute-force approximation by someone who doesn't understand video.
I could write a lot more about this paper, but that's getting really off-topic. I would not rely on the information in Hany's paper to help you understand how PhotoDNA works.
With your CV, I'm a tad concerned with your hyperbole concerning the hash lists used for matching.
"It is my belief that, upon creating an automated solution, everyone who distributes NCMEC's PhotoDNA hashes will be distributing child pornography, and everyone in possession of NCMEC's PhotoDNA hashes will be in possession of child pornography."
You and I both know that one cannot generate data out of nothing which is what a reverse algorithm would entail.
I by no means intend for this to imply my support of PhotoDNA or Apple's proposed CSAM detection. I completely object to Apple's implementation, but can understand the use of the software to scan private galleries which are own their own servers. Data on one's person should be just as protected as any other property from an unreasonable search, but the moment it enters the possession of another, it's their call. Plus, if you want a cloud backup, encrypt it first.
AFAICT, my CV isn't online.
Re: "You and I both know that one cannot generate data out of nothing which is what a reverse algorithm would entail."
What are you talking about?
An algorithm is not the same as the data used by the algorithm.
Moreover, I had sample PhotoDNA hashes for testing -- courtesy of Microsoft. (Like the windmill example. I have the picture and the hash.) I didn't reverse the algorithm "out of nothing".
After I reversed the algorithm, I checked it with people who had access to the actual algorithm. They confirmed that the same input pictures generated the same PhotoDNA hashes. This validated my implementation.
In addition, a few of those people knew how PhotoDNA works and confirmed that my write-up correctly described how it works. Those people did not tell me how it works before I reversed the algorithm. (I didn't meet them until after I had written the white paper.) Before I distributed the paper, I wasn't given any hints by people under any kind of NDA.
As for the risk that I describe, it only applies to people who are in possession of NCMEC's PhotoDNA hashes. I'm not in possession of those hashes. However, NCMEC, ICMEC, ProjectVIC, law enforcement, ICACs, a handful of vendors who create tools for ICACs, and some very large companies (Microsoft, Twitter, Google, etc.) are in possession of NCMEC's hashes. They are the ones who would be in trouble.
Re: unreasonable search
"Unreasonable search" only applies to law enforcement or someone acting on behalf of law enforcement.
Apple is not law enforcement, and it doesn't sound like Apple is doing this on behalf of law enforcement.
Apple can search your device just as your parents can search your bedroom or your landlord can search your rented home. While there are laws against trespassing (and digital trespassing), if they find and report anything illegal while trespassing, then it can be used against you.
In this case, you own the device (iPhone). Apple licenses the OS to you. Apple wants to change the license terms so that they can scan your device for potentially illegal content. It doesn't matter if they want to limit the scope of the scan to content intended for the iCloud. (By default, the iPhone is configured to use the iCloud for backups, so that means scanning everything.) If they want to do this legally, then they should scan on their iCloud servers and not on the user's device.
reverse hash with machine learning
https://www.anishathalye.com/2021/12/20/inverting-photodna/
I want to express my gratitude for your insightful work and for sharing it on your blog. Your posts have not only piqued my interest but have also proven to be immensely beneficial to me.
I am currently attempting to replicate the algorithm you've described, with the aim of obtaining the same values you've obtained for the windmill image. However, after several trials, I've come to realise that the hash values are heavily reliant on the initial normalisation algorithm. I am curious to know the specific normalisation steps you've employed to derive the values published in this article. I've experimented with histogram equalisation and linear normalisation both before and after resizing, using various methods such as nearest neighbours, bicubic, bilinear, and even Lanczos, among other less conventional methods. Despite all these combinations, I've been unable to come close to the results you've showcased here. I've also explored different options in the final step, such as capping the hash values to 255 or equalising using the maximum value as you've mentioned in the "Limitation: Capped Values" section. I've attempted using both a global maximum and local maxima per grid and direction, but to no avail. I would greatly appreciate any guidance you could provide on this matter. Another question I have is about the size of the Sobel output per grid, is it 6x4 and 4x6? Or 4x4 and 4x4 losing data around the borders? I've tried both of them as well.
On a separate note, I would like to discuss my specific context. I have signed an NDA with Microsoft and have been provided with two executable libraries, one in C# and one in Java, referred to as PhotoDNA Edge Hash algorithms. According to them, I am required to hash an image and connect to their servers for matching, confirming what you've mentioned in your blog post. However, upon analysing the hash sizes being sent to Microsoft, I've noticed that it is significantly large, implying that the image could potentially be reconstructed in detail. In some instances, the hash is even larger than the image itself. To prevent any potential leakage of my media to Microsoft, I am keen on implementing my own in-house verification solution. However, after signing an NDA with IWF, I was introduced to an API that only provides MD5 and the standard PhotoDNA hashes. Your blog has raised serious concerns for me about using these, as they can be reconstructed and are also quite slow to process. Unfortunately, IWF does not offer any alternatives. In the interim, I've been utilising AI Interrogators such as CLIP, but they lack the scalability to support the size of my database.
I look forward to your response and guidance.
Best regards
This is excellent news!
Unfortunately, I'm not able to distribute source code. (Basic policy here. Nothing personal.)
When I originally reversed the algorithm, my values were "in the ballpark" but not exact. As you pointed out, the difference was in the normalization function. Even without the correct normalization, the algorithm uses the computed angle between the two hash vectors. Your results should be close enough.
As with you, I realized that the difference was in the normalization function. That's when I noticed that Hany Farid had created the algorithm. He used to teach at Dartmouth College. Some of his students had posted their homework solutions online and those used a really funky normalization function that they had learned in class. I replicated the algorithm and got almost the exact same numbers. (I just went back to find those references and I can't find them online anywhere.) Farid uses a fast box filter algorithm (see http://nghiaho.com/?p=1159, http://6.869.csail.mit.edu/fa16/lecture/lecture3linearfilters.pdf, and H. Farid and E. P. Simoncelli, Differentiation of discrete multi-dimensional signals, IEEE Trans Image Processing, vol.13(4), pp. 496–508, Apr 2004).
A few years later, after I shared my initial whitepaper with Microsoft, I received an anonymous tip that shared Microsoft's code with me. (I've since figured out that it likely came from one of Microsoft's internal developers who worked closely on PhotoDNA. His name is all over the source code. Unfortunately, he died a few years after sharing the code with me.) This code allowed me to identify the final difference and generate the exact same values.
What I found: I definitely reversed the basic algorithm! I was about 95% of the way there!
The key differences:
- I didn't do the bilinear interpolation for the histogram. (blurring)
- I didn't do the weird/stupid scaling and equalization at the end.
None of this changed the limitations that I documented in the whitepaper. It also introduced a new limitation: The stoopid equalization loop means big changes to one gradient set can cause value differences in all other gradient sets. This increases the likelihood of a false-negative for any given change.
If you have the source code from Microsoft, then you have the algorithm. However, as you mentioned, you're under the NDA. (I never signed any NDA for this, and while I have seen Microsoft's code, my code was built independently and is not owned by Microsoft.)
Thank you for your detailed response. It has been immensely helpful in my quest to reverse engineer the PhotoDNA algorithm.
I would like to clarify that my NDA covers PhotoDNA Edge, a different algorithm that includes more information and apparently adds randomness to the hashes. Unfortunately, I do not have access to the source code or the algorithm itself.
With your valuable hints, I am pleased to tell you that I have finally managed to reverse engineer PhotoDNA and reproduce the results. While there are minor differences in a few values compared to the ones posted on your blog, they match 100% with the ones produced by pyPhotoDNA using a DLL I found on Exterro FTK distribution (albeit hard to find, it is still there and available). My guess is that PIL may be generating some small artefacts.
I believe I have found the works you referred to on Hany Farid's homepage. In the code section, there is a project called Wild-ID. Inside it, there is a sift-test folder with a file named MyMatch.java. This file contained the clues I needed: The Difference of Gaussians from Lowe's work and the starting sigma to be 1.6, which is then divided by 2. I applied the Gaussians starting with 1.6, and then the differences and the coefficients I observed (.196967309..., .2780373004..., and .4481117...) seemed promising as they were present in the DLL binary data.
Regarding the equalisation, I completely agree with you. It's astonishing how it degrades the values, and it's very expensive as it runs in a loop. What I found was that you have to calculate the Euclidean norm and literally normalise the vector, then clip anything below 0.25 to 0.25 and continue until the vector stabilises. Finally, an equalisation to 0 to 256 is performed and the value 256 gets replaced by 255.
My next goal is to port the algorithm to CUDA to allow fast parallel hashing of multiple files. This will significantly speed up the process and allow for more efficient handling of large datasets.
Once again, thank you for your guidance and support.