|
The Hacker Factor BlogTools, Techniques, and Tangents |
Home Blog |
Ten Little EndiansMonday, July 27. 2009
I like to think that I support the open source community. I find open source to be very beneficial; I like the ability to review code and see how things work. Although I use open source tools (everything from gcc and nmap to Gimp and Open Office), I seldom incorporate open source code into my own code, and rarely contribute my own code to the open source.
I have a few big reasons for not incorporating free and open source software (FOSS) in my closed source world. First, FOSS usually needs so much tweaking to fit into my code that it is easier to implement it from scratch. In many cases, my code has much more demanding requirements -- every assumption and error condition must be identified and handled correctly. In my experience, most FOSS is designed to get the job done, and not test for corner-case conditions. And then there is licensing... I hate GPL. Just as I don't like door-to-door solicitors who try to push their religion on me, I also don't like licenses that try to push their philosophies onto me. GPL is a virus. If you use GPL code in your code, then your code becomes GPL. Give me the MIT, BSD, or WTFPL licenses -- these allow anyone to freely use the code. (BSD and MIT have the additional requirement to cite their work, but that does not impede my own license.) GPL requires that your own code becomes GPL. It may be an open source license, but it comes with strings; GPL isn't "free". On one project, I worked with some pretty big names in the open source world. However, I felt the need to constantly remind them that "Open source is a fad" -- just like the pet rock and parachute pants. (But like the mini-skirt, not all fads are bad. I also believe that the Internet is a fad based on a bad joke, yet everyone takes it so seriously.) I actually have a list of reasons why open source is bad. This isn't to say that closed source is good -- closed source is bad for other reasons. One Little...One of my FOSS dislikes comes from the community. Engineers are stereotyped as geeky people with no social graces. Some are Prima Donnas, other are outright rude. In the closed-source corporate world, there are layers upon layers of support staff for keeping the engineers away from the customers. The last thing any project needs is an engineer on the phone with a customer saying, "That's stupid! Only a moron would do that!" (Business 101: Never call a customer stupid, regardless of how stupid they are.) With the open source community, you get to meet these people head-on. Let's say you have a question about an open source project -- anything from usage to technical programming queries. Most mid- to large-size projects want you to post your message to a forum or send it to a mailing list. In my experience, you have a 50% chance of getting the posting answered, and an independent 50% chance of getting insulted or a rude reply. There are some notable exceptions to this. For example, the Linux kernel mailing list seems very professional and polite. However, most crypto forums and truly open security mailing lists are crawling with trolls and rudeness. Two Little...About two months ago I had a need to use FFmpeg on a Mac. FFmpeg uses a LGPL license. (I'm not a lawyer, but...) This means that you can link to the shared library, but you cannot incorporate or statically link your code without altering your own license. Since my Mac code is compiled as a universal binary (runs on both Intel and PowerPC platforms), I needed FFmpeg compiled as a library with universal binaries. The universal libraries are only needed during compile-time, not run-time. At run-time, the "correct" platform libraries are used. In order to make a universal library out of FFmpeg, I needed to make some code changes. One of the big problems with supporting multiple platforms comes from byte order -- the endian issue. The FFmpeg library hard-codes the endian conditions during the configure stage (before compiling). This is a problem since a universal binary must support both endian conditions. Abiding by their license (it's part of the LGPL), I shared my list of changes with the FFmpeg community. Boy, this was a mistake. Talk about a rude bunch of people. Some of the responses:
As expected, there were a few useful replies. (Rude and unhelpful outnumbered useful by a 3-to-1 margin.) A few users suggested that I use lipo for combining libraries into a universal binary. (Good idea -- I ended up doing this.) And Graham was very helpful, offering a code suggestion and explaining to others the need for a universal binary in a professional manner. My posting was not an isolated incident. Many threads initiated by people other than the FFmpeg developers that contain multiple rude or hostile comments. If you don't like hostile replies, then you can follow Dark Shikari's advice and "Get the hell out." Three Little Endians!One of the replies to my endian suggestion really irked me. Vitor wrote: This is not possible for speed reasons. FFmpeg is full of very speed There are two issues here. First, I could not find this code example in the actual code. ('find ffmpeg-0.5 -type f | xargs grep -i bytestream_read_le32' returns nothing.) More importantly... my background includes many years as a C code optimizer. My job was to identify areas in code written by non-computer science people and make it run faster. (Compiler optimization flags can only do so much. If you write high-performance code, then you can usually do better than any tweaks that the -O2 or -O3 flags do to slower code.) Since Vitors's sample code uses three indirections within the loop, I wanted to see where it was used. Removing even one indirection would speed up the loop. And if this code was really critical, then perhaps inline ASM would be faster... Four Little, Five Little, Six Little Endians...What I did notice is that functions like "get_le32()" are called in critical loops. This function is defined in libavformat/aviobuf.c and calls get_le16() twice. And what does this do? It gets (get) 32 bits (32) in little endian (le) format. The family of FFmpeg endian functions (get_le16, get_le32, get_be16, etc.) can be seriously optimized -- much more than any "-O2" flag can do. Moreover, these functions can be combined to take advantage of the local architecture. For example, if you are on a big endian architecture and calling get_be16(), then you currently run: unsigned int get_be16(ByteIOContext *s) Here's some faster code for big endian architectures: unsigned int get_be16(ByteIOContext *s) Now, if you define this as a macro in libavformat/avio.h instead of a library function in libavformat/aviobuf.c, then the function call even becomes inlined. (With gcc -O2, inline functions don't cross library boundaries. Since FFmpeg compiles each .c file independently, no simple inlining happens.) Seven Little, Eight Little, Nine Little Endians...So Vitor's whole argument about "FFmpeg is full of very speed critical code" is bogus. If they were worried about speed, then they would optimize their endian conversion system. With any kind of video format, endian conversion happens everywhere. Of course, my code suggestion was to not even hard-code the endian-ness. Instead, determine the system's endian, then choose the right conversion method. The determination is only made once, when the program first starts (so there is no ongoing performance loss), and it is very fast: #ifndef BIG_ENDIAN Now, each of the get-functions can test the endian. We can even pass big/little as a flag: get_e16( (ByteIOContext *)s, (int) endian) \ Granted, this code does introduce an "if" condition. However, if this is defined as a macro in a header file rather than a C function (notice the backslashes at the end of each line), then you drop two push calls (parameters for the function), a function call, and associated stack pops. The one additional "if" is much faster. In the worst case, there is an "if", two assignments, and a return (which gets optimized to a single assignment by the compiler) -- much faster than the original two gets, two assignments, a bit shift, a logical OR, and a real return that cannot be optimized. All in all, this is a very good performance improvement. And since the endian code is called all over the place -- including in critical loops -- this should be a significant performance increase. The original FFmpeg code actually does something a little more complicated for getting a single byte (the get_byte() function). However, even this can be optimized. The FFmpeg source code actually includes the comment "XXX: put an inline version". ... Ten Little Endian BoysFFmpeg is one of the best libraries for playing video formats. Regardless of the format (AVI, WMV, MP3, etc.), this library supports it. While they have taken some good steps to make their code fast, their code is not as efficient as it could be. Moreover, they are stereotypical of the open source community: while a few people are helpful and friendly, a very vocal group are rude and hostile. This does not make me want to contribute. PS. For people who don't know the children's song, "Ten Little Indians", you can find the lyrics at Wikipedia. Dear Wired MagazineThursday, July 23. 2009
Dear Wired Magazine,
I am reading the August 2009 issue and I came across Bryan Gardiner's "Burning Question: How Do I Future-Proof My Digital Media" article on page 43. The article starts off with an important premise: As file formats change and systems are updated, many formats will become obsolete. Eventually, some files will not only become unsupported, we may actually lose the ability to access data. What Gardiner does not mention is that this problem is already happening. For example, a few years ago I was asked to help recover data from some old floppy disks. These were not just any floppy disks, these were 5.25" disks created on an Apple ][. Apple did not use standard floppy drives, so having old PC hardware would not help. Fortunately, I had a friend with a working Apple ][e. Unfortunately, he didn't have the necessary software for reading the files. Eventually we solved the problem by printing the binary data file to the parallel port, and capturing the data on an old PC. But even then, decoding the file took significant time. (We were very surprised that the floppies still held data!) I've come across the same problem with old Mac "Classic" drawing programs. Even with the bits, some files are unrecoverable due to undocumented and unsupported formats. With hardware, there are issues with old parallel devices; most new PCs lack parallel ports. And don't even get me started on storage devices... Can anyone still get data off of an old RLL or MFM hard drive? Even SCSI, IDE, and ATA are being replaced by SATA and other drive formats. Apple is already phasing out Firewire, and it is only a matter of time before USB devices drop USB-1 support. More disconcerting is the state of compilers. You cannot compile GCC without... GCC. I recently tried to update an old Mac 10.3 system from GCC 3.x to GCC 4.0. I couldn't do it. GCC 4.0 has a requirement on some other software that cannot be compiled without a newer compiler. I tried to slowly update through multiple old compiler versions, but I couldn't find all the versions of all the old dependencies. While the first half of Gardiner's article is well written, the second half (specifically paragraph 7) is outright incorrect. Here is the itemized list of mistakes.
While the article's premise is well-formed and the central arguments are strong, the conclusion is poorly researched and inaccurate.
Posted by Dr. Neal Krawetz
in Forensics, Image Analysis, Mass Media
at
08:10
| Comments (2)
| Permlink
Creepy TIFFSaturday, July 18. 2009
It's been a while since I wrote about image formats themselves. This time, it's TIFF -- the Tagged Image File Format. If JPEG is the shallow, stuck up, spoiled cheerleader who everyone thinks "why is she so popular?", then TIFF is her creepy older brother.
On the positive side, TIFF is a loss-less image format with some compression capabilities. It also can store multiple images per file. But those are really the only positive aspects. TIFF does support a few well-define meta data fields (similar to PNG), but they are all associated with predefined two-byte tags (similar to JPEG). This means that TIFF lacks extendable meta types. If your meta data is anything other than date/time, artist, copyright, software, host computer, make, model, document name, or the image description, then you're out of luck. The format doesn't even have a generic "comment" meta type. TIFF Format DetailsThe TIFF file format resembles a file system. There are a series of Image File Directory (IFD) records. The beginning of the TIFF has a pointer to the first IFD record. From there, every IFD points to the next IFD. The IFD structures are strictly a linked-list. Each IFD defines one image. If the TIFF only contains one picture, then there is only one IFD record. Within the IFD are a bunch of Directory Entry (DE) records. Each IFD defines the number of DE records, and each DE record has a tag that defines the purpose, type of data, number of data records, and an offset to the data itself. There is one DE per image attribute. Image width? One DE. Image height? One DE. Bits per pixel, color space, meta data, etc. -- one DE each. This makes parsing a breeze. However, while JPEG's creepy older brother may seem to know everything about the image, his room is still a mess. The offset to the DE's data is 4 bytes. If the data length is 4 bytes or less, then the data is stored in the offset location. If the data is longer, then it is stored wherever the offset points. (Sound familiar? JPEG does this too.) A TIFF is really a messy file format. With JPEG, PNG, BMP, GIF, and other image formats, all data is stored sequentially. However, TIFF is based on pointers to offsets in the file. So the first 8 bytes in a TIFF are well-defined -- they identify the TIFF and pointer to the first IFD. After that, everything goes to hell. You jump forward to the first IFD which has you jump forward or back for each DE record's data, then you jump again (forward or backwards) to the next IFD and repeat the jumping around. Because of all of the jumping, TIFF cannot be used as a streaming file format. TIFF's random access all destroys most performance gains from pipelining. Really CreepyTIFF predates JPEG. JPEG actually uses some of TIFF's features. For example, the first two bytes in the TIFF defines the byte ordering. It uses "II" or "MM" (for Intel or Motorola) to define big-endian or little-endian formatting. If this sounds familiar, it is because JPEG does this too. (Creepy older brother. See? They're related!) JPEG also uses the two-byte tags; JPEG has more predefined tags, but this comes from TIFF. And JPEG inherited the predefined data types and count for data size (but JPEG counts are off by two). However, there is an incestuous relationship between TIFF and JPEG. I call this the "creepy factor". You see, JPEG introduced a novel method for storing lossy images: quantization tables. With TIFF revision 6.0 (June 1992), they added support for lossy JPEG-style compression. So now you have a horrible image format (JPEG) influencing its older brother, who was a mess to start with. Common UsageBetween the complexity of the JPEG algorithm, and TIFF's random-access file format, this is really an ugly implementation. Fortunately, few systems use TIFF for lossy compression. In fact, I have yet to see a system that defaults to anything other than TIFF loss-less compression; TIFF supports a run-length compression scheme, modified Huffman compression, and LZW (GIF-style) compression. In general, TIFF is usually uncompressed and used in place of bitmaps. If most TIFF encoders don't compress, then is TIFF better than a BMP? Well, kind of... BMPs have variable, platform-dependent header formats. In contrast, TIFF is a well-defined format. TIFF also supports RGB, YCbCr, CMYK, and other color space models; BMP only supports RGB (stored in BGR). However, BMP uses a flat data stream, while TIFF makes you jump all over the place. Then again, if you're just storing an uncompressed RGB image, then BMP is more efficient than TIFF. Blurring The TruthMonday, July 13. 2009
In photography, the depth of field (DoF) determines what is in focus. This is commonly called the f-stop ("f" stands for focal ratio).
A low f-stop number (e.g., f/1.4) has a very narrow DoF. Items close to the camera or far from the camera will be blurry. In contrast, a large f-stop number (e.g., f/18 or f/22) has a wide DoF; few things will appear blurry. How To BlurDigg recently featured a tutorial on two ways to create a realistic depth of field. This tutorial is very good and shows both before and after images. One method uses a gradient blur, and the other uses a Gaussian blur. Detecting Realistic DoFI find the timing of the tutorial to be serendipitous for me. I recently finished developing a cool algorithm that can identify artificial blurs and estimate the blur radius. Gaussian or gradient or motion blur... it does not matter. Most artificial blurs have distinct properties that are very different from natural blurs. The detection algorithm is used to distinguish real from computer generated images and to identify digital manipulation. If the picture should be real, then it should have a real blur. My algorithm basically looks for a subtle edge around objects. With real photos, this halo is adjacent to edges. But with artificial images, the halo follows the edge of the blur. In effect, artificial blurs replicate the edge at a distance equal to the radius. This "echo" edge is strictly caused by an artificial blur that is computed using a radius. If you see the echo edge, then the image was digitally modified. (The math is kind of fun, but basically the echo lines are separated by a distance equal to the radius, not the diameter. This is also how I know that Photoshop does blurs wrong. A Gaussian blur with a radius of 8 pixels under Gimp, Coral Paint, and any other tool generates the same results as a Photoshop blur with a radius of 4 pixels.) Why does this echo edge happen? Artificial algorithms use a for-loop that stops at the specified radius. This creates a subtle edge. In real life, God does not use a for-loop. Since there is no edge at a specific radius, the only visible artifact is the line itself. Of course, digital cameras usually do some kind of adjacent pixel blending (the demosaicing algorithm), and rendered images use supersampling. This means that an edge usually appears as a single line or as a pair of double lines one pixel apart. (That's a blur with a radius of 1 pixel; that's the pixel itself.) However, scaling a picture does not alter or obscure this subtle edge; real pictures look real and artificial blurs look artificial, regardless of scale. Applying BlursSo here's an example application. I took a real photo of some kids at a playground. I used a long shutter duration in order to get a natural blur. Mouse over the image to see the edges. Each edge has either a 1-pixel single line, or a double line that 1 pixel apart. In some cases (like the blurred kids), there is no edge at all. This looks perfectly natural, because it is. ![]() Now let's look at the two tutorial pictures with artificial blurs. ![]() Notice how the buildings in the back have a thick gap between edges. That's the echo edge. The edges measure about 13 pixels apart, so the radius is about 13 pixels. (If you follow the tutorial, then you can see that he used a radius of 12 pixels. But he used Photoshop, so that should really be about 24 pixels, right? Well, he used a gradient and the buildings fall about half-way through the gradient coloring. So half-way to 12 is about 6 pixels and the Photoshop error makes 6 pixels look like 12. So the measured 13 is close if not dead on.) ![]() The light post, clouds, and roof in the background all have wide echo spacing. The tree also shows a wide spacing. Thus, the artificial blur is detected. Real LifeI always find it problematic when artists represent modified pictures as if they were real. If you're going to say it is real, then it had better be real. Consider the case of Aizar Raldes. He's a photographer for Getty Images and AFP. I saw some of his work on a collection of solstice photos that were featured on Digg. Mouse over the image to see the echo lines... ![]() The echo spacing shows that the hands in focus appear real. The foreground hands have a natural blur (no blur radius). However, the hands in the background all have artificial blurs. The background hands on the left and right show a blur with a radius of 6-7 pixels, while the center-back hands have a blur radius of 7-9 pixels. This picture is available from the AFP (search for Aizar Raldes) and Getty Images (Getty says they got it from the AFP). Both Getty and AFP say that they don't accept image modifications. According to the AFP: Date: Thu, 25 Jun 2009 12:27:28 -0400 According to the AFP, the picture is real. Sorry AFP, but I respectfully disagree. Now keep in mind, I'm not saying that a similar picture could not have been done with a camera. I'm saying that in this case, the blur in the background appears artificial and not camera original. But there is more... You see, Principal Component Analysis also identifies some irregularities. ![]() The white halo around the hands in the background is an irregularity. It means that the color along the blurred region is the wrong color for a natural picture. Specifically, this is the first principal component (measured from a vector normal to the plane that passes through the center of mass). According to the PCA, the blurred region is further from PC1 than either edge of the blur. With a natural blur, the coloring forms a gradient between the colors on either side of the blur. In this case, the background blur does not follow any kind of gradient. In fact, if I graph the color space and plot the color on the fingertip (center rear hand, middle finger), middle of the blur, and sky, then it is clear that it is not a gradient. Something is the wrong color, and it is either the sky or the gradient. ![]() The PCA also identifies one other thing: there are dots in the sky. They are uniform in shape and uniform in color. That isn't dust particles on the lens. I believe that is a paintbrush touching up the sky. If they were dust, then they shouldn't all be round, the same size, same color, and only in the sky. Focusing on BlursBlur detection is just one of many methods for identifying image manipulation. If the image is supposed to be real, then it should have a real blur. The presence of an artificial blur is a giveaway that the image is not realistic.
Posted by Dr. Neal Krawetz
in Forensics, Image Analysis, Mass Media
at
17:49
| Comments (5)
| Permlink
Less ConfusingSaturday, July 11. 2009
I periodically get emails from Comcast. Usually they are advertisements that they claim are "service related" emails. However, I just received one that really is a service related email:
From: online.communications@comcast.net To summarize: Comcast is starting to alter DNS results to protect brands and deter network attacks. Phishers, virus writers, and cybersquatters all register mistyped domain names in order to hijack users. While there is a significant problem with domain squatters and phishers using similar (type-o) domain names, there are also some legitimate companies. For example, bmi.com is really Broadcast Music Inc. and not a type-o of ibm.com. Is this a good thing?The domain detection is done by a company called Corporate Services Company. I find it ironic that their "secure, promote, and protect brands" offerings don't block them. I mean, they call themselves "CSC" -- the same as Computer Sciences Corporation (CSC). So let's compare:
Non-CompeteTrademarks and copyrights are designed to protect brands and deter consumer confusion. You can have two companies called "McDonalds" as long as they are not in competing fields. (McDonalds disagrees with this, claiming to be "everywhere", but the courts generally look for the competition aspect). While Corporate Services Company has been around longer than Computer Sciences Corporation, they have never been in competing fields: one does legal and the other does computers. When the legal company expanded into the computer field, they began competing with an established company. Thus, the legal company is causing consumer confusion. (I'm ignoring the Canadian CSC Global Technologies who has been online since 2001 and does video compression. And there are other "CSC" companies, but none appear to be direct competitors to CSC or CSC.) And this legal company, who is treading on CSC's trademark and consumer space? Yeah, they're the ones offering domain name trademark protections. Is that really more ethically acceptable than phishing? What is also interesting is their limited use. Aside from the one example domain name that they provided ("http://www.comtcas.com" instead of "http://www.comcast.com"), I cannot find a single mis-typed domain name that they are intercepting. bnakofamerica.com, cmcast.com, ciitbank.com and other domains that I tried all go to cybersquaters, phishers, or other domain protection services. CSC (the lawyers, not the computer folks) do not appear to be stopping anything except their one example. Host not found: 3(NXDOMAIN)Then again, Comcast's email only says they only intercept domains when there is no registered recipient. Since phishers, virus writers, and cybersquatters all register domains, I don't see how this service "protects" any brands. Back in 2003, there was a huge issue with VeriSign. As a domain registrar, they would redirect all unassigned domains to their own site. This broke some applications, and directly violated how DNS was supposed to work. By October 2003, ICANN ruled on this issue and VeriSign suspended their Site-Finder service. ICANN has said that intercepting unregistered domains is a no-no. I guess Comcast and CSC (the lawyers, not the computer company) didn't get the memo.
(Page 1 of 2, totaling 6 entries)
» next page
|
SearchCalendarArchivesCategoriesPopular PostsLinksSecurity
Internet Storm Center Security Focus CyberSpeak Happy as a Monkey Cybercrime Images Photoshop Disasters Food In Real Life Worth1000 CG Society Awkward Family Photos Media Stinky Journalism Unnecessary "Quotes" Oh No They Didn't Obama Conspiracies Barackryphal Blogs Fergie's Tech Blog Xenon's Isotopia James Carrion Mark Shuttleworth |
