A few words about steganography
I wrote a longer and more detailed document about vulgarization of steganography that you can download there. In addition, some Python scripts that illustrates the points explained in the pdf are available on github
Cryptography and steganography
Cryptography is the cornerstone of much of the internet. Cryptography allows to encrypt a message in such a way that, if performed correctly, only ‘legitimate’ actors can decipher it in a reasonable amount of time. However, while cryptography hides the content of the message, cryptography does not hide that exchange of information is taking place. This proves critical in many real-world applications, when the simple knowledge that encrypted communication is taking place between actors is enough to map networks, or gain information about which kind of services are exchanged or which kinds of protocols are used, therefore increasing dramatically the attack surface - not to talk about being a possible motivation for exerting constraints, legal or physical, in view of obtaining the decrypted message.
As a consequence, hiding that communication is taking place is also a capability of strategic importance. This is the aim of steganography, which goal is to hide messages into seemingly innocuous appearances. Of course, the best approach is to use both cryptography and steganography together so that, even if the steganography cover gets discovered, the message cannot be deciphered (and it is also probably quite difficult to prove, on a legal level, that a message is actually hidden inside an apparently legitimate cover without deciphering it as most steganography detection methods are only statistical, and can have false positive and negative). In addition, well-behaved cryptography turns any structured message into a seemingly random sequence, therefore breaking some of the structure that could be the target of attacks against steganography. The specific task of identifying whether or not a message is hidden in apparently innocuous data is called steganalysis.
Steganography on digital media: real-world use and difficulties
Both cryptography and steganography remain subjects of open research, and probably admit no definite solutions but rather are the theater of a forever-raging war between the sword and the shield. A good illustration of this last point in the case of cryptography is the fact that encryption methods (and related, such as hash functions) get in, and out, of fashion as vulnerabilities are found. Even mathematically sound methods need to take into account the exponentially increasing power of modern computers, and new recommendations are issued regarding the length of the encryption keys to use. Due to the importance of cryptography, those challenges regarding cryptography are well known also for the educated non specialist and information on this matter, as well as guidelines, are easy to find. By contrast, this issue is a bit less highlighted in the case of steganography as its use is more confidential, and the non-specialist may be less aware of steganography methods becoming vulnerable to attacks. Moreover, a lot of the literature about steganalysis is hard to read and one could argue that little vulgatization is available.
Due to their large size, and inherent noise content, digital images are an appealing support for performing steganography. Some of the simplest steganography techniques are based on modifying the least significan bit (LSB) of bitmap images. Two simple methods are particularly popular: LSB-substitution, where the LSBs are replaced by the data to hide, and LSB matching, where the individual pixel (or pixel channel) values are randomly increased or decreased by 1 (both with probability 0.5) if the LSB does not match the bit to hide. The interesting point is, while both techniques sound very similar, the first one (LSB-substitution) is very easy to detect even at very low embedding rates (the proportion of pixels in the image used to hide data): some sources in the literature assert that LSB-substitution is detectable as soon as a proportion of as little as 0.05 of the LSBs are modified. By contrast, LSB-matching is much more difficut to detect (see the pdf mentioned earlier on for some explanation of why there is such a difference in detectability between the two methods). Of course, the more pixels are modified, the easier steganography can be detected. Some sources in the literature recommend not to modify more than square root of the total number of LSBs to minimize the risk of detection. So, to summarize, one should be careful about which steganography algorithm is used, and how it is applied, in order to have good chances that the hidden message goes un-noticed.
The choice of the image format on which steganography is performed is also critical to successfull hiding of data. Most often in real life applications one does not need the full level of details of a multi-megapixel image and therefore one is willing to use lossy compression to reduce the size of images. The very popular jpg (or jpeg) format, and many video formats, are such lossy compression formats that take advantage of the characteristics of the human perception system to design an algorithm that can significantly compress images, with moderate visual impression of quality loss (at least, as long as a reasonable compression factor is used). However, performing steganography on such supports requires extra caution as the compression algorithm introduces new structures in the image, and therefore one must be careful not to disturb them (since the compression and decompressino algorithms are accessible to everybody, disturbing those structures could be detected by the steganalyst). In particular, a technique as simple as LSB steganography cannot be applied on jpeg images (or images that have previously been stored as jpeg) as this would create ‘nearly jpeg’ images, i.e. images where the existence of jpeg artifacts is visible while the image is not any longer the result of jpeg decompression. As a consequence, one should alter the compressed data representation used for generating the decompressed image rather than the decompressed bitmap in order to perform steganography on jpeg. This proves difficult, and while a number of methods were presented to do this, vulnerabilities to steganalysis have been described. As a consequence, it is probably wise to avoid as a support jpg, jpeg, or other lossy compression images, or any image that has been saved in such a format in its past - or more generally processed in any other way. If you want to perform steganography and are not a specialist of the field, you may want to stick to .png, .tiff or any other raw data or lossless compression format. While slightly less common than .jpg, those formats are common enough that they do not raise attention (especially if the user pretends to play with photography as a hobby) and the good point is, as the corresponding images are much heavier, you will be able to hide more information per image anyway.
Of course, ‘vanilla’ steganography can be cranked up into more ‘exotic’ uses. One such possibility is to hide several messages inside the same media, for example one ‘big’ message with non-confidential content can be hidden first, and a much smaller confidential content can be hidden after. If steganography is detected, it will be easy to give up the key to the big message, and if the second message is small enough compared to the big one, statistical methods used to reveal steganography will probably not be able to reveal with a reliable confidence that a bit more data is hidden.
Performing steganography that resists to steganalysis comes with a number of challenges. In addition, even if perfectly safe steganography is performed, there are several other issues that should be taken care of to avoid detection. The first, and most obvious one, is that the original supports used (i.e., for example, copies of original images before steganography is performed) should be reliably destroyed from all storage media. Otherwise it is trivial for an attacker to simply compare all copies of each possible support, and track changes. Some image formats may also contain several copies of the same object, possibly with different resolutions, and one should be careful not to end up in a situation where two such representations of an image are conflicting (for example, som .tiff format contain both the full resolution bitmap, and a low resolution, downsampled copy for quick display). Some arguably more subtle indices that steganography was performed could imply timestamps on the files, logs either in the OS, the software used or, for example, the command line if command line tools were used. Compromission of computers could also lead to steganography being revealed.