Zip Files
The first compound file format that we will look are Zip files, as specified in the document “APPNOTE.TXT – .ZIP File Format Specification”, revision date January 6, 2006 from PKWARE,Inc. For complete details of the file format specification, please refer to the hyperlink to the document, listed on page 1. The information described below applies to most common Zip files created with current versions of Zip archive utilities, such as WinZip.
A Zip file is broken into specific parts that can be searched for and identified based on separate signatures. The basic layout of a Zip file is first the individual compressed files within the archive.
These individual files are known as “local files” and start with a local file’s decryption header of”50 4B 03 04″, followed by the file data for the compressed local file and then followed by a data descriptor, which can be identified by the signature “50 4B 07 08”. This sequence of decryption header followed by file data, followed by data descriptor continues for each local file within the archive. “The decryption header will contain the value of the local file’s compressed file size, which includes the bytes of the decryption header, unless bit 3 of a 2-byte general purpose flag located at offset 0x06 in the decryption header is set. If this bit is set, then the compressed size is stored in the “data descriptor” that immediately follows the local file’s data, and is also stored in a central directory record for the local file, as part of the central directory located that is after all individual local files in the archive.
The central directory at the end of each Zip archive can be identified by searching for the signature “50 4B 01 02”, which identifies the beginning of each central directory record contained within the central directory. And lastly, the signature “50 4B 05 06” identifies the “End of the Central Directory Record”, which identifies the size in bytes of the central directory and it’s starting offset location in relation to the beginning of the first local file decryption header in the archive.
Upon identifying the signature “50 4B 05 06”, and using the size and starting offset information in the “End of Central Directory Record”, you search backwards from the beginning of the “”50 4B 05 06” the correct number of bytes (directory size + starting offset) and determine if that leaves you at the signature “50 4B 03 04”, which is the beginning of the first local file and the start of the archive.
The same search can also be performed in a forward manner, starting at the first “50 4B 03 04” you find and searching forward to the first “50 4B 05 06” you find and comparing the distance between the two with the result of the directory size + starting offset, located at offset 0x0C of the “End of Central Directory Record”.
If the location of the “End of Central Directory Record” is at a further offset than your calculation, then you have a fragmented archive file. The difference between the actual locationyou’re your calculation is the size of the fragmented block of data that doesn’t belong to the archive file. The next step is determining where the fragment occurs and distinguishing between the archive data and the fragment(s) that don’t belong to the file.
To do this we next look at the data descriptor, if present, at the end of each local file in the archive, or the individual central directory records for each local file in the central directory. The compressed size of the local file, which includes the size of the decryption header for the local file, is locate at offset 0x14 of each individual central directory record, which starts with the signature “50 4B 01 02.”
Once you have determined the starting point of each local file in the archive, from its signature”50 4B 03 04″ and you have determined the length of the local file from either the data descriptor at the end of the local file or from the length stored in its central directory record at the end of the archive, you can now determine which individual local file(s) contain the portion of the overall archive that is fragmented.
Starting from the first local file decryption header and going forward by the “size of compressed file” found in either of the two above locations, we should find the start of the next local file decryption header. If this brings you to the start of the next decryption header then this first local file is not fragmented. Continue with this method until there is a difference between the expected start of the next local file decryption header and the ACTUAL start of the specific local file decryption header. The size of the difference is the amount of fragmentation that has occurred. This difference is compared with the overall difference noted earlier between the overall size of the archive and the location of the “End of Central Directory Record” to determine if this is the entire amount of fragmentation within the archive or if more instances of fragmentation exist in another of the local files in the archive.
Once all individual local files in the archive, that contain fragmentation, are identified, and the size of the fragmentation is noted, you now review sectors of the fragmented local files for a block of data the size of the identified fragment that doesn’t belong. This can sometimes be more difficult to determine than other times, depending on the type of the fragmented data.