GCF Format

From Tommunism

ToDo: Synopsis

This document only covers version 6 of the GCF format (and version 1 of the NCF format). This is to keep the documentation simple for new readers of the file format. The order of these sections is the same order that they appear in the GCF/NCF files.

Notes that are in blue are important for parsing NCF files. (Most notably which structures are not in the NCF format.)

Contents

File Header

struct GCFHeader
{
    DWORD Dummy0;
    DWORD CacheType;
    DWORD FormatVersion;
    DWORD CacheID;
    DWORD CacheVersion;
    DWORD Updating;
    DWORD Dummy1;
    DWORD FileSize;
    DWORD BlockSize;
    DWORD BlockCount;
    DWORD Checksum;
}
  • Dummy0 is always 0x00000001. This could presumably be a "major" revision number for the format.
  • CacheType is always 0x00000001 for GCF files and 0x00000002 for NCF files.
  • FormatVersion is the file format version. The latest version number for GCF files is 6. The latest version number for NCF files is 1.
  • CacheID is the application ID of the cache. All the current application IDs are stored inside the ClientRegistry.blob file, within the ContentDescriptionRecord entry.
  • CacheVersion as with any software, updates need to be performed and tracked. This is simply a number to identify which version of an application the cache file is representing. All versions are stored inside the ClientRegistry.blob file, within the ContentDescriptionRecord entry.
  • Updating if set to 0x00000001 then the GCF is currently updating, otherwise it will be set to 0x00000000. (Unverified)
  • FileSize is the total size of the GCF file. NCF files set this field to 0.
  • BlockSize represents how many bytes are in each segment of a file. This is to improve cache writing performance. NCF files set this field to 0.
  • BlockCount represents how many total segments of files are stored in the cache. Note that this is not the same as a "file count." NCF files set this field to 0.
  • Checksum is used to validate the header. Currently, the checksums are calculated by adding every byte in the header (except the Checksum header, of course). See the following C code:
// This is not a safe implementation, it is merely a basic example of
// how the algorithm works.
DWORD GcfHeaderChecksum(BYTE *headerData)
{
    DWORD checksum = 0;
    int i;
    // 4 * 10: 4 is the size of a DWORD and 10 is the number of
    // DWORDs to sum.
    for (i = 0; i < (4 * 10); i++)
        checksum += headerData[i];
    return checksum;
}

Blocks

This section is not present in NCF files and should be ignored when processing them.

Header

struct GCFBlockEntryHeader
{
    DWORD BlockCount;
    DWORD BlocksUsed;
    DWORD Dummy0;
    DWORD Dummy1;
    DWORD Dummy2;
    DWORD Dummy3;
    DWORD Dummy4;
    DWORD Checksum;
}
  • BlockCount represents the number of blocks in the cache. This should be the same as in the GCFHeader.
  • BlocksUsed represents the number of blocks that actually have data. This should be the same as in the GCFHeader.
  • Checksum is used to validate the header. Currently, the checksums are calculated by adding all of the previous DWORDs together.

Entries

This is an array, with the size being GCFHeader.BlockCount.

struct GCFBlockEntry
{
    DWORD EntryFlags;
    DWORD FileDataOffset;
    DWORD FileDataSize;
    DWORD FirstDataBlockIndex;
    DWORD NextBlockEntryIndex;
    DWORD PreviousBlockEntryIndex;
    DWORD DirectoryIndex;
}
  • EntryFlags represents the type of block. These are the currently known values:
    • 0x200F8000 - Block contains data.
    • 0x200F0000 - Block contains no data or is unused.
    • 0x200FC000 - Block contains data. (Read only?)
  • FileDataOffset defines at which offset in the extracted file this block of data is located.
  • FileDataSize defines the length of the data in this block entry.
  • FirstDataBlockIndex defines the index to the first data block of this block entry's data.
  • NextBlockEntryIndex defines the next block entry in the series. If this value is equal to the BlockCount then there are no more blocks in the series.
  • PreviousBlockEntryIndex defines the previous block entry in the series. If this value is equal to the BlockCount then there are no more blocks in the series.
  • DirectoryIndex represents the index of the block entry in the directory.

Fragmentation Map

This section is not present in NCF files and should be ignored when processing them.

The fragmentation map provides a simple way to track chunks of files even though they may not be stored contiguously in the archive. For more information about this concept, look at this article on fragmentation. (Fragmentation is a side effect of storing files as blocks instead of as a whole.)

Header

struct GCFFragmentationMapHeader
{
    DWORD BlockCount;
    DWORD FirstUnusedEntry;
    DWORD Terminator;
    DWORD Checksum;
}
  • BlockCount represents the number of blocks in the cache. This should be the same as in the GCFHeader and GCFBlockEntryHeader.
  • FirstUnusedEntry represents the index of the first unused fragmentation map entry.
  • Terminator defines the fragmentation map terminator. If the value is 0, then the terminator is 0x0000FFFF; if the value is 1, then the terminator is 0xFFFFFFFF.
  • Checksum is used to validate the header. Currently, the checksums are calculated by adding all of the previous DWORDs together.

Entries

This is an array, with the size being GCFHeader.BlockCount.

struct GCFFragmentationMapEntry
{
    DWORD NextDataBlockIndex;
}
  • NextDataBlockIndex is the index of the next block of the file. If the value is equal to the terminator (defined by GCFFragmentationMapHeader.Terminator), then there are no more blocks in the file.

Directory

This is where all of the meta data for the files is located. This is sometimes also known as the "manifest".

Header

struct GCFDirectoryHeader
{
    DWORD Dummy0;
    DWORD CacheID;
    DWORD CacheVersion;
    DWORD ItemCount;
    DWORD FileCount;
    DWORD FileChecksumSize;
    DWORD DirectorySize;
    DWORD NameSize;
    DWORD HashTableKeyCount;
    DWORD CopyCount;
    DWORD LocalCount;
    DWORD Bitmask;
    DWORD Fingerprint;
    DWORD Checksum;
}
  • Dummy0 is always 0x00000004. This is most likely some kind of version. Otherwise it is just something left over from the network protocol.
  • CacheID is the application ID of the cache.
  • CacheVersion is the version of the application.
  • ItemCount represents how many directory entries there are.
  • FileCount represents how many files are in the cache.
  • FileChecksumSize defines how many bytes are used per checksum. (Checksums are defined later in the file.)
  • DirectorySize defines how many bytes are in the rest of the directory. (This is every structure here except for this header.)
  • NameSize defines the number of bytes in the name table.
  • HashTableKeyCount represents how many hash table keys there are.
  • CopyCount represents how many files to copy.
  • LocalCount represents how many files to keep local.
  • Bitmask is as the name says, a bit mask of various flags and values. These are the known masks:
    • 0x00000001 - Build Mode (Purpose unknown)
    • 0x00000002 - Is Purge All (Purpose unknown)
    • 0x00000004 - Is Long Roll (Purpose unknown-I think this has to do with when there are other GCF files in a "chain", especially language caches.)
    • 0xFFFFFF00 - Depot Key (I have no clue how this is generated. But if it fails, it will "purge all files". I think it has something to do with when Steam is updating a GCF rather than building a new one.)
  • Fingerprint is completely unknown. But the name was derived from "dormine's" analysis. My guess is that this is randomly generated by the client or server to uniquely identify a GCF file (and/or version). It could also be a hash of something (like the filename maybe). It seems bizarre that this field is checked against the server's version, however it is not included in the checksum, which means that this can be arbitrarily modified by the client. (I think this field has something to do with updating a GCF file. It probably postpones updating this field until the GCF has finished updating and it has been verified. So my guess is that it's randomly generated on the server.)
  • Checksum is somewhat complicated, as it's calculated using the entire directory, excluding a few fields. The following C code shows how the checksum is calculated:
// This is not a safe implementation, it is merely a basic example of
// how the checksum works. Also, you need to use to use the [http://www.zlib.net/ zlib]
// library.
 
// This version modifies the data, though it does set it back to its
// original state after the checksum has completed. If you need to
// calculate it without modifying the bytes, then use the other
// function.
DWORD GcfDirectoryChecksum(BYTE *directoryData, DWORD directorySize)
{
    DWORD *directoryDataDword = (DWORD *)directoryData;
    DWORD tempFingerprint, tempChecksum, checksum;
 
    // Store the non-calculated fields into variables temporarily
    // and replace them with zero for the checksum to properly
    // calculate.
    tempFingerprint = directoryDataDword[12]; directoryDataDword[12] = 0;
    tempChecksum    = directoryDataDword[13]; directoryDataDword[13] = 0;
 
    // Calculate checksum.
    checksum = adler32(0, directoryData, directorySize);
 
    // Fix temp fields.
    directoryDataDword[12] = tempFingerprint;
    directoryDataDword[13] = tempChecksum;
 
    // Return the checksum.
    return checksum;
}
 
// Another way to calculate the checksum (without modifying the bytes).
DWORD GcfDirectoryChecksum(const BYTE *directoryData, DWORD directorySize)
{
    // Temporary "replacement" bytes for non-calculated fields.
    const BYTE nullBytes[8] = {
        0, 0, 0, 0,
        0, 0, 0, 0
    };
 
    // Calculate checksum in a 3-part process.
    DWORD checksum = 0;
    checksum = adler32(checksum, directoryData, 4 * 12);
    checksum = adler32(checksum, nullBytes, 8);
    checksum = adler32(checksum, (BYTE *) (directoryData + (4 * 14)),
                       directorySize - (4 * 14));
 
    // Return the checksum.
    return checksum;
}

Entries

This is an array, with the size being GCFDirectoryHeader.ItemCount.

struct GCFDirectoryEntry
{
    DWORD NameOffset;
    DWORD ItemSize;
    DWORD ChecksumIndex;
    DWORD DirectoryFlags;
    DWORD ParentIndex;
    DWORD NextIndex;
    DWORD FirstIndex;
}
  • NameOffset defines the offset in the name table where the name of the item is located. The name is a C string.
  • ItemSize defines the size of the item. If the item is a file, then it is the number of bytes in the file; otherwise, if the item is a directory, then it is the number of files in the directory.
  • ChecksumIndex defines the index of the checksums for the file in the checksum map. If the item is a folder, then this value is 0xFFFFFFFF.
  • DirectoryFlags defines various flags for the item. These are the currently known flags:
    • 0x00004000 - The item is a file.
    • 0x00000800 - The item is executable. (Unverified)
    • 0x00000400 - The item is hidden. (Unverified)
    • 0x00000200 - The item is read-only. (Unverified)
    • 0x00000100 - The item is encrypted.
    • 0x00000080 - The item is a purge file. (Unverified)
    • 0x00000040 - Backup the item before overwriting it. (Versioned Uc File (Unverified))
    • 0x00000020 - The item is a no-cache file. (Unverified)
    • 0x0000000a - The item is to be copied to the disk. (This is actually a combination of launch file and locked flags.)
    • 0x00000008 - The item is locked. (Unverified)
    • 0x00000002 - The item is a launch file. (Unverified)
    • 0x00000001 - The item is a user config file. Don't overwrite the item if copying it to the disk and the item already exists.
  • ParentIndex defines the index to the container for the item. This is always a reference to a folder. If the item is at the root, then the value is 0xFFFFFFFF.
  • NextIndex defines the next item in the current hierarchy. If there are no more items, then the value is 0x00000000.
  • FirstIndex defines the first item in the current hierarchy. If there is no first item, then the value is 0x00000000.

Name Table

This is simply a byte array of C strings. They are referenced by offset by the directory entries. The number of bytes in this table is GCFDirectoryHeader.NameSize.

Hash Table

The hash table is somewhat bizarre, as it actually consumes two arrays of structures, rather than just one like all the other sections. In previous documentation (before it was known what these arrays were for), these were known simply as Info1 and Info2. Both of them are simple DWORD arrays; Info1 contains FileChecksumSize entries, while Info2 contains ItemCount entries. These have been renamed appropriately to HashTableKeys (previously Info1), HashTableKeysCount (previously Info1Count, and HashTableIndices (previously Info2) respectively.

Now, it is possible to use only one array to store the hash table. It would appear that the way the indices are referenced would lead one to assume that this is what VALVe does internally. So do not assume that this is the only way to parse/use the hash table. The reason for not documenting it this way is because it makes it too complicated for the average reader; understanding this is hard enough using two separate arrays... thus optimizing the implementation would make it even more difficult to understand.

For more information about the hash table being used, please read the following article on hash tables, coalesced hashing.

The below is a modified excerpt from the original forum thread which found this information.

First are the HashTableKeys, in parenthesis is the HashTableKeysCount.

Index > Value        Value-HashTableKeysCount (index to HashTableIndices)

The reason why the value is stored with the count of the HashTableKeys is so that you can directly reference the DWORD index from the start of HashTableKeys, rather than moving the pointer to HashTableIndices and then referencing the index there. (This seems like a very small and rather confusing "optimization" at first glance.)

Then the HashTableIndices entries, in parenthesis is the HashTableIndices count (equal to directory entry count).

Index (Index relative to HashTableKeys) > Value&80000000!=0        Value&7FFFFFFF        Filename using Value&7FFFFFFF as the Index

Note that the number of entries in HashTableKeys is always a power of 2. E.g., 2x = HashTableKeysCount. (This has to do with the hash table search algorithm.)

Example: Codename Gordon

-HashTableKeys----( 4)-------------------
      0  >  4     0
      1  >  6     2
      2  > 10     6
      3  > 12     8
-HashTableIndices-(15)-------------------
  0 ( 4) >  0     1  cg.exe
  1 ( 5) >  1     3  cg_languages.xml
  2 ( 6) >  0     0  (null)
  3 ( 7) >  0     2  cg_highscore.swf
  4 ( 8) >  0    11  dialogs_korean.xml
  5 ( 9) >  1    14  dialogs_tchinese.xml
  6 (10) >  0     5  cg_victims.xml
  7 (11) >  1    10  dialogs_japanese.xml
  8 (12) >  0     4  cg_version.swf
  9 (13) >  0     6  dialogs_english.xml
 10 (14) >  0     7  dialogs_french.xml
 11 (15) >  0     8  dialogs_german.xml
 12 (16) >  0     9  dialogs_italian.xml
 13 (17) >  0    12  dialogs_schinese.xml
 14 (18) >  1    13  dialogs_spanish.xml

Here is some code, to show how the hash table works. The GCF/NCF directory unknown data Info1 and Info2 are a files names hash table. This data is not required to access files within a GCF as you can simply enumerate through the directory tree, but this operation is slow on large directories. VALVe had the idea to add a way to search file IDs from their name. What I call HashTableKeyCount is the former Info1Count, HashTableKeys are Info1Entries, and HashTableIndices is Info2Entries.

The first function allows files to be quickly located with the same name. There is also another function which allows developers to get a specific file ID from it's path. This method is interesting only for very big GCFs with many files, else it's surely faster to cut the path in to elements and to enumerate the directory tree.

Note: jenkinsLookupHash2 is lookup2.c

// The maximum number of results to retrieve. (This number is probably overkill.)
#define RESULTS_SIZE 200
 
int findFileIds(GCF *gcf, char *filename, DWORD *ids, int maxResults)
{
    if (maxResults < 0)
        return 0;
 
    int nbResults = 0;
 
    // Safely copy filename and make it lowercase.
    char *fileName = (char *)malloc(strlen(filename) + 1);
    strcpy(fileName, filename);
    strlwr(filename);
 
    // Compute HashTableKeys index.
    DWORD hash = jenkinsLookupHash2((unsigned char *)fileName, strlen(fileName), 1);
    DWORD hashIndex = hash & (gcf->directory->HashTableKeysCount - 1);
 
    // Get the HashTableIndexes index.
    DWORD hashedFileIndex = gcf->directory->HashTableKeys[hashIndex];
    if (hashedFileIndex == -1)
    {
        // Clean up.
        free(fileName);
        // File not found.
        return 0;
    }
    // Not sure why Valve implemented this way.
    hashedFileIndex -= gcf->directory->HashTableKeysCount;
 
    // Search for the file in the index chain.
    DWORD stop = 0;
    do
    {
        DWORD hashedValue = gcf->directory->HashTableIndices[hashedFileIndex];
        DWORD fileId = hashedValue & 0x7FFFFFFF;
        stop = hashedValue & 0x80000000;
 
        if (!stricmp(fileName, gcf->directory->directoryNames + gcf->directory->entry[fileId].nameOffset))
        {
            // File found, add it to the results array.
            ids[nbResults++] = fileId;
        }
        hashedFileIndex++;
    } while (!stop && nbResults < maxResults);
 
    // Clean up.
    free(fileName);
    // Return the number of entries found.
    return nbResults;
}
 
DWORD getFileIdFast(GCF *gcf, char *path)
{
    // Clean the path and get just the file name.
    // (This looks more complicated than it really is.)
    int correct = (*path != '\\' && *path != '/') ? 1 : 0;
    char *p = (char *)malloc(strlen(path) + 1 + correct);
    if (correct)
        *p = '\\';
    strcpy(p + correct, path);
 
    char *slash = NULL;
    while (slash = strchr(p, '/'))
        *slash = '\\';
 
    char *fileName = strrchr(p, '\\') + 1;
    if ((int)fileName == 1)
        fileName = p;
 
    // Allocate an array of results.
    DWORD results[RESULTS_SIZE];
    int nbFiles = searchFileIds(gcf, fileName, results, RESULTS_SIZE);
 
    // Try to match one of the results with the wanted file/path.
    char filePath[1000];
    for (int i = 0; i < nbFiles; i++)
    {
        // Though the code for this does not exist, it simply
        // gets the full path of the file.
        getFilePath(gcf, results[i], filePath);
 
        if (!stricmp(filePath, p))
        {
            // Clean up.
            free(p);
            // File found.
            return results[i];
        }
    }
 
    // Clean up.
    free(p);
    // File not found.
    return -1;
}

Copy Entries

This is an array, with the size being GCFDirectoryHeader.CopyCount. All referenced files in this array should be extracted from the GCF to the operating system's local file system.

struct GCFDirectoryCopyEntry
{
    DWORD DirectoryIndex;
}
  • DirectoryIndex represents the index of the file to be copied.


Local Entries

This is an array, with the size being GCFDirectoryHeader.LocalCount. All referenced files in this array should be extracted from the GCF to the operating system's local file system. The local version should always have priority over the stored GCF version, regardless of any changes made to either.

struct GCFDirectoryLocalEntry
{
    DWORD DirectoryIndex;
}
  • DirectoryIndex represents the index of the file to keep local.

Directory Map

This is a mapping of the directory entries to the block entries. This way you can quickly look up the first block for a given file. This is in contrast to having to look through all the block entries to try finding the correct file (which could be very time consuming if there are thousands of blocks in a GCF). The reason this data is not stored directly in the directory is because of the checksum on it. Because the checksum is calculated on the entire directory (not just the directory header), it would be difficult to filter out all of these mappings. Thus they are stored here, separate from the directory.

Oddly enough, this section does appear in NCF files.

Header

The purpose of this header eludes me.

struct GCFDirectoryMapHeader
{
    DWORD Dummy0;
    DWORD Dummy1;
}
  • Dummy0 is always 0x00000001. I assume this is some kind of "version," but I cannot verify this.
  • Dummy1 is always 0x00000000.

Entries

This is an array, with the size being GCFDirectoryHeader.ItemCount.

struct GCFDirectoryMapEntry
{
    DWORD FirstBlockIndex;
}
  • FirstBlockIndex represents the index of the first data block for the file. If the value is equal to the BlockCount, then the item is a directory and/or is not applicable. In NCF files, this is 0x00000000 if the item is a directory, 0x00000001 if the file is empty, and 0x00000003 if the file is not empty.

Checksums

One thing to note about this is that there are a lot of "unused" checksums here. They are used, just not for the current version of the cache. This is actually spans every revision of the cache.

Header

This header is because it is technically part of the header below. The only reason why this is split is to simplify the ChecksumSize field as it skips those first two fields when counting the size.

struct GCFChecksumHeader
{
    DWORD Dummy0;
    DWORD ChecksumSize;
}
  • Dummy0 is always 0x00000001. This is left over from the network protocol and can be ignored. Though it could also be a version, but not likely.
  • ChecksumSize is the number of bytes in the checksum section (excluding this structure).

Map Header

struct GCFChecksumMapHeader
{
    DWORD Bitmask;
    DWORD Version;
    DWORD ItemCount;
    DWORD ChecksumCount;
}
  • Bitmask is as the name says, a bit mask of various flags and values. This is always 0x14893721. Here are the currently known masks:
    • 0x00000001 - Is Signed (Unverifed, Probably determines whether there is a signature at the end.)
    • 0xFFFFFFFE - Unknown
  • Version is the version of the checksum format. This is always 0x00000001.
  • ItemCount represents the number of map entries. This should also be the same as the file count in the directory header.
  • ChecksumCount represents the number of checksums.

Map Entries

This is an array, with the size being GCFChecksumMapHeader.ItemCount. The indices in this array are actually the file IDs for all versions of the given cache.

struct GCFChecksumMapEntry
{
    DWORD ChecksumCount;
    DWORD FirstChecksumIndex;
}
  • ChecksumCount defines how many checksums are used for the checksum block.
  • FirstChecksumIndex defines the index of the first checksum for checksum block.

Checksum Entries

This is an array, with the size being GCFChecksumMapHeader.ChecksumCount.

struct GCFChecksumEntry
{
    DWORD Checksum;
}
  • Checksum is a checksum for a given segment of a file. The following C code shows how the checksums are calculated:
// You need to use to use the [http://www.zlib.net/ zlib] library.
inline DWORD FileChecksum(BYTE *data, DWORD length)
{
    return adler32(0, data, length) ^ crc32(0, data, length);
}

Signature

struct GCFChecksumSignature
{
    BYTE Signature[0x80];
}
  • Signature is an RSA signature (using SHA-1) of the checksum data. The RSA key used is to-be-determined.

Data Blocks

This section is not present in NCF files and should be ignored when processing them.

struct GCFDataBlockHeader
{
    DWORD CacheVersion;
    DWORD BlockCount;
    DWORD BlockSize;
    DWORD FirstBlockOffset;
    DWORD BlocksUsed;
    DWORD Checksum;
}
  • CacheVersion is the version of the application.
  • BlockCount represents the number of blocks in the cache.
  • BlockSize represents how many bytes are in each segment of a file.
  • FirstBlockOffset represents the offset to the first data block.
  • BlocksUsed represents the number of blocks that actually have data.
  • Checksum is used to validate the header. Currently, the checksums are calculated by adding all of the previous DWORDs together, except for the CacheVersion field.

Unknown NCF DWORD

I am not entirely sure what this DWORD is, but it appears in every NCF file I have looked at. One guess would be that it is the count of the number of incomplete or non-downloaded files.

Credits