# Common ZIP Spec
This document is a technical specification for the ZIP file format originally invented by PKWARE, Inc. This is an independent document, not affiliated with PKWARE, that fully specifies the file format that is in common use as of 2026. For PKWARE's original specification, called "APPNOTE", see https://support.pkware.com/pkzip/appnote . This document covers a subset of APPNOTE in addition to some third-party extensions and common conventions beyond APPNOTE.
This document is a guide for implementing software that works with ZIP files in the modern age, as of 2026. This is a guide to:
- * creating ZIP files that can be read by most existing implementations,
- * reading ZIP files created by most implementations,
- * and guarding against surprising behavior from untrusted inputs.
For those already familiar with the ZIP file format, see ZIP File Features and Common ZIP vs APPNOTE.
The Common ZIP authors believe that while this specification is relevant in most situations, there are some use cases with specialized needs that must be non-conformant. Such specialized use cases could include environments where the writing and reading of ZIP files is performed by known, controlled software implementations, and in such situations the design goals enumerated above do not apply.
## Legal
This specification is licensed under the MIT License. The original APPNOTE has an embedded proprietary license that does not allow redistribution; see APPNOTE for details. This document is fully original and does not copy or reproduce any part of APPNOTE.
The ZIP file format itself was placed in the public domain in 1989 by its creators, Phil Katz and Gary Conway.
See Endnote `PublicDomain`.
Later additions to the ZIP file format require a license from PKWARE to utilize.
No part of the ZIP file format that requires a license is described in technical detail in this document.
This document covers a patent-free, freely-usable subset of the ZIP file format.
This document refers to the following registered and unregistered trademarks belonging to PKWARE: PKWARE®, ZIP64™.
### MIT License
Copyright (c) 2026 Josh Wolfe
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
## Informal Overview
This section is non-normative.
The following is a simplified diagram of the structure of a ZIP file:
``` for each entry { LocalFileHeader compressed { contents } } for each entry { CentralDirectoryHeader } EndOfCentralDirectoryRecord ```
The `contents` portions comprise the vast majority of a typical ZIP file.
The `EndOfCentralDirectoryRecord` contains information for locating the first `CentralDirectoryHeader`.
Each `CentralDirectoryHeader` contains information for locating the corresponding `LocalFileHeader`.
Each `LocalFileHeader` contains redundant metadata also found in the corresponding `CentralDirectoryHeader`.
The following is a complete diagram of a ZIP file conformant to this specification (non-normative):
``` for each entry { (optional unused space) LocalFileHeader optionally compressed { contents } if LocalFileHeader General Purpose Bit 3 is set { DataDescriptor } } (optional unused space) for each entry { CentralDirectoryHeader } (optional unused space) if optional { Zip64EndOfCentralDirectoryRecord (optional unused space) Zip64EndOfCentralDirectoryLocator } EndOfCentralDirectoryRecord ```
Note that APPNOTE specifies even more structures and variability, not pictured above.
### ZIP File Features
This specification classifies a subset of ZIP file features into the categories critical, optional, and obscure as explained below. Some of these features are documented in APPNOTE and some are conventions that have arisen independent of PKWARE and APPNOTE. Some ZIP file features, such as many extra fields, are omitted from specific discussion, and instead this spec encourages writers to omit them and readers to ignore them.
This section is a non-normative summary; the full details are explained in ZIP File Structure.
The following are **critical features** that a ZIP file MAY include. A reader SHOULD always support these, and a writer SHOULD use them as needed.
- * DEFLATE compression: `
compressionMethod` values `0` (stored) and `8` (deflated). - * CRC32 checksums: `
crc32` fields SHALL be set correctly, and readers SHOULD verify them. - * File Sizes: `
Zip64ExtendedInformation` and the two ZIP64 end structs, which support sizes larger than `4294967295` and number of entries more than `65535`. - * Deferred Lengths: General Purpose Bit 3 and `
DataDescriptor`. (Note that Streaming Reading is strongly discouraged.) - * `
fileName` determination: UTF-8, General Purpose Bit 11, `InfoZipUnicodePath`, validation rules. - * `
fileType` determination: classifying entries as directory, symlink, regular file, or POSIX executable file. If a reader does not support symlinks, it SHOULD produce an error rather than misinterpreting a symlink entry as a regular file.
The following are **optional features** that a ZIP file MAY include. A reader SHOULD support these as needed, or MAY ignore or skip them as appropriate.
- * `
lastModifiedTimestamp`: `dosTimestamp`, `InfoZipUniversalTimestamp`, `NtfsTimestamp` - * Full symlink support (beyond just detecting what is a symlink): if a reader does not produce an error when encountering a symlink, it SHOULD enforce validation rules to prevent path traversal vulnerabilities.
A conformant ZIP file SHALL NOT include any of the following **obscure features**. A reader SHOULD NOT support these unless specifically required. Many of these are explained in APPNOTE.
- * **Multi-Disk**: splitting an archive across multiple files or disks.
- * **More Compression Methods**: `
compressionMethod` other than `0` and `8`. - * **Patch Data**: general purpose bit 5. See `
generalPurposeBits`. - * **Traditional Encryption**: general purpose bit 0.
- * **Strong Encryption**: general purpose bits 6 and 13, and several structures documented in APPNOTE.
- * **Z390 Extra Field**: some content in the `
zip64 extensible data sector`. - * **Reserved General Purpose Bits**: general purpose bits 7, 8, 9, 10, 12, 14, and 15.
- * **More File Systems**: `
fileSystemCompatibility` other than `0` and `3`. - * **Future APPNOTE Versions**: `
appnoteCompatibilityMin` greater than `45` meaning features requiring APPNOTE versions beyond `4.5`. Note that some features, such as General Purpose Bit 11 were introduced later, but they do not require setting `appnoteCompatibilityMin` higher than `45`. - * **Base Offset Shift**: incorrectly setting offsets in the ZIP metadata suggesting that the ZIP file was concatenated to some other data. Note that this is a convention apparently pioneered by Info-ZIP and is not covered in APPNOTE. See Endnote `
BaseOffsetShift`.
## Definitions and Notation
This specification uses the following linguistic conventions:
- * SHALL indicates a requirement, and SHALL NOT indicates a forbiddance.
- * SHOULD indicates a recommendation, and SHOULD NOT indicates a discouragement.
- * MAY is used to indicate that something is permitted, and MAY NOT is not used in this document.
- * CAN is used to indicate that something is possible, and CANNOT is not used in this document.
- * When a term is defined, the above verb forms are not used; typically the term is written in **bold**.
- * If a sentence begins with the word Note, it is a non-normative clarification.
A writer is an implementation producing a ZIP file conformant to this specification. A reader is an implementation accepting a ZIP file as an input. This specification only places normative restrictions on ZIP files, not on implementations working with them; this specification gives recommendations and warnings to implementation authors for security and compatibility purposes.
### Types and Values
A byte is 8 bits. The type **`bytes`** refers to a variable-length sequence of bytes.
There is no implicit character encoding for `bytes`.
All integers are encoded in little-endian byte order unless otherwise stated.
The unsigned integer types are: **`uint8le`**, **`uint16le`**, **`uint32le`**, **`uint64le`**.
These integers are 1, 2, 4, and 8 bytes in size respectively.
When an integer value expressed in this document is prefixed by `0x` the remainder of the value is in hexadecimal,
when prefixed by `0o` the remainder is in octal, and otherwise the value is in decimal; e.g. `0xFF` equals `0o377` equals `255`.
Timestamps are represented in this document in `YYYY-MM-DDThh:mm:ssZ` format,
where `YYYY` is the year, `MM` is the month where `01` is January, `DD` is the day where `01` is the first day of the month,
`hh` is the hour in the range `00` to `23`, `mm` is the minutes, and `ss` is the seconds.
The final `Z` indicates that the timestamp is a UTC Time in the UTC+00:00 timezone.
A struct is a sequence of fields. Each field has a name and a type. A field CAN be specified to have a variable size in bytes. There is no implicit padding before, between, or after any defined fields within a struct. The size of the struct in bytes is the sum of the sizes of its data fields.
An offset is a number of bytes since the start of a given sequence of bytes where a given struct or field starts. For example, if a struct is at offset of 0 in a region, that means that the given struct is the first thing in the given region. An offset CAN be any integer value within the bounds of the region, meaning there is no implicit pointer alignment.
Unused space is 1 or more bytes that are not assigned any meaning by to this specification.
For example, a self-extracting ZIP file CAN contain executable code before the first `LocalFileHeader`.
The bytes in unused space CAN contain arbitrary values, including values resembling meaningful structures.
This specification carefully defines how meaning is assigned to bytes that a reader might find in a file.
See Recommendations for Implementations for guidance, and see ZIP File Structure for formal definitions.
In the context of this specification, a file is defined to be a sequence of bytes.
Note that typically a file resides on a persistent storage medium, but also may be a stream of data being read from or written to a network socket, or any other implementation of the concept of a file.
A file is often in a context where it has a symbolic path or file name. TODO: do we want to get this deep in the definitions section?
A directory is not defined in this specification, but has the familiar definition. TODO: actually give it a definition? is this overkill to define directories herein? See also the `fileType=DIRECTORY` definition below.
A symlink is defined as a special type of file whose contents have meaning as the path to another file, possibly relative to the immediate parent directory that contains the symlink. Note this corresponds more closely to the POSIX concept of a symbolic link, not the Windows 10+ concept, because there is no embedded bit indicating whether the symlink points to a directory.
TODO: A reader and writer should be defined here.
TODO: is 'a reader' vs 'the reader' and 'a writer' vs 'the writer' a problem? maybe use different terminology like "a reader implementation"?
TODO: is "the `contents`" vs just "`contents`" (and also for "the `fileData`") a problem?
A ZIP file is a file that encodes a ZIP archive according to the ZIP File Structure documented below. Note the following abstract structure definitions define what this specification recommends implementations consider significant, but any in-memory representations that implementations use MAY deviate from these. A ZIP archive is defined as an abstract structure with the following fields:
- * 0 or more abstract Entry objects. The order is significant.
An **Entry** is defined as an abstract structure with the following fields:
- * `
fileName`, a UTF-8 string. The file name of the entry relative to the archive root. E.g. `docs/README.md`. - * `
fileType`, one of: `FILE`, `POSIX_EXECUTABLE`, `DIRECTORY`, or `SYMLINK`. - * `
contents`, arbitrary `bytes`. For for `fileType=DIRECTORY`, `contents` must be 0-length. For `fileType=SYMLINK`, `contents` represents the symlink target (sometimes called the "source"). For `fileType=FILE` or `fileType=POSIX_EXECUTABLE`, `contents` represents the file's contents on disk. - * `
lastModifiedTimestamp`, an optional moment in time when the file was last modified. Although the precision and timezone information of the abstract `lastModifiedTimestamp` is unspecified, see `lastModifiedTimestamp` for information about concrete encodings and their limitations.
## ZIP File Structure
This section describes the ZIP file structure. See the Recommendations for Implementations section for a discussion of algorithms for working with ZIP files.
### `LocalFileHeader`, `fileData`, `DataDescriptor`
If there are 0 entries, the start of the ZIP file is described in the section End of Central Directory. Otherwise, for each entry, the ZIP file contains each of the following in order:
- * Optional unused space
- * `
LocalFileHeader` - * `
fileData` - * Conditionally-present `
DataDescriptor`
Note that the very start of the ZIP file MAY be arbitrary byte values in the unused space, including byte values that resemble a meaningful `LocalFileHeader`.
For a guide to reading a ZIP file despite this ambiguity, see Recommendations for Implementations.
The following defines **`LocalFileHeader`**:
``` struct LocalFileHeader = { // Name: Type; //Offset,Size| Comment signature: uint32le; // 0, 4 | SHALL be 0x04034b50 appnoteCompatibilityMin: uint16le; // 4, 2 | generalPurposeBits: uint16le; // 6, 2 | compressionMethod: uint16le; // 8, 2 | SHALL be 0 or 8 dosTimestamp: uint32le; // 10, 4 | crc32: uint32le; // 14, 4 | compressedSize32: uint32le; // 18, 4 | uncompressedSize32: uint32le; // 22, 4 | fileNameRawLength: uint16le; // 26, 2 | See fileNameRaw extraFieldsLength: uint16le; // 28, 2 | See extraFields fileNameRaw: bytes; // 30, fileNameRawLength extraFields: bytes; // 30 + fileNameRawLength, extraFieldsLength }; ```
The size of a `LocalFileHeader` is `30 + fileNameRawLength + extraFieldsLength`.
The `signature` SHALL be `0x04034b50` i.e. `{0x50, 0x4b, 0x03, 0x04}` i.e. `"PK\x03\x04"`.
The fields `appnoteCompatibilityMin`, `generalPurposeBits`, and `dosTimestamp` are documented in their own sections.
The **`compressionMethod`** SHALL be one of the following values:
- * `
0` **stored**: the `fileData` is the `contents` with no compression. - * `
8` **deflated**: the `fileData` is the `contents` compressed with the DEFLATE compression algorithm.
If `generalPurposeBits` bit 3 is set, then `crc32`, `compressedSize32`, and `uncompressedSize32` SHALL all be `0`,
otherwise `crc32` SHALL be the CRC32 of the `contents`,
and see the File Sizes section for the meaning of `compressedSize32` and `uncompressedSize32`.
`fileNameRawLength` is the length of `fileNameRaw`,
and see the `fileName` section for the meaning of `fileNameRaw`.
The `extraFieldsLength` is the length of the `extraFields`,
and `extraFields` is documented in its own section.
After the `LocalFileHeader` comes the entry's **`fileData`**, which encodes the entry's `contents` according to `compressionMethod`.
If `generalPurposeBits` bit 3 is set, then immediately following the `fileData` is a `DataDescriptor` struct.
A **`DataDescriptor`** has a variable structure taking one of the following forms:
``` struct DataDescriptor = one of { struct { // Name: Type; // Size | Comment signature: uint32le; // 4 | SHALL be 0x08074b50, when present crc32: uint32le; // 4 | compressedSize: uint32le; // 4 | uncompressedSize: uint32le; // 4 | }, struct { // Name: Type; // Size | Comment signature: uint32le; // 4 | SHALL be 0x08074b50, when present crc32: uint32le; // 4 | compressedSize: uint64le; // 8 | uncompressedSize: uint64le; // 8 | }, struct { // Name: Type; // Size | Comment crc32: uint32le; // 4 | compressedSize: uint32le; // 4 | uncompressedSize: uint32le; // 4 | }, struct { // Name: Type; // Size | Comment crc32: uint32le; // 4 | compressedSize: uint64le; // 8 | uncompressedSize: uint64le; // 8 | }, } ```
The size of a `DataDescriptor` is one of `16`, or `24`, or `12`, or `20` depending on the form.
If the `signature` field is present, it SHALL be `0x08074b50` i.e. `{0x50, 0x4b, 0x07, 0x08}` i.e. `"PK\x07\x08"`.
Note that the `crc32` field CAN have this value by coincidence.
See Endnote `DataDescriptorHasAmbiguousStructure` and Endnote `DataDescriptorIsAnAmbiguousSentinel`.
The `crc32` field SHALL be the CRC32 of the `contents`.
The `uncompressedSize` field SHALL be the length of the `contents`,
and the field must be of sufficient size to encode the length using a `uint64le` variant if needed.
The `compressedSize` field SHALL be the length of the `fileData`,
which is equal to `uncompressedSize` when `compressionMethod` is 0 (stored),
and the field must also be of sufficient size as with `uncompressedSize`.
The `uncompressedSize` and `compressedSize` MAY be `uint64le` integers even if their encoded values would fit in `uint32le` integers.
The structures described above for each entry SHALL NOT overlap with the structures for another entry; see Endnote `ZipBomb`.
### Central Directory
Before the first `CentralDirectoryHeader` is optional unused space.
Then, for each entry, there is a `CentralDirectoryHeader`.
There is no unused space between `CentralDirectoryHeader` structs.
Each `CentralDirectoryHeader` corresponds to a `LocalFileHeader`, and several fields are duplicated between the two structures.
The order of `LocalFileHeader` and corresponding `CentralDirectoryHeader` structs SHALL match; see Endnote `ZipBomb`.
The following defines **`CentralDirectoryHeader`**:
``` struct CentralDirectoryHeader = { // Name: Type; //Offset,Size| Comment signature: uint32le; // 0, 4 | SHALL be 0x02014b50 appnoteCompatibilityMax: uint8le; // 4, 1 | SHOULD be 63 fileSystemCompatibility: uint8le; // 5, 1 | SHOULD be 3 appnoteCompatibilityMin: uint16le; // 6, 2 | SHOULD be 45 generalPurposeBits: uint16le; // 8, 2 | SHOULD be 0x0800 compressionMethod: uint16le; // 10, 2 | SHALL match LocalFileHeader dosTimestamp: uint32le; // 12, 4 | SHALL match LocalFileHeader crc32: uint32le; // 16, 4 | compressedSize32: uint32le; // 20, 4 | uncompressedSize32: uint32le; // 24, 4 | fileNameRawLength: uint16le; // 28, 2 | See fileNameRaw extraFieldsLength: uint16le; // 30, 2 | See extraFields fileCommentLength: uint16le; // 32, 2 | See fileComment localFileHeaderDisk16: uint16le; // 34, 2 | internalFileAttributes: uint16le; // 36, 2 | SHOULD be 0 externalFileAttributes: uint32le; // 38, 4 | localFileHeaderOffset32: uint32le; // 42, 4 | fileNameRaw: bytes; // 46, fileNameRawLength extraFields: bytes; // 46 + fileNameRawLength, extraFieldsLength fileComment: bytes; // 46 + fileNameRawLength + extraFieldsLength, fileCommentLength }; ```
The size of a `CentralDirectoryHeader` is `46 + fileNameRawLength + extraFieldsLength + fileCommentLength`.
The `signature` SHALL be `0x02014b50` i.e. `{0x50, 0x4b, 0x01, 0x02}` i.e. `"PK\x01\x02"`.
Fields `appnoteCompatibilityMax`, `fileSystemCompatibility`, `appnoteCompatibilityMin`, `internalFileAttributes`, `externalFileAttributes`, and `generalPurposeBits` are documented in their own sections.
The fields `compressionMethod` and `dosTimestamp` SHALL match the values in the corresponding `LocalFileHeader`.
The `crc32` SHALL be the CRC32 of the corresponding `contents`.
See the section on the `Zip64ExtendedInformation` extra field for the meaning of `compressedSize32`, `uncompressedSize32`, `localFileHeaderDisk16`, and `localFileHeaderOffset32`.
`fileNameRawLength` is the length of `fileNameRaw`,
and see the `fileName` section for the meaning of `fileNameRaw`.
The `extraFieldsLength` is the length of the `extraFields`,
and `extraFields` is documented in its own section.
`fileCommentLength` is the length of `fileComment`.
This specification gives no defined meaning to `fileComment`.
TODO: actually do give defined character encoding as UTF-8 when General Purpose Bit 11 is set and implementation defined otherwise.
Note that the restrictions on bytes prior to the `EndOfCentralDirectoryRecord` described in the next section CAN affect the encoding of a `CentralDirectoryHeader`.
For more discussion see Recommendations for Implementations.
### End of Central Directory
Note that the last structures of a ZIP file are intended to be readable starting from the end of the file and working backwards. This places awkward forbiddances on certain values in specific locations. For more discussion see Recommendations for Implementations.
Following the last `CentralDirectoryHeader`, or if there are no entries then at the start of the file, is optional unused space.
Then is possibly **the two ZIP64 end structs** that MAY be both present. If one is omitted the other SHALL be omitted as well.
- * `
Zip64EndOfCentralDirectoryRecord` - * optional unused space
- * `
Zip64EndOfCentralDirectoryLocator`
Then always an `EndOfCentralDirectoryRecord`.
If the two ZIP64 end structs are omitted, then the 4 bytes starting 20 bytes prior to the start of the `EndOfCentralDirectoryRecord` SHALL NOT be `0x07064b50` i.e. `{0x50, 0x4b, 0x06, 0x07}` i.e. `"PK\x06\x07"`.
Note that such a value would appear to a reader working backward from the `EndOfCentralDirectoryRecord` to be the `signature` of the `Zip64EndOfCentralDirectoryLocator`.
For more discussion see Recommendations for Implementations and Endnote `Zip64EndStructsAmbiguity`.
The following defines **`EndOfCentralDirectoryRecord`**:
``` struct EndOfCentralDirectoryRecord = { // Name: Type; //Offset,Size| Comment signature: uint32le; // 0, 4 | SHALL be 0x06054b50 centralDirLastDisk16: uint16le; // 4, 2 | centralDirStartDisk16: uint16le; // 6, 2 | entryCountOnLastDisk16: uint16le; // 8, 2 | SHALL be equal to entryCount16 entryCount16: uint16le; // 10, 2 | centralDirSize32: uint32le; // 12, 4 | centralDirOffset32: uint32le; // 16, 4 | archiveCommentLength: uint16le; // 20, 2 | See archiveComment archiveComment: bytes; // 22, archiveCommentLength }; ```
The size of the `EndOfCentralDirectoryRecord` is `22 + archiveCommentLength`.
The ZIP file SHALL end immediately after the `EndOfCentralDirectoryRecord` with no unused space at the end.
The `signature` SHALL be `0x06054b50` i.e. `{0x50, 0x4b, 0x05, 0x06}` i.e. `"PK\x05\x06"`.
The sequence of 4 bytes `{0x50, 0x4b, 0x05, 0x06}` SHALL NOT appear starting after the `signature` and before 22 bytes prior to the end of the `EndOfCentralDirectoryRecord` struct.
Note that this forbiddance is to enable a reader to find the `signature` when searching backwards from the end of the file.
Note that if `archiveCommentLength` is at most `3`, then this forbiddance is trivially satisfied.
For more discussion see Recommendations for Implementations.
`archiveComment` is an arbitrary informational comment.
The character encoding is implementation defined, and the recommended implementation is ASCII constrained to the byte values `0x20`-`0x7d` inclusive; see Endnote `DefaultCharacterEncoding`.
Note that APPNOTE (in APPENDIX D introduced in version 6.3.0) suggests that the character encoding is "IBM Code Page 437", however see Endnote `Cp437IsAmbiguous`.
TODO: missing discussion of `entryCountOnLastDisk16`.
See below for the meaning of `centralDirLastDisk16`, `centralDirStartDisk16`, `entryCount16`, `centralDirSize32`, and `centralDirOffset32`.
The following defines **`Zip64EndOfCentralDirectoryLocator`**:
``` struct Zip64EndOfCentralDirectoryLocator = { // Name: Type; //Offset,Size| Comment signature: uint32le; // 0, 4 | SHALL be 0x07064b50 zip64EocdrStartDisk: uint32le; // 4, 4 | SHALL be 0 zip64EocdrOffset: uint64le; // 8, 8 | diskCount32: uint32le; // 16, 4 | SHALL be 1 }; ```
The size of the `Zip64EndOfCentralDirectoryLocator` is `20`.
The `signature` SHALL be `0x07064b50` i.e. `{0x50, 0x4b, 0x06, 0x07}` i.e. `"PK\x06\x07"`.
The `zip64EocdrStartDisk` SHALL be `0` and `diskCount32` SHALL be `1` indicating that the Multi-Disk obscure feature is not used.
The `zip64EocdrOffset` is the offset of the `Zip64EndOfCentralDirectoryRecord` in the ZIP file.
The following defines **`Zip64EndOfCentralDirectoryRecord`**:
``` struct Zip64EndOfCentralDirectoryRecord = { // Name: Type; //Offset,Size| Comment signature: uint32le; // 0, 4 | SHALL be 0x06064b50 zip64ExtensibleDataSizePlus44: uint64le; // 4, 8 | SHALL be 44 appnoteCompatibilityMax: uint8le; // 12, 1 | SHOULD be 63 fileSystemCompatibility: uint8le; // 13, 1 | SHOULD be 3 appnoteCompatibilityMin: uint16le; // 14, 2 | SHOULD be 45 zip64EocdrStartDisk: uint32le; // 16, 4 | SHALL be 0 centralDirStartDisk32: uint32le; // 20, 4 | SHALL be 0 entryCountOnLastDisk64: uint64le; // 24, 8 | SHALL be equal to entryCount64 entryCount64: uint64le; // 32, 8 | centralDirSize64: uint64le; // 40, 8 | centralDirOffset64: uint64le; // 48, 8 | }; ```
The size of the `Zip64EndOfCentralDirectoryRecord` is `56`.
Note that APPNOTE includes an additional variable size field in this structure called **`zip64 extensible data sector`**,
which is used for the Z390 Extra Field obscure feature.
The `signature` SHALL be `0x06064b50` i.e. `{0x50, 0x4b, 0x06, 0x06}` i.e. `"PK\x06\x06"`.
The `zip64ExtensibleDataSizePlus44` SHALL be `44` indicating that the Z390 Extra Field obscure feature is not used.
The `zip64EocdrStartDisk` and `centralDirStartDisk32` SHALL both be `0` and `entryCountOnLastDisk64` SHALL be equal to `entryCount64` indicating that the Multi-Disk obscure feature is not used.
The fields `appnoteCompatibilityMax`, `fileSystemCompatibility`, and `appnoteCompatibilityMin` have limited meaning in the context of the `Zip64EndOfCentralDirectoryRecord`.
Writers SHOULD set these fields to `63` meaning APPNOTE version 6.3, `3` (UNIX), and `45` meaning APPNOTE version 4.5 respectively.
Readers SHOULD ignore these fields in this context.
If the two ZIP64 end structs are present:
- * The `
entryCount64` SHALL be the number of entries in the archive. - * The `
centralDirOffset64` SHALL be the offset of the first `CentralDirectoryHeader` in the archive. - * The `
centralDirSize64` SHALL be the sum of the sizes of all `CentralDirectoryHeader` structs. - * At least one of the following SHALL be true: `
entryCount16` is `0xFFFF`, or `centralDirOffset32` is `0xFFFFFFFF`, or `centralDirSize32` is `0xFFFFFFFF`. These three fields otherwise have no defined meaning. See Endnote `Zip64EndStructsAmbiguity`. - * Each of `
centralDirLastDisk16` and `centralDirStartDisk16` SHALL be either `0` or `0xFFFF`. See Endnote `MultiDiskZip64Compatibility`.
If the two ZIP64 end structs are omitted:
- * The `
entryCount16` SHALL be the number of entries in the archive, and SHALL be less than `0xFFFF`. See Endnote `Zip64EndStructsAmbiguity`. - * The `
centralDirOffset32` SHALL be the offset of the first `CentralDirectoryHeader` in the archive, and SHALL be less than `0xFFFFFFFF`. - * The `
centralDirSize32` SHALL be the sum of the sizes of all `CentralDirectoryHeader` structs, and SHALL be less than `0xFFFFFFFF`. - * Each of `
centralDirLastDisk16` and `centralDirStartDisk16` SHALL be `0` indicating that the Multi-Disk obscure feature is not used.
### `extraFields`
The `extraFields` bytes encode 0 or more variable-sized `ExtraField` structs.
The `ExtraField` structs are contiguous with no padding before or between them,
however there MAY be up to 3 bytes at the end of the `extraFields` bytes (after the last `ExtraField` if any).
If there are such bytes, their values SHALL be `0`.
See Endnote `AndroidAlignment`.
Note: Despite the name, some extra fields are critical to understanding the structure of ZIP file;
parsing the `extraFields` is not optional.
The following defines **`ExtraField`**:
``` struct ExtraField = { // Name: Type; //Offset, Size | Comment tag: uint16le; // 0, 2 | dataSize: uint16le; // 2, 2 | See data data: bytes; // 4, dataSize | }; ```
The size of an `ExtraField` is `4 + dataSize`.
An `ExtraField` SHALL NOT exceed the bounds of the `extraFields` that contains it.
See Endnote `ExtraFieldBufferOverflow`.
The `tag` identifies the meaning of the `data` in each `ExtraField`.
The following `tag` values are assigned meaning by this specification:
- * `
0x0000` - Padding - documented below. - * `
0x0001` - `Zip64ExtendedInformation` - documented in File Sizes. - * `
0x000a` - `NtfsTimestamp` - documented in `lastModifiedTimestamp`. - * `
0x5455` - `InfoZipUniversalTimestamp` - documented in `lastModifiedTimestamp`. - * `
0x7075` - `InfoZipUnicodePath` - documented in `fileName`.
Note that APPNOTE assigns meaning to many more `tag` values than this,
but the authors of Common ZIP have determined that only the above listed `tag` values are worth documenting.
A ZIP file MAY contain arbitrary `tag` values that a reader does not recognize.
A reader SHOULD ignore any `ExtraField` structs with unrecognized `tag` values.
`ExtraField` structs MAY appear in any order.
Multiple `ExtraField` structs in the same `extraFields` SHALL NOT have the same `tag` value,
where the `tag` is one of `0x0001`, `0x000a`, `0x5455`, or `0x7075`.
Note that duplicate Padding `ExtraField` structs is allowed.
If an `ExtraField` `tag` is `0x0000`, it represents **Padding**, the `dataSize` SHALL be `0`, and the extra field has no meaning.
Padding extra fields MAY appear multiple times.
Note `tag` value `0x0000` is not given a name or definition in APPNOTE,
but it is sometimes used to add alignment padding in ZIP files.
See Endnote `AndroidAlignment`.
Note that a writer MAY simply add arbitrarily many `0` bytes to the end of the `extraFields` (only limited by size encodable in `extraFieldsLength`)
without any consideration for the structure of the Padding `ExtraField` or the up to 3 bytes allowed at the end of the `extraFields`.
### File Sizes
Note this section describes how the location and size of an entry's `fileData` is encoded
and how the size of an entry's `contents` is encoded.
- * In the context of a `
LocalFileHeader` when General Purpose Bit 3 is not set, the abstract fields `compressedSize` and `uncompressedSize` are encoded in `compressedSize32`, `uncompressedSize32`, and any `Zip64ExtendedInformation` in the `extraFields`. - * In the context of a `
LocalFileHeader` when General Purpose Bit 3 is set, see Deferred Lengths. - * In the context of a `
CentralDirectoryHeader`, the abstract fields `localFileHeaderOffset`, `compressedSize`, and `uncompressedSize` are encoded in `localFileHeaderOffset32`, `compressedSize32`, `uncompressedSize32`, and any `Zip64ExtendedInformation` in the `extraFields`.
Note that `compressedSize32` and `uncompressedSize32` are effectively overridden by `compressedSize64` and `uncompressedSize64` respectively, the latter being located in a `Zip64ExtendedInformation` extra field;
the exact details are documented below.
If an `ExtraField` `tag` is `0x0001`, then the `data` is a `Zip64ExtendedInformation` struct.
Note that a `Zip64ExtendedInformation` struct, if present, is critical to understanding the structure of the ZIP file;
it is not optional metadata despite being in the `extraFields`.
The meaning of fields in a `LocalFileHeader` or a `CentralDirectoryHeader` depend on the presence or absence of a `Zip64ExtendedInformation` in the respective `extraFields`.
In this section, `uncompressedSize32`, `compressedSize32`, `compressionMethod`, and `generalPurposeBits` refer to fields
in the `LocalFileHeader` or `CentralDirectoryHeader` containing the `extraFields` that contains the `Zip64ExtendedInformation`,
`localFileHeaderOffset32` and `localFileHeaderDisk16` refer to the fields in the `CentralDirectoryHeader`,
and `fileData` and `contents` refer to the `fileData` and corresponding `contents` following the `LocalFileHeader`.
Each field of a `Zip64ExtendedInformation` struct is only present if the field it extends is set to its maximum value: `0xFFFFFFFF` for `uint32le` fields or `0xFFFF` for `uint16le` fields.
If a field is not present, it effectively has 0 size.
In the context of a `LocalFileHeader`, the fields `localFileHeaderOffset64` and `localFileHeaderDisk32` are never present.
The following defines **`Zip64ExtendedInformation`**:
``` struct Zip64ExtendedInformation = { // Name: Type; // Size | Extends uncompressedSize64: uint64le; // 8 or 0 | uncompressedSize32 compressedSize64: uint64le; // 8 or 0 | compressedSize32 localFileHeaderOffset64: uint64le; // 8 or 0 | localFileHeaderOffset32 localFileHeaderDisk32: uint32le; // 4 or 0 | localFileHeaderDisk16 }; ```
The size of a `Zip64ExtendedInformation` determined by `dataSize` SHALL match its size determined by the set of fields that are present.
The following intermediate values are referenced below:
- * Let `
uncompressedSize` be the length of the `contents`. - * Let `
compressedSize` be the length of the `fileData`, which is equal to `uncompressedSize` when `compressionMethod` is 0 (stored). - * Let `
localFileHeaderOffset` be the offset of the start of the `LocalFileHeader` in the ZIP file.
If a `Zip64ExtendedInformation` is present:
- * One of the following SHALL be true: `
uncompressedSize` is less than `0xFFFFFFFF` and `uncompressedSize32` is `uncompressedSize`, or `uncompressedSize32` is `0xFFFFFFFF` and `uncompressedSize64` is `uncompressedSize`, or `generalPurposeBits` bit 3 is set in the context of a `LocalFileHeader`. - * One of the following SHALL be true: `
compressedSize` is less than `0xFFFFFFFF` and `compressedSize32` is `compressedSize`, or `compressedSize32` is `0xFFFFFFFF` and `compressedSize64` is `compressedSize`, or `generalPurposeBits` bit 3 is set in the context of a `LocalFileHeader`. - * In the context of a `
CentralDirectoryHeader`, one of the following SHALL be true: `localFileHeaderOffset` is less than `0xFFFFFFFF` and `localFileHeaderOffset32` is `localFileHeaderOffset`, or `localFileHeaderOffset32` is `0xFFFFFFFF` and `localFileHeaderOffset64` is `localFileHeaderOffset`. - * In the context of a `
CentralDirectoryHeader`, one of the following SHALL be true, indicating that the Multi-Disk obscure feature is not used: `localFileHeaderDisk16` is `0`, or `localFileHeaderDisk16` is `0xFFFF` and `localFileHeaderDisk32` is `0`.
If a `Zip64ExtendedInformation` is not present:
- * One of the following SHALL be true: the `
uncompressedSize32` is `uncompressedSize`, or `generalPurposeBits` bit 3 is set in the context of a `LocalFileHeader`. - * One of the following SHALL be true: the `
compressedSize32` is `compressedSize`, or `generalPurposeBits` bit 3 is set in the context of a `LocalFileHeader`. - * In the context of a `
CentralDirectoryHeader`, the `localFileHeaderOffset32` SHALL be `localFileHeaderOffset`. - * In the context of a `
CentralDirectoryHeader`, the `localFileHeaderDisk16` SHALL be `0`, indicating that the Multi-Disk obscure feature is not used.
Note that a `Zip64ExtendedInformation` MAY be present even if it has `0` size,
and a field MAY be extended even if the increased size is not required to encode the value.
Note that in the context of a `LocalFileHeader`, if `generalPurposeBits` bit 3 is set,
then any `Zip64ExtendedInformation` SHALL have `0` size, because `uncompressedSize32` and `compressedSize32` are both `0`.
### Deferred Lengths
TODO: move it all here.
### `fileName`
This section describes how an entry's `fileName` is encoded in the context of a `CentralDirectoryHeader` and `LocalFileHeader` from the following:
- * `
fileNameRaw` - * General Purpose Bit 11
- * `
InfoZipUnicodePath` in `extraFields`
NOTE: an `InfoZipUnicodePath` is critical to understanding an entry's `fileName` despite being in the `extraFields`.
Also, understanding an entry's `fileName` is critical for security. See Endnote `ConsistencyForSecurity`.
NOTE: Most of this section's complexity is to support reading ZIP files created by long-obsolete ZIP implementations as of 2026.
Writing a `fileName` is much simpler than reading it.
A writer SHOULD simply set General Purpose Bit 11, SHOULD NOT include an `InfoZipUnicodePath`, and SHOULD encode the entry's `fileName` in `fileNameRaw` in UTF-8 using forward slash `/` as the path delimiter;
see Writing `LocalFileHeader`, `fileData`, and `DataDescriptor` for more discussion.
If an `ExtraField` `tag` is `0x7075` (`"up"` in ASCII), `dataSize` SHALL be at least `6`, and `data` is an `InfoZipUnicodePath` struct.
TODO: what if `dataSize` is `5` or less? Should that be a big error or just skip the `InfoZipUnicodePath` extra field?
The following defines **`InfoZipUnicodePath`**:
``` struct InfoZipUnicodePath = { // Name: Type; //Offset, Size | Comment version: uint8le; // 0, 1 | SHALL be 1 oldCrc32: uint32le; // 1, 4 | newFileName: bytes; // 5, dataSize - 5 | }; ```
The size of an `InfoZipUnicodePath` is `dataSize` (in the containing `ExtraField`).
The length of `newFileName` is `dataSize - 5`.
Let the intermediate value `hasInfoZipUnicodePath` be `1` if all of the following, `0` otherwise:
- * there is an `
InfoZipUnicodePath` extra field, - * `
version` is `1`, - * and `
oldCrc32` is the CRC32 of `fileNameRaw`.
The intermediate value `fileNameUtf8` is defined by the following:
- * If `
hasInfoZipUnicodePath` is `1`, let `fileNameUtf8` be `newFileName`. - * Otherwise, if General Purpose Bit 11 is set, let `
fileNameUtf8` be `fileNameRaw` interpreted in UTF-8. - * Otherwise, let `
fileNameUtf8` be `fileNameRaw` interpreted in an implementation-defined ASCII-based character encoding. See Endnote `DefaultCharacterEncoding`.
TODO: consider placing stricter requirements on the correspondence of `newFileName` and `fileNameRaw` to guard against malicious ambiguities;
e.g. perhaps all the printable ascii characters between the two must match, or perhaps `newFileName` must contain non-ascii characters,
or perhaps the below validation requirements must also pass on `fileNameRaw`.
Let the entry's `fileName` be the result of performing the following checks and transformations on `fileNameUtf8`.
- 1. Let `
s` be the result of replacing each backslash `\` with forward slash `/` in `fileNameUtf8`. See Endnote `DotNetBackslashes`. - 2. The length of `
s` must be greater than `0`. - 3. `
s` must not be exactly a single full stop `.`. TODO: link to research where sunzip corrupts the output dir. - 4. `
s` must not begin with slash `/`. Note: This would signal an absolute path. See Endnote `PathTraversal`. - 5. `
s` must not begin with an ASCII letter (`0x41` `A` to `0x5A` `Z` inclusive or `0x61` `a` to `0x7A` `z` inclusive) followed by colon `:`. Note: This would signal an absolute path on Windows. - 6. No slash-delimited path segment of `
s` is empty or one or two consecutive full stops `.` or `..`. In other words, there must be no occurrence of `//` in `s` and every occurrence of either `.` or `..` in `s` must not be BOTH: either at the beginning of `s` or after a slash `/`; AND either at the end of `s` or before a slash `/`. Note: This would encode either a non-normalized path or a relative path that escapes the output directory. See Endnote `NonNormalizedPaths`. - 7. Let `
fileName` be `s`.
TODO: consider allowing `./` as a file name, because most reader implementations seem to accept it. Can we find a writer implementation that produces `./` by default?
A reader SHOULD produce an error when an entry's `fileName` fails the above checks.
`fileName` is a slash-delimited file system path, e.g. `"dir1/dir2/file.txt"`.
The slash direction is always forward slash due to step 1 above canonicalizing.
Note that on some platforms, such as Windows, it may be appropriate for readers to replace slash with backslash before using the file system path.
A writer SHOULD always use a forward slash as the path delimiter, which may require replacing backslash with slash in some contexts.
NOTE: `fileName` CAN end with `/`; see `fileType` for discussion.
NOTE: Despite Windows absolute paths being valid relative paths on some systems, e.g. Linux, a reader SHOULD always produce an error when encountering a Windows absolute path regardless of the system on which the reader is running.
See Extracting to a File System for a discussion of case sensitivity and other complications.
### `lastModifiedTimestamp`
This section describes the meaning of the following in the context of a `CentralDirectoryHeader` and `LocalFileHeader`:
**`dosTimestamp`** encodes `lastModifiedTimestamp` as an MS-DOS Date Time, a bitpacked integer described below.
Note the `Shift & Mask` column is redundant with the `Bits` column and is provided for convenience.
``` Bits | Shift & Mask | Range | Description 0-4 | (x>>0)&0x1f | 0-29 | Seconds divided by 2 5-10 | (x>>5)&0x3f | 0-59 | Minutes 11-15 | (x>>11)&0x1f | 0-23 | Hours on a 24-hour clock 16-20 | (x>>16)&0x1f | 1-31 | Day of the month 21-24 | (x>>21)&0xf | 1-12 | Month 25-31 | (x>>25)&0x7f | 0-119 | Years since 1980 ```
As a special case, `dosTimestamp` MAY be `0`, which means that `dosTimestamp` field has no meaning.
See Endnote `DosTimestampZero`.
If `dosTimestamp` is not `0`, then every bit-packed value SHALL fit in the range given by the `Range` column, inclusive.
Additionally, the `Day of the month` value SHALL NOT exceed the number of days in the encoded month taking into account leap years.
Note the minimum timestamp encodable is `1980-01-01T00:00:00` encoded as `0x00210000`, and the maximum is `2099-12-31T23:59:58` encoded as `0xef9fbf7d`.
Note that every year encoded by a `Years since 1980` value that is divisible by `4` is a leap year; year 2100 would be an exception, but it is not in bounds.
Note that it is invalid to encode in the `dosTimestamp` any moment since 1972 when the clock shows `60` seconds passed the minute due to Leap Seconds;
the `Seconds divided by 2` is still not permitted to be `30` even for those moments.
The MS-DOS Date Time format does not encode any timezone information.
Readers and writers SHOULD use the system local timezone.
This means that the `dosTimestamp` does not encode a precise moment in history,
but rather a subjective date and time of day relative to whatever system is working with the ZIP file.
If it is necessary to encode a precise moment in history in UTC+00:00 timezone,
a writer SHOULD use `InfoZipUniversalTimestamp` (`0x5455`),
or if more range and/or precision is needed, the writer SHOULD use `NtfsTimestamp` (`0x000a`).
If an `ExtraField` `tag` is `0x5455` (`UT` in ASCII), the `dataSize` SHALL be at least `5`, and the `data` is an `InfoZipUniversalTimestamp` struct.
The following defines **`InfoZipUniversalTimestamp`**:
``` struct InfoZipUniversalTimestamp = { // Name: Type; //Offset, Size | Comment flags: uint8le; // 0, 1 | See below mtime: uint32le; // 1, 4 | See below theRest: bytes; // 4, variable | }; ```
The `flags` field SHALL have bit 0 set `0x01`; a writer SHOULD set `flags` to `1`.
Common ZIP assigns no meaning to bits other than bit 0 in `flags` and assigns no meaning to `theRest`; a reader SHOULD ignore them.
Note: Info-ZIP's definition of this struct allows for atime and ctime to be included as well, but Common ZIP discourages readers and writers from supporting those beyond simply ignoring them.
See Endnote `ModifiedTimestampVsOtherTimestamps`.
`mtime` encodes `lastModifiedTimestamp` as a POSIX Time in the range `0` to `2147483647` inclusive, effectively a 31-bit integer,
encoding a moment in time in the range `1970-01-01T00:00:00Z` to `2038-01-19T03:14:07Z` inclusive.
Despite many implementations interpreting POSIX Time values as signed integers, POSIX Time is officially undefined when negative,
so this specification limits the range for maximum compatibility.
TODO: reconcile this with a windows timestamp field if present.
If an `ExtraField` `tag` is `0x000a`, the `dataSize` SHALL be `32`, and the `data` is an `NtfsTimestamp` struct.
The following defines **`NtfsTimestamp`**:
``` struct NtfsTimestamp = { // Name: Type; //Offset, Size | Comment reserved: uint32le; // 0, 4 | SHALL be 0 innerTag: uint16le; // 4, 2 | SHALL be 1 innerSize: uint16le; // 6, 2 | SHALL be 24 mtime: uint64le; // 8, 8 | atime: uint64le; // 16, 8 | ctime: uint64le; // 24, 8 | }; ```
The first three fields `reserved`, `innerTag`, and `innerSize`, SHALL be `0`, `1`, and `24` respectively.
Note: Considered as a single `uint64le`, their combined value is `0x0018000100000000` i.e. `{0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x18, 0x00}` i.e. `"\x00\x00\x00\x00\x01\x00\x18\x00"`.
If the `dataSize`, `reserved`, `innerTag`, and/or `innerSize` deviate from their expected values, a reader SHOULD ignore the extra field.
Such a situation would suggest that a future version of the `NtfsTimestamp` has been introduced with a structure unrecognized by the reader.
The fields `atime` and `ctime` are assigned no meaning by Common ZIP.
A writer SHOULD set them to `0`; a reader SHOULD ignore `atime` and `ctime`.
See Endnote `ModifiedTimestampVsOtherTimestamps`.
`mtime` encodes `lastModifiedTimestamp` as an NTFS File Time.
`mtime` SHALL be in the range `0` to `2650152384000000000` inclusive,
representing a moment in time in the range `1601-01-01T00:00:00Z` to `9999-01-01T00:00:00Z`.
Note: Microsoft gives no definitive limit to an NTFS File Time;
Common Zip imposes a rather permissive limit as a compromise between legitimate use cases and guarding against accidental corruption.
TODO: Move this justification to an Endnote and provide evidence.
Note: To convert an NTFS File Time to a POSIX Time,
divide by `10000000` and subtract `11644473600`, which is the number of seconds between the epochs
(`369` years including `89` leap days; `134774` days with `86400` seconds each).
Note: The Gregorian Calendar has been in use since 1582.
Note: Leap Seconds were first introduced in 1972, and also are not encodable in either POSIX Time or NTFS File Time values.
### APPNOTE Compatibility
This section describes the meaning of the following in the context of a `CentralDirectoryHeader` and `LocalFileHeader`:
Note: The fields `appnoteCompatibilityMax` and `fileSystemCompatibility` are identified in APPNOTE as `version made by`,
and `appnoteCompatibilityMin` is identified in APPNOTE as `version needed to extract`.
The fields were separated and renamed in this specification to better encode their functions.
Note: These fields also appear in the `Zip64EndOfCentralDirectoryRecord`, but that usage is not discussed here.
Common ZIP assigns no meaning to `appnoteCompatibilityMax`.
Writers SHOULD set `appnoteCompatibilityMax` to `63` indicating that implementers are aware of APPNOTE version 6.3.
Readers SHOULD ignore `appnoteCompatibilityMax`.
TODO: link to research verifying that it really doesn't do anything.
A writer SHALL set `appnoteCompatibilityMin` no higher than `45` meaning APPNOTE version 4.5.
A writer SHOULD set `appnoteCompatibilityMin` to the minimum value in the below table as specified,
but MAY set the value higher than strictly needed, such as unconditionally setting the value to `45`.
- * `
10` meaning APPNOTE 1.0. This value means that none of the below features are required. - * `
20` meaning APPNOTE 2.0. This value means that `compressionMethod` is 8 (deflated) and/or that `generalPurposeBits` bit 3 is set indicating deferred lengths, and that the below feature is not required. - * `
45` meaning APPNOTE 4.5. This value means that `extraFields` contains a `Zip64ExtendedInformation`. - * Note that although `
generalPurposeBits` bit 11 indicating UTF-8 file names was first introduced in APPNOTE 6.3.0, the presence of this feature is not represented in `appnoteCompatibilityMin`, and some implementations will reject archives with `appnoteCompatibilityMin` set higher than `45`, despite fully supporting UTF-8 File Names. See Endnote `AppnoteCompatibilityMinBeyond45`.
The reader SHOULD require `appnoteCompatibilityMin` to be at most `63` meaning APPNOTE version 6.3,
unless the reader supports the Future APPNOTE Versions obscure feature.
### `fileType`
This section describes how the `fileType` is encoded in a `CentralDirectoryHeader`.
Note that is it not possible to determine `fileType` only from a `LocalFileHeader`.
The following fields are relevant:
- * **`
fileSystemCompatibility`** - * **`
externalFileAttributes`** - * `
fileName` - see the dedicated section for how this derived field is encoded.
Note: The fields `appnoteCompatibilityMax` and `fileSystemCompatibility` are identified in APPNOTE as `version made by`.
The fields were separated and renamed in this specification to better encode their functions.
Note: `fileSystemCompatibility` also appears in the `Zip64EndOfCentralDirectoryRecord`, but that usage is not discussed here.
`fileSystemCompatibility` SHALL be one of the following values.
A reader SHOULD require `fileSystemCompatibility` be one of the following values, unless it supports the More File Systems obscure feature:
- * `
0` DOS - * `
3` UNIX
This section uses these intermediate variables:
- * Let `
endsWithSlash` be `1` if `fileName` ends with a `/`, and `0` otherwise. - * Let `
mode` be the upper 16 bits of `externalFileAttributes` (`externalFileAttributes >> 16`). - * Let `
ifmt` be the upper 4 bits of `mode` (`externalFileAttributes >> 24`). - * Let `
perms` be the lower 12 bits of `mode` (`(externalFileAttributes >> 16) & 0o7777`). - * Let `
isExecutable` be `1` if `(perms & 0o111) != 0`, and `0` otherwise.
One of the following file type conditions SHALL be met. A reader SHOULD use these to determine the `fileType`:
- * The `
fileType` is `FILE` if and only if `endsWithSlash` is `0`, and either: `fileSystemCompatibility` is `0` (DOS); OR `fileSystemCompatibility` is `3` (UNIX), `ifmt` is `4`, and `isExecutable` is `0`. TODO: change to `ifmt` is not `10`? - * The `
fileType` is `POSIX_EXECUTABLE` if and only if `endsWithSlash` is `0`, `fileSystemCompatibility` is `3` (UNIX), `ifmt` is `4`, and `isExecutable` is `1`. - * The `
fileType` is `DIRECTORY` if and only if `endsWithSlash` is `1`. - * The `
fileType` is `SYMLINK` if and only if `endsWithSlash` is `0`, `fileSystemCompatibility` is `3` (UNIX), and `ifmt` is `10`.
Each of the file types SHOULD be encoded by setting `fileSystemCompatibility` to `3` (UNIX), and:
- * If `
fileType=FILE`, set `externalFileAttributes` to `0o100644 << 16` (`0x81a40000`), and do not end `fileName` with a `/`. - * If `
fileType=POSIX_EXECUTABLE`, set `externalFileAttributes` to `0o100755 << 16` (`0x81ed0000`), and do not end `fileName` with a `/`. - * If `
fileType=DIRECTORY`, set `externalFileAttributes` to `0o040755 << 16` (`0x41ed0000`), and end `fileName` with a `/`. - * If `
fileType=SYMLINK`, set `externalFileAttributes` to `0o120777 << 16` (`0xa1ff0000`), and do not end `fileName` with a `/`.
Note: Although the `mode` expresses the intended permission bits of an extracted file, readers are encouraged to guard against untrusted inputs by only determining the `fileType`, and then setting permission bits accordingly.
See Extracting to a File System for details.
See also Symlinks.
#### Symlinks
TODO: length limit recommendation. path traversal vulnerabilities. normalization. ancestor directory vs symlink. Permission bits?
### `internalFileAttributes`
Common ZIP assigns no meaning to this field.
It SHOULD be set to `0` when writing and SHOULD be ignored when reading.
Note: APPNOTE defines two uses for this field neither of which are relevant to modern use cases: "apparently an ASCII or text file", and "mainframe data transfer support".
### `fileComment`
TODO
### `generalPurposeBits`
TODO: reword this section to impose restrictions on zip files not on readers and writers.
The `generalPurposeBits` field appears in the `CentralDirectoryHeader` and `LocalFileHeader` structs.
Each individual bit has a distinct meaning.
This specification refers to the individual bits by number.
The `Feature` column below indicates that the bit is:
- * obscure meaning only set to `
1` if an obscure feature is required to read the entry. - * ignore meaning the bit has no effect on reading.
- * critical meaning this specification requires readers support the feature.
- * reserved meaning APPNOTE defines no meaning for the bit, and if the value is not `
0`, it means a future version of APPNOTE has given it meaning which a reader might not support.
NOTE: the `Shift` and `Mask` columns are redundant with the `Bit` column and are provided for convenience.
``` Bit | Shift | Mask | Feature 0 | 1 << 0 | 0x0001 | obscure: Traditional Encryption 1 | 1 << 1 | 0x0002 | ignore: compression metadata 2 | 1 << 2 | 0x0004 | ignore: compression metadata 3 | 1 << 3 | 0x0008 | critical: Deferred Length 4 | 1 << 4 | 0x0010 | ignore: compression metadata 5 | 1 << 5 | 0x0020 | obscure: Patch Data 6 | 1 << 6 | 0x0040 | obscure: Strong Encryption 7 | 1 << 7 | 0x0080 | reserved 8 | 1 << 8 | 0x0100 | reserved 9 | 1 << 9 | 0x0200 | reserved 10 | 1 << 10 | 0x0400 | reserved 11 | 1 << 11 | 0x0800 | critical: UTF-8 File Name 12 | 1 << 12 | 0x1000 | reserved 13 | 1 << 13 | 0x2000 | obscure: Strong Encryption 14 | 1 << 14 | 0x4000 | reserved 15 | 1 << 15 | 0x8000 | reserved ```
For convenience:
the reserved bits all together have a mask of `0xD780`;
the reserved and obscure feature bits all together have a mask of `0xF7E1`;
the ignore bits all together have a mask of `0x0016`.
A writer SHALL set all reserved bits and bits for obscure features to `0`;
in other words, `generalPurposeBits & 0xF7E1` must be `0`.
A writer SHOULD set all bits marked `ignore: compression metadata` to `0`,
but MAY set them for some reason.
APPNOTE defines these bits to have meaning based on command line options to PKZIP,
which are not otherwise explained in APPNOTE, and most implementations simply set them to 0.
See Endnote `GeneralPurposeBits1And2`.
If a reader does not support any obscure features,
the reader SHALL require all reserved bits and bits for obscure features are `0`;
in other words, `generalPurposeBits & 0xF7E1` must be `0`.
Otherwise, a reader MAY allow bits to be set if the reader supports the feature corresponding to that bit.
The reader SHOULD ignore bits marked `ignore: compression metadata`.
These bits are related to settings used during compression; see APPNOTE for details.
The remaining bits, 3 and 11, are critical to the structure of a ZIP file.
**General Purpose Bit 3** is only meaningful in a `LocalFileHeader`; see the sections above documenting its meaning there.
A writer SHOULD set bit 3 to `0` in a `CentralDirectoryHeader`,
and a reader SHOULD ignore it there.
**General Purpose Bit 11** indicates UTF-8 encoding for the `fileNameRaw` and `fileComment` fields of the same struct. See `fileName` and `fileComment`.
## Recommendations for Implementations
This section is non-normative and gives advice for implementations. Any uses of "SHALL" or "SHALL NOT" in this section are redundant reminders of requirements or forbiddances in the normative sections of this document.
TODO: talk about file name collisions, including file names that are the ancestor directories of other entries, and platform-specific collision situations.
### Random-Access Reading
The most robust way to read a ZIP file is to start with the central directory at the end and jump backward to any file data of interest. Reading from a stream is discussed in a later section.
#### Starting at the end structs
Locating the `EndOfCentralDirectoryRecord` can be performed by searching backwards for the `signature`, `0x06054b50` i.e. `{0x50, 0x4b, 0x05, 0x06}` i.e. `"PK\x05\x06"`.
The signature can be located at a starting `offset` in range `fileSize - 0xffff - 22 <= offset <= fileSize - 22` (additionally clamped by the total size of the file),
and remember that the _last_ occurrence of the byte pattern is the true signature.
Typically, the signature is at offset `fileSize - 22`, because `archiveCommentLength` is typically `0`.
If the `EndOfCentralDirectoryRecord` `signature` is not found in this range, the input file is not a ZIP file.
A reader SHOULD require that `archiveCommentLength` is the expected value or else reject the ZIP file as corrupted.
This is possible to happen when a writer attempts to write an `EndOfCentralDirectoryRecord` with a `archiveCommentLength` greater than `3`,
and erroneously allows the signature bytes to appear after the intended signature, such as in the `archiveComment`, `centralDirSize32`, or some overlapping combination of adjacent fields.
Checking the `archiveCommentLength` guards against unlikely accidental corruption, but does not guarantee the reader's interpretation matches the writer's original intention;
it is up to the original writer to prevent corruption of this kind.
Immediately preceding the `EndOfCentralDirectoryRecord` CAN be a `Zip64EndOfCentralDirectoryLocator`.
If the `4` bytes starting `20` bytes prior to the start of the `EndOfCentralDirectoryRecord` are `0x07064b50` i.e. `{0x50, 0x4b, 0x06, 0x07}` i.e. `"PK\x06\x07"`,
and at at least one of `entryCount16` is `0xFFFF`, or `centralDirOffset32` is `0xFFFFFFFF`, or `centralDirSize32` is `0xFFFFFFFF`,
then the two ZIP64 end structs are present, and the `Zip64EndOfCentralDirectoryLocator` is located immediately prior to the `Zip64EndOfCentralDirectoryRecord`.
See also Endnote `Zip64EndStructsAmbiguity`.
The reader SHOULD require that `zip64EocdrStartDisk` is `0` and `diskCount32` is `1`,
unless the reader supports the Multi-Disk obscure feature.
Checking both fields is better for security; see Endnote `ConsistencyForSecurity`.
The `zip64EocdrOffset` gives the offset of the `Zip64EndOfCentralDirectoryRecord`.
The typical and maximum value for this offset is the offset of the `Zip64EndOfCentralDirectoryLocator` minus `56`,
locating the `Zip64EndOfCentralDirectoryRecord` immediately prior to the `Zip64EndOfCentralDirectoryLocator`.
A reader SHOULD require the offset does not exceed the maximum, because structures SHALL not overlap.
A reader SHOULD require the `Zip64EndOfCentralDirectoryRecord` `signature` be `0x06064b50` i.e. `{0x50, 0x4b, 0x06, 0x06}` i.e. `"PK\x06\x06"`,
or else reject the ZIP file as corrupted.
If the signature is not the expected value, it may indicate the use of the Base Offset Shift obscure feature.
A reader SHOULD ignore the `zip64ExtensibleDataSizePlus44` field, unless the reader supports the Z390 Extra Field obscure feature.
As of APPNOTE 6.3.10, these obscure features do not interfere with the interpretation of the rest of the ZIP file.
A reader SHOULD require that `appnoteCompatibilityMin` is at most `63`, meaning APPNOTE 6.3,
unless the reader supports the Future APPNOTE Versions obscure feature.
A reader SHOULD require that `appnoteCompatibilityMin` is at least `45`, meaning APPNOTE 4.5 which introduced this structure.
A reader SHOULD ignore fields `appnoteCompatibilityMax` and `fileSystemCompatibility`.
In the context of the `Zip64EndOfCentralDirectoryRecord`, `fileSystemCompatibility` has no meaning.
A reader SHOULD require `zip64EocdrStartDisk` and `centralDirStartDisk32` are both `0` and `entryCountOnLastDisk64` is equal to `entryCount64`,
unless the reader supports the Multi-Disk obscure feature.
Checking all three fields is better for security; see Endnote `ConsistencyForSecurity`.
The following intermediate values are referred to below:
- * Let `
entryCount` be the number of entries in the archive determined by either `entryCount16` or `entryCount64`. - * Let `
centralDirOffset` be the offset of the first `CentralDirectoryHeader` in the archive determined by either `centralDirOffset32` or `centralDirOffset64`. - * Let `
centralDirSize` be the sum of the sizes of all `CentralDirectoryHeader` structs as determined by either `centralDirSize32` or `centralDirSize64`. - * Let `
endOffset` be the offset of the `Zip64EndOfCentralDirectoryRecord` if present or the `EndOfCentralDirectoryRecord` otherwise.
A reader SHOULD require that either `entryCount` and `centralDirSize` are both `0`,
or that `centralDirOffset + centralDirSize <= endOffset`, because structures SHALL not overlap.
Note that `centralDirOffset` has no meaning when `entryCount` and `centralDirSize` are both `0`.
The intermediate values `entryCount`, `centralDirOffset`, and `centralDirSize` are referenced in the next section.
#### Listing the `CentralDirectoryHeader` entries
If `entryCount` is not `0`, then a sequence of `entryCount` number of `CentralDirectoryHeader` structs starts at `centralDirOffset` and ends after exactly `centralDirSize` bytes.
A reader SHOULD require that the sequence of `CentralDirectoryHeader` structs never exceeds size bounds dictated by `centralDirSize`,
and SHOULD require that no less than `centralDirSize` bytes are used to encode `entryCount` number of `CentralDirectoryHeader` structs.
See Endnote `ConsistencyForSecurity`.
(The terms `entryCount`, `centralDirOffset`, and `centralDirSize` are defined in the previous section.)
For each `CentralDirectoryHeader`, the reader SHOULD verify the `signature` is `0x02014b50` i.e. `{0x50, 0x4b, 0x01, 0x02}` i.e. `"PK\x01\x02"`.
If the first signature is not the expected value, it may indicate the use of the Base Offset Shift obscure feature.
If any subsequent signature is not the expected value, then the stream of `CentralDirectoryHeader` structs is corrupted.
A reader SHOULD require that `appnoteCompatibilityMin` be at most `63`, meaning APPNOTE 6.3,
unless the reader supports the Future APPNOTE Versions obscure feature.
A reader SHOULD ignore `appnoteCompatibilityMax` and `internalFileAttributes`.
For more information, see the sections documenting these fields.
The fields `fileSystemCompatibility` and `externalFileAttributes` CAN indicate that a file is a symlink.
See `fileType` for more details.
TODO: not sure i want to just enumerate all the fields again here and say basically the same thing as everywhere else. Let's see what else there is.
TODO: `dosTimestamp`.
TODO: `fileComment`.
TODO: advise that the sequence of `localFileHeaderOffset` should be strictly increasing, and then not spill into the central directory.
#### Jumping to `LocalFileHeader` and `fileData`
Each `CentralDirectoryHeader` gives a `localFileHeaderOffset` (derived from `localFileHeaderOffset32` and the `Zip64ExtendedInformation` extra field) which is the offset of the corresponding `LocalFileHeader`.
A reader SHOULD require the `LocalFileHeader` `signature` is `0x04034b50` i.e. `{0x50, 0x4b, 0x03, 0x04}` i.e. `"PK\x03\x04"`.
If the signature is not the expected value, the ZIP file is corrupted.
A reader SHOULD require the `LocalFileHeader` `appnoteCompatibilityMin` be at most `63`, meaning APPNOTE 6.3,
unless the reader supports the Future APPNOTE Versions obscure feature.
A reader SHOULD require the redundant information in a `LocalFileHeader` matches the information in the corresponding `CentralDirectoryHeader`.
See Endnote `ConsistencyForSecurity`.
This information always includes:
- * `
compressionMethod` - * `
fileName` (derived from `fileNameRaw`, `generalPurposeBits`, and the `InfoZipUnicodePath` extra field). - * TODO: do we care about `
dosTimestamp` and other timestamps?
If the `LocalFileHeader` `generalPurposeBits` bit 3 is not set, then the list additionally includes:
- * `
compressedSize` (derived from `compressedSize32` and the `Zip64ExtendedInformation` extra field). - * `
uncompressedSize` (derived from `uncompressedSize32` and the `Zip64ExtendedInformation` extra field). - * `
crc32`
If the `LocalFileHeader` `generalPurposeBits` bit 3 is set, then the reader SHOULD instead require that `compressedSize32`, `uncompressedSize32`, and `crc32` are all `0`.
For the `generalPurposeBits` in both the `CentralDirectoryHeader` and `LocalFileHeader`, the reader SHOULD require that all reserved bits are `0`,
and SHOULD require that all obscure bits are `0` unless the reader supports the corresponding obscure feature.
The reader SHOULD ignore the ignore bits.
See the section in `generalPurposeBits` for the classification of the bits.
A reader SHOULD require that `compressionMethod` is either either `0` (stored) or `8` (deflated),
unless the reader supports the More Compression Methods obscure feature.
Immediately following the `LocalFileHeader` is the `fileData`.
Note that the size of a `LocalFileHeader` can only be determined by reading the `fileNameRawLength` and `extraFieldsLength` fields from the `LocalFileHeader`,
so it is generally not possible to jump straight from a `CentralDirectoryHeader` to the corresponding `fileData`.
Note that despite having the same name, the `extraFieldsLength` in the `CentralDirectoryHeader` and `LocalFileHeader` frequently have different values,
such as when a `Zip64ExtendedInformation` in the `CentralDirectoryHeader` includes a `localFileHeaderOffset64`.
The length of the `fileData` is given by the `CentralDirectoryHeader` `compressedSize` (derived from `compressedSize32` and the `Zip64ExtendedInformation` extra field).
If the `fileData` is compressed, the reader SHOULD require that the compression stream's built-in end-of-stream signal corresponds to the end of the `fileData` byte range.
See Endnote `ZipBomb` and Endnote `ConsistencyForSecurity`.
The length of `contents` is given by the `CentralDirectoryHeader` `uncompressedSize` (derived from `uncompressedSize32` and the `Zip64ExtendedInformation` extra field).
If the `compressionMethod` is not `0` (not stored), then the reader SHOULD require that `compressedSize` equals `uncompressedSize`.
See Endnote `ConsistencyForSecurity`.
A reader SHOULD require that the amount of data produced by any decompression process never exceeds the expected size.
See Endnote `ZipBomb`.
A reader SHOULD also require that the amount of data is not less than the expected size, which indicates either a corrupted ZIP file or a corrupted compression stream.
Note there is no way to encode the window size for a DEFLATE stream in the ZIP file format,
so a reader SHOULD use the maximum 15 bit window size for maximum compatibility;
for zlib, a popular DEFLATE implementation, this means `windowBits=-15`.
A reader SHOULD require that the CRC32 of the `contents` is equal to `crc32`.
If it is not, this typically indicates unintentional single-bit data corruption.
See Endnote `WhyVerifyCrc32`.
If the `LocalFileHeader` `generalPurposeBits` bit 3 is set, then immediately following the `fileData` is a `DataDescriptor`.
The `DataDescriptor` gives no useful information and can be safely ignored.
### Extracting to a File System
TODO: file name collision. case collision. unicode normalization collision. ancestor directory vs file collision.
TODO: Symlinks. length limit recommendation. path traversal vulnerabilities. ancestor directory vs symlink.
TODO: set permission bits. e.g. for `fileType=POSIX_EXECUTABLE`, set `mode |= (mode & 0o444) >> 2` after file creation rather than simply enabling all three executable bits;
this is to respect any `umask` setting that may have limited the permissions lower than `0o644`.
TODO: `lastModifiedTimestamp`.
TODO: if Alternate Data Streams are supported, check for the Mark of the Web on the input ZIP file and propagate it to the extracted contents. TODO: move this reference somewhere: https://textslashplain.com/2016/04/04/downloads-and-the-mark-of-the-web/
### Streaming Reading
The most robust way to read a ZIP from from a stream is to buffer it on disk first and then use the Random-Access Reading strategy above. An implementation SHOULD NOT attempt to read entries from a ZIP file from a stream, because of the necessary complexity and fragility of the implementation. Read on if you'd like to truly understand what you're getting yourself into if you attempt to read a ZIP file from the beginning.
The benefit to reading a ZIP file from a stream is that it may save disk space, depending on the implementation.
Note that it is not possible to skip using disk storage by handling entries as they are encountered in the stream.
The true contents of a ZIP archive cannot be determined until reaching the very end of the stream.
It is necessary to buffer a working, optimistic interpretation of the entries on disk or in memory until the assumptions can be verified or rejected, which happens at the very earliest 19 bytes prior to the end of the stream.
And even if all assumptions turn out to be correct, it is not possible to distinguish between regular files and symlinks until the `CentralDirectoryHeader` structs near the end of the stream.
The below algorithm gives a suggestion for optimistically extracting what might be entries from the archive and saving them to disk, then validating the assumption once reaching the end of the ZIP file. If the assumptions hold, then the entries saved on disk can be used to extract the archive, but if the assumptions were not correct, the whole extraction process is a failure, and there is no way to recover or fallback to random-access reading. If an implementation attempts to support a fallback by buffering the ZIP file on disk, then it's far simpler and more robust to defer processing until the whole ZIP file is buffered anyway, using the Random-Access Reading strategy above instead of this streaming strategy.
You have been warned. Here's how to read a ZIP file from a stream.
#### Streaming what might be `LocalFileHeader` and `fileData`
The very start of a ZIP file CAN be unused space, which the reader MAY skip through looking for one of the following signatures. The reader SHOULD assume the first occurrence of one of the following patterns of 4 bytes has meaning as follows:
- * `
0x04034b50` i.e. `{0x50, 0x4b, 0x03, 0x04}` i.e. `"PK\x03\x04"` - the reader SHOULD assume this to start the first `LocalFileHeader`. - * `
0x06064b50` i.e. `{0x50, 0x4b, 0x06, 0x06}` i.e. `"PK\x06\x06"` - the reader SHOULD assume this to start the `Zip64EndOfCentralDirectoryRecord`, suggesting that there are 0 entries. Skip ahead to Streaming the end structs. - * `
0x06054b50` i.e. `{0x50, 0x4b, 0x05, 0x06}` - the reader SHOULD assume this to start the `EndOfCentralDirectoryRecord`, suggesting that there are 0 entries. Skip ahead to Streaming the end structs.
See below for how interpreting the first and subsequent `LocalFileHeader` structs.
After each `LocalFileHeader` struct, `fileData`, and possibly-present `DataDescriptor`,
there CAN be unused space, which the reader MAY skip through looking for one of the following signatures.
The reader SHOULD assume the next occurrence of one of the following patterns of 4 bytes has meaning as follows:
- * `
0x04034b50` i.e. `{0x50, 0x4b, 0x03, 0x04}` i.e. `"PK\x03\x04"` - the reader SHOULD assume this to start another `LocalFileHeader`. - * `
0x02014b50` i.e. `{0x50, 0x4b, 0x01, 0x02}` i.e. `"PK\x01\x02"` - the reader SHOULD assume this to start the first `CentralDirectoryHeader`, suggesting the last `LocalFileHeader` has been read. Skip ahead to Streaming what might be `CentralDirectoryHeader`.
For each assumed `LocalFileHeader` encountered, the first 4 bytes have already been verified.
The remainder of the fields are discussed below.
A reader SHOULD require that `appnoteCompatibilityMin` be at most `63`, meaning APPNOTE 6.3,
unless the reader supports the Future APPNOTE Versions obscure feature.
A reader SHOULD require that `compressionMethod` is either either `0` (stored) or `8` (deflated),
unless the reader supports the More Compression Methods obscure feature.
For the `generalPurposeBits`, the reader SHOULD require that all reserved bits are `0`,
and SHOULD require that all obscure bits are `0` unless the reader supports the corresponding obscure feature.
The reader SHOULD ignore the ignore bits.
See the section in `generalPurposeBits` for the classification of the bits.
See the section on `fileName` for how to determine the file name of the entry.
Note that the entry MAY be a symlink, but there is no way to determine that from a `LocalFileHeader`.
See Streaming what might be `CentralDirectoryHeader` below.
See the section on `Zip64ExtendedInformation` for the meaning of `compressedSize32` and `uncompressedSize32`.
If `generalPurposeBits` bit 3 is not set, then the reader should have a value for `compressedSize` and `uncompressedSize` derived from the presence or absence of a `Zip64ExtendedInformation`.
The `compressedSize` gives the length of `fileData` in the stream following the `LocalFileHeader`.
If the reader decompresses the `fileData`, the reader SHOULD require that `compressedSize` is correct as discussed in the Jumping to `LocalFileHeader` and `fileData` section above.
See Endnote `ZipBomb`.
If `generalPurposeBits` bit 3 set, indicating deferred lengths, and `compressionMethod` is `8` (deflated),
the reader SHOULD decompress the `fileData` following the `LocalFileHeader` until reaching the end of stream signal.
If `generalPurposeBits` bit 3 set, indicating deferred lengths, and `compressionMethod` is `0` (stored),
then the reader MAY attempt to search the following `fileData` for a `DataDescriptor` as a sentinel.
However, this is a processing-intensive operation due to the variability of the `DataDescriptor` structure, see Endnote `DataDescriptorHasAmbiguousStructure`,
and also it is easy for a maliciously crafted ZIP file to contain what appears to be a correct `DataDescriptor` in the `fileData`, see Endnote `DataDescriptorIsAnAmbiguousSentinel`.
Despite the ambiguity, the whole streaming reading process is inherently ambiguous until reaching the very end of the file anyway,
so the security hazard due specifically to the `DataDescriptor` ambiguity is inconsequential.
If the reader is interested in the `fileData`, the reader SHOULD store the `fileData` in a temporary file.
If the `fileData` is compressed, reader MAY choose leave the data compressed in the temporary file until the corresponding `CentralDirectoryHeader` has been found later.
Note that this `fileData` CAN be a misinterpretation of the ZIP file, and potentially the entire archive extraction process will result in failure, so a reader SHOULD NOT trust that the `fileData` or any other aspect of this entry is correct at this point in the processing.
If `generalPurposeBits` bit 3 is set, indicating deferred lengths, then immediately following the `fileData` is a `DataDescriptor`.
Note that because the true structure of the ZIP file is impossible to determine until reaching the end,
the `DataDescriptor` contains no useful information for the reading process, except potentially signaling an early failure to parse the ZIP file.
The apparent intent of the `DataDescriptor` according to APPNOTE was to give an opportunity to a streaming reader to verify the `compressedSize`, `uncompressedSize`, and `crc32` immediately after processing the `fileData`,
however due to the variable and ambiguous structure of the `DataDescriptor`, its presence only makes the parsing process more difficult.
The reader SHOULD ignore the `DataDescriptor` and simply scan forward for the next recognized signature, either suggesting a `LocalFileHeader` or `CentralDirectoryHeader`, as described at the start of this section.
The reader MAY instead perform a more rigorous check for the different forms of the `DataDescriptor`, such as checking for its optional signature,
which CAN result in a more robust parsing process.
For example, if the `uncompressedSize` happens to be `0x04034b50`, then the naive recommendation given above would find bytes appearing to be a `LocalFileHeader` signature, and parsing would eventually result in failure.
However, the streaming reading process is inherently unreliable, and the ambiguity of the `DataDescriptor` is one of many problems,
so the recommendation given in this document is for the reader to ignore the `DataDescriptor` and simply scan for the next recognized signature.
See Endnote `DataDescriptorHasAmbiguousStructure`.
If the reader is interested in this entry, then in addition to saving the `fileData` in a temporary file on disk,
the reader SHOULD also keep track of the following information to cross check against a later-found `CentralDirectoryHeader`:
- * The offset of the `
LocalFileHeader` since the start of the stream, referred to later as `localFileHeaderOffset`. - * The `
fileName`. - * The `
compressionMethod`.
See the start of this section for checking for a subsequent `LocalFileHeader` or finding the first `CentralDirectoryHeader`.
#### Streaming what might be `CentralDirectoryHeader`
At this point, the reader has found a signature `0x02014b50` i.e. `{0x50, 0x4b, 0x01, 0x02}` i.e. `"PK\x01\x02"` suggesting that the following bytes are the rest of a `CentralDirectoryHeader`.
After each `CentralDirectoryHeader`, if the next 4 bytes are `0x02014b50` i.e. `{0x50, 0x4b, 0x01, 0x02}` i.e. `"PK\x01\x02"`, the reader SHOULD assume that another `CentralDirectoryHeader` follows and those 4 bytes are the signature.
Otherwise, the reader SHOULD check the 4 bytes for one of the following byte patterns, and possibly skip over unused space if necessary:
- * `
0x06064b50` i.e. `{0x50, 0x4b, 0x06, 0x06}` i.e. `"PK\x06\x06"` - the reader SHOULD assume this to start the `Zip64EndOfCentralDirectoryRecord`. Skip ahead to Streaming the end structs. - * `
0x06054b50` i.e. `{0x50, 0x4b, 0x05, 0x06}` i.e. `"PK\x05\x06"` - the reader SHOULD assume this to start the `EndOfCentralDirectoryRecord`. Skip ahead to Streaming the end structs.
For each assumed `CentralDirectoryHeader` encountered, the first 4 bytes have already been verified.
The remainder of the fields are discussed below.
Fields `appnoteCompatibilityMax`, `fileSystemCompatibility`, `appnoteCompatibilityMin`, `internalFileAttributes`, `externalFileAttributes`, and `generalPurposeBits` are documented in their own sections.
TODO: and this gives us the `isSymlink` intermediate value.
TODO: `dosTimestamp`.
TODO: `fileComment`.
See the sections on `fileName` and `Zip64ExtendedInformation` to derive the following intermediate values: `fileName`, `localFileHeaderOffset`, `compressedSize`, `uncompressedSize`.
TODO: is it `fileName` or `InfoZipUnicodePath` section?
If the reader is interested in this entry, the reader SHOULD require that the `fileName`, `localFileHeaderOffset`, and `compressionMethod` match an entry found while Streaming what might be `LocalFileHeader` and `fileData`.
If fewer than all three of the values match, it indicates that either the parsing is a failure due to some or all of the structures parsed thus far being misinterpreted,
or it indicates that this is a maliciously crafted ZIP file. See Endnote `ZipBomb`, Endnote `MaliciouslyAmbiguousStructure`, and Endnote `MaliciouslyConflictingCompressionMethod`.
If the reader decompresses or otherwise uses the `fileData` in the temporary file, the reader SHOULD require that the length of the `fileData` is `compressedSize`,
and that the length of the `contents` is `uncompressedSize` as explained in the section Jumping to `LocalFileHeader` and `fileData`.
See Endnote `ZipBomb` and Endnote `ConsistencyForSecurity`.
A reader SHOULD require that the CRC32 of the `contents` is equal to `crc32`.
If it is not, this typically indicates unintentional single-bit data corruption.
See Endnote `WhyVerifyCrc32`.
If `isSymlink`, then the reader SHOULD impose restrictions on the `contents`,
which is the link target, according to the section Symlinks.
See Endnote `PathTraversal`.
Immediately following the `CentralDirectoryHeader`, the next 4 bytes CAN suggest that another `CentralDirectoryHeader` as discussed at the start of this section.
If the reader found an entry during the Streaming what might be `LocalFileHeader` and `fileData` process that has no corresponding `CentralDirectoryHeader`,
then the reader SHOULD assume that no such entry exists in the ZIP archive.
Assuming that the `CentralDirectoryHeader` parsing explained in this section proves to be valid (see Streaming the end structs),
this likely means that an entry was deleted from the ZIP file without shifting all the `LocalFileHeader` and `fileData` down,
and only the `CentralDirectoryHeader` was removed.
#### Streaming the end structs
At this point the reader has found a signature suggesting either a `Zip64EndOfCentralDirectoryRecord` or `EndOfCentralDirectoryRecord`.
If the reader finds the `EndOfCentralDirectoryRecord` without finding the two ZIP64 end structs,
then the reader SHOULD require that the 4 bytes found 20 bytes prior to the `EndOfCentralDirectoryRecord` signature are not `0x07064b50` i.e. `{0x50, 0x4b, 0x06, 0x07}` i.e. `"PK\x06\x07"`.
Finding theses bytes would suggest that the `Zip64EndOfCentralDirectoryLocator` preceded the `EndOfCentralDirectoryRecord`,
and parsing the ZIP file is a total failure.
If the bytes are not found, then the following discussion of the two ZIP64 end structs does not apply.
If the reader finds the `Zip64EndOfCentralDirectoryRecord`, the signature has already been verified.
The remainder of the fields are discussed below.
The same recommendations apply as in the "Starting at the end structs" section for the following fields: `appnoteCompatibilityMin`, `appnoteCompatibilityMax`, `fileSystemCompatibility`, `zip64EocdrStartDisk`, `centralDirStartDisk32`, `entryCountOnLastDisk64`.
The reader SHOULD require the following:
- * The `
entryCount64` is the number of `CentralDirectoryHeader` structs found in the previous section. - * The `
centralDirSize64` is the sum of the sizes of the `CentralDirectoryHeader` structs found in the previous section. - * If any `
CentralDirectoryHeader` structs were found in the previous section, the `centralDirOffset64` is the offset of the first one.
The `zip64ExtensibleDataSizePlus44 - 44` bytes after the end of the `Zip64EndOfCentralDirectoryRecord` encodes the `zip64 extensible data sector`,
which the reader SHOULD ignore unless the reader supports the Z390 Extra Field obscure feature.
After the `zip64 extensible data sector` the reader SHOULD search for the bytes `0x07064b50` i.e. `{0x50, 0x4b, 0x06, 0x07}` i.e. `"PK\x06\x07"`,
possibly skipping over unused space.
The reader SHOULD assume the bytes are the `signature` of the `Zip64EndOfCentralDirectoryLocator`.
The remainder of the fields are discussed below.
The reader SHOULD require that `zip64EocdrStartDisk` is `0` and `diskCount32` is `1`,
unless the reader supports the Multi-Disk obscure feature.
Checking both fields is better for security; see Endnote `ConsistencyForSecurity`.
The reader SHOULD require that `zip64EocdrOffset` is the offset of the presumed `Zip64EndOfCentralDirectoryRecord`.
A different value means that parsing the ZIP file is a total failure.
Immediately following the `Zip64EndOfCentralDirectoryLocator` is the `EndOfCentralDirectoryRecord` with no unused space between.
The reader SHOULD require that the 4 bytes following the `Zip64EndOfCentralDirectoryLocator` are `0x06054b50` i.e. `{0x50, 0x4b, 0x05, 0x06}` i.e. `"PK\x05\x06"`.
A different value means that parsing the ZIP file is a total failure.
At this point, the reader has found what is presumed to be the `signature` of the `EndOfCentralDirectoryRecord`.
If the two ZIP64 end structs were not found, then the above discussion of them does not apply.
The remainder of the fields of the `EndOfCentralDirectoryRecord` are discussed below.
TODO: describe how the zip64 fields override these fields, and how the multi-disk stuff doesn't apply unless blah.
The reader SHOULD require that the ZIP file ends exactly after the `archiveComment`.
If the file does not end exactly at the expected offset, the parsing is a total failure.
If `archiveCommentLength` is `3` or less, the parsing is a success.
Otherwise, the reader SHOULD require that the bytes `0x06054b50` i.e. `{0x50, 0x4b, 0x05, 0x06}` i.e. `"PK\x04\x06"` do not appear at any offset after the `signature` up through 22 bytes prior to the end of the ZIP file.
If these bytes are found, the parsing is a total failure.
Otherwise, the parsing is a success.
### Writing
Writing a ZIP file can be done from start to finish without knowing the full list of entries up front.
The writer SHOULD write the `LocalFileHeader`, `fileData`, and optional `DataDescriptor` to the final output file or stream
while at the same time tracking sufficient information about each entry to write the `CentralDirectoryHeader` structs later.
For ZIP archives with a large number of entries, the writer CAN write the `CentralDirectoryHeader` structs to a temporary file to save memory,
although this makes it harder to check for file name collision.
The writer SHOULD refuse to create a ZIP file with multiple entries with identical `fileName`.
See Endnote `FileNameCollisions`.
#### Writing `LocalFileHeader`, `fileData`, and `DataDescriptor`
For each entry to be written to the ZIP file, the writer SHOULD write a `LocalFileHeader` with no unused space before it; see Endnote `UnusedSpaceCompatibility`.
The writer SHOULD set `generalPurposeBits` bit 11, indicating the `fileNameRaw` is UTF-8 encoded.
The writer SHOULD set `appnoteCompatibilityMin` as documented in that field's own section, or simply set it to `45` unconditionally.
The writer MAY instead leave `generalPurposeBits` bit 11 unset, and include an `InfoZipUnicodePath` extra field encoding the file name in UTF-8.
This is only recommended when the ZIP file is intended to be readable by implementations that haven't been updated since 2006 without APPNOTE 6.3 compatibility,
in which case the reader SHOULD set `fileNameRaw` to the file name encoded in the character encoding that such an implementation would use.
Note that such an implementation would not support the `InfoZipUnicodePath`, which was also invented in 2006;
the `InfoZipUnicodePath` field is intended for more modern implementations that don't know the intended encoding of the `fileNameRaw`.
Note that there is no reliable way to express the intended encoding of the `fileNameRaw` other than setting `generalPurposeBits` bit 11.
See Endnote `DefaultCharacterEncoding`.
The writer SHOULD set the `compressionMethod` to `8` (deflated),
unless the writer determines somehow that DEFLATE compression is unsuitable for this entry.
For example, if the files being archived are already compressed media,
DEFLATE compression would have a negligible effect on the size of the ZIP file.
However, DEFLATE compression will rarely have a meaningful negative impact on the archival process,
so simply using DEFLATE compression on every entry is the recommended default behavior.
TODO: link to research on worst case deflate situations; mention CPU resources, etc.
TODO: `dosTimestamp`
The writer SHOULD compute the CRC32 of the `contents` and, when using DEFLATE compression,
compress the `contents` to a temporary file before copying the compressed content to the final ZIP file.
Writing to a temporary file first instead of the final ZIP file allows the following fields to be computed and included in the `LocalFileHeader`: `crc32`, `compressedSize`, `uncompressedSize`.
If the `compressedSize` and/or `uncompressedSize` exceed `0xFFFFFFFF`, a `Zip64ExtendedInformation` extra field is needed in the `extraFields`.
The writer MAY include a `Zip64ExtendedInformation` extra field even if it is not needed.
Note that DEFLATE compression algorithms are generally not guaranteed to be deterministic, so it is not recommended to try to save the disk space by compressing the `contents` once to measure the size, and then again for the final output.
Alternatively, the writer MAY set `generalPurposeBits` bit 3, indicating deferred lengths, and set `crc32`, `compressedSize32`, and `uncompressedSize32` to `0`.
This is recommended in situations where using a temporary file is undesirable.
Alternatively, the writer MAY write placeholder values for `crc32`, `compressedSize32`, `uncompressedSize32`, and possibly a `Zip64ExtendedInformation` extra field,
and then, after computing the CRC32 of the `contents` and writing the `fileData` to the ZIP file,
seek back and overwrite these placeholder values in the `LocalFileHeader` with the correct values.
This is only possible when the ZIP file being written is seekable, and requires a heuristic to pre-allocate space for a `Zip64ExtendedInformation` before knowing the `compressedSize`.
If the `uncompressedSize` is also unknown before processing, then this strategy is not recommended.
After the `fileData`, the writer SHALL include a `DataDescriptor` if and only if `generalPurposeBits` bit 3 is set.
The writer SHOULD include the `signature` of the struct, and otherwise SHOULD use the smallest struct variant sufficient to encode the sizes.
See Endnote `DataDescriptorHasAmbiguousStructure`.
#### Writing `CentralDirectoryHeader`
The writer SHOULD NOT include any unused space before the first `CentralDirectoryHeader`; see Endnote `UnusedSpaceCompatibility`.
The order of entries in the `CentralDirectoryHeader` SHALL match the order of entries written in the previous section; see Endnote `ZipBomb`.
The writer SHOULD NOT support setting the `fileComment`, and SHOULD simply set `fileCommentLength` to `0`; see Endnote `FileCommentSupport`.
The last `CentralDirectoryHeader` has a special requirement when the two ZIP64 end structs are not going to be used,
which is that the 4 bytes starting 20 bytes from the end SHALL NOT be `0x07064b50` i.e. `{0x50, 0x4b, 0x06, 0x07}` i.e. `"PK\x06\x07"`.
The writer SHOULD buffer the last `CentralDirectoryHeader` in memory to check for this before writing it to the ZIP file.
The writer MAY do this check on all `CentralDirectoryHeader` structs for implementation simplicity.
If the forbidden byte pattern is found, the writer SHOULD increase `extraFieldsLength` by `1` and append a `0` byte to the `extraFields`,
which will guarantee avoid the forbidden pattern of bytes being found at the specific offset from the end.
If `extraFieldsLength` is already `0xffff`, then the writer SHOULD fail to create the ZIP file.
See Endnote `Zip64EndStructsAmbiguity` and Padding.
#### Writing end structs
The two ZIP64 end structs SHOULD only be included when necessary, but MAY be included in any case.
TODO: either don't use a comment, or make sure the signature doesn't show up after the signature. workaround: switch to zip64 format if not in the comment. and it shouldn't be in the comment, because control codes are ambiguous in cp437.
## Common ZIP vs APPNOTE
While it is out of scope to describe the precise meaning of APPNOTE, there are some cases where Common ZIP clearly diverges. This is a non-exhaustive list of them.
### Removed Features
Several features from APPNOTE are removed in Common ZIP:
- * Multi-disk support, aka splitting and spanning, is forbidden.
- * Traditional encryption is forbidden.
- * Strong encryption including central directory encryption and certificates is forbidden.
- * Patch data is forbidden.
- * Compression methods beyond 0 (stored) and 8 (deflated) are forbidden: `
1`, `2`, `3`, `4`, `5`, `6`, `7`, `9`, `10`, `11`, `12`, `13`, `14`, `15`, `16`, `17`, `18`, `19`, `20`, `93`, `94`, `95`, `96`, `97`, `98`, `99`. - * `
fileSystemCompatibility` other than `0` (DOS) and `3` (UNIX) are forbidden: `1`, `2`, `4`, `5`, `6`, `7`, `8`, `9`, `10`, `11`, `12`, `13`, `14`, `15`, `16`, `17`, `18`, `19`. - * All `
zip64 extensible data sector` extra fields are forbidden: Z390 Extra Field. - * All bits of `
internalFileAttributes` are ignored: `0x1`, `0x2`. - * Extra fields other than `
0x0000`, `0x0001`, `0x5455`, and `0x7075`: `0x0007`, `0x0008`, `0x0009`, `0x000c`, `0x000d`, `0x000e`, `0x000f`, `0x0014`, `0x0015`, `0x0016`, `0x0017`, `0x0018`, `0x0019`, `0x0020`, `0x0021`, `0x0022`, `0x0023`, `0x0065`, `0x0066`, `0x4690`, `0x07c8`, `0x1986`, `0x2605`, `0x2705`, `0x2805`, `0x334d`, `0x4154`, `0x4341`, `0x4453`, `0x4704`, `0x470f`, `0x4854`, `0x4b46`, `0x4c41`, `0x4d49`, `0x4d63`, `0x4f4c`, `0x5356`, `0x554e`, `0x5855`, `0x6375`, `0x6542`, `0x6854`, `0x7441`, `0x756e`, `0x7875`, `0x7855`, `0xa11e`, `0xa220`, `0xcafe`, `0xd935`, `0xe57a`, `0xfd4a`, `0x9901`, `0x9902`. - * Ads for PKWARE Proprietary Technology are removed.
Several discouragements from APPNOTE are relaxed to be simply permitted in Common ZIP:
- * APPNOTE section 4.1.2 discourages ZIP files from having no entries, but Common ZIP simply permits this.
- * APPNOTE section 4.4.12 discourages the total size of a `
LocalFileHeader` or `CentralDirectoryHeader` from exceeding `65535` bytes, but Common ZIP simply permits this.
### Additional Features
Many of the additions to Common ZIP relative to APPNOTE come in the form of security recommendations,
for example the recommendation to require the `fileName` matches between `CentralDirectoryHeader` and `LocalFileHeader`.
These additions are numerous and are not enumerated here.
Common ZIP requires features that improve compatibility between modern ZIP implementations. APPNOTE either describes these as optional or does not include them:
- * The Padding extra field is explicitly documented. See Endnote `
AndroidAlignment`. - * Base Offset Shift is explicitly forbidden. See Endnote `
BaseOffsetShift`. - * The `
dosTimestamp` format is documented, and validation rules are provided. A value of `0` is explicitly allowed. - * The `
InfoZipUniversalTimestamp` struct is documented. - * The `
NtfsTimestamp` struct is more thoroughly documented.
This spec also includes substantial Recommendations for Implementations, which APPNOTE declares out of scope in section 1.2.1.
### Changed Features
#### Structural Ambiguities
APPNOTE describes the structure of a ZIP file ambiguously and Common ZIP declares a definitive interpretation:
- * Common ZIP defines how readers can locate the `
EndOfCentralDirectoryRecord` `signature` and `Zip64EndOfCentralDirectoryLocator` `signature`, and forbids writers from creating any ambiguities there. - * Common ZIP explicitly allows unused space in certain places in a ZIP file. APPNOTE does not make this explicit, but perhaps suggests it in section 4.1.9.
#### Entry Order
APPNOTE section 4.4.1.3 states that the order of `LocalFileHeader` entries and `CentralDirectoryHeader` entries may be different.
Common ZIP requires they are the same.
#### Character Encoding
APPNOTE describes the default character encoding as `IBM Code Page 437` in section D.1.
Common ZIP allows readers to use any ASCII-based character encoding when UTF-8 is not explicitly signaled,
and encourages writers to set `generalPurposeBits` bit 11 indicating UTF-8 file names unconditionally.
See Endnote `DefaultCharacterEncoding`.
#### `fileName` Validation
Common ZIP differs from APPNOTE regarding `fileName` validation:
- * Readers supporting `
InfoZipUnicodePath` is required rather than optional. - * Readers normalizing `
\` to `/` is required rather than `\` being illegal. See Endnote `DotNetBackslashes`. - * Empty `
fileName` is not permitted rather than being used for input that "came from standard input". (4.4.12)
#### `appnoteCompatibilityMin` aka `version needed to extract`
APPNOTE 4.4.3.1 requires that `version needed to extract` be set to the lowest value strictly necessary for the set of features used.
Common ZIP allows and encourages setting `appnoteCompatibilityMin` to `45` even if ZIP64 is not used.
TODO: delete this TODO: Endnote. That word makes vim autocomplete the singular form instead of the plural form when editing above this line.
## Endnotes
**Endnote `PublicDomain`**:
The ZIP file format was placed in the public domain in 1989 by its creators, Phil Katz of PKWARE, Inc. and Gary Conway of Infinity Design Concepts, Inc. "This next step was developed jointly by IDC and PKWARE and released into the public domain on Feb. 14,1989, via a joint press release. Why public domain? Because the ZIP file format was built on the work of many others (as were .ARC files)" - Infinity Design Concepts, Inc. (Archived) https://web.archive.org/web/20040210234346/http://www.idcnet.us/ziphistory.html
- * The referenced press release: http://cd.textfiles.com/pcmedic9310/MAIN/MISC/COMPRESS/ZIP.PRS (Archived: https://web.archive.org/web/20250524153854/http://cd.textfiles.com/pcmedic9310/MAIN/MISC/COMPRESS/ZIP.PRS )
**Endnote `DotNetBackslashes`**:
Microsoft's `System.IO.Compression.ZipFile` class in .NET versions 4.5.0 until 4.6.1 created ZIP files with backslashes in file names.
Common ZIP explicitly permits backslashes as non-canonical alternatives to forward slashes in file names to accommodate this issue.
Common ZIP specifies this canonicalization to occur after `InfoZipUnicodePath` interpretation despite the buggy versions of .NET never producing this extra field;
This is intentional both for simplicity of implementation (and specification), and to mitigate the possibility of backslashes in the extra field causing implementation differences between Windows and non-Windows systems.
https://learn.microsoft.com/en-us/dotnet/framework/migration-guide/mitigation-ziparchiveentry-fullname-path-separator
**Endnote `AndroidAlignment`**:
Android's `zipalign` tool intentionally adds `0` bytes to the end of the `extraFields` buffer in `LocalFileHeader` structs to achieve 4-byte or 16KiB alignment within the ZIP file.
Between 1-3 bytes of `0` after the last `ExtraField` are not possible to read.
4 bytes of `0` encode a 0-length extra field with a tag value of `0`, which has no defined meaning in APPNOTE or presumably any implementation.
https://developer.android.com/tools/zipalign
**Endnote `MaliciouslyAmbiguousStructure`**:
It is possible to craft ambiguous ZIP files arguably compliant with APPNOTE that produce different contents depending on reader implementation.
The only known cases of such ambiguous ZIP files are security exploits, and so this specification requires that random-access readers check for and reject such ZIP files.
https://gynvael.coldwind.pl/?id=682
**Endnote `ZipBomb`**:
A zip bomb exploit involves a deceptively small input ZIP file that produces excessively large contents when extracted onto disk.
This can be used as a denial of service attack, and this specification requires readers take certain precautions to prevent this from happening.
https://www.bamsoftware.com/hacks/zipbomb/
**Endnote `LargeMetadataDenialOfService`**:
The listing of files in an ZIP file can be extremely large.
The maximum number of entries is 4294967295, and the maximum file name length is 65535 bytes.
The names alone max out around 281TiB of data total.
Implementations SHOULD guard against denial-of-service attacks from untrusted inputs by limiting the amount of memory and/or disk space used for storing entry metadata.
TODO: link to research.
**Endnote `DataDescriptorHasAmbiguousStructure`**:
The `DataDescriptor` struct is ambiguously sized if the `crc32` happens to be `0x08074b50`,
or if the `uncompressedSize` first 4 bytes happens to have the same value as the `crc32`,
or if the `crc32` happens to have the value `0x04034b50` or `0x02014b50`.
The recommendation is to enable the signature all the time so that a more naive check can follow along.
And also recommend 64-bit sizes only when necessary to be less surprising.
TODO: link to research on this.
**Endnote `DataDescriptorIsAnAmbiguousSentinel`**:
If a file's contents are compressed using DEFLATE, a streaming reader can determine the end of the file contents by using the end-of-stream signal inherent to the compression algorithm.
If a streaming reader cannot determine the end of a file's contents, the reader cannot read the ZIP file.
The `DataDescriptor` is poorly suited as a sentinel to user data, not only due to the complexities described in Endnote `DataDescriptorHasAmbiguousStructure`,
but also because uncompressed file contents might contain arbitrary byte sequences including a maliciously crafted `DataDescriptor` to cause ambiguity in the ZIP file.
See also Endnote `MaliciouslyAmbiguousStructure`.
TODO: link to someone else's research here.
**Endnote `DecompressionDenialOfService`**:
In order to enable readers to usefully guard against denial of service attacks, the `uncompressedSize` metadata must be accurate.
If a reader expects the uncompressed size will fit within a size limit, but then the decompression process produces much more, it can subvert the guard against the attack.
Therefore, readers are required to guard against the attack mid-decompression.
TODO: link to someone else's research here.
**Endnote `Zip64EndStructsAmbiguity`**:
APPNOTE gives no definitive test for how to detect the presence of the two ZIP64 end structs.
Different ZIP reader implementations behave differently, and this specification suggests behavior that errs on the side of compatibility with varying interpretations rather than adhering strictly to one of them.
Some implementations check for the `Zip64EndOfCentralDirectoryLocator` signature unconditionally (e.g. Info-ZIP);
some implementations check for the signature only if there is a max-valued field to be extended (e.g. Go `archive/zip`);
some implementations require the signature if there is a max-valued field (e.g. yauzl prior to v3.1.1).
Some links can be found here: https://github.com/thejoshwolfe/yauzl/issues/108 .
Therefore, the recommendation in this specification is that max-valued fields should correspond to the presence of the two ZIP64 end structs.
TLDR: you need to clue some implementations into the locator by setting at least one field to -1.
Also, if a field happens to be sincerely max-valued and `version needed to extract` is lower than `45`,
there's no way to know at this time, because that field only shows up on individual file entries or the `Zip64EndOfCentralDirectoryRecord`.
Example ambiguity: an `NtfsTimestamp` at the end of the last `CentralDirectoryHeader` with an `mtime` representing `3205-01-02T11:35:25Z` (plus or minus about 3m35s).
**Endnote `MultiDiskZip64Compatibility`**:
TODO: explain that if an implementation panics too soon on multi-disk info, they might not notice that the zip64 end structs overwrite it back to single-disk info.
**Endnote `BaseOffsetShift`**:
Info-ZIP and some other implementations allow all encoded offsets in an ZIP file to be too small by some fixed amount suggesting that an otherwise well-formed ZIP file was concatenated to the end of some other data, often a self-extracting executable program.
The offset is detected by reading the offset of and size of the central directory, and assuming the central directory is immediately followed by the end structs,
use the size to calculate a correction for the offset.
TODO: how does this work with the variable-sized extensible data sector? Needs research.
Info-ZIP also has a `-F` fixup operational mode where it will correct the offsets found to be shifted in this way,
and gives instructions (TODO: cite actual source here) for creating self-extracting ZIP files by using this concatenate-then-fixup workflow.
This specification forbids ZIP files having this Base Offset Shift, and instead requires that offsets be correct.
**Endnote `ConsistencyForSecurity`**:
TODO: link to gynvael coldwind's research on how implementation differences can lead to security problems.
The idea is that if two implementations see different contents in a ZIP file, then that unexpected situation can be exploited,
such as by sneaking malicious content passed a security checker.
TODO: consider merging this with Endnote `MaliciouslyAmbiguousStructure`.
**Endnote `WhyVerifyCrc32`**:
TODO: idk. is this important?
**Endnote `MaliciouslyConflictingCompressionMethod`**:
TODO: explain that if you trust the compression method in the `LocalFileHeader` vs `CentralDirectoryHeader` and don't notice a conflict between them,
then the contents of the file can take multiple forms for different readers, even while preserving the `compressedSize`, `uncompressedSize`, and `crc32`.
Also TODO: craft such an example.
**Endnote `PathTraversal`**:
TODO: explain for file paths and symlink targets.
**Endnote `NonNormalizedPaths`**:
TODO: research why non-normalized paths are harmful, e.g. `a/../b.txt`, `a/./b.txt`, `a//b.txt`.
**Endnote `ExtraFieldBufferOverflow`**:
TODO: explain.
**Endnote `Cp437IsAmbiguous`**:
There are two conflicting definitions of `IBM Code Page 437` aka CP437, one from IBM and one from Unicode.
The conflict is in the range 1-31, where IBM defines there to be various dingbats, and Unicode defines the range to be the ASCII control characters.
The common intersection of the two definitions of CP437 and UTF-8 are the printable ASCII characters in the range `0x20` to `0x7e`.
**Endnote `DefaultCharacterEncoding`**:
When `generalPurposeBits` bit 11 is unset (and there is no `InfoZipUnicodePath` extra field, or the `oldCrc32` doesn't match),
the reader is left to interpret the `fileNameRaw` in an implementation-defined ASCII-based character encoding.
APPNOTE reports that PKZIP has used `IBM Code Page 437` in this situation,
however that is both not a prescriptive requirement and also not clear what exactly that means. See Endnote `Cp437IsAmbiguous`.
Because of this behavior from PKZIP, every reader will almost surely attempt to use some ASCII-based encoding, never an EBCDIC-based encoding.
It is generally a safe assumption that if all the bytes of `fileNameRaw` are in the printable ASCII range, then the encoding is ASCII.
One notable exception to this is the historically common Shift JIS encoding which maps ASCII `\` (`0x5c`) to `¥` (U+00a5) and ASCII `~` (`0x7e`) to `‾` (U+203E).
The recommended default character encoding for readers is ASCII limited to the range `0x20` to `0x7d` inclusive,
and any byte value outside this range is recommended to produce an error of some kind.
Note that `\` is already specially handled in a `fileName`.
**Endnote `UnusedSpaceCompatibility`**:
TODO: talk about how sunzip and probably other streaming readers simply choke on a signature mismatch instead of scanning over it.
**Endnote `ModifiedTimestampVsOtherTimestamps`**:
TODO: talk about how Info-ZIP only includes mtime in `0x5455` despite the struct supporting more.
TODO: research how `NtfsTimestamp` `0x000a` gets encoded on windows.
TODO: talk about how atime and ctime don't really work https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information .
**Endnote `DosTimestampZero`**:
TODO: Some Go implementations of ZIP generate archives with `dosTimestamp` `0`. Mention them and argue why it's intuitive or whatever to assume that `0` means unspecified.
**Endnote `FileNameCollisions`**:
File name collisions will almost always cause errors when extracting.
And even when not extracting, checking for duplicates puts a complexity burden on readers.
TODO: research on the behavior of various implementations regarding duplicate file names.
TODO: include nuances about case collisions, normalization collisions, ancestor directory vs file collisions, and ancestors-are-symlinks collisions.
**Endnote `FileCommentSupport`**:
TODO: research how well file comments are supported.
**Endnote `GeneralPurposeBits1And2`**:
Although Info-ZIP sets `generalPurposeBits` bit 1 and bit 2 based on the compression level given to zlib, see https://github.com/thejoshwolfe/info-zip-zip/blob/3.0/zipup.c#L1457-L1461 ,
the Python `zipfile` module never sets these for DEFLATE compression https://github.com/python/cpython/blob/3.14/Lib/zipfile/__init__.py#L1810-L1815 .
Because the bits are not a reliable source of information, a reader SHOULD simply ignore them,
and a writer MAY unconditionally set them to `0`.
**Endnote `AppnoteCompatibilityMinBeyond45`**:
Although `generalPurposeBits` bit 11 indicating UTF-8 file names was introduced in APPNOTE 6.3, use of the feature is not encoded in `appnoteCompatibilityMin`.
APPNOTE itself does not list the feature in section 4.4.3.2, the `version needed to extract` table,
only listing compression and encryption features beyond version 4.5.
Info-ZIP `unzip` will print a warning and skip entries with `appnoteCompatibilityMin` greater than `45`,
presumably because the field encodes the use of one of the explicitly listed features in APPNOTE section 4.4.3.2, none of which `unzip` supports.
- * https://github.com/thejoshwolfe/info-zip-unzip/blob/6.0/extract.c#L933-L941
- * https://github.com/thejoshwolfe/info-zip-unzip/blob/6.0/unzpriv.h#L684
## References
**APPNOTE**: the .ZIP File Format Specification from PKWARE, the original creators. https://support.pkware.com/pkzip/appnote
**DEFLATE**: a compression algorithm invented by Phil Katz in 1990, standardized in RFC 1951 (1996) https://datatracker.ietf.org/doc/html/rfc1951 .
Any number of window bits can be used for the compression in a ZIP file, typically 15.
Note that in terms of popular DEFLATE implementation zlib, ZIP files always use raw streams with no containers or headers; typically `windowBits=-15`.
**CRC32**: The standard cyclic redundancy check implemented in most standard libraries. The following are the standard parameters: `width=32 poly=0x04c11db7 init=0xffffffff refin=true refout=true xorout=0xffffffff check=0xcbf43926 residue=0xdebb20e3 name="CRC-32/ISO-HDLC"`. https://reveng.sourceforge.io/crc-catalogue/all.htm#crc.cat.crc-32-iso-hdlc
**UTF-8**: The most popular variable-width encoding for text as bytes. Never includes any byte order mark. https://datatracker.ietf.org/doc/html/rfc3629
**ASCII** aka US-ASCII aka C0 Controls and Basic Latin: A fixed-width encoding for text as bytes, coinciding with the first 128 codepoints of UTF-8, thereby being a strict subset of UTF-8. https://www.unicode.org/charts/PDF/U0000.pdf
**CP437** aka IBM Code Page 437: An ambiguous charset mentioned in APPNOTE as the default character encoding if UTF-8 is not used.
IBM and Unicode give conflicting definitions for this encoding. See Endnote `Cp437IsAmbiguous`.
- * IBM: https://public.dhe.ibm.com/software/globalization/gcoc/attachments/CP00437.txt
- * IBM: https://public.dhe.ibm.com/software/globalization/gcoc/attachments/CP00437.pdf
- * Unicode: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
- * Unicode: https://github.com/unicode-org/icu/blob/d78cf74bca8dcbcbaea8cdcb1d8ac04db02b61ed/icu4c/source/data/mappings/ibm-437_P100-1995.ucm
**UTC Time**: The standard timekeeping system used by most computer systems. Includes Leap Seconds.
**MS-DOS Date Time**: The bit-packed timestamp representation used by MS-DOS on the FAT family of file systems,
usually expressed as a pair of 16-bit integers: date and time.
In Common ZIP these two numbers are combined into a single 32-bit integer.
The maximum value allowed for the year varies across Microsoft APIs, sometimes `119` representing year 2099 and sometimes `127` representing year 2107.
This specification limits the year to the more restrictive value for maximum compatibility.
Although some implementations may permit out of bounds values, such as day `0` or hour `25`, this specification does not.
Moments in time that are Leap Seconds are arguably valid in this encoding,
however Common ZIP forbids encoding 60 seconds passed the minute in order to simplify validation logic.
Timezones are not possible to encode in this format; typically implementations will interpret times in the system's local timezone, which is further reason to forbid allowing leap second encoding.
- * https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-dosdatetimetofiletime
- * https://learn.microsoft.com/en-us/windows/win32/api/oleauto/nf-oleauto-dosdatetimetovarianttime
**POSIX Time**: A number which is an approximation of the number of seconds since `1970-01-01T00:00:00Z` encoding a moment in time.
Moments before 1970 could be represented by negative numbers, but The Open Group leaves this undefined.
The Open Group defines a conversion from year, month, day, hour, minute, and second to POSIX Time,
however the relationship between these year, month, etc. values and "actual time of day" is explicitly unspecified.
Because a POSIX day is defined by exactly `86400` seconds, POSIX Time is oblivious of Leap Seconds.
Implementations reconcile this oblivion in various ways, all of which are subtly problematic.
- * https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16 (Archived: https://web.archive.org/web/20171211234521/http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16 )
**NTFS File Time**: A number encoding a moment in time as the number of 100-nanosecond intervals since `1601-01-01T00:00:00Z`.
Microsoft references "Coordinated Universal Time (UTC)" and describes the number being in "UTC format",
however it appears as though Leap Seconds are not included in NTFS File Time despite being included in Coordinated Universal Time (UTC).
- * https://learn.microsoft.com/en-us/windows/win32/sysinfo/file-times
- * https://www.forensicfocus.com/articles/interpretation-of-ntfs-timestamps/ (search for "Leap seconds")
**Leap Seconds**: A leap second is a 1 second adjustment to a clock to synchronize timekeeping based on counting seconds and timekeeping based on the rotation of the Earth. Leap seconds are only used in some timekeeping systems, and are notably absent from POSIX Time which is used by most computing systems. Leap seconds are relevant in contexts such as encoding a moment in time in a digital format, converting between encodings, and measuring the interval of time between two moments. For an archive format like ZIP, only encoding and conversion are relevant, and leap seconds can be ignored for these cases. The three encodings of moments in time supported in Common ZIP lack explicit support for encoding leap seconds. This lack of support is explicit for POSIX Time, undocumented for NTFS File Time, and unclear for MS-DOS Date Time. The epoch-oriented timekeeping systems, POSIX Time and NTFS File Time, have epochs (1970 and 1601 respectively) prior to the first leap second (1972), so converting between them is as simple as a single multiplication/division and a single addition/subtraction; no knowledge of leap seconds is required.
- * Official leap second announcements: https://www.iers.org/IERS/EN/Publications/Bulletins/bulletins.html
- * Consolidated list of leap seconds by hpiers.obspm.fr: https://hpiers.obspm.fr/iers/bul/bulc/Leap_Second.dat
- * Consolidated list of leap seconds by nist.gov: https://www.nist.gov/pml/time-and-frequency-division/time-realization/leap-seconds
**PKZIP**: The original implementation of the ZIP file format by PKWARE from 1989. https://www.pkware.com/products/pkzip
**Info-ZIP**: a popular implementation of the ZIP file format from 1990 that pioneered several conventions that have become commonplace. Official website: https://infozip.sourceforge.net/ . Code mirrors: https://github.com/thejoshwolfe/info-zip-zip/tree/3.0 , https://github.com/thejoshwolfe/info-zip-unzip/tree/6.0 .
**zlib**: a popular open-source implementation of the DEFLATE compression algorithm. Note that to configure zlib for ZIP file compression, set `windowBits=-15`. https://zlib.net/
## Special Thanks
The authors would like to thank the following people for their contributions to this project: Aarjav Patni, Amilia MacIntyre, Ange Albertini, Curtis Antolik, David Fifield, Gynvael Coldwind, Raunak Ramakrishnan, and everyone who's made a ZIP implementation, from the first implementation by PKWARE to the smallest hobby project and bespoke proprietary use case. This document is a passion project that would not be possible without all the love for ZIP software that came before it.