public class DedupDataIndex extends Object

Each backup partition may optionally use a central index of data chunks for all non-empty regular files. Files are split into chunks of FAILOVER_FILE_REPLICATION_CHUNK_SIZE bytes. Each chunk may be reused via hard links in order to achieve space savings. Each chunk is also optionally gzip compressed for additional space savings. File tails (partial chunks at the end less than FAILOVER_FILE_REPLICATION_CHUNK_SIZE in length) are also added to the index. A zero length chunk may never be added.

The central index is a single layer directory hash, based on the first four characters of the content's MD5 hash. Because of this single layer hash, an individual hash directory can easily have enough entries to require a file system with support for many files in one directory, such as ext4 or xfs.

Files are named based on the lowercase hex-coded MD5 hash of the uncompressed chunk contents. However, due to both MD5 hash collisions and the per-file hard link limits, there may be more than one file per MD5 hash value. The files are named as follows:


The directory_hash is the first four characters of the MD5 sum.

The remaining_hash is the remaining 28 characters of the MD5 sum.

The uncompressed_length is the hex-coded length of the uncompressed chunk contents. When the length is a multiple of 0x100000 (1 MiB), it is represented with an "M" following the number of mebibytes in hex. When the length is a multiple of 0x400 (1 kiB), it is represented with a "k" following the number of kibibytes in hex.

The uncompressed length is added to the filename to allow the efficient location of candidate contents in the event of an MD5 collision. Chunks with a different size can be immediately excluded by filename without any additional stat (for uncompressed) or full decompression. Note that if the file does not end with ".gz", this length will always equal the actual file length.

The collision# is a zero-based hex counter for each unique set of data resulting in this MD5 hash. When locating new data from the index, matches are not done by MD5 alone, the contents will be verified byte-by-byte. When a second set of content must be added for a given MD5 hash, it will be (remaining_hash)-(uncompressed_length)-1-(link#)[.gz].

The link# is a zero-based hex counter to workaround the file system limits on the number of hard links allowed to one file (65000 for ext4). Once the first file is "full", a second copy of the content is stored. The second link file will be named (remaining_hash)-(uncompressed_length)-(collision#)-1[.gz]

The .gz extension is added to chunks that have been gzip compressed. Chunks smaller than FILE_SYSTEM_BLOCK_SIZE are never compressed as the space reduction will not yield any savings. For larger files, the chunk is compressed, then the compressed version is only used if it is sufficiently smaller to cross a FILE_SYSTEM_BLOCK_SIZE block boundary in size. This avoids further compression overhead when the space reduction does not yield any savings.

The .corrupt extension indicates that the background verifier detected this chunk to no longer match the expected MD5 sum or chunk length and the chunk could not be restored from another copy (see link#). TODO: Can we restore this from backup and recover in-place and remove .corrupted from the filename? This file will no longer be used for any new links, and links pointing to it will be migrated to another copy of the data (see link#). If there is no other copy of the link, then the client will be asked to re-upload the chunk. During restore, an attempt will be made to locate an alternate copy of the chunk. Once all links are migrated, this corrupt chunk will be deleted as normal when link count reaches one.

Both collision# and link# are maintained in sequential order starting at 0. The system renumbers files as-needed as things are removed in order to maintain no gaps in the sequence. During routine operations, searches are done one-past the end to detect and correct any gaps in the sequence caused by any unclean shutdowns.

Once DUPLICATE_LINK_COUNT or more links are created to a data chunk, a second copy of the data is created. Once two copies exist, links will be distributed between them evenly. Only when both of the first two copies have hit FILE_SYSTEM_MAX_LINK_COUNT links will a third copy be created.

Once the number of links in the first two copies reaches COALESCE_LINK_COUNT, all new links are all put into the first and the second copy will be eventually be removed once no longer referenced.

Files are normally removed from the index immediately as they are removed from the backup directory trees. However, in the event of an unclean shutdown or manual administrative action, there may be orphaned index files (with a link count of 1). A cleanup job is ran at startup as well as once per day to find and delete any orphaned index files. This cleanup job can also be accomplished manually on the shell:

 /etc/init.d/aoserv-daemon stop
 find (/backup-partition)/DATA-INDEX -mindepth 2 -maxdepth 2 -type f -links 1 -print # -delete
 find (/backup-partition)/DATA-INDEX -mindepth 1 -maxdepth 1 -type d -empty -print # -delete
 # Note: -delete commented for safety, uncomment to actually delete the orphans.
 /etc/init.d/aoserv-daemon start

The backup process recreates missing index files from existing hard linked chunks, so the entire index may be discarded and it will be recreated with minimal loss of drive space. Some links might not be created from new data to old (if not yet put back in the index), but the system will still function and eventually settle to an optimal state once again as backup directories are recycled.

The background verifier uses the chunk's modified time to keep track of the last time the chunk was verified. The chunk will be re-verified approximately once every VERIFICATION_INTERVAL milliseconds.

Security: Client-provided MD5 values must never be trusted for what goes into the index. They can be used to link to existing data within the client's backup, but anything being added to the index must have server-side MD5 computed.

Locks are maintained on a per-hash-directory basis, so the I/O can be dispatched with up to 2^16 concurrency. Locks are done within the JVM using synchronized blocks, as well as between processes using file locking. It is safe for multiple processes to use the same directory index concurrently. All locks are exclusive for simplicity as concurrency is still obtained by the sheer number of locks.

TODO: Keep track of number of in-JVM locks, only release file lock when all JVM locks released.

AO Industries, Inc.
See Also:
  • Method Details

    • getInstance

      public static DedupDataIndex getInstance(PosixFileSystem fileSystem, Path dataIndexDir) throws IOException
      Gets the index for the given index directory. Only one instance is created per file system instance and unique path.
    • getFileSystem

      public PosixFileSystem getFileSystem()
      The file system containing this index.
    • getDataIndexDir

      public Path getDataIndexDir()
      Returns the path (within the file system) containing this index.
    • verify

      public void verify(boolean quick) throws IOException
      Cleans all orphaned index files. The lock is only held briefly one file at a time, so other I/O can be interleaved with this cleanup process. It is possible that new orphans created during the cleanup will not be cleaned-up on this pass.

      For long-lived data indexes, it is good to run this once per day during low usage times.

      quick - When true, performs a quick pass to clean orphaned data only, but does not verify MD5 sums.