Each backup partition may optionally use a central index of data chunks for
all non-empty regular files. Files are split into chunks of
FAILOVER_FILE_REPLICATION_CHUNK_SIZE
bytes. Each chunk may be
reused via hard links in order to achieve space savings. Each chunk is also
optionally gzip compressed for additional space savings. File tails (partial
chunks at the end less than FAILOVER_FILE_REPLICATION_CHUNK_SIZE
in length) are also added to the index. A zero length chunk may never be
added.
The central index is a single layer directory hash, based on the first four characters of the content's MD5 hash. Because of this single layer hash, an individual hash directory can easily have enough entries to require a file system with support for many files in one directory, such as ext4 or xfs.
Files are named based on the lowercase hex-coded MD5 hash of the uncompressed chunk contents. However, due to both MD5 hash collisions and the per-file hard link limits, there may be more than one file per MD5 hash value. The files are named as follows:
(/backup_partition)/DATA-INDEX/(directory_hash)/(remaining_hash)-(uncompressed_length)-(collision#)-(link#)[.gz][.corrupt]
The directory_hash
is the first four characters of the MD5 sum.
The remaining_hash
is the remaining 28 characters of the MD5 sum.
The uncompressed_length
is the hex-coded length of the
uncompressed chunk contents. When the length is a multiple of 0x100000
(1 MiB), it is represented with an "M" following the number of mebibytes in
hex. When the length is a multiple of 0x400 (1 kiB), it is represented with
a "k" following the number of kibibytes in hex.
The uncompressed length is added to the filename to allow the efficient
location of candidate contents in the event of an MD5 collision. Chunks with
a different size can be immediately excluded by filename without any
additional stat
(for uncompressed) or full decompression. Note
that if the file does not end with ".gz", this length will always equal the
actual file length.
The collision#
is a zero-based hex counter for each unique set
of data resulting in this MD5 hash. When locating new data from the index,
matches are not done by MD5 alone, the contents will be verified byte-by-byte.
When a second set of content must be added for a given MD5 hash, it will be
(remaining_hash)-(uncompressed_length)-1-(link#)[.gz]
.
The link#
is a zero-based hex counter to workaround the
file system limits on the number of hard links allowed to one file (65000 for
ext4). Once the first file is "full", a second copy of the content is stored.
The second link file will be named
(remaining_hash)-(uncompressed_length)-(collision#)-1[.gz]
The .gz
extension is added to chunks that have been gzip
compressed. Chunks smaller than FILE_SYSTEM_BLOCK_SIZE
are
never compressed as the space reduction will not yield any savings.
For larger files, the chunk is compressed, then the compressed version is
only used if it is sufficiently smaller to cross a
FILE_SYSTEM_BLOCK_SIZE
block boundary in size. This avoids
further compression overhead when the space reduction does not yield any
savings.
The .corrupt
extension indicates that the background verifier
detected this chunk to no longer match the expected MD5 sum or chunk length
and the chunk could not be restored from another copy (see link#
).
TODO: Can we restore this from backup and recover in-place and remove .corrupted from the filename?
This file will no longer be used for any new links, and links pointing to it
will be migrated to another copy of the data (see link#
). If
there is no other copy of the link, then the client will be asked to re-upload
the chunk. During restore, an attempt will be made to locate an alternate
copy of the chunk. Once all links are migrated, this corrupt chunk will be
deleted as normal when link count reaches one.
Both collision#
and link#
are maintained in sequential
order starting at 0
. The system renumbers files as-needed as
things are removed in order to maintain no gaps in the sequence. During routine
operations, searches are done one-past the end to detect and correct any gaps
in the sequence caused by any unclean shutdowns.
Once DUPLICATE_LINK_COUNT
or more links are created to a data chunk,
a second copy of the data is created. Once two copies exist, links will be
distributed between them evenly. Only when both of the first two copies have
hit FILE_SYSTEM_MAX_LINK_COUNT
links will a third copy be created.
Once the number of links in the first two copies reaches COALESCE_LINK_COUNT
,
all new links are all put into the first and the second copy will be eventually
be removed once no longer referenced.
Files are normally removed from the index immediately as they are removed from the backup directory trees. However, in the event of an unclean shutdown or manual administrative action, there may be orphaned index files (with a link count of 1). A cleanup job is ran at startup as well as once per day to find and delete any orphaned index files. This cleanup job can also be accomplished manually on the shell:
/etc/init.d/aoserv-daemon stop find (/backup-partition)/DATA-INDEX -mindepth 2 -maxdepth 2 -type f -links 1 -print # -delete find (/backup-partition)/DATA-INDEX -mindepth 1 -maxdepth 1 -type d -empty -print # -delete # Note: -delete commented for safety, uncomment to actually delete the orphans. /etc/init.d/aoserv-daemon start
The backup process recreates missing index files from existing hard linked chunks, so the entire index may be discarded and it will be recreated with minimal loss of drive space. Some links might not be created from new data to old (if not yet put back in the index), but the system will still function and eventually settle to an optimal state once again as backup directories are recycled.
The background verifier uses the chunk's modified time to keep track of the
last time the chunk was verified. The chunk will be re-verified
approximately once every VERIFICATION_INTERVAL
milliseconds.
Security: Client-provided MD5 values must never be trusted for what goes into the index. They can be used to link to existing data within the client's backup, but anything being added to the index must have server-side MD5 computed.
Locks are maintained on a per-hash-directory basis, so the I/O can be dispatched with up to 2^16 concurrency. Locks are done within the JVM using synchronized blocks, as well as between processes using file locking. It is safe for multiple processes to use the same directory index concurrently. All locks are exclusive for simplicity as concurrency is still obtained by the sheer number of locks.
TODO: Keep track of number of in-JVM locks, only release file lock when all JVM locks released.
- Author:
- AO Industries, Inc.
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionReturns the path (within the file system) containing this index.The file system containing this index.static DedupDataIndex
getInstance
(PosixFileSystem fileSystem, Path dataIndexDir) Gets the index for the given index directory.void
verify
(boolean quick) Cleans all orphaned index files.
-
Method Details
-
getInstance
public static DedupDataIndex getInstance(PosixFileSystem fileSystem, Path dataIndexDir) throws IOException Gets the index for the given index directory. Only one instance is created per file system instance and unique path.- Throws:
IOException
-
getFileSystem
The file system containing this index. -
getDataIndexDir
Returns the path (within the file system) containing this index. -
verify
Cleans all orphaned index files. The lock is only held briefly one file at a time, so other I/O can be interleaved with this cleanup process. It is possible that new orphans created during the cleanup will not be cleaned-up on this pass.For long-lived data indexes, it is good to run this once per day during low usage times.
- Parameters:
quick
- When true, performs a quick pass to clean orphaned data only, but does not verify MD5 sums.- Throws:
IOException
-