Compressing Sets and Multisets of Sequences
IEEE Transactions on Information Theory
MetadataShow full item record
Steinruecken, C. (2015). Compressing Sets and Multisets of Sequences. IEEE Transactions on Information Theory, 61 (3), 1485-1490. https://doi.org/10.1109/TIT.2015.2392093
This is the accepted manuscript for a paper published in IEEE Transactions on Information Theory, Vol. 61, No. 3, March 2015, doi: 10.1109/TIT.2015.2392093. © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
This paper describes lossless compression algorithms for multisets of sequences, taking advantage of the multiset’s unordered structure. Multisets are a generalization of sets, where members are allowed to occur multiple times. A multiset can be encoded naïvely by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible (such as cryptographic hash sums). The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects.
Arithmetic coding, Bayesian methods, data compression, hash sums, multisets, tree data structures
This work was supported in part by the Engineering and Physical Sciences Research Council under Grant EP/I036575 and in part by a Google Research Award. This paper was presented at the 2014 Data Compression Conference
External DOI: https://doi.org/10.1109/TIT.2015.2392093
This record's URL: https://www.repository.cam.ac.uk/handle/1810/247209
All Rights Reserved
Licence URL: https://www.rioxx.net/licenses/all-rights-reserved/