Login
You're viewing the front-end.social public feed.
  • Jul 3, 2026, 6:20 PM

    Are there any current-use compression formats that, asked to compress a directory, can successfully exploit commonalities between files / deduplicate files rather than treating each file as a world unto itself as zip does?

    💬 12🔄 1⭐ 0

Replies

  • Jul 3, 2026, 6:22 PM

    I think .tar.gz will technically do this to some degree because gzip doesn't know what files are, but this approach still limited by the size of its sliding window (eg it can't deduplicate two files whose size is over 258 bytes or distance apart in the tar is larger than 32K)

    💬 7🔄 0⭐ 0
  • Jul 3, 2026, 6:22 PM

    @mcc wouldn’t any tar-based compression do this since the archiving to a flat file and compression are two independent stages?

    💬 1🔄 0⭐ 0
  • 💬 1🔄 0⭐ 0
  • Jul 3, 2026, 6:28 PM

    @mcc unsure precisely what the relative properties of bzip2/xz/zstd are here, but they may provide better results for your needs than gzip does

    💬 0🔄 0⭐ 0
  • Jul 3, 2026, 6:26 PM

    @mcc does the ZFS stream format count? Also the brotli sliding window is much larger (up 16M)

    💬 1🔄 0⭐ 0
  • Jul 3, 2026, 6:39 PM

    @evert @mcc Yeah, if the question is whether *any* such format exists, you could make a ZFS filesystem with dedup and compression enabled, write the data, then ‘zfs send’ it to a file. The file could then be ‘zfs receive’d or mounted. Since dedup only matters on write, you just don’t use dedup on the receiving end, and you don’t pay the RAM penalty.

    That’s more or less how I make base images for zones.

    💬 0🔄 0⭐ 0
  • 💬 0🔄 0⭐ 0
  • Jul 3, 2026, 6:42 PM

    @mcc Besides .tar.gz's limited deduplication, there's also git's packs, though unsure how well those work in practice. Iirc they should deduplicate file contents.

    💬 0🔄 0⭐ 0
  • Jul 3, 2026, 6:48 PM

    @mcc .tar with a more algorithm than .gz. .tar.xz or .tar.zst. There are probably dedicated file-deduplicating solutions too.

    💬 0🔄 1⭐ 0
  • Willwaffle_iron@nyan.lol
    Jul 3, 2026, 6:48 PM

    @mcc I guess squashfs's deduplication could accomplish that, but I'm sure most people would be confused by squashfs as an archive.

    💬 0🔄 0⭐ 0
  • Jul 3, 2026, 7:01 PM

    @mcc anything that's operating in "solid archive" mode should perform similar to a tarball, I think?

    I've seen game roms where every region of a game is in one zip, and those compress as if they are de-duped.

    💬 0🔄 0⭐ 0
  • Jul 3, 2026, 7:12 PM

    @mcc Some people have already said "tar + something with a large compression window", but in particular tar + lrzip may be best if you go that way. It should be packaged on most distros.

    It's intended to have a compression window up to the size of your RAM (or 2 GB on a 32-bit OS), or optionally even larger than RAM, though that's slower.

    The "LR" stands for Long Range, even!

    It's also multithreaded, which not everything is.

    It can also be combined with other compression algorithms if you are a hardcore compression algorithm enthusiast.

    (The only thing that bothers me about lrzip is that it doesn't preserve the file's modification time.)

    wiki.archlinux.org/title/Lrzip
    en.wikipedia.org/wiki/Rzip

    💬 0🔄 0⭐ 0
  • 💬 0🔄 0⭐ 0
  • Jul 3, 2026, 6:26 PM

    @mcc Not the answer to what you're asking, but in case it fits your workflow/problem domain: I use Restic for this.

    💬 1🔄 0⭐ 0
  • 💬 1🔄 0⭐ 0
  • 💬 0🔄 0⭐ 0
  • Jul 3, 2026, 6:26 PM

    @mcc

    7zip with solid compression? I think.

    Probably other formats with similar options available.

    In the Second Age of Middle Earth we would Zip a folder with no compression at all, then zip that zip with compression as needed 🙂

    💬 0🔄 0⭐ 0
  • 💬 0🔄 0⭐ 0
  • 💬 0🔄 0⭐ 0
  • Jul 3, 2026, 6:36 PM

    @mcc Seems like 7zip will, but it might require some setting tuning. 7zip tries to sort files by similarity, that way its sliding window can better capture duplications across files. You can also help it along by increasing the dictionary size, make sure you're making a "solid" archive, and some of the algorithms might be better than others (LZMA might be better than LZMA2, strangely?)

    This is me finding information online, not my personal experience, so take it with a grain of salt, but this is where I'd start.

    💬 0🔄 0⭐ 0
  • Jul 3, 2026, 6:39 PM

    @mcc janky but might work: zip a zip file?

    💬 0🔄 0⭐ 0
  • 💬 0🔄 0⭐ 0
  • 💬 1🔄 0⭐ 0
  • 💬 1🔄 0⭐ 0
  • Jul 3, 2026, 7:14 PM

    @mcc
    Cygwin might have it, WSL for sure, but I guess that I am not telling you anything new ;-)

    💬 0🔄 0⭐ 0
  • 💬 0🔄 0⭐ 0