Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.

Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.

On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).

It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).

Checkout the repo - https://github.com/samyak112/pym-particles

8 points | by spidy__ 2 days ago

7 comments

  • tae0086 18 hours ago
    Neat approach. Since the 900KB model ships with the compressed file, is there a file size below which the model overhead just eats the gains? Curious where the crossover is.
    • spidy__ 12 hours ago
      For the model overhead to become significant enough to eat into the gains, the file size would need to be fairly small, right? I assumed nobody would use this for compressing anything below 100 MB.

      I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules).

      I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.

  • 7373737373 2 days ago
    What does it compress the full 1GB file to? http://prize.hutter1.net/
    • spidy__ 2 days ago
      I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.

      I know the top submission was able to get it to 13 mb.

      Still trying some ideas to get better compression.

  • purple-leafy 1 day ago
    That’s so awesome! I want to try something similar. I’ve been going crazy with compression work. I reckon I can beat that prize link
    • spidy__ 12 hours ago
      Reallly?? So have you published something so far? Can i read something? Sounds like you got some interesting ideas.
      • purple-leafy 1 hour ago
        I will be showcasing something on hackernews soon! Basically I found a way to “compress” a multiplayer game state from ~100KB+ to ~1KB

        But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things

  • roshiya 7 hours ago
    [flagged]
  • keynha 17 hours ago
    [dead]
  • xunevega 1 day ago
    [flagged]
  • jessedaniel 3 hours ago
    [dead]