You may have come across a compressed archive file (.zip, .tar.gz, .rar etc.) that infinitely contains itself and thought that looks neat.
You might even want to make your own—that is, if you have suspiciously too much time on your hands and like to jump into black holes of pointless endeavours. Let me welcome you comrade. This post will explore everything you need to know to be on your merry way creating these quines.
Much of the credit goes to folks much smarter than myself (they will be introduced); this tutorial is meant to curate previous work and literature as much as it is for myself to educate you. The goal here is to allow for any curious, technically-minded newcomer to make sense of all the concepts involved in creating compression quines.
Particular thanks goes to Russ Cox who wrote on the subject of compressed files quines. This post is essentially a more unassuming approach then his; nonetheless I would encourage reading both articles.
A word of warning for those inexperienced in Lempel-Ziv and Huffman Codes, compressed archive specs like DEFLATE, and editing files on a binary-level—there is a lot to learn, understand and work through. Although a very educational process (and I think a lot of fun), there is little direct practical achievement here. For example, I took more than a month to just create the petty 204-byte quine.gz.
A GZIP quine will be the main subject of this article. It will be the easiest format to create a quine with, as the GZIP format does not concern itself with archiving a collection of files like ZIP does. Once you’ve learnt how to make a GZIP quine, you will feel confident to explore other formats and specifications.
Also mind, GZIP is well-supported on popular Linux distributions. About Windows and Mac—I don’t know, but I’m sure there’ll be tools available if you search around.
Compressed archive files can be considered as programs. We’ll get back to how that is soon, but first you need to know that most programming languages can be used to create self-replicating programs. So when the source program executes, it will produce an exact copy of it’s own source.
Never written one yourself? I implore you to try making a quine using your programming language of choice and come back here when you’re done. It can take a while so don’t feel bad if you struggle to get there.
It can be fun and educational to discover things completely by yourself, but it’s certainly not paramount. So a few hours later when you’re banging your head against the keyboard, feel free to cheat a bit and check out this Rosetta Code page, which provides examples in many languages. Excellent further reading could be this extensive look at quines by David Madore.
GZIP file structure: Part 1
In a literal sense, a file consists of ones and zeroes; it’s up to the software that’s reading the file to translate those bytes into useful information.
That’s utter nonesense at this point, so don’t worry; the endgame of this article is to reconstruct the above. We’re going to abstract the information (metadata) and instructions (print & repeat opcodes) of GZIP files before encoding them.
Software that decompresses a GZIP file assumes a specification has been followed by the software that originally compressed the original file to create it so that the file’s bits become meaningful. RFC 1952 is the specification that should be strictly adhered to (although bear in mind that it’s not uncommon to see minor deviations).
GZIP files comprise of metadata in a header and trailer, with compressed data in-between.
The composition of the header/data/trailer would look something like this:
We’re using the short hand P for Print, and R for Repeat. Repeat arguements are given a single arguement (i), which denotes we’re repeating the i amount of output from i amount away from the end of the output—basically it’s the XandY value.
So we can write compressed data that executes (rather, decompresses) to itself—a quine! But how about expanding into a fully-formed file; how can compressed data output both itself and the metadata of a GZIP file?
Compressed files as a program
Cox notes the above example contains a 6-step sequence that prints a larger 8-step sequence:
P4 P0 P0 P0 P4 | P0 P0 P0 P4
R4 | P0 P0 P0 P4
He explains it’s critical for a sequence like the above to exist for compressed file quines to be possible. Output lags behind the input before the last instruction (the R4), only to jump ahead of the input. This demonstrates that arbitary data, such as a header and trailer of a GZIP file, could be purposely outputted at the head and tail ends of the compressed data.
The P0—“print nothing” instruction—used in Cox’s compressed data quine can actually be swapped for a header and trailer. Cox thus creates a quine-ish program that “expands” itself with an arbitary prefix aa bb cc (the header) and suffix xx yy zz (the trailer):
Code | Output
-----------------|-----------------
P4 aa bb cc P4 | aa bb cc P4
R4 | aa bb cc P4
P4 R4 P4 R4 P4 | R4 P4 R4 P4
R4 | R4 P4 R4 P4
P4 R4 xx yy zz | R4 xx yy zz
R4 | R4 xx yy zz
A framework for a self-expanding compressed file! Nearly at least, as headers and trailer tend to be of varying lengths.
A generalised program that will work for any sized header and any sized trailer would be extremely useful. It would certainly be possible with variable arguments to P and R instructions. For example, the start of the file quine could look like this:
..H.. |
Lh+1 ..H.. Lh+1 | ..H.. Lh+1
Where ..H.. is a header of variable length and h is the length of the header.
So yes, I am going to ask you to try create one yourself. “You can do it!” And if you can’t, don’t sweat it.
And there you have it. Translate something like this into GZIP-compliant binary and you’re done!
But hold on hotshot ‘coz there’s a ways to go yet. Probably a good time to make yourself a cuppa.
GZIP file structure: Part 2
We’re going to deep dive on the GZIP specification, RFC 1952, and the DEFLATE specification (we’ll get to that in a sec) it refers to, RFC 1951. You’ll want to open these up for heavy reference. I’m basically asking you to recreate quine.gz.
We’ll be hex editing files, for which there are various solutions available; using the Unix tool tweak worked well on my end. Emacs users can enjoy hexl-mode. Anything you can find for your OS should do—there’s not a great deal of features needed for a hex editing tool anyway.
I highly recommend setting up the infgen tool by compression king Mark Adler, which is a DEFLATE decompresser that outputs to a readable format (it was kindly made just for educational purposes). I found it extremely useful as I could verify my manual DEFLATE coding was working as intended, which gets important especially for the awkward method of writing Repeat instructions we’ll be employing.
Also, maybe try reading Adler’s puff.c DEFLATE-inflater. It was written to employ simple logic so folk could understand how DEFLATE works.
Header
Section 2.3 of RFC 1952 (or GZIP:2.3, which is how I’ll denote sections like so from now on) states the members available in a GZIP file header. The base header looks like this:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+---+---+---+---+---+---+---+---+---+---+
000 |ID1|ID2|CM |FLG| MTIME |XFL|OS |
+---+---+---+---+---+---+---+---+---+---+
That could just be it for our quine. However, I would add a FNAME attribute so the decompressed outfile that the decompressing tool generates is of the same name as the origianl infile. My quine.gz is encoded like so:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
000 |ID1|ID2|CM |FLG| MTIME |XFL|OS | FNAME := "quine.gz" >
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
020 < |
+---+---+---+
Read up on the GZIP spec and translate the above to the respective binary!
I would suggest writing/typing in seperated 8-bit/1-byte blocks (using with the left-most bit as the MSB), and then converting to hexadecimal.
Let me give you some values for a few fields:
CM value of 8 as it denotes the DEFLATE compression method we’ll use.
MTIME value of 0 as it says no timestamp is available.
XFL value of 0 because it’s not really meaningful.
We have a header! Note that I’ll be using the above in my examples from now on.
We can’t quite determine the trailer yet, as it’s two set fields—CRC32 and ISIZE—are both determined by the contents of the “original, uncompressed file”—yes, that input file is also the output file. So essentially writing the trailer will have to come once we know everything else about the quine we’re writing (spoiler: it will get confusing).
DEFLATE-compressed data stream
As mentioned, compressed data can be boiled down to just of print and repeat instructions. But to get to that level of abstraction, we need to understand what it really is—a DEFLATE data stream .
And even them I’m fudging it a bit, because technically it can be any compression method you like (denoted by the CM block in the GZIP header). For modern GZIP compressors and decompressors (tools tend to do both), I believe most programs just expect use of DEFLATE anyway, so that’s what we are caring about today. DEFLATE is not ubiquitous in the wider world of “compressing stuff”, but it’s concepts are shared by many modern compression efforts.
Right! So before we “code” in DEFLATE, we need to understand it’s intended use—loseless-ly compressing files to save up on space. It’s beauty is in how it harmonizes two effective algorithms in “reducing redundancy” in a set of information (in DEFLATE’s case, binary files):
Dictionary-based compression (by way of Lempel-Ziv)
Entropy-based compression (by way of Huffman coding)
(You should be somewhat familiar with Lempel-Ziv, as the Print & Repeat opcodes are essentially the end result of LZ algorithms.)
I’m not going to teach you about the design choices for DEFLATE here, so if you have no idea how it works in the broad-stokes, maybe check out the following:
A little article titled “The Elegance of Deflate” by Richard Mitton, which aptly explains why DEFLATE is so effective.
Educator-extraordinaire Prof. David Brailsford explaining the compression concepts Lempel-Ziv and Huffman coding.
Block types
Blocks are the container for the instructions and arguments of a set of data, starting with a BFINAL bit that denotes if it’s the last block in the compressed data stream, and is followed by two bits that denote the BTYPE (block type) of the block, as explained in DEFLATE:3.2.3..
There are three BTYPE blocks in DEFLATE:
Stored block (BTYPE=00)
The Print (X) command. Also referred to as the “literal” block, as it really says “treat the next X-length bytes literally, not as instructions”.
Compressed blocks
For our purposes, these can basically be the Repeat (X, Y) command. The important distinction here is that the instructions are Huffman coded. There are two kinds of compressed blocks:
Fixed (BTYPE=01)
A Huffman tree pre-built into the DEFLATE specification is used.
Dynamic (BTYPE=10)
A compressor-defined Huffman tree is first defined, to be used with subsequent compressed data.
We’ll just parallel the Repeat (X, X) instructions (like used in the original compressed data quine example above) with fixed compressed blocks. We don’t want to bother with defining a Huffman tree like with a dynamic block, at least when manually writing a quine.
Print
Our first print sequence comes from the top of the Cox quine:
..H.. |
Lh+1 ..H.. Lh+1 | ..H.. Lh+1
For us that comes out to the following:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
000 |ID1|ID2|CM |FLG| MTIME |XFL|OS | FNAME := "quine.gz" >
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
020 < | Print 24 |ID1|ID2|CM |FLG| MTIME |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
040 |XFL|OS | FNAME := "quine.gz" | Print 24 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
The 24 argument for the Print instruction is the number we desire: the header is 19-bytes long, and 5-bytes long Print 24. All literal blocks start with 5 bytes:
0 1 2 3 4 5...
+---+---+---+---+---+================================+
| B | LEN | NLEN |... LEN bytes of literal data...|
+---+---+---+---+---+================================+
B: BFINAL + BTYPE + Zero padding to byte boundary
LEN: Length of literal data
NLEN: Bit flipped LEN
Read DEFLATE:3.2.4. and write the corresponding hex.
Did you mess up the byte order? DEFLATE:3.1.1. states how DEFLATE packs it’s bytes.
… but honestly it’s contents still confuse me. For literal blocks, I just think of writing an unsigned integer in bits, with padding to the left to fit the 2-byte boundary. It’s in a little-endian byte order, and if you have no idea what that is, maybe Dr. Steve Bagley can help you out?
+-- Literal codes between 256..279 are 7 bits long
||+-- All distance codes in BTYPE=01 blocks are 5 bits long
__|___|_/\/\0+01+0001110[+01]+01000[+111]+0000^^^^^^^|||||||||||||+-- Zero padding to byte boundary
|||||||||||+-- Add 7 to length base 17 => 24
|||||||||+-- Distance code 8 (length base: 17, extra bits: 3)
|||||||+-- Add 1 to length base 23 => 24
|||||+-- Literal code 270 (length base: 23, extra bits: 2)
|||+-- BTYPE=01 (block is compressed using DEFLATE's pre-built Huffman table)
|+-- BFINAL=0 (not the final block)
If you have no idea how DEFLATE’s fixed Huffman codes work, you may want to read this excellent DEFLATE reference by Calmarius. He importantly offers a clear explanation of how the DEFLATE alphabet works (needed for literal/length & distance codes), as well as the specification generally. Don’t mind the ZLIB stuff.
Take your time~
Repeat (12, 24) (12, 24)
Anywho, of course I’m not going to give you the actual solution just like that. The problem with this R (24, 24) is that only 3 bytes long, which messes up the Coxian quine that presumes the same size for all Print & Repeat instructions.
As the Print commands always start with 5 bytes, we’ll use that as the constant “unit” of our program, i.e. the 1 in Ph+1 & Rh+1 is 5 bytes.
Fortunately, we can write a different representation of R (24,24) in the way of R (12, 24) R (12, 24). Spoiler: it comes to 5 bytes. You can infact fudge all the Coxian Repeat instructions like this.
You know what your task is. I’ll give you a little template to help you out:
A translation of the Coxian Rt+1 for us will be R (13, 13), seeing that GZIP trailers are always 8 bytes and we’ve established “1” is 5 bytes. This means we can’t neatly split this into two of the same literal & distance pairs.
No fret, because something like Repeat (6, 13) (7, 13) works just as well!
Actually, at this point your quine will work on a lot of popular decompressors (if they have options to not verify the checksum). Congratulations!
… I wasn’t quite satisfied with that, so I decided to see about finding a CRC32 that, when found in both in quoted and actual part of the GZIP quine, will work. A self referential checksum, so to speak.
I won’t get into how a cyclic redundancy check works—you can read these succinct notes by Prof. Norman Matloff—but for all intents and purposes you can simply bruteforce all possible values until something works.
For the 32 bits of CRC32, there are 2^32 (over 4 billion) permutations you can test. For modern tech that ain’t too much of an effort, so I wrote a dirty Python script utilizing multi-processing. It works by embedding a given CRC32 value into defined quoted & actual positions in a provided template file, checking if the CRC32 checksum of the modified template matches the given CRC32 value.
I plonked it on a Google Cloud server to run on (they offer more than enough free credit), so as not to see my not-so-modern Pentium CPU bleed to death. If you were going to use it on a basic computer, you’ll likely want to modify the script to add appropriate sleep instructions if you don’t want to repurpose your processor as a frying pan.
Oh yeah and, I’m not a mathematician, so maybe this is overkill. Can someone tell me if there’s some algebra involving modulus or whatever that could actually solve for a self-referential CRC32 answer?
Anyway, if you haven’t been cheating all this time with the quine.gz source code, I’ll save you the hassle and tell you the CRC32 value that we want for our quine: 0x ff 79 ff a9.
Conclusion
I’ve got actually useful work to get on with, so I’m leaving the challenge of writing a programmatic solution for the creation of compressed file/archive quines with variable arguments to y’all. Have some fun with it, but make sure it’s in Python ;)
Of course, Russ Cox actually has already done this by way of rgzip.go (search his “Zip Files All The Way Down” post for it), though mind it’s in some old version of Go that is hard comprehend at this point. Take note that he utilizes dynamic compressed blocks!
Infact, check the comment section of his post to see some fascinating explorations of compression quines.
… just ignore the user Misi, who exclaims that you can create fodder bytes in your quine to manipulate the CRC value to be whatever you want—making my ridiculous brute-forcing methods a rather useless affair. At least having a self referential CRC is kind-of cool, right?
Anyway, feel free to contact me wherever about any questions you have, or something cool you’d like to share. I hope this tutorial has helped you out but it’s probably a bit strange, so if you’ve got any concept you’d explored more/differently then definitely tell me. This is a new Jekyll blog I’ve set up, and I have haphazardly written the scripting/styling/ASCII charts, so do point out any errors or accessibility issue if you could.
I really can’t thank the following folk enough for the amazing educational resources they’ve provided free of charge!