Sunday, 29 March 2015

more in less

Curiosity got the better of me again and now there's 3:30am night and early rise behind me and pretty much another whole weekend "down the drain".

I had another go trying to grok VCDIFF format but failed again: not sure if it's my encoder or decoder but I can't get them to agree on the address compression and so it fails on most data.

Putting that aside I experimented with some ideas on address compression from VCDIFF, and then ended up creating a new format based on what i learnt from it and other experiments. My work with VCDIFF demonstrated the utility of address compression and dual operation instructions.

Anyway I think it turned out quite neat and tidy so here's a summary. It might be useful if you're trying to understand VCDIFF - its very different but borrows similar mechanisms.

Opcodes

There are basically two families of opcodes: dual-operation and single operation. The MSb selects which so each has a full 7 bits of information to play with. This allows for quite a wide range of data lengths and operations to be encoded into a single byte.

Only 6 individual operations are defined together with 2 dual operations.

Integers are encoded like many formats as unsigned 7-bit big-endian with the MSb as a continue bit. Addresses are encoded independently using a separate mechanism as described below.

There are a couple of other format parameters:

smallest
The smallest copy size possible. Encoder dependent.
split
The split point for the single operation opcodes. Fixed at 100 but could be changed.

Dual operation instructions

The encoder tracks the last instruction and if possible can combine two into the same instruction. This is one of the things vcdiff uses to get more compact deltas.

0Taaabbb
  0 - selects a dual instruction
  T - the type of the first operation, 0=ADD 1=COPY
aaa - 3-bit length for first operation.  This allows a range of
      (1-8 ) byte appends or (0-7)+smallest copy.
bbb - 3-bit length for second operation.  It is always COPY.
      This allows a range of (0-7)+smallest copy.

3 bits allows for an ADD of 1-8 bytes and a COPY of (0-7)+smallest bytes. With a smallest size of 6 bytes this allows ADD(1-8)+COPY(6-13) or COPY(6-13)+COPY(6-13).

This is all bits exhausted so the only dual instructions possible. Any data or addresses required are encoded in order in the following bytes.

I also investigated a fixed function 3-bit ADD and 4-bit COPY but it wasn't as good (albeit with very limited investigation).

Single operation instructions

The single op instructions allow for longer immediate copies as well as extended instructions which encode the length as a separate parameter.

Rather than split the data in bit-selected blocks a single 128-value number is broken into several ranges which are interpreted differently.

1nnnnnnnn
  1 - selects single operation
  n - 7-bit number interpreted via inclusive ranges as follows:

000 - 099   copy (0-99)+smallest bytes
100 - 123   add (1-24) bytes
      124   read length.  copy length+100+smallest bytes.
      125   read length.  add length+24+1 bytes.
      126   read length.  run of length+3 bytes.
      127   eof/reserved or something

The split was based on some statistical analysis of a couple of files: copies cover a larger range than adds, and runs are rare. Well and 100 is easier to work with.

For a smallest of size 6, this allows a single instruction to encode a copy of 6-105 bytes or an append of 1-24 bytes. These cover the majority of cases for the data i've been testing with (not that it is a very big set) and the behaviour of the matcher.

The multi-byte encoding is such that the 6+ and 4+ bits already implied taken care of are removed from their lengths which can save the occasional `overflow' byte.

Addresses

This format generated slightly smaller deltas than the previous format but I knew from my experiments with VCDIFF that address compression could increase the gains. The problem here is now i've run out of bits to use so I had to come up with a solution which encoded addresses independently of any operation codes.

Setting aside even a few bits for each address would be too costly so after a bit of thought I came up with a solution based on the observation that most addresses will be >127 and require 2 bytes at least anyway, therefore if I just encode the rare all-7-bit addresses in 16 bits instead it leaves a full 7 bits to use for other encoding schemes whilst retaining the `natural use' of those bits for longer values.

The next problem is how to use those 7 bits. VCDIFF uses 3 bits to select from a near/same table to chose either a positive offset of the last 4 addresses or together with an octet to select a specific address from a table of 768 (using the default code table). I did some experiments to find out which aids more: 'next' is used more often but 'same' saves a lot each time it is used; but it needs 11 bits to do it. Too much here. I also found that adding a sign to the near addresses offset improved the results.

I chose a trade-off which has features of both but requires fewer bits. It combines the same and near table into the same array and adds a sign to the offsets of near addresses. Because I don't need the sign bit for 'same' addresses I can use that to increase the address space of the same table. This allows a full 6 bits to be used for the match table and 5 for the near table.

It is necessarily slower than VCDIFF because I perform a linear search over these values to find an exact match (64 elements) or the nearest match (32 elements). The tables could be overlapped: they are just the last 32 or 64 addresses encoded or decoded stored using a cyclic index. Like VCDIFF both encoder and decoder must maintain this in sync.

This is the encoding used:

00nnnnnn - This is an exact match and a complete address.
           n is an index into a 64-element table.
01Smmmmm - This is a near match.  S is a sign bit and m is
           an index into a 32-element table.  The offset follows.
1aaaaaaa* 0aaaaaaa - An absolute address.  To avoid clashing with
          the above it is forced to at least 2 bytes length.

Note that absolute addresses are just encoded as simple integers: with no `wasted bits' for the other information if their value is greater than 127; which is very likely in the common case.

The encoder is free to choose the smallest option of these for any address in the stream.

Results

These are with the same encoder settings of a 'smallest' of 6 bytes, as with the other benchmark data from the home page.

                       dez-1.2     dez-?.?     gzip -4
  GPL2 to GPL3         13 591      12 053
  jjmpeg.dll           10 809       8 770
  bible (compress)  1 731 250   1 539 501   1 550 998 

Runtime is a tiny bit longer for the shorter files due to the address table lookup, although i haven't optimised the address table scan yet. It's still 180x slower than gzip on the KJV.

This actually beats my current VCDIFF encoder but since that is still broken it's pretty much useless to compare to it. Even a single bug can radically alter the patch size.

But one (admittedly small) plus is that unlike VCDIFF this format is fully streamed and doesn't require staging. Actually another plus is that the code is quite a bit simpler due to a more orthogonal instruction set and few special cases, but it only has an implied-fixed code table so isn't as flexible.

Can it go smaller?

Because the string matcher performs an exhaustive search it may find multiple equivalent-length matches for a given target sub-string. This provides an obvious opportunity for selecting a source address that can be represented in fewer bytes than others. This could save some more bytes at the cost of encoding time.

Update: Boy, didn't even let the paint dry on this one. Tried the last idea.

                       dez-1.2     dez-?.?     addr opt    gzip -4
  GPL2 to GPL3         13 591      12 053      11 965
  jjmpeg.dll           10 809       8 770       8 725
  bible (compress)  1 731 250   1 539 501   1 507 072   1 550 998 

Well it's more than zero, so I guess that's a yes?

Saturday, 28 March 2015

dez 1.2

So I came up with a couple more ideas to try ... actually they all pretty much amounted to nothing but here is a small update anyway. See the notes on the home page.

Some things I tried and discarded: incrementally chaining the hash table (avoiding the realloc+copy); inlining the hash function; calculating all hashes of source and target at the start; doing the same in parallel. Most of them amounted to no gain although the parallel hashing helped quite a bit in some cases but not often enough to be worth the memory required.

Thus this is mostly a post about the expanded homepage where i added a bit detail and more benchmarks.

I was a bit surprised with the memory use compared to the "competition" because dez amounts to a full exhaustive search, albeit index assisted, whereas the others do not.

Friday, 27 March 2015

synthz 0.0

Decided to code-drop the audio synthesiser i hacked up a couple of weeks ago.

Probably of no use to anybody but what the hell. Maybe one day i'll make some better sfx for that space invaders I did.

It's got it's own home page up on the usual spot.

dez 1.0

I packaged up the best version of the delta generator into dez-1.0. It and some details are available on the dez home page.

A couple of nights ago I tried one of the hashtable ideas from a previous post. It made a big improvement so I kept whittling it away until there really wasn't much of a hash table left apart from a couple of arrays. Fastest version using the least memory. Open addressing just didn't handle some bad cases at all well.

I didn't see the point in keeping the others around so i only kept this one in the release.

And that should be me done with this for a while.

Tuesday, 24 March 2015

Getting the runs.

As is usually the case I kept poking last night and added a runlength detector to the internal matcher loop of the delta processor.

To start with, any sequence which begins with 3 common bytes are not added to the hash table. For my problem file this reduces the hashtable load significantly.

Then when scanning the target data any location which starts with 3 common bytes are not looked up in the hash table and just converted to a byte run instead.

I had to add runs to the byte encoding and to do this and i noticed that ADD and RUN are typically tiny so just made the command+length byte always 1 byte with 6 bits of length. But I might change that to 5 bits of length so that it can still have a continue bit since most are under 32 bytes as well and that will protect against the occasional long length.

So something like this:

   00XXXXXX            - copy + 6 bit length
   10XXXXXX CXXXXXXX*  - copy + extended length
   011XXXXX            - add + 5 bit length
   111XXXXX CXXXXXXX*  - add + extended length
   010XXXXX            - run + 5 bit count
   110XXXXX CXXXXXXX*  - run + extended count

Expansion space? Why? If need be I could let $00 be an extended command byte or something.

The delta sizes are slightly larger but the runtime is ~2 orders better on this particular problem file (jjmpeg.dll) with similar settings. And it's like ~1s vs 10ms rather than 10ms vs 100us.

(6,1)      delta        size
 rev    orig     new    
  59  155323  155323    155323
  57   15509   15626    154093
  55    8168    8227    154093
  53   10425   10490    153839
  47    9942    9958    153095
  45   15879   16000    145425
  40   16584   16701    146274
  36   14561   14632    144658
  22   10751   10832    144145
  20      38      38    144145

           jjmpeg.dll

These are the deltas for jjmpeg.dll from the trunk of jjmpeg. It's a reverse delta so the top line is the latest version with no processing and each subsequent is an editing list which turns the previous version into it.

One probably obvious thing I hadn't realised is that reverse deltas work particularly well on source code - or anything that is added to over time. Since most edits are editions, a reverse delta is mostly just deletes (actually, just no commands at all).

 rev   delta    size
 147    3977    3977
 146      31    3801
 145      26    3671
  85     101    3177
  53      28    3180
  52      39    2591
  45      49    2488
  16      48    1645
  13      15    1556
  11      42    1300
   6      25    1194
   2      20    1051

     AVNative.java

I'm not working on a revision control system for source code, it just provides easy test data.

So with that i'm pretty well done with the that* ... unless i can do something about the hash table.

With short matching patterns required to get the best results certain data gets repeated a lot which is problematic for the open addressed hashing. Most solutions I can think of just blow the memory out since as soon as you start adding links it at least doubles the memory.

After a bit of thought a couple of possibly practical ideas spring to mind ...

Overflow and keep.
Once a search exceeds a certain range drop subsequent records into an overflow record. The record in the table can be a negative index to the overflow record.

Clashes still cause 'difficulty' because the match scan will have to deal with multiple overflow records. Although they could include the hash key to help filter them out.

Overflow all.
This is similar but rather than leave the entries there all values for the same key are extracted and placed into the overflow record and the rest are marked deleted. The first available slot is set to point to the overflow record.

The problem here is that the hash keys are not stored in the table and they are required to be able to filter the entries since they could include other hash keys. But! They can be regenerated quite cheaply.

Hybrid and keep.
Have a second smaller chained (or otherwise) hash table and if any search exceeds some limit store values in that instead.

This is the simplest but would require two lookups each time.

Hybrid all.
Combine the last two. Move all the other values of that key to the new table too.

This would only require two lookups for values which are not present in the overflow table, which by definition occur less often. It's not really the lookups which are the problem though.

Just thoughts, haven't tried anything.

The journey never ends

So while thinking about the next problem: the scalability of the brute force search of all the substrings with the same keys ... I thought of an exact solution that combines the string search with the index.

It's pretty obvious so probably has someone's name Update: Well sort of, it's a longest common prefix table, or LCP table. The below is definitely a naive approach with O(N log N) running time, but it is space efficient and simple to understand.

First, the whole file is pre-processed into a common prefix tree.

  1. Sort a list of every substring in the source;
  2. Iterate over the sorted list and count the length of the common prefix;
  3. (optional) build an index of the first row of each possible byte.

Because it just uses pointers/indices it doesn't take up any memory for the data and it only needs one entry for each location.

This data structure is basically a directed graph where each offset in 'starts' is a branch and each value in 'commons' is a link to the branch below.

 0 @ 15 c=1 '  Test.'
 1 @ 16 c=1 ' Test.'
 2 @ 7  c=1 ' a test.  Test.'
 3 @ 4  c=1 ' is a test.  Test.'
 4 @ 9  c=0 ' test.  Test.'
 5 @ 21 c=1 '.'
 6 @ 14 c=0 '.  Test.'
 7 @ 17 c=1 'Test.'
 8 @ 0  c=0 'This is a test.  Test.'
 9 @ 8  c=0 'a test.  Test.'
10 @ 18 c=4 'est.'
11 @ 11 c=0 'est.  Test.'
12 @ 1  c=0 'his is a test.  Test.'
13 @ 5  c=3 'is a test.  Test.'
14 @ 2  c=0 'is is a test.  Test.'
15 @ 6  c=2 's a test.  Test.'
16 @ 3  c=1 's is a test.  Test.'
17 @ 19 c=3 'st.'
18 @ 12 c=0 'st.  Test.'
19 @ 20 c=2 't.'
20 @ 13 c=1 't.  Test.'
21 @ 10 c=0 'test.  Test.'

This is the result for 'This is a test. Test.'. '@' is the position in the original string, and 'c=' is the number of characters of common prefix with the string below.

Using this table it is a simple walk to find the longest string from these which matches an arbitrary string.

And of course ... this is exactly the problem I was solving originally a few days ago.

Sorted Array Delta

So I started from scratch and wrote another delta builder based only on this technique. I only implemented the searches of the source buffer but it does do runs. I had to use a custom quick-sort as the java Arrays sort will not let me override the comparator for primitive arrays. I use an index for the first byte to jump to the right location to speed up the first level of the tree.

So how's it go?

Run-time wise it's something like 1.5-10x slower than the best hash based version I have depending on the data - the closest end is with jjmpeg.dll. The sorting stage can dominate the processing time, for jjmpeg.dll it's 5/6ths of the total processing time but for GPL2 it's only 1/2.

But it only needs half the memory of the best of the rest, so it's not all bad.

Rubbish

I continue to be surprised at how well java copes with lots of object allocation and garbage; despite using 10x the memory using a HashMap<Integer,ArrayList<<nteger>> is pretty much on par to the open addressed hash table even when it's working well and much faster compared to the failure cases for the dll file. It's tons faster than reference counting (and threads, java is dreams for threads) and I never found explicit allocation very fast to start with. Not quite as quick as alloca().

* - i'm never done with that.

Monday, 23 March 2015

deltas and diffs again

I had a quick look at cleaning up the delta code I wrote yesterday and got sidetracked into making a more optimised version - one where the hash table and other bits are just in-lined into a single class.

Firstly, its about 20% faster so far: as i suspected simple micro-optimisations really count here.

Secondly, it's somewhat better at producing compact deltas. Oops, I guess I had a bug somewhere in the code - the original still produces working output but it doesn't find as many matches as it should.

With the new encoder the (6,1) case creates ~16K of delta compared to ~18K previously. And now using a smaller block size of (4,1) does even better at ~14K5 where previously it only hindered.

I also realised just how simple it was to add the currently-decoded target data into the search space of the encoder and copy space of the decoder. Literally a couple of lines in the decoder and under 10 in the encoder.

This drops the test case of GPL2 to GPL3 text to under 13 000 bytes for DEZ1(4,1) and about 13K5 for DEZ1(6,1).

This also allows it to work as a functional if not particularly terrific compressor. An empty buffer to GPL3 "delta" is about 17K. "gzip -9" (== GZipOutputStream defaults) is 12k with comparable execution times (gzip a bit better and scales better). Not that it is very useful in general but concatenating 10x GPL3 texts and then performing the same operation produces just an ~18K delta vs ~100K+ gzip stream.

It's pretty memory hungry on the best setting though; windows could be used.

I'm also curious as to applying this to line-by-line diffs.

Update: Ahah, not so good with some binary data. It creates pretty good deltas but the performance drops a few orders of magnitude. Any data with lots of '00 00 00' type sequences blow out the hash collision rate and break the performance of the open hashing algorithm. A straight java collections implementation (with arraylists and boxed integer's and all) scales a lot better (and surprisingly runs about the same speed in general) although it uses way more memory.

It can be mitigated by limiting the search for an empty slot or other trade-offs at a pretty hefty cost to the generated delta size. Maybe detecting runs would help some.

lambdas and finals

I just tried using -verbose:gc to get some basic memory statistics and it looks like the previous implementation uses way more temporary memory than it should. It's not allocating any differently so i can only think it's something to do with the foreach/lambda code, and this probably explains the extra runtime.

To confirm i moved a local variable i had accessed as a final to the an object field: yep, it drops ~500K of garbage. Moving the same lambda code to a field variable as well saved another ~500K too, but even then it's still ~1M more than the other implementation.

"hmm".

versioning, deltas and diffs

I've been jumping around a bit with hacking lately and this weekend i've been playing with a versioned database and related stuff. I have a fairly long article written about it but it's not quite ready yet. I came up with a few variations and some of the features included cheap copies, cheap branches, proper renaming and so on. Apart from the fun-factor I do have an end-use in mind but i'm still some way off.

One goal was also efficient storage so I worked on a system which stored changes as reverse deltas. After some searching i came across the vcdiff format and some java code to work with it - but much of each implementation is pretty undecipherable code and to be honest just not that hot and I had a niggling interest to look a bit further into the problem.

So today I basically wrote my own from scratch and surprisingly it creates smaller diffs than jvcdiff given that it is literally an afternoons hacking and a naive approach at best (at least on a single example). Runtime is comparable although in either case they are very quick. I'm still trying to get hard statistics on memory use and gc load but i'm pretty sure it's better too.

I also looked at badiff which works with very large files, but that isn't something I need and it generated very fat diffs, used a ton of memory, and wasn't very fast (putting it lightly, see below).

Anyway the problem turned out to be quite interesting along the way.

Delta Algorithm?

It's more or less based on Bentley-McIlroy's paper `Data Compression Using Long Common Strings' although that paper is so brief I didn't get a great deal from it after a skim read. It's guts I think is more or less based on Rabin-Karp's paper `Efficient randomised pattern-matching algorithms' but that was too dense for weekend reading so after my eyes glazed over I did a quick scan of the wikipedia article and then thought about what problem it was actually solving.

In more human terms I think it basically devolves into this: compare every sub-string in the target array to every sub-string in the source array to identify the longest match. Use it to do shit. The details come in trying to make this practical and efficient and what `shit' you do with `it'.

So the following algorithm is kinda probably the same thing but only based on a loose understanding thereof.

# algorithm to generate a list of editing commands
# which transforms one array into another

hsize = length of sub-string
hstep = bytes to step in input

htable = hash table mapping hash value to location

# build hash table
for each location in source in steps of hstep
  htable.addHash(source, location)
end

# find matches
for each location in target
  hash = generateHash(target, location)

  for each candidate in htable.lookup(hash)
     determine longest actual match
       of source @ candidate against target @ location
  end

  if there was a match and longest > threshold
     if any data not matched
       output an append command
     endif
     output copy command
     location += match length
  fi
end
add any left over

As one can see it's actually rather simple and mirrors the one-sentence English text description in a straightforward way. The hash table is only used as a helper to narrow the search space to possible candidates rather than having to check them all.

It requires the full source data but could trivially stream the target data. Restoring the target requires the source plus the list of editing commands.

hsize is typically 4-8 and determines the runtime and compression rate. A smaller number gives more potential candidates to match against (and more hashing clashes) but a larger hash will not match smaller strings.

hstep sets how often the source is sampled. Using a hstep of 1 gives the best result by far but results in a large hashtable and longer running time.

Rolling hashes are used so the hash length has no effect on that calculation but it does affect the runtime in other ways.

There are two interesting problems here - rolling hashes and the hash table itself.

Rolling hash

I tried to implement a Rabin-Karp hash but lost patience trying to get my head around the modulo maths. I think i just had some indexing bug rather than mathematical ones but in the end I felt more comfortable with a cyclic hash based on rotating bit arithmetic. The mathematics are still messy but even the 6510 cpu has a 'ror' instruction and so the actual code is trivial for a digital computer.

// Copyright (C) 2015 Michael Zucchi
// License is GNU GPL v3 or later
public class CyclicHash {
        int b;
        int hash;
        int fix0;
        static final int[] random = new int[256];

        static {
                Random r = new Random();
                for (int i = 0; i < random.length; i++)
                        random[i] = r.nextInt();
        }

        public CyclicHash(int b) {
                this.b = b;
                this.fix0 = ((b - 1) * 9) & 31;
        }

        public int init(byte[] data, int off) {
                hash = 0;
                for (int i = 0; i < b; i++)
                        hash = rotateLeft(hash, 9) ^ random[data[i + off] & 0xff];
                return hash;
        }

        public int update(byte leave, byte enter) {
                int leaving = rotateLeft(random[leave & 0xff], fix0);
                int entering = random[enter & 0xff];

                hash = rotateLeft(hash ^ leaving, 9) ^ entering;
                return hash;
        }
}

A random table is used to create the hash values for each possible byte. To add a new value to the hash it is first rotated and then xor'd with the random hash for the new byte. To remove an old value one only has to rotate by the length of the block and xor it again. Rotation is supposed to be modulo 32 but the java implementation in Integer is not and doesn't match the javadocs - to make sure I do it first (actually it worked before i did this so i know it works but this simplifies the call). The shift amount of 9 i plucked out of my arse.

Choosing better `random' values has some effect but this works.

Approximate Hash Table

I don't just want a simple hash table I want a hash table which lists all the potential match candidates against a given hash value. In java this could be expressed as:

        HashMap<Integer,List<Integer>> htable;

However this will (should, i imagine) have scalability and gc issues and boxed primitives are quite heavy weight to start with.

There are some good optimised java libraries for primitive hash tables but i don't think they support multiple entries off the same hash key and they tend to be gigantic libraries targeted at different problems. Anyway it's the sort of problem i find interesting (fun, actually) so I came up with my own.

I sat outside in the last of the rays of the sun with knuth and a beer and read up on open addressing hash tables. Typically i'm partial to the chained variety for their conceptual simplicity but from looking at some libraries like trove it seems open addressing has benefits in some cases and particularly for java. Armed with the knowledge and the setting sun I came up with a space efficient 'approximate' hash table which suits the problem at hand.

// Copyright (C) 2015 Michael Zucchi
// License is GNU GPL v3 or later
class MatchMap {

        int[] values;
        int mask;
        int logN;

        static final int EMPTY = Integer.MAX_VALUE;

        public MatchMap(int N) {
                int size;

                logN = 31 - Integer.numberOfLeadingZeros(N);
                size = 1 << (logN + 2);
                mask = size - 1;
                values = new int[size];

                Arrays.fill(values, EMPTY);
        }

        public void add(int hash, int value) {
                int i = hash & mask;
                int c = ((hash >> logN) | 1) & mask;

                while (values[i] != EMPTY)
                        i = (i + c) & mask;

                values[i] = value;
        }

        public void foreach(int hash, IntConsumer func) {
                int i = hash & mask;
                int c = ((hash >> logN) | 1) & mask;
                int v;

                while ((v = values[i]) != EMPTY) {
                        func.accept(v);
                        i = (i + c) & mask;
                }
        }
}

It uses open addressing with double hashing and doesn't bother with store the hash key, only the value. Knowing the size is important - but also known in this application. Deletes are a bit messy but fortunately not required. The table is sized so that the load factor is well less than 0.5.

Adding a new entry just starts probing at the modulo-size location and searches for an empty slot, stepping by the second hash (higher bits in the function or'd with 1 to make it odd). Knuth suggests using an odd number for the step which makes a lot of sense; it will eventually visit every location. This is an effective technique against clustering that occurs if linear probing is performed (I also tried that).

Retrieval just does the same thing. Any values are considered potential matches even if they were stored by a completely different hash key. Given that each location needs to be compared directly with the data arrays it doesn't make mistakes very costly and it even seems to improve the delta output in some cases.

One might note here ... this in total ... is probably less code than just the extra scaffolding required to use a java collections hashtable+list for the same purpose and probably 4-10x more space efficient. open-vcdiff has ~1KLOC for this purpose although that's about half comments explaining it's complexity - i don't even care to understand it enough to know if any of that is actually justified.

These two functions are 80% of the smarts and about 30% of the code. The rest is just to wrap it in some 'efficient' byte-streaming encoding, settle on a magic number, and write a decoder. vcdiff encoding would be nice but the encoding is a little more work than I have time left this weekend.

A result

I created a delta from the text versions of GPL 2.0 to GPL 3.0 using this and compared to jvcdiff and badiff. As I said i don't think jvcdiff is the best implementation and it appears to just be a port of open-vcdiff, but it's what I have handy. For jvcdiff i'm forcing a single window but otherwise using the defaults.

              badiff      jvcdiff    DEZ1
                                     (6,1)       (6,2)       (6,4)       (8,1)       (8,8)       (16,16)
source size   18092
target size   35147
patch size    47833       28829      18064       19053       20421       19128       23159       28877

encode        0.174870    0.002000   0.001550    0.001200    0.000950    0.001250    0.000720    0.001000
decode        0.000700    0.000170   0.000150    0.000150    0.000150    0.000120    0.000090    0.000060

hardware is kaveri cpu, jvm is oracle java 8 64-bit.

FWIW gzip is about 16K for both concatenated together, but that isn't really the same thing.

The value in brackets is (hsize,hstep). A larger hstep will reduce the size of the hash table generated and thus affect memory use, possibly significantly. hsize affects the minimum length of a match as well as hashing conflicts at the low end.

Times are in seconds and approx/rounded from the last few of 200 runs - jvcdiff has much more complex code so it takes nearly this many before it is fully optimised by the jvm, before that it is approximately 0.01s for encode. Because it's fast it might need even more loops for the jvm to notice. Due to randomisation of the hash in these test runs the DEZ1 encoding can vary between runs.

Dunno what's up with badiff, maybe this data isn't suitable for it and/or it's tuned to larger datasets.

As I said I don't have any hard data on the memory footprint but for DEZ1 the (x,1) worst-case it uses a 64K element int array for the hash table, plus the memory for the input and output buffers and half a dozen small utility objects. Decoding needs no extra memory apart from input, patch, and output buffer; and the output size is encoded in the delta so it only needs one allocation of non-garbage.

After writing most of this I tried the (16,16) option and considering the patch size is close to jvcdiff perhaps that matches the parameters it uses. Actually I kinda knew about this but forgot about it. Even then it is surprising that the delta size is about the same as jvcdiff uses much more advanced encoding techniques from VCDIFF that should reduce the size.

Format

There's probably not much point going into a lot of detail in the file format but lets just say i chose the most compact and simplest format I could.

Integers are encoded using a big-endian stream of 7-bits data + 1 bit continue indicator, much the same as vcdiff and dozens of other binary encodings.

Because I only have two opcodes as a further optimisation I encode the opcode as a single bit along with the length. This allows lengths of up to 63 bytes (hmm, 64 if i add one ...) to be encoded together with the opcode in a single byte.

I just encode the copy source address as an integer every time which means they will tend to take 3-4 bytes in typical files. This is where a lot of the smarts in vcdiff comes from but it also requires more faffing about and runtime overheads.

Improvements

So a relatively cheap and possibly significant improvement is to include the the incremental output as part of the search space: i.e. the part of the output which will have been decoded up until that time. As long as the hash table is big enough that should be relatively simple in the encoder and isn't much more work in the decoder.

I don't see the need for run-length encoding as it rarely comes up in real text (and yeah, real men use tabs).

Since the code is so simple it is open to gains from micro-optimisations. Being simple does help the compiler a lot to start with though and hotspot compiles faster code with less lead-time than the more complex implementations.

The hash table is a critical part of the performance and could be tuned further, not that there is much code to tune.

Currently any match of 4+ is encoded as a copy even if an add might be smaller. I don't know if 4 is even the best value. A look-ahead of a few characters may also yield longer matches and more compact encodings. Other address compression techniques as used in vcdiff could be used.

Scalability is unknown. My application is 'post-sized' text so it should be fine there. Increasing the block size (hsize) tends to speed up the encoder as it reduces the search space, but it misses small matches so the delta size increases. A two-tiered approach might work: a longer hash for the whole file and a smaller hash for a local region.

Having said all that, 'simplicity' and 'good enough' also have their own considerable charm and appeal so I might just clean it up and do some better testing and tuning and leave it at that.

Closing Thoughts

Well ... I guess that was a fun afternoon, ... which seems to have gone well beyond midnight, oops - these long articles take hours to write. Given I went in blind i'm pretty surprised that the results are at least comparable to existing code with such modest effort.

I'm also surprised - or maybe not surprised if i let my ego talk - at how simple the algorithms actually are compared to the complexity of any implementation i've seen. And it's a little difficult to see this case as a matter of being a subject expert thus making it appear simpler than it really is. The Bentley/McIlroy paper mentions a working implementation in ~150LOC of C which was hard to believe looking at these libraries; but now i do.

And also ... back to work (later) today. It's been an indescribable 3 month break, but not as good as that sounds. I guess I lost 5 kilos in the last 6 weeks.