Firstly, its about 20% faster so far: as i suspected simple micro-optimisations really count here.
Secondly, it's somewhat better at producing compact deltas. Oops, I guess I had a bug somewhere in the code - the original still produces working output but it doesn't find as many matches as it should.
With the new encoder the (6,1) case creates ~16K of delta compared to ~18K previously. And now using a smaller block size of (4,1) does even better at ~14K5 where previously it only hindered.
I also realised just how simple it was to add the currently-decoded target data into the search space of the encoder and copy space of the decoder. Literally a couple of lines in the decoder and under 10 in the encoder.
This drops the test case of GPL2 to GPL3 text to under 13 000 bytes for DEZ1(4,1) and about 13K5 for DEZ1(6,1).
This also allows it to work as a functional if not particularly terrific compressor. An empty buffer to GPL3 "delta" is about 17K. "gzip -9" (== GZipOutputStream defaults) is 12k with comparable execution times (gzip a bit better and scales better). Not that it is very useful in general but concatenating 10x GPL3 texts and then performing the same operation produces just an ~18K delta vs ~100K+ gzip stream.
It's pretty memory hungry on the best setting though; windows could be used.
I'm also curious as to applying this to line-by-line diffs.
Update: Ahah, not so good with some binary data. It creates pretty good deltas but the performance drops a few orders of magnitude. Any data with lots of '00 00 00' type sequences blow out the hash collision rate and break the performance of the open hashing algorithm. A straight java collections implementation (with arraylists and boxed integer's and all) scales a lot better (and surprisingly runs about the same speed in general) although it uses way more memory.
It can be mitigated by limiting the search for an empty slot or other trade-offs at a pretty hefty cost to the generated delta size. Maybe detecting runs would help some.
lambdas and finals
I just tried using -verbose:gc to get some basic memory statistics and it looks like the previous implementation uses way more temporary memory than it should. It's not allocating any differently so i can only think it's something to do with the foreach/lambda code, and this probably explains the extra runtime.
To confirm i moved a local variable i had accessed as a final to the an object field: yep, it drops ~500K of garbage. Moving the same lambda code to a field variable as well saved another ~500K too, but even then it's still ~1M more than the other implementation.