Tuesday, 24 March 2015

Getting the runs.

As is usually the case I kept poking last night and added a runlength detector to the internal matcher loop of the delta processor.

To start with, any sequence which begins with 3 common bytes are not added to the hash table. For my problem file this reduces the hashtable load significantly.

Then when scanning the target data any location which starts with 3 common bytes are not looked up in the hash table and just converted to a byte run instead.

I had to add runs to the byte encoding and to do this and i noticed that ADD and RUN are typically tiny so just made the command+length byte always 1 byte with 6 bits of length. But I might change that to 5 bits of length so that it can still have a continue bit since most are under 32 bytes as well and that will protect against the occasional long length.

So something like this:

   00XXXXXX            - copy + 6 bit length
   10XXXXXX CXXXXXXX*  - copy + extended length
   011XXXXX            - add + 5 bit length
   111XXXXX CXXXXXXX*  - add + extended length
   010XXXXX            - run + 5 bit count
   110XXXXX CXXXXXXX*  - run + extended count

Expansion space? Why? If need be I could let $00 be an extended command byte or something.

The delta sizes are slightly larger but the runtime is ~2 orders better on this particular problem file (jjmpeg.dll) with similar settings. And it's like ~1s vs 10ms rather than 10ms vs 100us.

(6,1)      delta        size
 rev    orig     new    
  59  155323  155323    155323
  57   15509   15626    154093
  55    8168    8227    154093
  53   10425   10490    153839
  47    9942    9958    153095
  45   15879   16000    145425
  40   16584   16701    146274
  36   14561   14632    144658
  22   10751   10832    144145
  20      38      38    144145


These are the deltas for jjmpeg.dll from the trunk of jjmpeg. It's a reverse delta so the top line is the latest version with no processing and each subsequent is an editing list which turns the previous version into it.

One probably obvious thing I hadn't realised is that reverse deltas work particularly well on source code - or anything that is added to over time. Since most edits are editions, a reverse delta is mostly just deletes (actually, just no commands at all).

 rev   delta    size
 147    3977    3977
 146      31    3801
 145      26    3671
  85     101    3177
  53      28    3180
  52      39    2591
  45      49    2488
  16      48    1645
  13      15    1556
  11      42    1300
   6      25    1194
   2      20    1051


I'm not working on a revision control system for source code, it just provides easy test data.

So with that i'm pretty well done with the that* ... unless i can do something about the hash table.

With short matching patterns required to get the best results certain data gets repeated a lot which is problematic for the open addressed hashing. Most solutions I can think of just blow the memory out since as soon as you start adding links it at least doubles the memory.

After a bit of thought a couple of possibly practical ideas spring to mind ...

Overflow and keep.
Once a search exceeds a certain range drop subsequent records into an overflow record. The record in the table can be a negative index to the overflow record.

Clashes still cause 'difficulty' because the match scan will have to deal with multiple overflow records. Although they could include the hash key to help filter them out.

Overflow all.
This is similar but rather than leave the entries there all values for the same key are extracted and placed into the overflow record and the rest are marked deleted. The first available slot is set to point to the overflow record.

The problem here is that the hash keys are not stored in the table and they are required to be able to filter the entries since they could include other hash keys. But! They can be regenerated quite cheaply.

Hybrid and keep.
Have a second smaller chained (or otherwise) hash table and if any search exceeds some limit store values in that instead.

This is the simplest but would require two lookups each time.

Hybrid all.
Combine the last two. Move all the other values of that key to the new table too.

This would only require two lookups for values which are not present in the overflow table, which by definition occur less often. It's not really the lookups which are the problem though.

Just thoughts, haven't tried anything.

The journey never ends

So while thinking about the next problem: the scalability of the brute force search of all the substrings with the same keys ... I thought of an exact solution that combines the string search with the index.

It's pretty obvious so probably has someone's name Update: Well sort of, it's a longest common prefix table, or LCP table. The below is definitely a naive approach with O(N log N) running time, but it is space efficient and simple to understand.

First, the whole file is pre-processed into a common prefix tree.

  1. Sort a list of every substring in the source;
  2. Iterate over the sorted list and count the length of the common prefix;
  3. (optional) build an index of the first row of each possible byte.

Because it just uses pointers/indices it doesn't take up any memory for the data and it only needs one entry for each location.

This data structure is basically a directed graph where each offset in 'starts' is a branch and each value in 'commons' is a link to the branch below.

 0 @ 15 c=1 '  Test.'
 1 @ 16 c=1 ' Test.'
 2 @ 7  c=1 ' a test.  Test.'
 3 @ 4  c=1 ' is a test.  Test.'
 4 @ 9  c=0 ' test.  Test.'
 5 @ 21 c=1 '.'
 6 @ 14 c=0 '.  Test.'
 7 @ 17 c=1 'Test.'
 8 @ 0  c=0 'This is a test.  Test.'
 9 @ 8  c=0 'a test.  Test.'
10 @ 18 c=4 'est.'
11 @ 11 c=0 'est.  Test.'
12 @ 1  c=0 'his is a test.  Test.'
13 @ 5  c=3 'is a test.  Test.'
14 @ 2  c=0 'is is a test.  Test.'
15 @ 6  c=2 's a test.  Test.'
16 @ 3  c=1 's is a test.  Test.'
17 @ 19 c=3 'st.'
18 @ 12 c=0 'st.  Test.'
19 @ 20 c=2 't.'
20 @ 13 c=1 't.  Test.'
21 @ 10 c=0 'test.  Test.'

This is the result for 'This is a test. Test.'. '@' is the position in the original string, and 'c=' is the number of characters of common prefix with the string below.

Using this table it is a simple walk to find the longest string from these which matches an arbitrary string.

And of course ... this is exactly the problem I was solving originally a few days ago.

Sorted Array Delta

So I started from scratch and wrote another delta builder based only on this technique. I only implemented the searches of the source buffer but it does do runs. I had to use a custom quick-sort as the java Arrays sort will not let me override the comparator for primitive arrays. I use an index for the first byte to jump to the right location to speed up the first level of the tree.

So how's it go?

Run-time wise it's something like 1.5-10x slower than the best hash based version I have depending on the data - the closest end is with jjmpeg.dll. The sorting stage can dominate the processing time, for jjmpeg.dll it's 5/6ths of the total processing time but for GPL2 it's only 1/2.

But it only needs half the memory of the best of the rest, so it's not all bad.


I continue to be surprised at how well java copes with lots of object allocation and garbage; despite using 10x the memory using a HashMap<Integer,ArrayList<<nteger>> is pretty much on par to the open addressed hash table even when it's working well and much faster compared to the failure cases for the dll file. It's tons faster than reference counting (and threads, java is dreams for threads) and I never found explicit allocation very fast to start with. Not quite as quick as alloca().

* - i'm never done with that.

No comments: