Wednesday 3 September 2014

simplex noise, less memory

I thought i'd look at something a bit different today: noise. Something to get the fragment shaders doing some more work.

I've looked at some of this before but it's been a while and never had much use for it.

I started with "wavelet noise" but when I realised it needed big lookup tables I went back to looking at the simplex noise algorithm. It seems wavelets are being used to create bandwidth limited versions of existing noise so that it scales better; but this isn't something I need to worry about.

A paper and implementation by Stefan Gustavson and others pretty much had me covered but I wanted to try and remove the 512+256 element lookup tables used to hash the integer coordinates to save some memory on the epiphany.

I came up with two working solutions in the end.

The first one uses a 32-element lookup table of prime numbers to implement a 2D hash function. I just grabbed the first 32 primes (20 apart) for the table and fiddled with eor/mul and shift until I had something that seemed to work. I arbitrarily chose 32 because it was a nice round number.

// there's nothing particularly good or useful here

    static final int[] hasha = {
        71, 173, 281, 409, 541, 659, 809, 941,
        1069, 1223, 1373, 1511, 1657, 1811, 1987, 2129,
        2287, 2423, 2617, 2741, 2903, 3079, 3257, 3413,
        3571, 3772, 3907, 4057, 4231, 4409, 4583, 4751
    };

    private static int hash16(int a, int b) {
        return ((((b ^ hasha[a & 31]) * (a ^ hasha[b & 31])) >> 5) & 15);
    }

Because I was only interested in the 2D case I changed the gradient normal array to 16 elements so I didn't have to modulo the result as well. TBH it's kind of surprising it works as well as it does since hashing numbers is pretty tricky to get right and I really didn't know what I was doing.

When I started I didn't realise exactly what it was for so once I had a better understanding of why it was there I thought i'd try an existing integer hash function. In general they failed miserably but I found one that came from the h2 database which worked sufficiently well.

    private static int hash(int x) {
        x = ((x >> 16) ^ x) * 0x45d9f3b;
        x = ((x >> 16) ^ x) * 0x45d9f3b;
        x = ((x >> 16) ^ x);
        return x;
    }

    private static int hash16(int a, int b) {
        return (hash(a * b + a + b)) & 15;
    }

I used the (a*b+a+b) calculation to turn it into a 2D hash function.

So this final version requires no lookup table for the gradient table permute at all - nice. But it requires 3 integer multiplies - not so nice for epiphany. And even the other version needs an integer multiply and thus the same costly fpu mode changes on epiphany.

Since I only need a limited number of output bits it might (should?) be possible to change this to using float multiplies to avoid the costly mode change; but this is something for further study. The first version might make this easier.

Screenshots ... this first is a simple 4-octave fractal noise generated using the 2D Simplex Noise code from Stefan. I think the 2D noise function has a small bug because it's using the 12-point 3D gradient bases which don't always evaluate to vectors of the same length in 2D but it isn't apparent once fractal noise is generated as here.

The next one is an example using the naive hash function (it may be a different scale to the others since I ran it separately). Covering 4 octaves hides some problems it might have but I've done some very basic testing to larger scales and it seems about as stable and nicely random as the others.

And the final shot is using the M2 hash function and 16 gradients evenly spaced around the unit circle rather than 12 evenly spaced around the unit sphere as in the traditional version.

Look about the same to me?

I don't know if it's useful for anything I might do or if it is even fast enough to run in a shader on the epiphany but I learnt a couple of interesting things along the way.

ezegpu

I've been doing some little bits and pieces on the ezegpu code as well.

  • Changed the way async dma works in ezecore so that you can use either dma channel, enqueue your own DMA blocks and chains, and increased the queue length for more outstanding requests. I kept the old api compatible and interoperable;
  • Fixed the rasteriser to use this new queue;
  • Added some async dma to the controller but until everything uses it it wont pay off. It got messy enough that I need to redo it with the new goal in mind, it's not much code but I haven't gotten to it yet;
  • Made the backends open the framebuffer so the same 'gles2 demo' can run against different backends by just relinking them so as to simplify comparisons;
  • Added a NEON matrix multiply - it's 3x to 5x faster than a C version;
  • Added a NEON scale+clamp RGBA float to byte for the all-ARM version. This is 3x faster than the C version although i've got the channels messed up so the colours are wrong;
  • Did a bunch of cvs + code management stuff.

Together the NEON changes amount to a 6% improvement to the total runtime of the all-ARM code for my current testing case (8x8x8 stars). Nothing major, although it goes up on simpler scenes mostly due to the faster RGBA float to byte conversion.

No comments: