Main types, the tape and a few function examples

The following should be familiar from the basics of AD. There are some differences, the main being is that for some nodes we do not want to calculate the adjoints (because they might be input or target nodes) so we need a flag for that. The rest is boilerplate.

type Df_rec = {
    P: floatType ref
    A : floatType ref
    is_constant : bool
    } with

    static member create P =
        {P=P;A=ref 0.0f;is_constant=false}
    static member createConstant P =
        {P=P;A=ref 0.0f;is_constant=true}

type DM_rec = {
    P : dMatrix
    A : dMatrix
    is_constant : bool
    } with

    static member create (P: dMatrix) =
        {P=P;A=P.zeroLike();is_constant=false}
        
    static member createConstant (P: dMatrix) =
        {P=P;A=dMatrix.createEmpty;is_constant=true}

    static member createEmpty =
        {P=dMatrix.createEmpty;A=dMatrix.createEmpty;is_constant=false}
        
    static member createEmptyConstant =
        {P=dMatrix.createEmpty;A=dMatrix.createEmpty;is_constant=true}

    /// Resizes the primal and the adjoint if they are below nr*nc in size.
    member t.Resize nr nc =
        let p = t.P
        let a = t.A

        // This is an optimization to prevent an clogup of dMatrix objects here.
        // GC can't free up memory if the dMatrix instances are pointing to the same dArray.

        // If the class is larger, replace the reference else the function will mutably just adjust
        // the num_rows and num_col fields.
        p.ReplaceIf nr nc
        a.ReplaceIf nr nc

    member t.Dispose() = 
        (t.P :> IDisposable).Dispose()
        (t.A :> IDisposable).Dispose()

At the time of writing, there are three steps in the library, the build step, the forward step and the reverse step. In the forward step, the nodes dynamically resize themselves if they have insufficient capacity to meet the demand by calling the ReplaceIf functions.

let Noop() = ()
type Df = 
    {
    r: Df_rec
    ff : (unit -> unit)
    fb : (unit -> unit)
    }
    static member create(r,ff,fb) = {r=r;ff=ff;fb=fb}
and DM =
    {
    r: DM_rec
    ff : (unit -> unit)
    fb : (unit -> unit)
    }
    static member create(r,ff,fb) = {r=r;ff=ff;fb=fb}

    static member makeNode(hidden_size, input_size) =
        let p = dMatrix.create(hidden_size,input_size)
        {r=DM_rec.create p;ff=Noop;fb=Noop}
    static member makeNode(hidden_size, input_size, input: floatType[]) =
        let p = dMatrix.create(hidden_size,input_size, input)
        {r=DM_rec.create p;ff=Noop;fb=Noop}
    static member makeNode p =
        {r=DM_rec.create p;ff=Noop;fb=Noop}

    static member makeConstantNode(hidden_size, input_size, input: floatType[]) =
        let p = dMatrix.create(hidden_size,input_size, input)
        {r=DM_rec.createConstant p;ff=Noop;fb=Noop}
    static member makeConstantNode p =
        {r=DM_rec.createConstant p;ff=Noop;fb=Noop}

    static member makeUniformRandomNode(hidden_size,input_size) =
        let scale = (1.0f / sqrt(hidden_size+input_size |> floatType))
        let p = dMatrix.createRandomUniformMatrix hidden_size input_size scale 0.0f
        {r=DM_rec.create p;ff=Noop;fb=Noop}

The Df and DM types are the types the library operates on respectively. The hierarchy goes like this from top to bottom: DM->DM_rec->dMatrix and vice versa for Df. In addition to the records holding the primals and the adjoints, they each carry anonymous functions that take and return nothing, ff and fb. These functions are closures – closet classes really – that actually execute the forward and the backwards steps.

open System.Collections.Generic
type tapeType() =
    let d = Dictionary<int,List<obj>*(List<DM_rec>*int ref)>()
    let mutable select = 0
    /// Instantiates a new List if none is present at the selection and adds to it, else it just adds to the selected one.
    /// The default select is 0.
    member t.Add a =
        if d.ContainsKey(select) = false then
            let tape = List()
            let memory_dm = List(), ref 0
            d.Add(select, (tape, memory_dm))
            tape.Add(a)
        else
            let tape,_ = d.[select]
            tape.Add(a)

    /// Sets the select to input.
    member t.Select i = select <- i

    /// Runs all the forward functions in the currently selected tape.
    member t.forwardpropTape select = 
        let tape,_ = d.[select]
        for i=0 to tape.Count-1 do 
            match tape.[i] with
            | 😕 Df as x -> x.ff()
            | 😕 DM as x -> x.ff()
            | _ -> failwith “Type not supported”
    /// Runs all the backward functions in the currently selected tape, starting from the top.
    member t.reversepropTape select = 
        let tape,_ = d.[select]
        for i=tape.Count-1 downto 0 do 
            match tape.[i] with
            | 😕 Df as x -> x.fb()
            | 😕 DM as x -> x.fb()
            | _ -> failwith “Type not supported”
    /// Resets the adjoints of the selected tape.
    member t.resetTapeAdjoint select = 
        let tape,_ = d.[select]
        for i=tape.Count-1 downto 0 do 
            match tape.[i] with
            | 😕 Df as x -> x.r.A := 0.0f
            | 😕 DM as x -> x.r.A.setZero()
            | _ -> failwith “Type not supported”
    /// Resets the adjoints of the selected tape.
    member t.resetTapePrimal select = 
        if d.ContainsKey(select) then
            let tape,_ = d.[select]
            for i=tape.Count-1 downto 0 do 
                match tape.[i] with
                | 😕 Df as x -> x.r.P := 0.0f
                | 😕 DM as x -> x.r.P.setZero()
                | _ -> failwith “Type not supported”
    /// Disposes all the elements of the select tape and then clears it including the memory buffer.
    member t.Dispose select =
        let tape,mp = d.[select]
        let memory,dm_pointer = mp
        for x in tape do 
            match x with
            | 😕 Df as x -> ()
            | 😕 DM as x -> x.r.Dispose()
            | _ -> failwith “Type not supported”
        for x in memory do x.Dispose()
        tape.Clear()
        memory.Clear()
        dm_pointer := 0
    /// Clears the select tape without disposing it or the memory buffer.
    /// Also sets the pointer to zero for the select.
    member t.Clear select =
        if d.ContainsKey(select) then
            let tape,mp = d.[select]
            let memory,dm_pointer = mp
            tape.Clear()
            dm_pointer := 0
    /// Disposes all the elements of all the tapes and then clears them including the memory buffer.
    member t.DisposeAll() =
        for tape,mp in d.Values do
            let memory,dm_pointer = mp
            for x in tape do 
                match x with
                | 😕 Df as x -> ()
                | 😕 DM as x -> x.r.Dispose()
                | _ -> failwith “Type not supported”
            for x in memory do x.Dispose()
            tape.Clear()
            memory.Clear()
            dm_pointer := 0
    /// Returns an empty DM_rec if none exists at the pointer and adds it to the memory buffer, else it returns the DM_rec at the pointer to be reused.
    /// Increments the pointer afterwards.
    member t.GetDMIf =
        if d.ContainsKey(select) then
            let _, mp = d.[select]
            let memory,dm_pointer = mp
            if memory.Count > !dm_pointer then
                dm_pointer := !dm_pointer+1
                memory.[!dm_pointer-1]
            else
                dm_pointer := !dm_pointer+1
                let t = DM_rec.createEmpty
                memory.Add(t)
                t
        else
            let tape = List()
            let memory = List()
            let dm_pointer = ref 1
            d.Add(select, (tape, (memory,dm_pointer)))
            let t = DM_rec.createEmpty
            memory.Add(t)
            t

/// The global tape instance.
let tape = tapeType()

The above class is the one most likely to change in the future.

Not only does it store the steps of the program in the order they are defined, it also houses the memory buffers. They are the current solution to reusing GPU allocated memory.

During the first build step GetDMIf is called and it returns an empty DM_rec before storing its location to memory. Then during the forward step, the program resizes those records. Then after the reverse step the local (not global) tape that records the steps is cleared. During the subsequent steps GetDMIf then actually returns the previously allocated DM_recs. It works quite nicely as an optimization and gives us that 100% speed gain for memory reuse. It works for LSTMs quite nicely.

Here are a few examples of library functions. They are similar from the basics of AD tutorial.

let hadamaradMultiplicationModule = lazy new DeviceBinaryTransformModule “x*y;”
let hadamaradMultiplicationErrorModule = lazy new DeviceTrinaryTransformModule “x*y+z;”
/// Hadamarad (elementwise) multiplication function.
let hadmult (a: DM) (b: DM) =
    let va = a.r.P
    let vb = b.r.P
    let el = a.r.A
    let er = b.r.A

    let node = tape.GetDMIf
    let c = node.P
    let error = node.A

    let ff () = 
        let nr, nc = va.rc
        node.Resize nr nc
        hadamaradMultiplicationModule.Value.A(va, vb, c)
    let fb () = 
        if a.r.is_constant = false then hadamaradMultiplicationErrorModule.Value.A(vb,error,el,el)
        if b.r.is_constant = false then hadamaradMultiplicationErrorModule.Value.A(va,error,er,er)
    let t = DM.create(node,ff,fb)
    tape.Add(t)
    t

In the forward step, it resizes the node which houses the primal and the adjoint and then calls the hadamarad multiplication module function.

The new thing here is the lazy prefix. What that does is wraps an expression in something like a function so it defers evaluation until it needs to be used.

let a = lazy (5*5)
// *Do something else*
printfn “%i” a.Value // Only now evaluate a.

The above example is not too good, but imagine if you have over a dozen of new DeviceModule expressions like the ones for the Hadamarad module. Each of them requires roughly 0.35s to evaluate even if they are unused. By adding those statements the time it takes to load the library script goes from 9.3s to 4.1s on my machine.

At the end of the function call what has happened is that the program has created a node and two closures with the forward and the backwards steps and added them to the tape.

There are really two ‘tapes’ so to speak in the global tape and they are separate. The closure tape is detached from the memory buffer tape and should they be called on wildly different function the buffers might become miscalibrated. Not really a problem with machine learning algorithms though. Even the LSTM which has a branching factor depending on the sequence length, is a pretty benign case.

/// Matrix-matrix multiply.
let matmult (a: DM) (b:DM) =
    let va = a.r.P
    let vb = b.r.P
    let el = a.r.A
    let er = b.r.A

    let node = tape.GetDMIf
    let c = node.P
    let error = node.A
        
    let ff () = 
        let nr = (va).num_rows
        let nc = (vb).num_cols
        node.Resize nr nc
        gemm2 nT nT 1.0f va vb 0.0f c
    let fb () = 
        if a.r.is_constant = false then gemm2 nT T 1.0f error vb 1.0f el// The derivative with respect to the left. So the above argument gets inserted from the right left. Usually error * input.
        if b.r.is_constant = false then gemm2 T nT 1.0f va error 1.0f er// The derivative with respect to the right. So the above argument gets inserted from the right side. Usually weights * error.
    let t = DM.create(node,ff,fb)
    tape.Add(t)
    t

As in all steps, on the forward step, the node is set, but on the backwards steps, all adjoints are added to. This requires resetting the adjoints to zero before every reverse pass manually by calling tape.resetTapeAdjoint().

The above function are really trivial to add to the library and with the current design, it is readily extensible.

Another example would be the one I just finished recently.

let clipModule = lazy new DeviceTrinaryCoefTransformModule "((x < coef_x) ? coef_x : (x > coef_y ? coef_y : x))+coef_z;"
let clipErrorModule = lazy new DeviceTrinaryCoefTransformModule "y*((x < coef_x) ? 0.0f : (x > coef_y ? 0.0f : 1.0f))+z;"
/// o <- clip(min,max,a)+scalar
/// The clip function. Can be used as Relu by setting max to positive infinity. 
/// Can be used to make linear clipped sigmoid by setting min,max,scalar to -0.5f,0.5f,0.5f.
let clip min max a scalar =
    let va = a.r.P
    let el = a.r.A

    let node = tape.GetDMIf
    let c = node.P
    let error = node.A

    let ff () = 
        let nr,nc = (va).rc
        node.Resize nr nc
        clipModule.Value.A(min,va,max,va,scalar,va,c)
    let fb () = 
        clipErrorModule.Value.A(min,va,max,error,max,el,el)
    let t = DM.create(node,ff,fb)
    tape.Add(t)
    t

let inline clipped_sigmoid x = clip 0.0001f 0.9999f (sigmoid x) 0.0f
let inline clipped_steep_sigmoid coef x = clip 0.0001f 0.9999f (steep_sigmoid coef x) 0.0f
let inline relu x = clip 0.0f Single.PositiveInfinity x 0.0f

// The linear versions of the sigmoid and tanh.
let inline clipped_linear_sigmoid x = clip -0.4999f 0.4999f x 0.5f // Clipped linear sigmoid in the [0.001,0.999] range.
let inline linear_sigmoid x = clip -0.5f 0.5f x 0.0f // Linear sigmoid in the [0.0,1.0] range.
let inline linear_tanh x = clip -1.0f 1.0f x 0.0f // Linear tanh in the [-1.0,1.0] range.

A good way to not have to suffer Nan errors in logistic regression is to just clip the min and the max values to something close to 1, but not 1. Otherwise, eventually gradient descent will push the outputs in the final layer to 1 or 0 exactly and the net will blow up. In the library there is also a steep sigmoid variant which works better than the regular sigmoid with LSTMs or so I’ve heard.

I’ve tried clipped linear tanh and sigmoid and they really do not work well though, in feedforward regimes. On the XOR problem the relu gets stuck sometimes. It really depends on problem to problem.

Actually the clipped linear versions worked really well for me two months or so ago while I was writing everything by hand. At the time out of laziness, I did not implement the cost function properly in the last layer and just let all the gradients flow through and the clipped linear function worked better than the regular sigmoid for me.

It might be worth doing some research on mismatched clipped functions that let the gradients flow backwards in an irregular manner.

Further potential improvements to the library

The really major speed gains that come to the library are from bunching the functions together like I did in the clip function when in the forward step I combined the min, max and the add function in one single pass.

Likewise, one potential improvement is to extend the linear layer into a nonlinear layer. That would eliminate the need to have a primal and an adjoint for the activation step. At that point the library would be as efficient as my previous handwritten (and faulty) code.

As a AD library Spiral is really specialized at this point, not so much for reverse mode, but for parallel operations on large collections. As I consider the possibilities, one thing that greatly interests me now is the source transformation. What if instead of making these closures, I somehow make it so that the functions translate the tape into native code directly?

In AD literature there are such source transformation tools and compared to their operator overload brethren, they are order or two of magnitude faster. However, even though I am interested regardless, I do not think think there are significant gains to be had in this direction for Spiral. It would be different if it was operating directly on scalars, but the overhead of having having extra function calls on large collection is insignificant compared to the scalar case.

Conclusion

There is a lot to be satisfied about with the current design of the library. It essentially combines the best of OO, functional and even imperative design to make coding machine learning algorithms on the GPU a relatively straightforward affair. Past this point, getting the library to the next level is simply a matter of adding features to it. Bug reports and new feature pull requests are welcome.

Oh and before I forget, some of the code in the last few posts is already a tad outdated. I always recommend checking out the source instead of this tutorial.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s