The LSTM Reber grammar example

Reber grammar is a simple string generator that will be used to showcase the LSTM. In this case it will be the more complex embedded version of Reber grammar with long term dependencies.

// The function that generates embedded Reber grammars.
// This is a more difficult kind with long term dependencies.

// http://christianherta.de/lehre/dataScience/machineLearning/neuralNetworks/reberGrammar.php

type reberNode =
    | NodeF  // Only outputs B.
    | NodeS  // Can only receive B. Outputs T or P.

    | Node0a // Can only receive T. Outputs B.
    | Node1a // Can only receive B. Outputs T or P.
    | Node2a // Can receive T or S. Outputs S or X.
    | Node3a // Can receive P or T. Outputs V or T.
    | Node4a // Can receive X or P. Outputs X or S.
    | Node5a // Can only receive V. Outputs P or V.
    | Node6a // Can receive S or V. Outputs E.
    | Node7a // Can only receive E. Outputs T.

    | Node0b // Can only receive P. Outputs B.
    | Node1b // Can only receive B. Outputs T or P.
    | Node2b // Can receive T or S. Outputs S or X.
    | Node3b // Can receive P or T. Outputs V or T.
    | Node4b // Can receive X or P. Outputs X or S.
    | Node5b // Can only receive V. Outputs P or V.
    | Node6b // Can receive S or V. Outputs E.
    | Node7b // Can only receive E. Outputs P.

    | Node8  // Can receive T or P. Outputs E.

let rng = System.Random()

let b_string = [|1.0f;0.0f;0.0f;0.0f;0.0f;0.0f;0.0f|]
let t_string = [|0.0f;1.0f;0.0f;0.0f;0.0f;0.0f;0.0f|]
let p_string = [|0.0f;0.0f;1.0f;0.0f;0.0f;0.0f;0.0f|]
let s_string = [|0.0f;0.0f;0.0f;1.0f;0.0f;0.0f;0.0f|]
let x_string = [|0.0f;0.0f;0.0f;0.0f;1.0f;0.0f;0.0f|]
let v_string = [|0.0f;0.0f;0.0f;0.0f;0.0f;1.0f;0.0f|]
let e_string = [|0.0f;0.0f;0.0f;0.0f;0.0f;0.0f;1.0f|]

let t_p_string = [|0.0f;1.0f;1.0f;0.0f;0.0f;0.0f;0.0f|]
let t_v_string = [|0.0f;1.0f;0.0f;0.0f;0.0f;1.0f;0.0f|]
let s_x_string = [|0.0f;0.0f;0.0f;1.0f;1.0f;0.0f;0.0f|]
let p_v_string = [|0.0f;0.0f;1.0f;0.0f;0.0f;1.0f;0.0f|]


let rec make_random_reber_string str list prediction node =
    match node with
        | NodeF ->
            make_random_reber_string "B" [b_string] [b_string] NodeS
        | NodeS ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"T") (t_string::list) (t_p_string::prediction) Node0a else make_random_reber_string (str+"P") (p_string::list) (t_p_string::prediction) Node0b

        | Node0a ->
            make_random_reber_string (str+"B") (b_string::list) (b_string::prediction) Node1a
        | Node1a ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"T") (t_string::list) (t_p_string::prediction) Node2a else make_random_reber_string (str+"P") (p_string::list) (t_p_string::prediction) Node3a
        | Node2a ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"S") (s_string::list) (s_x_string::prediction) Node2a else make_random_reber_string (str+"X") (x_string::list) (s_x_string::prediction) Node4a
        | Node3a ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"T") (t_string::list) (t_v_string::prediction) Node3a else make_random_reber_string (str+"V") (v_string::list) (t_v_string::prediction) Node5a
        | Node4a ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"X") (x_string::list) (s_x_string::prediction) Node3a else make_random_reber_string (str+"S") (s_string::list) (s_x_string::prediction) Node6a
        | Node5a ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"P") (p_string::list) (p_v_string::prediction) Node4a else make_random_reber_string (str+"V") (v_string::list) (p_v_string::prediction) Node6a
        | Node6a ->
            make_random_reber_string (str+"E") (e_string::list) (e_string::prediction) Node7a
        | Node7a ->
            make_random_reber_string (str+"T") (t_string::list) (t_string::prediction) Node8

        | Node0b ->
            make_random_reber_string (str+"B") (b_string::list) (b_string::prediction) Node1b
        | Node1b ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"T") (t_string::list) (t_p_string::prediction) Node2b else make_random_reber_string (str+"P") (p_string::list) (t_p_string::prediction) Node3b
        | Node2b ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"S") (s_string::list) (s_x_string::prediction) Node2b else make_random_reber_string (str+"X") (x_string::list) (s_x_string::prediction) Node4b
        | Node3b ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"T") (t_string::list) (t_v_string::prediction) Node3b else make_random_reber_string (str+"V") (v_string::list) (t_v_string::prediction) Node5b
        | Node4b ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"X") (x_string::list) (s_x_string::prediction) Node3b else make_random_reber_string (str+"S") (s_string::list) (s_x_string::prediction) Node6b
        | Node5b ->
            let p = rng.NextDouble()
            if p > 0.5 then make_random_reber_string (str+"P") (p_string::list) (p_v_string::prediction) Node4b else make_random_reber_string (str+"V") (v_string::list) (p_v_string::prediction) Node6b
        | Node6b ->
            make_random_reber_string (str+"E") (e_string::list) (e_string::prediction) Node7b
        | Node7b ->
            make_random_reber_string (str+"P") (p_string::list) (p_string::prediction) Node8

        | Node8 ->
            (str+"E"), ((e_string::list) |> List.rev), ((e_string::prediction) |> List.rev)

open System.Collections.Generic
let make_reber_set num_examples =
    let mutable c = 0
    let reber_set = new HashSet<string * float32 [] list * float32 [] list>()

    while c < num_examples do
        if reber_set.Add (make_random_reber_string "" [] [] NodeF) then c <- c+1
    reber_set

The above is the string generator. In the examples above the make_reber_set is dispatched to generate random 3000 unique strings and then take those of length 20 for the training set. They usually came out to around 320 or so examples.

Here are the functions for processing that data to the GPU:

// Spiral reverse AD example. Used for testing.
// Embedded Reber grammar LSTM example.

#load "../Spiral Library/ad_utils_spiral_v1.fsx"
open Ad_utils_spiral_v1

#r "../packages/FSharp.Charting.0.90.13/lib/net40/FSharp.Charting.dll"
#r "System.Windows.Forms.DataVisualization.dll"

open FSharp.Charting

#load "embedded_reber.fsx"
open Embedded_reber

let reber_set = make_reber_set 3000

let make_data_from_set target_length =
    let twenties = reber_set |> Seq.filter (fun (a,b,c) -> a.Length = target_length) |> Seq.toArray
    let batch_size = (twenties |> Seq.length)

    let d_training_data =
        [|
        for i=0 to target_length-1 do
            let input = [|
                for k=0 to batch_size-1 do
                    let example = twenties.[k]
                    let s, input, output = example
                    yield input.[i] |] |> Array.concat
            yield DM.makeConstantNode(7,batch_size,input)|]

    let d_target_data =
        [|
        for i=1 to target_length-1 do // The targets are one less than the inputs. This has the effect of shifting them to the left.
            let output = [|
                for k=0 to batch_size-1 do
                    let example = twenties.[k]
                    let s, input, output = example
                    yield output.[i] |] |> Array.concat
            yield DM.makeConstantNode(7,batch_size,output)|]

    d_training_data, d_target_data

Here is the training function:

let lstm_embedded_reber_train num_iters learning_rate (data: DM[]) (targets: DM[]) (data_v: DM[]) (targets_v: DM[]) clip_coef (l1: LSTMLayer) (l2: FeedforwardLayer) =
    [|
    let l1 = l1
    let l2 = l2
    
    let base_nodes = [|l1.ToArray;l2.ToArray|] |> Array.concat

    let training_loop (data: DM[]) (targets: DM[]) i =
        tape.Select i
        let costs = [|
            let mutable a, c = l1.runLayerNoH data.[0]
            let b = l2.runLayer a
            let r = squared_error_cost targets.[0] b
            yield r
    
            for i=1 to data.Length-2 do
                let a',c' = l1.runLayer data.[i] a c
                a <- a'; c <- c'
                let b = l2.runLayer a
                let r = squared_error_cost targets.[i] b
                yield r
            |]
        scale (1.0f/float32 (costs.Length-1)) (sum_scalars costs)

    let ts = (data.Length-1)
    let vs = ts//(data_v.Length-1) // It is possible to put the validation set on a separate branch and run them that way. In the earlier versions of Spiral before efficient tape rebuilding that is how it was done for speed.
                                   // In such case each branch would also have its memory buffer. Keep it all on one branch for maximum reuse.

    let mutable r' = 0.0f
    let mutable i = 1
    while i <= num_iters && System.Single.IsNaN r' = false do
        
        let rv = training_loop data_v targets_v vs
        
        tape.forwardpropTape vs
        printfn "The validation cost is %f at iteration %i" !rv.r.P i
        
        tape.Clear vs

        let r = training_loop data targets ts

        tape.forwardpropTape ts
        printfn "The training cost is %f at iteration %i" !r.r.P i
        
        yield !r.r.P, !rv.r.P

        tape.resetTapeAdjoint -1 // Resets base adjoints
        tape.resetTapeAdjoint ts // Resets the adjoints for the training select
        r.r.A := 1.0f
        tape.reversepropTape ts // Runs the reverse step.
        add_gradients_to_weights base_nodes learning_rate clip_coef

        tape.Clear ts
        //tape.Clear vs

        i <- i+1
        r' <- !r.r.P
    |]

Build the tape, run it forward by calling forwardpropTape(), reset the adjoints, run it backwards, this is similar to the previous examples.

In the cost function as there are multiple outputs for each of the time steps, we add them up together to make the result of the function a R^1 scalar.

Here is the last bit of code in the example file that makes the LSTM layers:

let d_training_data_20, d_target_data_20 = make_data_from_set 20
let d_training_data_validation, d_target_data_validation = make_data_from_set 30

let hidden_size = 64

let l1 = LSTMLayer.createRandomLSTMLayer hidden_size 7 tanh_ tanh_
let l2 = FeedforwardLayer.createRandomLayer 7 hidden_size (steep_sigmoid 3.0f)

// Add the base nodes to the tape for easier resetting and disposal.
tape.Select -1
let base_nodes = [|l1.ToArray;l2.ToArray|] |> Array.concat |> Array.iter (fun x -> tape.Add x)

let learning_rate = 5.0f
// This iteration is to warm up the library. It compiles all the lazy Cuda modules.
lstm_embedded_reber_train 1 learning_rate d_training_data_20 d_target_data_20 d_training_data_validation d_target_data_validation 1.0f l1 l2
#time
let s = [|
        yield lstm_embedded_reber_train 99 learning_rate d_training_data_20 d_target_data_20 d_training_data_validation d_target_data_validation 1.0f l1 l2
        |] |> Array.concat
#time
// On the GTX 970, I get 3-4s depending on how hot the GPU is.
let l = [|for l,_ in s do yield l|]
let r = [|for _,r in s do yield r|]

(Chart.Combine [|Chart.Line l;Chart.Line r|]).ShowChart()
tape.DisposeAll() // Disposes all the elements of all the tapes and then clears them including the memory buffer.

The main difference from the previous example is that the nodes are added directly into the tape on the -1 branch (the default is 0.) That allows us to call resetAdjoints on that branch.

I did not find it much of time saver, so I omitted this feature when doing the feedforward net examples.

Here is the LSTM layer main body. There is also a GRU in the library as well:

type LSTMLayer =
    {W_z:DM  // Input weight matrix for the block input
     U_z:DM  // Recurrent weight matrix for the block input
     b_z:DM  // Bias vector for the block input

     W_i:DM  // Input weight matrix for the input gate
     U_i:DM  // Recurrent weight matrix for the input gate
     b_i:DM  // Bias vector for the input gate
     P_i:DM  // Peephole weight matrix for the input gate

     W_f:DM  // Input weight matrix for the forget gate
     U_f:DM  // Recurrent weight matrix for the forget gate
     b_f:DM  // Bias vector for the forget gate
     P_f:DM  // Peephole weight matrix for the forget gate

     W_o:DM  // Input weight matrix for the output gate
     U_o:DM  // Recurrent weight matrix for the output gate
     b_o:DM  // Bias vector for the output gate
     P_o:DM  // Peephole weight matrix for the output gate

     block_input_a : DM -> DM
     block_output_a : DM -> DM
     } with
    
    /// Returns all the weights in an array.
    member l.ToArray = [|l.W_z;l.U_z;l.b_z;l.W_i;l.U_i;l.b_i;l.P_i;l.W_f;l.U_f;l.b_f;l.P_f;l.W_o;l.U_o;l.b_o;l.P_o|]
    static member fromArray (a: DM[]) block_input_a block_output_a =
        {
         W_z = a.[0]
         U_z = a.[1]
         b_z = a.[2]

         W_i = a.[3]
         U_i = a.[4]
         b_i = a.[5]
         P_i = a.[6]

         W_f = a.[7]
         U_f = a.[8]
         b_f = a.[9]
         P_f = a.[10]

         W_o = a.[11]
         U_o = a.[12]
         b_o = a.[13]
         P_o = a.[14]

         block_input_a = block_input_a
         block_output_a = block_output_a
        }

    static member createRandomLSTMLayer hidden_size input_size block_input_a block_output_a =
        {
        W_z = DM.makeUniformRandomNode(hidden_size, input_size)
        U_z = DM.makeUniformRandomNode(hidden_size, hidden_size)
        b_z = DM.makeUniformRandomNode(hidden_size, 1)

        W_i = DM.makeUniformRandomNode(hidden_size, input_size)
        U_i = DM.makeUniformRandomNode(hidden_size, hidden_size)
        b_i = DM.makeUniformRandomNode(hidden_size, 1)
        P_i = DM.makeUniformRandomNode(hidden_size, hidden_size)

        W_f = DM.makeUniformRandomNode(hidden_size, input_size)
        U_f = DM.makeUniformRandomNode(hidden_size, hidden_size)
        b_f = DM.makeUniformRandomNode(hidden_size, 1)
        P_f = DM.makeUniformRandomNode(hidden_size, hidden_size)

        W_o = DM.makeUniformRandomNode(hidden_size, input_size)
        U_o = DM.makeUniformRandomNode(hidden_size, hidden_size)
        b_o = DM.makeUniformRandomNode(hidden_size, 1)
        P_o = DM.makeUniformRandomNode(hidden_size, hidden_size)

        block_input_a = block_input_a
        block_output_a = block_output_a
        }

    member l.runLayer (x:DM) (y:DM) (c:DM) =
        let block_input = linear_layer [|l.W_z,x;l.U_z,y|] [||] (Some l.b_z) |> l.block_input_a
        let input_gate = linear_layer [|l.W_i,x;l.U_i,y;l.P_i,c|] [||] (Some l.b_i) |> sigmoid
        let forget_gate = linear_layer [|l.W_f,x;l.U_f,y;l.P_f,c|] [||] (Some l.b_f) |> sigmoid
        let c' = linear_layer [||] [|block_input,input_gate;c,forget_gate|] None
        let output_gate = linear_layer [|l.W_o,x;l.U_o,y;l.P_o,c'|] [||] (Some l.b_o) |> sigmoid
        hadmult (l.block_output_a c') output_gate, c'

    member l.runLayerNoH (x:DM) =
        let block_input = linear_layer [|l.W_z,x|] [||] (Some l.b_z) |> l.block_input_a
        let input_gate = linear_layer [|l.W_i,x|] [||] (Some l.b_i) |> sigmoid
        let forget_gate = linear_layer [|l.W_f,x|] [||] (Some l.b_f) |> sigmoid
        let c' = hadmult block_input input_gate
        let output_gate = linear_layer [|l.W_o,x;l.P_o,c'|] [||] (Some l.b_o) |> sigmoid
        hadmult (l.block_output_a c') output_gate, c'

    member l.runLayerNoI (y:DM) (c:DM) =
        let block_input = linear_layer [|l.U_z,y|] [||] (Some l.b_z) |> l.block_input_a
        let input_gate = linear_layer [|l.U_i,y;l.P_i,c|] [||] (Some l.b_i) |> sigmoid
        let forget_gate = linear_layer [|l.U_f,y;l.P_f,c|] [||] (Some l.b_f) |> sigmoid
        let c' = linear_layer [||] [|block_input,input_gate;c,forget_gate|] None
        let output_gate = linear_layer [|l.U_o,y;l.P_o,c'|] [||] (Some l.b_o) |> sigmoid
        hadmult (l.block_output_a c') output_gate, c'

There is a lot of boilerplate, but the sheer effort saved over doing this by hand in the runLayer functions is astounding.

Simply being able to write runLayer and having the confidence that the program will do what the user indeed it to, is the reason for the library’s existence.

Conclusion of the first series

I actually tried making a LSTM and a GRU a few times and failed because the code became such a bloated mess. Then I wrote this library. The failures are what motivated me thus far.

I understand that this tutorial is pretty code dump heavy and intended for a niche audience. If you found it helpful then I am glad. So far there is a dearth of good reverse AD Windows libraries that work on the GPU. I hope with this I’ve filled that gap a little.

It could not have been written nearly as easily in any other language as in F#.

Let me talk about languages and what great benefit they bring to their users and what a great gap there is between ‘possible’ and ‘it will get done.’

In theory this library could have been written in assembly, but no so much in reality.

It stands to a reason that if we imagine looking back to the point where we are now from some distant future, that the tools we use, hardware certainly, but especially software here are lacking.

Currently, the Cuda libraries as they exist today are a straitjacket, and those kernels that I have dumped, while entirely necessary are an embarrassment that they should have been dumped. There should have been such sorts in the Nvidia library and they should be open sourced. I cannot imagine that it won’t be the case that in the future, that one couldn’t simply fuse these kernels together in a functional fashion.

Here at the start of 2016, the ML spring has come and I find myself wanting for a good library and writing GPU select routines by hand. I am also constrained by the cuBLAS library and no doubt other researchers are too.

Take for example gemm function (general matrix-matrix multiply). It is fine if one wants to take a dot product (a*b) of all the vectors in one matrix with all the vectors in the other, but what if for example one wants to do (a-b)^2 instead?

Wouldn’t it be interesting to stack a layer of k-means after k-means like with the WTA autoencoder. What is so special about a*b anyway?

GPU algorithms especially could benefit greatly from increased meta-programming capabilities. GPU kernels while very simplistic, are incredibly intricate and hard to get right. Inserting strings like “(a-b)*(a-b);” is merely the first step. Fusing a series of gemm calls to a single target without the scheduler turning on the red lights and having them switch to using atomics would be another good step.

Without the basics firmly in place nothing else will proceed.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s