Basics of automatic differentiation

In the last post I gave an example of a true feedforward net if a small one. All neural nets are really just a class of functions R^n -> R.

The XOR network rewritten in more mathy notation would be:

y = sum((targets-sigmoid(W2 * tanh(W*input+bias) + bias2))^2)

The challenge of optimizing this neural net is to make sure the W,W2,bias and bias2 variables are such that the cost is as close as possible to zero. To do that one must calculate the derivatives of those variables with respect to the cost function and use an algorithm such as gradient descent to push the weights closer towards their optimal values.

In school I am assuming you learned how to take the derivatives of polynomoials and various complex expressions. For myself, I now think that most of that was a waste of time.

It is possible to calculate the derivatives of an arbitrary expression such as the one above completely automatically in both the forward and reverse mode by simple local rules.

To do that let me simplify the above for clarity’s sake and assume that W,W2,bias,bias2 are not matrices, but single variables instead. I’ll also take out the sum as there is nothing to sum here and change ‘targets’ to ‘target’.

y = (target-sigmoid(W2 * tanh(W*input+bias) + bias2))^2


let sigmoid a = 1.0f/(1.0f+exp(-a))

let target = -0.5f
let input = 1.0f

let W = 1.5f
let W2 = 2.0f
let bias = 0.25f
let bias2 = 0.0f

let y = (target-sigmoid(W2 * tanh(W*input+bias) + bias2))**2.0f

Output:


val sigmoid : a:float32 ->; float32
val target : float32 = -0.5f
val input : float32 = 1.0f
val W : float32 = 1.5f
val W2 : float32 = 2.0f
val bias : float32 = 0.25f
val bias2 : float32 = 0.0f
val y : float32 = 1.87122381f

A person calculating this by hand, and the compiler itself, would break the large expression down into individual pieces and evaluate them separately.

Let us go through it step by step:


// These are the original assignments. In AD literature the starting variables are denoted from i up to 0, but here they will start at 0.
// Starting from scratch, this would be the evaluation trace of the program had I decompiled it.
let v0 = target
let v1 = input
let v2 = W
let v3 = W2
let v4 = bias
let v5 = bias2
// The first calculation is W*input = v2*v1.
let v6 = v2*v1
// Then comes v6+bias=v6+v4
let v7 = v6+v4
// Then comes tanh(v7)
let v8 = tanh(v7)
// Then comes W2*v8=v3*v8.
let v9 = v3*v8
// Then comes v9+bias2=v9+v5
let v10 = v9+v5
// Then comes sigmoid(v10)
let v11 = sigmoid(v10)
// Then comes target-v11=v0-v11
let v12 = v0-v11
// Then comes v12**2.0f
let v13 = v12**2.0f

Output:


val v0 : float32 = -0.5f
val v1 : float32 = 1.0f
val v2 : float32 = 1.5f
val v3 : float32 = 2.0f
val v4 : float32 = 0.25f
val v5 : float32 = 0.0f
val v6 : float32 = 1.5f
val v7 : float32 = 1.75f
val v8 : float32 = 0.941375554f
val v9 : float32 = 1.88275111f
val v10 : float32 = 1.88275111f
val v11 : float32 = 0.867926776f
val v12 : float32 = -1.36792684f
val v13 : float32 = 1.87122381f

v13 is exactly the same as y.

Reverse AD

Let us put aside the above evaluation trace for a moment and consider how one would differentiate much simpler examples – a*b for starters.


module Isolated =
    let a = 3.0f
    let b = 2.0f
    let isolated_mult a b =
        let c = a*b
        let dc_a = b // dc/da = b - The derivative of c with respect to a
        let dc_b = a // dc/db = a - The derivative of c with respect to b
        c,dc_a,dc_b
    let c,dc_a,dc_b = isolated_mult a b

Output:


module Isolated = begin
  val a : float32 = 3.0f
  val b : float32 = 2.0f
  val isolated_mult : a:float32 -> b:float32 -> float32 * float32 * float32
  val dc_b : float32 = 3.0f
  val dc_a : float32 = 2.0f
  val c : float32 = 6.0f
end

Picture it as a graph with nodes a and b leading into c. In the isolated example above as c has no nodes leading into it the correct steps would indeed be the above. But had this been a part of the larger expression such as (a*a)*(b*b) it would be wrong.

What needs to be added is the error coming from above.

dc_a and dc_b need to know the error at c. In reverse mode parlance such directional gradients or errors are also called adjoints.


// The functions in here propagate the error from upwards to downwards.
module Propagating =
    let a = 3.0f
    let b = 2.0f
    let prop_mult a b =  
        let c = a*b
        // http://stackoverflow.com/questions/36636/what-is-a-closure
        let fdc_a error_c = error_c*b // dc/da = error*b - The derivative of c with respect to a. This is a function.
        let fdc_b error_c = error_c*a // dc/db = error*a - The derivative of c with respect to b. This is a function.
        c,fdc_a,fdc_b
    let c,fdc_a,fdc_b = prop_mult a b
    let dc_a = fdc_a 1.0f
    let dc_b = fdc_b 1.0f

Output:

module Propagating = begin
  val a : float32 = 3.0f
  val b : float32 = 2.0f
  val prop_mult :
    a:float32 ->
      b:float32 -> float32 * (float32 -> float32) * (float32 -> float32)
  val fdc_b : (float32 -> float32)
  val fdc_a : (float32 -> float32)
  val c : float32 = 6.0f
  val dc_a : float32 = 2.0f
  val dc_b : float32 = 3.0f
end

In the above I did a little sleight of hand and modified the dc_a and dc_b into functions. Now they require to be feed the error at the node they are at.

Let us try a more complex example: (a*b)*(a*b).


module Propagating2 =
    let a = 3.0f
    let b = 2.0f
    let prop_mult a b =  
        let c = a*b
        // http://stackoverflow.com/questions/36636/what-is-a-closure
        let fdc_a error_c = error_c*b // dc/da = error*b - The derivative of c with respect to a. This is a function.
        let fdc_b error_c = error_c*a // dc/db = error*a - The derivative of c with respect to b. This is a function.
        c,fdc_a,fdc_b

    // (a*b)*(a*b)
    let c,fdc_a,fdc_b = prop_mult a b // c=a*b
    let d,fdc_a',fdc_b' = prop_mult a b // d=a*b
    let e,fde_c,fde_d = prop_mult c d // e=c*d

    let er_c, er_d = fde_c 1.0f, fde_d 1.0f // errors (or adjoints) of e with respect to c and d
    let er_a', er_b' = fdc_a' er_c, fdc_b' er_d // adjoints of a and b
    let er_a, er_b = fdc_a er_c, fdc_b er_d // adjoints of a and b

    let adjoint_a = er_a+er_a'
    let adjoint_b = er_b+er_b'

Output:


module Propagating2 = begin
  val a : float32 = 3.0f
  val b : float32 = 2.0f
  val prop_mult :
    a:float32 ->
      b:float32 -> float32 * (float32 -> float32) * (float32 -> float32)
  val fdc_b : (float32 -> float32)
  val fdc_a : (float32 -> float32)
  val c : float32 = 6.0f
  val fdc_b' : (float32 -> float32)
  val fdc_a' : (float32 -> float32)
  val d : float32 = 6.0f
  val fde_d : (float32 -> float32)
  val fde_c : (float32 -> float32)
  val e : float32 = 36.0f
  val er_d : float32 = 6.0f
  val er_c : float32 = 6.0f
  val er_b' : float32 = 18.0f
  val er_a' : float32 = 12.0f
  val er_b : float32 = 18.0f
  val er_a : float32 = 12.0f
  val adjoint_a : float32 = 24.0f
  val adjoint_b : float32 = 36.0f
end

The adjoint of a is 24 and the adjoint of b is 36.

This one is easy to check by hand.

e = a^2*b^2
de / da = 2*a*b^2 = 2*3*2*2 = 24
de / db = 2*a^2*b = 2*3*3*2 = 36

No problem. The above is correct, but requires us to manually push in values into the returned functions. It is easy to modify the above procedure so that is not necessary.

module Propagating3 =
    type Df =
        struct // Struct is similar to record or a class except it is stack allocated.
        val P : float32 // primal
        val A : float32 ref // adjoint (reference type)
        new p = {P=p;A=ref 0.0f}
        end
    let a = Df 3.0f
    let b = Df 2.0f
    let mult (a: Df) (b: Df) =  
        let c = Df (a.P*b.P)
        // http://stackoverflow.com/questions/36636/what-is-a-closure
        let fb() = //The function for the backwards pass.
            a.A := !c.A*b.P + !a.A 
            b.A := !c.A*a.P + !b.A 
        c,fb

    // (a*b)*(a*b)
    let c,fc = mult a b // c=a*b
    let d,fd = mult a b // d=a*b
    let e,fe = mult c d // e=c*d

    e.A := 1.0f // Feed the 1.0f at the top.

    fe() // Crank the machine
    fd() // Crank the machine
    fc() // Crank the machine

    let adjoint_a = !a.A
    let adjoint_b = !b.A

Output:

module Propagating3 = begin
  type Df =
    struct
      new : p:float32 -> Df
      val P: float32
      val A: float32 ref
    end
  val a : Df = FSI_0005+Propagating3+Df
  val b : Df = FSI_0005+Propagating3+Df
  val mult : a:Df -> b:Df -> Df * (unit -> unit)
  val fc : (unit -> unit)
  val c : Df = FSI_0005+Propagating3+Df
  val fd : (unit -> unit)
  val d : Df = FSI_0005+Propagating3+Df
  val fe : (unit -> unit)
  val e : Df = FSI_0005+Propagating3+Df
  val adjoint_a : float32 = 24.0f
  val adjoint_b : float32 = 36.0f
end

It is kind of ugly now, but the output is obvious. Behind the scenes, the closures are compiled into classes. The above could be also done in a language like C++ using virtual function overloads, but with closures the pattern is unparalleled in its elegance.

The above is essentially, the main design pattern of the Spiral library. It can be done better still:

module Propagating4 =
    open System.Collections.Generic
    
    type Df =
        struct // Struct is similar to record or a class except it is stack allocated.
        val P : float32 // primal
        val A : float32 ref // adjoint (reference type)
        new p = {P=p;A=ref 0.0f}
        end
    let a = Df 3.0f
    let b = Df 2.0f

    let tape = List<unit -> unit>() // List is not really a list, but a dynamic array. unit -> unit is the function type that takes no parameters and returns nothing.
    let mult (a: Df) (b: Df) =  
        let c = Df (a.P*b.P)
        // http://stackoverflow.com/questions/36636/what-is-a-closure
        let fb() = //The function for the backwards pass.
            a.A := !c.A*b.P + !a.A 
            b.A := !c.A*a.P + !b.A 
        tape.Add(fb)
        c

    // (a*b)*(a*b)
    let c = mult a b // c=a*b
    let d = mult a b // d=a*b
    let e = mult c d // e=c*d

    e.A := 1.0f // Feed the 1.0f at the top.

    for i=tape.Count-1 downto 0 do tape.[i]() // Let the computer crank it for you from top to bottom.

    let adjoint_a = !a.A
    let adjoint_b = !b.A

Output:

module Propagating4 = begin
  type Df =
    struct
      new : p:float32 -> Df
      val P: float32
      val A: float32 ref
    end
  val a : Df = FSI_0006+Propagating4+Df
  val b : Df = FSI_0006+Propagating4+Df
  val tape : System.Collections.Generic.List<(unit -> unit)>
  val mult : a:Df -> b:Df -> Df
  val c : Df = FSI_0006+Propagating4+Df
  val d : Df = FSI_0006+Propagating4+Df
  val e : Df = FSI_0006+Propagating4+Df
  val adjoint_a : float32 = 24.0f
  val adjoint_b : float32 = 36.0f
end

Just put all the operations into a tape and then run it backwards. No problem.

I think this is generally the difference between backpropagation and reverse AD. One buries you in calculus and the other just gives you a crank. The tape is the favored approach of reverse AD.

In theory one should not have to study AD explicitly to figure this out, but showing it like this makes a difference.

F# does support operator overloading:

module Propagating5 =
    open System.Collections.Generic
    
    type Df =
        struct // Struct is similar to record or a class except it is stack allocated.
        val P : float32 // primal
        val A : float32 ref // adjoint (reference type)
        new p = {P=p;A=ref 0.0f}
        end
    let a = Df 3.0f
    let b = Df 2.0f

    let tape = List<unit -> unit>() // List is not really a list, but a dynamic array. unit -> unit is the function type that takes no parameters and returns nothing.
    let mult (a: Df) (b: Df) =  
        let c = Df (a.P*b.P)
        // http://stackoverflow.com/questions/36636/what-is-a-closure
        let fb() = //The function for the backwards pass.
            a.A := !c.A*b.P + !a.A 
            b.A := !c.A*a.P + !b.A 
        tape.Add(fb)
        c

    type Df with
        static member inline (*)(a: Df, b: Df) = mult a b // The overloaded * operator

    // (a*b)*(a*b)
    let e = (a*b)*(a*b)

    e.A := 1.0f // Feed the 1.0f at the top.

    for i=tape.Count-1 downto 0 do tape.[i]() // Let the computer crank it for you from top to bottom.

    let adjoint_a = !a.A
    let adjoint_b = !b.A

Output:

module Propagating5 = begin
  type Df =
    struct
      new : p:float32 -> Df
      val P: float32
      val A: float32 ref
      static member ( * ) : a:Df * b:Df -> Df
    end
  val a : Df = FSI_0007+Propagating5+Df
  val b : Df = FSI_0007+Propagating5+Df
  val tape : System.Collections.Generic.List<(unit -> unit)>
  val mult : a:Df -> b:Df -> Df
  val e : Df = FSI_0007+Propagating5+Df
  val adjoint_a : float32 = 24.0f
  val adjoint_b : float32 = 36.0f
end

The above overload is just an aesthetic improvement, but right now I haven’t even put it into the Spiral library yet. With this the basic pattern is complete.

With it we can make the backward pass as we go forward.

Let us go back to the first example:

module ReverseADExample =
    type Df =
        struct // Struct is similar to record or a class except it is stack allocated.
        val P : float32 // primal
        val A : float32 ref // adjoint (reference type)
        new (p) = {P=p;A=ref 0.0f}
        new (p,t) = {P=p;A=ref t}
        end with

        override t.ToString() = sprintf "(%f,%f)" t.P !t.A // To make F# Interactive print out the fields
        member t.dup = Df(t.P,!t.A) // Makes a duplicate of the struct.

    open System.Collections.Generic
    let tape = List<unit -> unit>() // List is not really a list, but a dynamic array. unit -> unit is the function type that takes no parameters and returns nothing.
    let sigmoid (a: Df) =  
        let c = Df (1.0f/(1.0f+exp(-a.P)))
        let fb() = // The function for the backwards pass.
            a.A := !c.A*c.P*(1.0f-c.P) + !a.A 
        tape.Add(fb)
        c
    let tanh (a: Df) =
        let c = Df (tanh a.P)
        let fb() = // The function for the backwards pass.
            a.A := !c.A*(1.0f-c.P*c.P) + !a.A 
        tape.Add(fb)
        c
    let pow (a: Df) b =
        let c = Df (a.P**b)
        let fb() = // The function for the backwards pass.
            a.A := !c.A*b*(c.P/a.P) + !a.A 
        tape.Add(fb)
        c
    let mult (a: Df) (b: Df) =  
        let c = Df (a.P*b.P)
        let fb() = //The function for the backwards pass.
            a.A := !c.A*b.P + !a.A 
            b.A := !c.A*a.P + !b.A 
        tape.Add(fb)
        c
    let add (a: Df) (b: Df) =  
        let c = Df (a.P+b.P)
        let fb() = //The function for the backwards pass.
            a.A := !c.A + !a.A 
            b.A := !c.A + !b.A 
        tape.Add(fb)
        c
    let sub (a: Df) (b: Df) =  
        let c = Df (a.P-b.P)
        let fb() = //The function for the backwards pass.
            a.A := !c.A + !a.A 
            b.A := !b.A - !c.A
        tape.Add(fb)
        c

    type Df with
        static member inline (*)(a: Df, b: Df) = mult a b // The overloaded * operator
        static member inline (+)(a: Df, b: Df) = add a b // The overloaded + operator
        static member inline (-)(a: Df, b: Df) = sub a b // The overloaded - operator
        static member inline Pow(a: Df, b) = pow a b // The overloaded ** operator

    let target = Df -0.5f
    let input = Df 1.0f

    let W = Df 1.5f
    let W2 = Df 2.0f
    let bias = Df 0.25f
    let bias2 = Df 0.0f

    //let y = (target-sigmoid(W2 * tanh(W*input+bias) + bias2))**2.0f

    // These are the original assignments. In AD literature the starting variables are denoted from i up to 0, but here they will start at 0.
    // Starting from scratch, this would be the evaluation trace of the program had I decompiled it.
    let v0 = target.dup // The reason for these duplicates is to emphasize that the F# Interactive will output only the values of the final run.
    let v1 = input.dup  // In this case it makes no difference as they will be the same either way, but it would improper to not copy them.
    let v2 = W.dup
    let v3 = W2.dup
    let v4 = bias.dup
    let v5 = bias2.dup
    // The first calculation is W*input = v2*v1.
    let v6 = v2*v1
    // Then comes v6+bias=v6+v4
    let v7 = v6+v4
    // Then comes tanh(v7)
    let v8 = tanh(v7)
    // Then comes W2*v8=v3*v8.
    let v9 = v3*v8
    // Then comes v9+bias2=v9+v5
    let v10 = v9+v5
    // Then comes sigmoid(v10)
    let v11 = sigmoid(v10)
    // Then comes target-v11=v0-v11
    let v12 = v0-v11
    // Then comes v12**2.0f
    let v13 = v12**2.0f

    v13.A := 1.0f // Feed the 1.0f at the top.

    for i=tape.Count-1 downto 0 do tape.[i]() // Let the computer crank it for you from top to bottom.

    let adjoint_W = !v2.A
    let adjoint_W2 = !v3.A
    let adjoint_bias = !v4.A
    let adjoint_bias2 = !v5.A

    // Once more from the top.
    tape.Clear()
    let y = (target-sigmoid(W2 * tanh(W*input+bias) + bias2))**2.0f
    y.A := 1.0f
    for i=tape.Count-1 downto 0 do tape.[i]() // Let the computer crank it for you from top to bottom.

    let adjoint_W' = !W.A
    let adjoint_W2' = !W2.A
    let adjoint_bias' = !bias.A
    let adjoint_bias2' = !bias2.A

Output:

module ReverseADExample = begin
  type Df =
    struct
      new : p:float32 -> Df
      val P: float32
      val A: float32 ref
      override ToString : unit -> string
      static member Pow : a:Df * b:float32 -> Df
      static member ( + ) : a:Df * b:Df -> Df
      static member ( * ) : a:Df * b:Df -> Df
      static member ( - ) : a:Df * b:Df -> Df
    end
  val tape : System.Collections.Generic.List<(unit -> unit)>
  val sigmoid : a:Df -> Df
  val tanh : a:Df -> Df
  val pow : a:Df -> b:float32 -> Df
  val mult : a:Df -> b:Df -> Df
  val add : a:Df -> b:Df -> Df
  val sub : a:Df -> b:Df -> Df
  val target : Df = (-0.500000,-2.735854)
  val input : Df = (1.000000,0.107078)
  val W : Df = (1.500000,0.071385)
  val W2 : Df = (2.000000,0.295225)
  val bias : Df = (0.250000,0.071385)
  val bias2 : Df = (0.000000,0.313611)
  val v0 : Df = (-0.500000,-2.735854)
  val v1 : Df = (1.000000,0.107078)
  val v2 : Df = (1.500000,0.071385)
  val v3 : Df = (2.000000,0.295225)
  val v4 : Df = (0.250000,0.071385)
  val v5 : Df = (0.000000,0.313611)
  val v6 : Df = (1.500000,0.071385)
  val v7 : Df = (1.750000,0.071385)
  val v8 : Df = (0.941376,0.627221)
  val v9 : Df = (1.882751,0.313611)
  val v10 : Df = (1.882751,0.313611)
  val v11 : Df = (0.867927,2.735854)
  val v12 : Df = (-1.367927,-2.735854)
  val v13 : Df = (1.871224,1.000000)
  val adjoint_W : float32 = 0.0713853538f
  val adjoint_W2 : float32 = 0.295225322f
  val adjoint_bias : float32 = 0.0713853538f
  val adjoint_bias2 : float32 = 0.313610584f
  val l : unit = ()
  val y : Df = (1.871224,1.000000)
  val adjoint_W' : float32 = 0.0713853538f
  val adjoint_W2' : float32 = 0.295225322f
  val adjoint_bias' : float32 = 0.0713853538f
  val adjoint_bias2' : float32 = 0.313610584f
end

I checked it with DiffSharp and surprisingly it is correct on the first try. DiffSharp has more advanced features than Spiral, but no Cuda support at the moment or inplace operations.

Second order differentiation

In order for me to explain how to calculate the Hessian using AD, it is necessary for me to explain the forward AD mode first. As the opposite of the reverse mode, it can be used to calculate the derivative of a function R -> R^n. By iterating it similarly to finite difference approximation one might use for gradient checking, it is possible to calculate R^m -> R^n.

As ML algorithms in general are R^m to R that means one would have to rerun the algorithm m times to get all the weight gradients.

Forward mode is pretty simple to understand.

One just attaches an extra number to the Df similar to the adjoint, called the tangent.

module Forward1 =
    open System.Collections.Generic
    
    type Df =
        struct // Struct is similar to record or a class except it is stack allocated.
        val P : float32 // primal
        val T : float32 ref // tangent (reference type)

        new (p) = {P=p; T=ref 0.0f}
        new (p,t) = {P=p; T=ref t}
        end with
        override t.ToString() = sprintf "(%f,%f)" t.P !t.T // To make F# Interactive print out the fields
    let a = Df 3.0f
    let b = Df 2.0f

    let mult (a: Df) (b: Df) =  
        let cp = a.P*b.P // primal
        let ct = !a.T * b.P + a.P * !b.T // tangent
        let c = Df(cp,ct)
        c

    type Df with
        static member inline (*)(a: Df, b: Df) = mult a b // The overloaded * operator

    // In order to get the derivative of the cost with respect to a, set a's tangent to 1.0f and every others' to 0.0f
    a.T := 1.0f
    b.T := 0.0f
    let c = a*a*b*b
    // In order to get the derivative of the cost with respect to b, set a's tangent to 1.0f and every others' to 0.0f
    a.T := 0.0f
    b.T := 1.0f
    let c' = a*a*b*b

Output:

module Forward1 = begin
  type Df =
    struct
      new : p:float32 -> Df
      new : p:float32 * t:float32 -> Df
      val P: float32
      val T: float32 ref
      override ToString : unit -> string
      static member ( * ) : a:Df * b:Df -> Df
    end
  val a : Df = (3.000000,0.000000)
  val b : Df = (2.000000,1.000000)
  val mult : a:Df -> b:Df -> Df
  val c : Df = (36.000000,24.000000)
  val c' : Df = (36.000000,36.000000)
end

The tangent is in the second field. 24 and 36 seems about right for a and b. Tangents are really just errors from below.

c = a*b
c’ = a’*b+a*b’

In the above equation we just replace the a’ and b’ with these dual numbers as we move up the graph. In the reverse mode we would just do that from the opposite side – on the way down.

To get second order derivatives, we combine reverse and the forward mode:

module Hessian =
    open System.Collections.Generic
    
    type Df =
        struct // Struct is similar to record or a class except it is stack allocated.
        val P : float32 // primal
        val T : float32 ref // tangent (reference type)
        val A : float32 ref // adjoint (reference type)
        val TA : float32 ref // tangent of the adjoint (reference type)

        new (p) = {P=p; T=ref 0.0f; A=ref 0.0f; TA=ref 0.0f}
        new (p,t) = {P=p; T=ref t; A=ref 0.0f; TA=ref 0.0f}
        end with
        override t.ToString() = sprintf "(primal=%f,tangent=%f,adjoint=%f,tangent of adjoint=%f)" t.P !t.T !t.A !t.TA // To make F# Interactive print out the fields
    let a = Df 3.0f
    let b = Df 2.0f

    let tape = List<unit -> unit>() // List is not really a list, but a dynamic array. unit -> unit is the function type that takes no parameters and returns nothing.
    let mult (a: Df) (b: Df) =  
        let cp = a.P*b.P // primal
        let ct = !a.T * b.P + a.P * !b.T // tangent
        let c = Df(cp,ct)
        let fb() = 
            a.A := !c.A*b.P + !a.A 
            b.A := !c.A*a.P + !b.A 

            // The derivative of !c.A*b.P is !c.TA*b.P + !c.A* !b.T
            // We also run the forward mode during the reverse.
            // This calculates the Hessian multiplied by a vector.
            a.TA := !c.TA*b.P + !c.A* !b.T + !a.TA 
            b.TA := !c.TA*a.P + !c.A* !a.T + !b.TA 
        tape.Add(fb)
        c

    type Df with
        static member inline (*)(a: Df, b: Df) = mult a b // The overloaded * operator

    // In order to get the derivative of the cost with respect to a, set a's tangent to 1.0f and every others' to 0.0f
    a.T := 1.0f
    b.T := 0.0f
    let c = a*a*b*b
    c.A := 1.0f
    for i=tape.Count-1 downto 0 do tape.[i]() // Let the computer crank it for you from top to bottom.
    printfn "The elements of the Hessian are inside the tangent of the adjoint."
    printfn "a=%A" a
    printfn "b=%A" b


    // Once more from the top.
    [|a;b|] |> Array.iter (fun x -> x.A := 0.0f;x.T := 0.0f;x.TA := 0.0f) // Reset the adjoints and the tangents of the base variables.
    a.T := 0.0f
    b.T := 1.0f
    tape.Clear()
    let c' = a*a*b*b
    c'.A := 1.0f
    for i=tape.Count-1 downto 0 do tape.[i]() // Let the computer crank it for you from top to bottom.
    printfn "a=%A" a
    printfn "b=%A" b

Output:

The elements of the Hessian are inside the tangent of the adjoint.
a=(primal=3.000000,tangent=1.000000,adjoint=24.000000,tangent of adjoint=8.000000)
b=(primal=2.000000,tangent=0.000000,adjoint=36.000000,tangent of adjoint=24.000000)
a=(primal=3.000000,tangent=0.000000,adjoint=24.000000,tangent of adjoint=24.000000)
b=(primal=2.000000,tangent=1.000000,adjoint=36.000000,tangent of adjoint=18.000000)
// I've omitted the rest

So based on the above:
d^2 c / d^2 a = 8
d^2 c / d ab = 24
d^2 c / d^2 b = 18

What I did is attach a tangent to the adjoint so that in effect we are running forward mode on the way down along with the reverse. At the end we get the elements of the Hessian as shown in the output.

Closing remarks

With the above tools, one has everything one needs to implement any machine learning imaginable. For myself, it took me quite a while to figure out second order differentiation. I definitely could not find any info on how to do it online.

Now I know and now that I revealed the secret, you do too.

I do not actually intend to extend the support for this in the library as it would place too great a burden on me as a programmer to support a feature I do not need, but there is satisfaction in knowing how to do this in case it becomes needed.

While the above examples were on scalars, they extend naturally to matrix operations.

Having covered the basics, the next few posts will be a stroll down memory lane.

The full code for post 6 is at the usual place.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s