The dMatrix type and the multiply and add wrappers.

The dMatrix type

Before anything can be done, the basic 2D matrix type has to be defined. It would be far too unsafe to call Cuda library functions on raw Cuda matrices. Instead is very helpful to wrap them inside a class, or in this case a record which is similar to a class. The difference is that it can’t be inherited from, its field can be made mutable and there is the ability to make updated copies.

/// The main matrix type.
type dMatrix =
    {
    mutable num_rows:int
    mutable num_cols:int
    mutable dArray: CudaDeviceVariable<floatType>
    }  

    /// The main create function. A substitute for the constructor.
    static member create(num_rows: int,num_cols,dArray: CudaDeviceVariable<floatType>) =
        {num_rows=num_rows;num_cols=num_cols;dArray=dArray}

    /// Throws an exception if it tries to allocate an array of size 0.
    static member create(num_rows: int,num_cols) =
        let q = (num_rows*num_cols) |> SizeT
        let t = new CudaDeviceVariable<floatType>(q)
        {num_rows=num_rows;num_cols=num_cols;dArray=t}

    /// Copies a host to a device array.
    /// Throws an exception if it tries to allocate an array of size 0.
    static member create(num_rows: int,num_cols,dArray: floatType[]) =
        let q = num_rows*num_cols
        if dArray.Length <> q then failwith "Invalid size in dMatrix construction."
        let t = to_dev dArray
        {num_rows=num_rows;num_cols=num_cols;dArray=t}

    /// Returns a new instance of an (dMatrix.createEmpty) dMatrix.
    /// Unlike the let statements, the member statements are always reevaluated.
    static member createEmpty = dMatrix.create(0,0,CudaDeviceVariable.Null)

    /// Returns num_rows, num_cols as a tuple
    member inline t.rc = t.num_rows, t.num_cols
    /// Sets the matrix to zero.
    member inline t.setZero() = t.dArray.MemsetAsync(0u,str.Stream)
    /// Set the matrix to a value.
    member inline t.set (x: floatType) = 
        let v = BitConverter.ToUInt32(BitConverter.GetBytes(x),0)
        t.dArray.MemsetAsync(v,str.Stream)
    /// Creates a copy of this matrix with all the values set to zero.
    member inline t.zeroLike() =
        let c = dMatrix.create(t.num_rows,t.num_cols)
        c.setZero()
        c
    /// Copies a matrix.
    member inline t.copy() =
        let c = dMatrix.create(t.num_rows,t.num_cols)
        c.dArray.AsyncCopyToDevice(t.dArray,str)
        c

    /// Resized the dArray if the current one is less than nr*nc. Otherwise it only adjusts num_rows and num_cols.
    member inline t.ReplaceIf nr nc =
        if int t.dArray.Size < nr*nc 
        then
            (t :> IDisposable).Dispose()
            t.num_rows <- nr
            t.num_cols <- nc
            t.dArray <- new_dev (nr*nc)
        else
            t.num_rows <- nr
            t.num_cols <- nc

    /// Copies a matrix to a host array.
    member inline t.Gather() =
        let h_a = Array.zeroCreate<floatType> (int t.dArray.Size)
        t.dArray.CopyToHost(h_a)
        h_a

    member inline t.isEmpty = t.dArray.Equals(CudaDeviceVariable.Null)

    /// The unmanaged Cuda memory has to be freed explicitly or by letting go of the context by resetting  the F# Interactive.
    /// Finalizers work really poorly and can lead to unpredictable bugs when used to manage Cuda memory.
    interface IDisposable with
        member t.Dispose() = 
            if t.isEmpty = false then
                t.dArray.Dispose()

The reason why the record fields are mutable is to enable resizing. When an array is resized upwards it disposes it dArray and assigns a new one, while it just changes num_rows and num_cols fields when it resizes downwards. That is in order to keep the dynamic reallocations to a minimum.

To achieve optimal performance, the amount of memory allocation and deallocation needs to be kept to a minimum. Ideally one should allocate all the memory on the first pass and keep reusing those matrices on the other passes. The versions that can reuse the memory are roughly 3-4 times faster than their more functional equivalents. On the other hand, the functional equivalents that use dynamic allocation solely are definitely much easier to write.

An useful practice I have found is to fluidly transition between the two styles by writing the first versions in a functional manner and then making them faster by making them use less dynamic memory juggling.

Gemm, Geam and AddTensor wrappers

The most essential part of a ML library would be that it provides access to the basic linear algebra functions that are in turn, placed inside of other libraries. For fully connected neural nets the functions that power them are the sgemm (A*x) matrix multiply found in the cuBLAS library and the AddTensor (+ bias). The cuDNN user guide can be found under downloads.

A recommendation for those interested in making convolutional functions work is to use the Julia documentation instead as it is much better than the Nvidia’s one. It actually explains what the parameters do.

I won’t be covering convolutional functions in this series, but I have written a bunch of wrappers for them in Alea (the Cuda .Net compiler library), that will have to be adapted to work in ManagedCuda. I’ll do it eventually after I finish my exploration of recurrent nets.

Here is the general matrix multiply:

let T = Operation.Transpose
let nT = Operation.NonTranspose

/// General matrix-matrix multiply from cuBLAS.
let gemm transa transb (alpha: floatType) (A:dMatrix) (B:dMatrix) =
    let a_col = if transa = nT then A.num_cols else A.num_rows
    let b_row = if transb = nT then B.num_rows else B.num_cols
    if a_col <> b_row then failwith (sprintf "a_col <> b_row in gemm! %i <> %i" a_col b_row)
    let m = if transa = nT then A.num_rows else A.num_cols
    let n = if transb = nT then B.num_cols else B.num_rows
    let k = a_col

    let lda = if transa = nT then m else k
    let ldb = if transb = nT then k else n
    let ldc = m

    let C_dArray = new CudaDeviceVariable<floatType>(m*n |> SizeT)
    cublas.Gemm(transa, transb, m, n, k, alpha, A.dArray, lda, B.dArray, ldb, 0.0f, C_dArray, ldc)
    dMatrix.create(m,n,C_dArray)

/// General matrix-matrix multiply from cuBLAS. Inplace version
let gemm2 transa transb (alpha: floatType) (A:dMatrix) (B:dMatrix) beta (C:dMatrix) =
    let a_col = if transa = nT then A.num_cols else A.num_rows
    let b_row = if transb = nT then B.num_rows else B.num_cols
    if a_col <> b_row then failwith (sprintf "a_col <> b_row in gemm! %i <> %i" a_col b_row)
    let m = if transa = nT then A.num_rows else A.num_cols
    let n = if transb = nT then B.num_cols else B.num_rows
    let k = a_col

    let lda = if transa = nT then m else k
    let ldb = if transb = nT then k else n
    let ldc = m

    let C_dArray = C.dArray
    if m <> C.num_rows || n <> C.num_cols then failwith "m <> C.num_rows || n <> C.num_cols in gemm2"
    cublas.Gemm(transa, transb, m, n, k, alpha, A.dArray, lda, B.dArray, ldb, beta, C_dArray, ldc)

The gemm (general matrix matrix multiply) functions were difficult to figure out as they have some insane parameter rules that force the programmer to do acrobatics with the row and column variables.

At any rate, there are not difficult to figure out. The boundary checks will throw an exception if the matrix dimensions are input incorrectly.

The difference between the two is easy enough to see by the return type. The standard gemm will dynamically allocate and return the dMatrix C, while the inplace version requires the memory to be preallocated.

/// General matrix-matrix addition.
let geam transa transb (alpha: floatType) (A:dMatrix) beta (B:dMatrix) =
    let a_row = if transa = nT then A.num_rows else A.num_cols
    let a_col = if transa = nT then A.num_cols else A.num_rows
    let b_row = if transb = nT then B.num_rows else B.num_cols
    let b_col = if transb = nT then B.num_cols else B.num_rows
        
    if a_row <> b_row then failwith (sprintf "a_row <> b_row in geam! %i <> %i" a_row b_row)
    if a_col <> b_col then failwith (sprintf "a_col <> b_col in geam! %i <> %i" a_col b_col)

    let lda = if transa = nT then a_row else a_col
    let ldb = if transa = nT then b_row else b_col
    let ldc = a_row

    let C_dArray = new CudaDeviceVariable<floatType>(a_row*a_col |> SizeT)
    cublas.Geam(transa, transb, a_row, a_col, alpha, A.dArray, lda, B.dArray, ldb, beta, C_dArray, ldc)
    dMatrix.create(a_row,a_col,C_dArray)

/// General matrix-matrix addition. Inplace version.
let geam2 transa transb (alpha: floatType) (A:dMatrix) beta (B:dMatrix) (C:dMatrix) =
    let a_row = if transa = nT then A.num_rows else A.num_cols
    let a_col = if transa = nT then A.num_cols else A.num_rows
    let b_row = if transb = nT then B.num_rows else B.num_cols
    let b_col = if transb = nT then B.num_cols else B.num_rows
        
    if a_row <> b_row then failwith (sprintf "a_row <> b_row in geam2! %i <> %i" a_row b_row)
    if a_col <> b_col then failwith (sprintf "a_col <> b_col in geam2! %i <> %i" a_col b_col)

    if a_row <> C.num_rows then failwith (sprintf "a_row <> C.num_rows in geam2! %i <> %i" a_col b_col)
    if a_col <> C.num_cols then failwith (sprintf "a_col <> C.num_cols in geam2! %i <> %i" a_col b_col)

    let lda = if transa = nT then a_row else a_col
    let ldb = if transa = nT then b_row else b_col
    let ldc = a_row

    cublas.Geam(transa, transb, a_row, a_col, alpha, A.dArray, lda, B.dArray, ldb, beta, C.dArray, ldc)

let inline transpose t = geam T T 1.0f t 0.0f t // Transpose function

The extended cuBlas function geam(matrix-matrix addition) acts much like the vector vector addition function with a few extra features. Besides adding two matrices together, it is also able to transpose them before doing that, meaning that it can also be used as a transpose function.


let biasTensorDesc = new TensorDescriptor()
let dstTensorDesc = new TensorDescriptor()
let SpiralCuDNNDataType = 
    if typeof<floatType> = typeof<float32> then cudnnDataType.Float
    else if typeof<floatType> = typeof<float> then cudnnDataType.Double
    else failwith "cudnnDataType not supported."

///o <- beta*mat + alpha*vec (matrix-vector broadcast addition)
let broadcastAdd beta (mat: dMatrix) alpha (vec: dMatrix) =
    let TensorFormat = cudnnTensorFormat.NCHW;
    biasTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, 1, vec.num_rows, vec.num_cols)
    dstTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, mat.num_cols, mat.num_rows, 1)
    let copy_mat = mat.copy()
    cudnn.AddTensor(alpha,biasTensorDesc,vec.dArray,beta,dstTensorDesc,copy_mat.dArray)
    copy_mat
    
///mat <- beta*mat + alpha*vec (matrix-vector broadcast addition)
let broadcastAdd2 beta (mat: dMatrix) alpha (vec: dMatrix) =
    let TensorFormat = cudnnTensorFormat.NCHW;
    biasTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, 1, vec.num_rows, vec.num_cols)
    dstTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, mat.num_cols, mat.num_rows, 1)
    cudnn.AddTensor(alpha,biasTensorDesc,vec.dArray,beta,dstTensorDesc,mat.dArray)

/// o <- sum_across_channels(alpha*mat)
/// For 2D matrices, channels are the columns.
/// The function sums along the rows.
let rowSum alpha (mat: dMatrix) =
    let TensorFormat = cudnnTensorFormat.NHWC;
    dstTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, mat.num_rows, 1, mat.num_cols)
    
    let vec = dMatrix.create(mat.num_rows,1)
    biasTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, vec.num_rows, 1, vec.num_cols)
    
    cudnn.ConvolutionBackwardBias(alpha,dstTensorDesc,mat.dArray,0.0f,biasTensorDesc,vec.dArray)
    vec

/// vec <- sum_across_channels(alpha*mat)+beta*vec
/// For 2D matrices, channels are the columns.
/// The function sums along the rows.
let rowSum2 alpha (mat: dMatrix) beta (vec: dMatrix) =
    let TensorFormat = cudnnTensorFormat.NHWC;
    dstTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, mat.num_rows, 1, mat.num_cols)
    biasTensorDesc.SetTensor4dDescriptor(TensorFormat, SpiralCuDNNDataType, 1, mat.num_rows, 1, vec.num_cols)
    
    cudnn.ConvolutionBackwardBias(alpha,dstTensorDesc,mat.dArray,beta,biasTensorDesc,vec.dArray)

The above functions do broadcast matrix vector addition and summation along the rows, used for adding the biases and calculating their adjoints (error derivatives) respectively.

There is no particular need to think deeply what the particular parameters do here, it is enough to know that this is basically what they do. Just put them in a box mentally somewhere. Unlike the cuBLAS functions these do not have boundary checking inside the loop because cuDNN already takes care of that on its own.

Interesting tidbit worth noting here is that a scalar (or a vector in the case of broadcast) expanded to a matrix in the forward pass does not require averaging of errors in the return pass, but just their summation instead. Just now I noticed that I had this error in the rowSum functions in fact.

With these four basic functions, we have almost everything we to make our first neural net, though we still need the map kernels for the activation functions and the random function to initialize the weights and the biases.

Before that here is a demonstration one of interesting features of F#: type extensions.

type dMatrix with
    /// For accessing individual elements with the .[a,b] syntax.
    member t.Item
        with get(a: int, b: int) = t.dArray.[a+b*t.num_rows |> SizeT]
        and set(a: int, b: int) (value: floatType) = t.dArray.[a+b*t.num_rows |> SizeT] <- value

    /// For displaying column majors matrices inside Array2D (which is row major.)
    member inline t.Gather'() =
            let h_a = Array2D.zeroCreate<floatType> t.num_rows t.num_cols
            use t' = transpose t // Transpose to row major. The use keyword ensures that it is disposed automatically as soon as it goes out of scope.
            t'.dArray.CopyToHost(h_a) // Copy directly to host array.
            h_a

When this snipped does is add two methods to the dMatrix type. The first is a function to access the individual elements like one would a 2D array and the other is to copy the Cuda array to a 2D host one.

As the Array2D class uses the row major ordering before making the transfer, first a transpose needs to be done and for that we needed the geam function to be completed first. Without type extensions, because it would be impossible for us to complete the geam function without making the dMatrix type first, we would have to put everything inside the dMatrix type. Without type extensions it would not be impossible to implement, but this is very convenient.

A short example to prove that the above works:

let a = dMatrix.create(5,3)
a.set 1.5f
let b = dMatrix.create(3,4)
b.set 2.3f

let c = gemm nT nT 1.0f a b

let a' = a.Gather'()
let b' = b.Gather'()
let c' = c.Gather'()

let c'2 = geam nT nT 0.1f c 0.0f c // Multiplies the first array by 0.1f and the second by 0.0f
let c'2' = c'2.Gather'()

let bias = dMatrix.create(5,1)
for i=0 to 4 do bias.[i,0] <- 0.5f + floatType i

let bias' = bias.Gather'()

let d = broadcastAdd 1.0f c'2 1.0f bias
let d' = d.Gather'()
let e = rowSum 1.0f d
let e' = e.Gather'()
let e2 = rowSum 2.0f d
let e2' = e2.Gather'()

Here is the output:


val a : dMatrix = {num_rows = 5;
                   num_cols = 3;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val b : dMatrix = {num_rows = 3;
                   num_cols = 4;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val c : dMatrix = {num_rows = 5;
                   num_cols = 4;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val a' : floatType [,] = [[1.5f; 1.5f; 1.5f]
                          [1.5f; 1.5f; 1.5f]
                          [1.5f; 1.5f; 1.5f]
                          [1.5f; 1.5f; 1.5f]
                          [1.5f; 1.5f; 1.5f]]
val b' : floatType [,] =
  [[2.29999995f; 2.29999995f; 2.29999995f; 2.29999995f]
   [2.29999995f; 2.29999995f; 2.29999995f; 2.29999995f]
   [2.29999995f; 2.29999995f; 2.29999995f; 2.29999995f]]
val c' : floatType [,] =
  [[10.3499994f; 10.3499994f; 10.3499994f; 10.3499994f]
   [10.3499994f; 10.3499994f; 10.3499994f; 10.3499994f]
   [10.3499994f; 10.3499994f; 10.3499994f; 10.3499994f]
   [10.3499994f; 10.3499994f; 10.3499994f; 10.3499994f]
   [10.3499994f; 10.3499994f; 10.3499994f; 10.3499994f]]
val c'2 : dMatrix =
  {num_rows = 5;
   num_cols = 4;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val c'2' : floatType [,] =
  [[1.03499997f; 1.03499997f; 1.03499997f; 1.03499997f]
   [1.03499997f; 1.03499997f; 1.03499997f; 1.03499997f]
   [1.03499997f; 1.03499997f; 1.03499997f; 1.03499997f]
   [1.03499997f; 1.03499997f; 1.03499997f; 1.03499997f]
   [1.03499997f; 1.03499997f; 1.03499997f; 1.03499997f]]
val bias : dMatrix =
  {num_rows = 5;
   num_cols = 1;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val bias' : floatType [,] = [[0.5f]
                             [1.5f]
                             [2.5f]
                             [3.5f]
                             [4.5f]]
val d : dMatrix = {num_rows = 5;
                   num_cols = 4;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val d' : floatType [,] =
  [[1.53499997f; 1.53499997f; 1.53499997f; 1.53499997f]
   [2.53499985f; 2.53499985f; 2.53499985f; 2.53499985f]
   [3.53499985f; 3.53499985f; 3.53499985f; 3.53499985f]
   [4.53499985f; 4.53499985f; 4.53499985f; 4.53499985f]
   [5.53499985f; 5.53499985f; 5.53499985f; 5.53499985f]]
val e : dMatrix = {num_rows = 5;
                   num_cols = 1;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val e' : floatType [,] = [[6.13999987f]
                          [10.1399994f]
                          [14.1399994f]
                          [18.1399994f]
                          [22.1399994f]]
val e2 : dMatrix = {num_rows = 5;
                    num_cols = 1;
                    dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val e2' : floatType [,] = [[12.2799997f]
                           [20.2799988f]
                           [28.2799988f]
                           [36.2799988f]
                           [44.2799988f]]

The above example is pretty innocuous, but I managed to discover a bug related to the finalizer that used to be inside the dMatrix record. It made me decide to remove it from the dMatrix record.

Finalizers – A bad idea

What I am doing here is a bit of a throwback to the old times where things like memory management were done explicitly in languages like C. Unfortunately, with device memory that will also have to be the case. Until GPU programming matures, in backwards times such as this (almost 2016), the programmer has the responsibility of handling device memory.

It would recommend not trying to take the easy path by putting in finalizers as in fact, in the above example I found an instance of memory corruption. Most likely, for larger examples the errors would be more subtle and devastating.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s