Map and reduce kernels and random functions.

Cuda runtime compilation

In order to make any kind of machine learning algorithm work, the map operations are essential. Relu, sigmoid, tanh…for such functions and for more complex kinds like k-selection, it is necessary to apply map operations to the result of Ax+b.

Before just last month I started working with ManagedCuda, most of my examples were written in Alea. It is a full .NET Cuda compilation solution. It has various advanced features such as the ability to take anonymous function inside its map kernel, compilation to PTX directly off F# code and even a garbage collection scheme for its objects.

Almost by accident, when I rewrote the library the first time in ManagedCuda which does not have nearly as advanced features as Alea – it is mostly a wrapper library – I noticed a 3-4 fold speedup in a straight up comparison. At the time of writing, I am not sure whether that is due to the fact that ManagedCuda provides access to the latest Cuda libraries (7.5) while Alea is stuck on 6.5 currently or something else.

At any rate, interfacing with Cuda even with access to its libraries would be impossible without some means of doing runtime compilation. Thankfully, Nvidia provides exactly such a tool – NVRTC – as a part of its SDK.

It allows one to compile C++ code as a string like so:


let inline divup a b = (a+b-1)/b // Division with rounding up.

/// o <- f(x)
type DeviceUnaryTransformModule(op: string) = 
    let block_size = 256

    let kernel_code = "
        //Kernel code:
        extern \"C\" {
            __device__ inline "+FloatTypeCpp+" op("+FloatTypeCpp+" x)
            {
                return "+op+"
            }
        
            // Device code
            __global__ void Map1Kernel(const "+FloatTypeCpp+"* A, "+FloatTypeCpp+"* O, const int N)
            {
                int i = blockDim.x * blockIdx.x + threadIdx.x;
                const int stride = blockDim.x * gridDim.x;
                while (i < N)
                {
                    O[i] = op(A[i]);
                    i += stride;
                }
            }
        }

        "
    let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"Map1Kernel")
    do  
        try k.Compile([||])
        with 
        | πŸ˜• NVRTCException as x -> 
            printfn "%s" (k.GetLogAsString())
            reraise()

    let kernel = ctx.LoadKernelPTX(k.GetPTX(),"Map1Kernel")

    member t.A(x: CudaDeviceVariable<floatType>) =
        let n = int x.Size
        let o = new_dev<floatType> n
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,o.DevicePointer,n) |> ignore
        o

    member t.A(x: CudaDeviceVariable<floatType>, o: CudaDeviceVariable<floatType>) =
        let n = int o.Size
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,o.DevicePointer,n) |> ignore

    member t.A(x: dMatrix) =
        let o = dMatrix.create(x.num_rows,x.num_cols)
        t.A(x,o)
        o

    member t.A(x: dMatrix, o: dMatrix) =
        if x.rc <> o.rc then failwith "x.rc <> o.rc in DeviceUnaryTransformModule"
        t.A(x.dArray,o.dArray)

When the above class is created, NVRTC takes in the C++ string and compiles it into a kernel that can then be run like a class function.

Here is an example with the most commonly used ML activations.


let rng = System.Random()

let n = 9
let h_a = Array.init n (fun _ -> (rng.NextDouble()-0.5)*6.0 |> floatType)
let a = dMatrix.create(3,3,h_a)

let sigmoidModule = DeviceUnaryTransformModule "1.0f / (1.0f + expf(-x));"
let tanhModule = DeviceUnaryTransformModule "tanhf(x);"
let reluModule = DeviceUnaryTransformModule "x > 0.0f ? x : 0.0f;"

let sig_a = sigmoidModule.A(a)
let tanh_a = tanhModule.A(a)
let relu_a = reluModule.A(a)

let a' = a.Gather'()
let sig_a' = sig_a.Gather'()
let tanh_a' = tanh_a.Gather'()
let relu_a' = relu_a.Gather'()

Output:


val rng : Random
val n : int = 9
val h_a : float32 [] =
  [|-0.665023863f; -0.248744547f; 0.32134819f; 0.774180651f; -0.298591435f;
    -2.38772154f; 1.29249966f; 0.437697798f; 1.47583818f|]
val a : dMatrix = {num_rows = 3;
                   num_cols = 3;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val sigmoidModule : DeviceUnaryTransformModule
val tanhModule : DeviceUnaryTransformModule
val reluModule : DeviceUnaryTransformModule
val sig_a : dMatrix =
  {num_rows = 3;
   num_cols = 3;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val tanh_a : dMatrix =
  {num_rows = 3;
   num_cols = 3;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val relu_a : dMatrix =
  {num_rows = 3;
   num_cols = 3;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val a' : floatType [,] = [[-0.665023863f; 0.774180651f; 1.29249966f]
                          [-0.248744547f; -0.298591435f; 0.437697798f]
                          [0.32134819f; -2.38772154f; 1.47583818f]]
val sig_a' : floatType [,] = [[0.339611977f; 0.684424579f; 0.784569979f]
                              [0.438132524f; 0.42590186f; 0.607710302f]
                              [0.579652786f; 0.0841137916f; 0.813943148f]]
val tanh_a' : floatType [,] = [[-0.581697047f; 0.649353862f; 0.859779835f]
                               [-0.24373816f; -0.290023059f; 0.411734343f]
                               [0.310725451f; -0.983272374f; 0.90068531f]]
val relu_a' : floatType [,] = [[0.0f; 0.774180651f; 1.29249966f]
                               [0.0f; 0.0f; 0.437697798f]
                               [0.32134819f; 0.0f; 1.47583818f]]

Seems decent enough. The above kernel just iterates over each element in parallel, applying the specified operation that can be inserted as a text string, similar to an anonymous function.

Here are the rest of the simple map kernels:


/// o <- f(x,y)
type DeviceBinaryTransformModule(op: string) = 
    let block_size = 256

    let kernel_code = "
        //Kernel code:
        extern \"C\" {
            __device__ inline "+FloatTypeCpp+" op("+FloatTypeCpp+" x, "+FloatTypeCpp+" y)
            {
                return "+op+"
            }
        
            // Device code
            __global__ void Map2Kernel(const "+FloatTypeCpp+"* A, const "+FloatTypeCpp+"* B, "+FloatTypeCpp+"* O, const int N)
            {
                int i = blockDim.x * blockIdx.x + threadIdx.x;
                const int stride = blockDim.x * gridDim.x;
                while (i < N)
                {
                    O[i] = op(A[i],B[i]);
                    i += stride;
                }
            }
        }

        "
    let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"Map2Kernel")
    do  
        try k.Compile([||])
        with 
        | πŸ˜• NVRTCException as x -> 
            printfn "%s" (k.GetLogAsString())
            reraise()

    let kernel = ctx.LoadKernelPTX(k.GetPTX(),"Map2Kernel")

    member t.A(x: CudaDeviceVariable<floatType>, y: CudaDeviceVariable<floatType>) =
        let n = int x.Size
        let o = new_dev<floatType> n
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,y.DevicePointer,o.DevicePointer,n) |> ignore
        o

    member t.A(x: CudaDeviceVariable<floatType>, y: CudaDeviceVariable<floatType>, o: CudaDeviceVariable<floatType>) =
        let n = int o.Size
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,y.DevicePointer,o.DevicePointer,n) |> ignore

    member t.A(x: dMatrix, y: dMatrix) =
        let o = dMatrix.create(x.num_rows,x.num_cols)
        t.A(x,y,o)
        o

    member t.A(x: dMatrix, y: dMatrix, o: dMatrix) =
        if x.rc <> y.rc then failwith "x.rc <> y.rc in DeviceBinaryTransformModule"
        if y.rc <> o.rc then failwith "y.rc <> o.rc in DeviceBinaryTransformModule"
        t.A(x.dArray,y.dArray,o.dArray)

/// o <- f(x,y,z)
type DeviceTrinaryTransformModule(op: string) = 
    let block_size = 256

    let kernel_code = "
        //Kernel code:
        extern \"C\" {
            __device__ inline "+FloatTypeCpp+" op("+FloatTypeCpp+" x, "+FloatTypeCpp+" y, "+FloatTypeCpp+" z)
            {
                return "+op+"
            }
        
            // Device code
            __global__ void Map3Kernel(const "+FloatTypeCpp+"* A, const "+FloatTypeCpp+"* B, const "+FloatTypeCpp+"* C, "+FloatTypeCpp+"* O, const int N)
            {
                int i = blockDim.x * blockIdx.x + threadIdx.x;
                const int stride = blockDim.x * gridDim.x;
                while (i < N)
                {
                    O[i] = op(A[i],B[i],C[i]);
                    i += stride;
                }
            }
        }

        "
    let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"Map3Kernel")
    do  
        try k.Compile([||])
        with 
        | πŸ˜• NVRTCException as x -> 
            printfn "%s" (k.GetLogAsString())
            reraise()

    let kernel = ctx.LoadKernelPTX(k.GetPTX(),"Map3Kernel")

    member t.A(x: CudaDeviceVariable<floatType>, y: CudaDeviceVariable<floatType>, z: CudaDeviceVariable<floatType>) =
        let n = int x.Size
        let o = new_dev<floatType> n
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,y.DevicePointer,z.DevicePointer,o.DevicePointer,n) |> ignore
        o

    member t.A(x: CudaDeviceVariable<floatType>, y: CudaDeviceVariable<floatType>, z: CudaDeviceVariable<floatType>, o: CudaDeviceVariable<floatType>) =
        let n = int o.Size
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,y.DevicePointer,z.DevicePointer,o.DevicePointer,n) |> ignore

    member t.A(x: dMatrix, y: dMatrix, z: dMatrix) =
        let o = dMatrix.create(x.num_rows,x.num_cols)
        t.A(x,y,z,o)
        o

    member t.A(x: dMatrix, y: dMatrix, z: dMatrix, o: dMatrix) =
        if x.rc <> y.rc then failwith "x.rc <> y.rc in DeviceTrinaryTransformModule"
        if y.rc <> z.rc then failwith "y.rc <> z.rc in DeviceTrinaryTransformModule"
        if z.rc <> o.rc then failwith "y.rc <> o.rc in DeviceTrinaryTransformModule"
        t.A(x.dArray,y.dArray,z.dArray,o.dArray)

/// o <- sum(f(x))
type DeviceUnaryMapSumModule(op: string) = 
    let block_size = 256

    let kernel_code = "
        //Kernel code:
        extern \"C\" {
            __device__ inline "+FloatTypeCpp+" op("+FloatTypeCpp+" x)
            {
                return "+op+"
            }
        
            __device__ inline "+FloatTypeCpp+" warpDownReduce("+FloatTypeCpp+" value){
	            for (int i = 16; i>0; i = i / 2) value += __shfl_down(value, i);
	            return value;
            }

            // Device code
            __global__ void MapSumKernel(const "+FloatTypeCpp+"* A, "+FloatTypeCpp+"* O, const int N)
            {
	            int i = blockDim.x * blockIdx.x + threadIdx.x;
	            const int stride = blockDim.x * gridDim.x;
	            __shared__ "+FloatTypeCpp+" temp[32];
                if (threadIdx.x < 32) temp[threadIdx.x] = 0.0f; "+FloatTypeCpp+" acc = 0.0f;
	            while (i < N)
	            {
		            acc += op(A[i]);
		            i += stride;
	            }
	            __syncthreads(); "+FloatTypeCpp+" out_partial = warpDownReduce(acc);
	            if (threadIdx.x % 32 == 0) temp[threadIdx.x / 32] = out_partial;
	            __syncthreads();
	            if (threadIdx.x < 32) out_partial = warpDownReduce(temp[threadIdx.x]);
	            if (threadIdx.x == 0) atomicAdd(O, out_partial);
            }
        }

        "
    let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"MapSumKernel")
    do  
        try k.Compile([|"-arch=compute_30"|])
        with 
        | πŸ˜• NVRTCException as x -> 
            printfn "%s" (k.GetLogAsString())
            reraise()

    let kernel = ctx.LoadKernelPTX(k.GetPTX(),"MapSumKernel")

    member t.A(x: CudaDeviceVariable<floatType>) =
        let n = int x.Size
        use o = new_dev<floatType> 1
        o.Memset(0u)
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,o.DevicePointer,n) |> ignore
        o.[SizeT 0]

    member t.A(x: dMatrix) =
        t.A(x.dArray)

/// o <- sum(f(x,y))
type DeviceBinaryMapSumModule(op: string) = 
    let block_size = 256

    let kernel_code = "
        //Kernel code:
        extern \"C\" {
            __device__ inline "+FloatTypeCpp+" op("+FloatTypeCpp+" x, "+FloatTypeCpp+" y)
            {
                return "+op+"
            }
        
            __device__ inline "+FloatTypeCpp+" warpDownReduce("+FloatTypeCpp+" value){
	            for (int i = 16; i>0; i = i / 2) value += __shfl_down(value, i);
	            return value;
            }

            // Device code
            __global__ void Map2SumKernel(const "+FloatTypeCpp+"* A, const "+FloatTypeCpp+"* B, "+FloatTypeCpp+"* O, const int N)
            {
	            int i = blockDim.x * blockIdx.x + threadIdx.x;
	            const int stride = blockDim.x * gridDim.x;
	            __shared__ "+FloatTypeCpp+" temp[32]; 
                if (threadIdx.x < 32) temp[threadIdx.x] = 0.0f; "+FloatTypeCpp+" acc = 0.0f;
	            while (i < N)
	            {
		            acc += op(A[i],B[i]);
		            i += stride;
	            }
	            __syncthreads(); "+FloatTypeCpp+" out_partial = warpDownReduce(acc);
	            if (threadIdx.x % 32 == 0) temp[threadIdx.x / 32] = out_partial;
	            __syncthreads();
	            if (threadIdx.x < 32) out_partial = warpDownReduce(temp[threadIdx.x]);
	            if (threadIdx.x == 0) atomicAdd(O, out_partial);
            }
        }

        "
    let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"Map2SumKernel")
    do  
        try k.Compile([|"-arch=compute_30"|])
        with 
        | πŸ˜• NVRTCException as x -> 
            printfn "%s" (k.GetLogAsString())
            reraise()

    let kernel = ctx.LoadKernelPTX(k.GetPTX(),"Map2SumKernel")

    member t.A(x: CudaDeviceVariable<floatType>,y: CudaDeviceVariable<floatType>) =
        let n = int x.Size
        use o = new_dev<floatType> 1
        o.Memset(0u)
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, x.DevicePointer,y.DevicePointer,o.DevicePointer,n) |> ignore
        o.[SizeT 0]

    member t.A(x: dMatrix,y: dMatrix) =
        if x.rc <> y.rc then failwith "x.rc <> y.rc in DeviceBinaryMapSumModule"
        t.A(x.dArray,y.dArray)

/// o <- f(coef_x,x)
type DeviceUnaryCoefTransformModule(op: string) = 
    let block_size = 256

    let kernel_code = "
        //Kernel code:
        extern \"C\" {
            __device__ inline "+FloatTypeCpp+" op("+FloatTypeCpp+" coef_x, "+FloatTypeCpp+" x)
            {
                return "+op+"
            }
        
            // Device code
            __global__ void MapCoefKernel(const "+FloatTypeCpp+" coef_A, const "+FloatTypeCpp+"* A, "+FloatTypeCpp+"* O, const int N)
            {
                int i = blockDim.x * blockIdx.x + threadIdx.x;
                const int stride = blockDim.x * gridDim.x;
                while (i < N)
                {
                    O[i] = op(coef_A,A[i]);
                    i += stride;
                }
            }
        }

        "
    let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"MapCoefKernel")
    do  
        try k.Compile([||])
        with 
        | πŸ˜• NVRTCException as x -> 
            printfn "%s" (k.GetLogAsString())
            reraise()

    let kernel = ctx.LoadKernelPTX(k.GetPTX(),"MapCoefKernel")

    member t.A(coef_x: floatType, x: CudaDeviceVariable<floatType>) =
        let n = int x.Size
        let o = new_dev<floatType> n
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, coef_x,x.DevicePointer,o.DevicePointer,n) |> ignore
        o

    member t.A(coef_x: floatType, x: CudaDeviceVariable<floatType>, o: CudaDeviceVariable<floatType>) =
        let n = int o.Size
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, coef_x,x.DevicePointer,o.DevicePointer,n) |> ignore

    member t.A(coef_x, x: dMatrix) =
        let o = dMatrix.create(x.num_rows,x.num_cols)
        t.A(coef_x,x,o)
        o

    member t.A(coef_x, x: dMatrix, o: dMatrix) =
        if x.rc <> o.rc then failwith "x.rc <> o.rc in DeviceUnaryCoefTransformModule"
        t.A(coef_x,x.dArray,o.dArray)

/// o <- f(coef_x,x,coef_y,y)
type DeviceBinaryCoefTransformModule(op: string) = 
    let block_size = 256

    let kernel_code = "
        //Kernel code:
        extern \"C\" {
            __device__ inline "+FloatTypeCpp+" op("+FloatTypeCpp+" coef_x, "+FloatTypeCpp+" x, "+FloatTypeCpp+" coef_y, "+FloatTypeCpp+" y)
            {
                return "+op+"
            }
        
            // Device code
            __global__ void MapCoef2Kernel(const "+FloatTypeCpp+" coef_A, const "+FloatTypeCpp+"* A, const "+FloatTypeCpp+" coef_B, const "+FloatTypeCpp+"* B, "+FloatTypeCpp+"* O, const int N)
            {
                int i = blockDim.x * blockIdx.x + threadIdx.x;
                const int stride = blockDim.x * gridDim.x;
                while (i < N)
                {
                    O[i] = op(coef_A,A[i],coef_B,B[i]);
                    i += stride;
                }
            }
        }

        "
    let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"MapCoef2Kernel")
    do  
        try k.Compile([||])
        with 
        | πŸ˜• NVRTCException as x -> 
            printfn "%s" (k.GetLogAsString())
            reraise()

    let kernel = ctx.LoadKernelPTX(k.GetPTX(),"MapCoef2Kernel")

    member t.A(coef_x: floatType, x: CudaDeviceVariable<floatType>,coef_y: floatType, y: CudaDeviceVariable<floatType>) =
        let n = int x.Size
        let o = new_dev<floatType> n
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, coef_x,x.DevicePointer,coef_y, y.DevicePointer,o.DevicePointer,n) |> ignore
        o

    member t.A(coef_x: floatType, x: CudaDeviceVariable<floatType>, coef_y: floatType, y: CudaDeviceVariable<floatType>, o: CudaDeviceVariable<floatType>) =
        let n = int o.Size
        let gridSize = min (2*numSm*(1024/block_size)) (divup n block_size)
        kernel.GridDimensions <- dim3(gridSize)
        kernel.BlockDimensions <- dim3(block_size)
        kernel.RunAsync(str.Stream, coef_x,x.DevicePointer,coef_y,y.DevicePointer,o.DevicePointer,n) |> ignore

    member t.A(coef_x, x: dMatrix, coef_y, y: dMatrix) =
        let o = dMatrix.create(x.num_rows,x.num_cols)
        t.A(coef_x,x,coef_y,y,o)
        o

    member t.A(coef_x, x: dMatrix, coef_y, y: dMatrix, o: dMatrix) =
        if x.rc <> y.rc then failwith "x.rc <> y.rc in DeviceBinaryCoefTransformModule"
        if y.rc <> o.rc then failwith "y.rc <> o.rc in DeviceBinaryCoefTransformModule"
        t.A(coef_x,x.dArray,coef_y,y.dArray,o.dArray)

The DeviceMapSum modules do a block sum reduction after applying a map function and then add up the results using atomics. Despite how it might look, in newer Nvidia cards banks conflicts are resolved in L2 cache making atomics quite fast for reductions.

It might be good to replace my custom reductions with CUB library calls, but I am afraid that it might make compilation times even slower than they already are.

Cuda has ways to go until it catches up to native code compilation speed.

The hidden purpose of dealing with this low level stuff is to build some mental muscle for when neuromorphic architectures arrive. Even if Cuda is hard to code for, the magic brain chips freshly off the presses will be much more so. All the libraries will have be pretty much redone when that happens, but the design patterns will remain intact and therein lies the value in having firm foundations.

Device random functions

DeviceCoefTransform modules are similar to the standard map ones except they can take extra scalar arguments. With those one can implement gradient clipping or resize and translate matrices.


// The gradient clipping module.
let gradclipModule = DeviceUnaryCoefTransformModule "(x < -coef_x) ? -coef_x : (x > coef_x ? coef_x : x);"
   
// coef_x = scale
// coef_y = location
// y does not get used.
let randMapModule = DeviceBinaryCoefTransformModule "coef_x*(x-0.5f)+coef_y;"

type dMatrix with
    /// Generates a matrix sampled from a random uniform distribution in <-1.0f,1.0f]
    static member createRandomUniformMatrix weights_num_rows weights_num_cols (scaling_factor : floatType) location =
        let weights_total_size = weights_num_rows*weights_num_cols
        
        let cudaBuffer = new_dev<floatType> weights_total_size
        cudaRandom.GenerateUniform(cudaBuffer)

        // 2.0f*scaling_factor ensures that it is rescaled around zero if the scaling_factor is 1.0f.
        randMapModule.A(2.0f*scaling_factor,cudaBuffer,location,cudaBuffer,cudaBuffer)

        dMatrix.create(weights_num_rows,weights_num_cols,cudaBuffer)

    /// Fills matrix by sampling from a random uniform distribution in <-1.0f,1.0f]
    member t.fillRandomUniformMatrix (scaling_factor : floatType) location =
        let weights_total_size = t.num_rows*t.num_cols

        cudaRandom.GenerateUniform(t.dArray)
        // 2.0f*scaling_factor ensures that it is rescaled around zero if the scaling_factor is 1.0f.
        randMapModule.A(2.0f*scaling_factor,t.dArray,location,t.dArray,t.dArray)

Again I extend the dMatrix type by giving it the ability to create random matrices. Here is a demonstration of the things build so far in tandem:


let W = dMatrix.createRandomUniformMatrix     3 2 1.0f 0.0f
let bias = dMatrix.createRandomUniformMatrix  3 1 1.0f 0.0f
let W2 = dMatrix.createRandomUniformMatrix    1 3 1.0f 0.0f
let bias2 = dMatrix.createRandomUniformMatrix 1 1 1.0f 0.0f

let input = dMatrix.create(2,4,[|0.0f;0.0f;0.0f;1.0f;1.0f;0.0f;1.0f;1.0f|])
let targets = dMatrix.create(1,4,[|0.0f;1.0f;1.0f;0.0f|])

let inline mm a b = gemm nT nT 1.0f a b
let inline badd a b = broadcastAdd 1.0f a 1.0f b
let inline tanh_act (a: dMatrix) = tanhModule.A(a)
let inline sig_act (a: dMatrix) = sigmoidModule.A(a)

// tanh(W*input + bias)
let a1 = badd (mm W input) bias |> tanh_act
// sigmoid(W2*a1 + bias2)
let o = badd (mm W2 a1) bias2 |> sig_act

let W' = W.Gather'()
let bias' = bias.Gather'()
let W2' = W2.Gather'()
let bias2' = bias2.Gather'()
let input' = input.Gather'()
let targets' = targets.Gather'()

let a1' = a1.Gather'()
let o' = o.Gather'()

let squareDifferenceModule = new DeviceBinaryMapSumModule "(x-y)*(x-y);"

let L2_cost = squareDifferenceModule.A(targets,o)

Here is an example run:


val W : dMatrix = {num_rows = 3;
                   num_cols = 2;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val bias : dMatrix =
  {num_rows = 3;
   num_cols = 1;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val W2 : dMatrix = {num_rows = 1;
                    num_cols = 3;
                    dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val bias2 : dMatrix =
  {num_rows = 1;
   num_cols = 1;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val input : dMatrix =
  {num_rows = 2;
   num_cols = 4;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val targets : dMatrix =
  {num_rows = 1;
   num_cols = 4;
   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val inline mm : a:dMatrix -> b:dMatrix -> dMatrix
val inline badd : a:dMatrix -> b:dMatrix -> dMatrix
val inline tanh_act : a:dMatrix -> dMatrix
val inline sig_act : a:dMatrix -> dMatrix
val a1 : dMatrix = {num_rows = 3;
                    num_cols = 4;
                    dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val o : dMatrix = {num_rows = 1;
                   num_cols = 4;
                   dArray = ManagedCuda.CudaDeviceVariable`1[System.Single];}
val W' : floatType [,] = [[-0.443403065f; -0.249918759f]
                          [0.238359213f; -0.51902771f]
                          [0.175117373f; -0.170489728f]]
val bias' : floatType [,] = [[-0.812610984f]
                             [0.265104771f]
                             [-0.430138469f]]
val W2' : floatType [,] = [[0.558640003f; -0.573411465f; -0.173779786f]]
val bias2' : floatType [,] = [[-0.934492171f]]
val input' : floatType [,] = [[0.0f; 0.0f; 1.0f; 1.0f]
                              [0.0f; 1.0f; 0.0f; 1.0f]]
val targets' : floatType [,] = [[0.0f; 1.0f; 1.0f; 0.0f]]
val a1' : floatType [,] =
  [[-0.671028078f; -0.786630154f; -0.849961519f; -0.906214595f]
   [0.259063959f; -0.248602718f; 0.464837015f; -0.0155624701f]
   [-0.405437022f; -0.537496448f; -0.249632731f; -0.40156284f]]
val o' : floatType [,] =
  [[0.199815348f; 0.242691875f; 0.163491398f; 0.203910157f]]
val squareDifferenceModule : DeviceBinaryMapSumModule
val L2_cost : floatType = 1.3547678f

The above is a single forward pass of the XOR neural net problem and we have everything we need to make the backwards pass.

The backward pass as I have unfortunately found out is especially hard to get right, even more so for more complex architectures like LSTMs.

For the simple feedforward case as in the above, one could do it by hand, but it is not worth the bother anyway.

As I couldn’t prevent WordPress from stuffing random html into my code this time, all the examples as they are written can be found in the tutorial examples directory. The current post is post 5.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s