The first steps to running the library

64-bit Mode

Assuming you have installed F#, the first step you should take would be to run F# Interactive in 64-bit mode. In VS go into Tools -> Options and just write F# in the search bar. Then go to F# Tools -> F# Interactive. Enable both debugging and the 64-bit mode. Debugging is for later convenience – it won’t slow down the program in any case, but the 64-bit mode will be necessary to run Cuda libraries. Starting with version 7.0 Nvidia dropped support for the 32-bit versions.

The Spiral Library

I plan to build up the library step by step in the following chapters, but if one wants to peek ahead, here is the repository to an already finished version.

It it is similar to Andrej Karpathy’s Javascript library and AndyP’s Julia one (that was also inspired by Karpathy’s) except with more of a focus on getting the most out of one’s hardware.

At the time of this writing GPU programming is nowhere nearly as easy as standard CPU code and even if one just intends to call routines that are packaged in a library there are difficulties one must work around.

F# Interactive and running ManagedCuda

F# is statically typed functional language that can be compiled to an executable, but that can also be interpreted inside the IDE similar to Python.

All the examples that follow should be run from the F# Interactive.

Let’s test it out. To do anything at all, first we must reference the ManagedCuda files. It can be done like so:


// The Spiral library v1. Basic reverse mode AD on the GPU.

#r "../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/ManagedCuda.dll"
#r "../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/NVRTC.dll"
#r "../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/CudaBlas.dll"
#r "../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/CudaRand.dll"
#r "../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/NPP.dll"
#r "../packages/ManagedCuda-CudaDNN.3.0/lib/net45/CudaDNN.dll"

To run the lines just select them and then using right click and send them to F# Interactive. If the packages are installed something like the following should show up.


--> Referenced 'C:\Users\Marko\Documents\Visual Studio 2015\Projects\Spiral Library\Tutorial examples\../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/ManagedCuda.dll'


--> Referenced 'C:\Users\Marko\Documents\Visual Studio 2015\Projects\Spiral Library\Tutorial examples\../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/NVRTC.dll'


--> Referenced 'C:\Users\Marko\Documents\Visual Studio 2015\Projects\Spiral Library\Tutorial examples\../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/CudaBlas.dll'


--> Referenced 'C:\Users\Marko\Documents\Visual Studio 2015\Projects\Spiral Library\Tutorial examples\../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/CudaRand.dll'


--> Referenced 'C:\Users\Marko\Documents\Visual Studio 2015\Projects\Spiral Library\Tutorial examples\../packages/ManagedCuda-75-x64.7.5.7/lib/net45/x64/NPP.dll'


--> Referenced 'C:\Users\Marko\Documents\Visual Studio 2015\Projects\Spiral Library\Tutorial examples\../packages/ManagedCuda-CudaDNN.3.0/lib/net45/CudaDNN.dll'

That is roughly all the setup that is necessary to run this.

Setting up Cuda libraries inside the script and the importance of streams

Unlike the host or CPU code, which tends to run a single thread, for performance reasons GPU kernels tend to be launched asynchronously. To do that they need to be set on a single stream so that the scheduler can manage them. And the stream must be a part of a context which is analogous to a CPU process.


// Open up the namespaces.
open ManagedCuda
open ManagedCuda.BasicTypes
open ManagedCuda.VectorTypes
open ManagedCuda.CudaBlas
open ManagedCuda.CudaRand
open ManagedCuda.NVRTC
open ManagedCuda.CudaDNN

open System
open System.IO
open System.Collections

// Initialize the context. Analogous to a CPU process. Cuda tries to offload as much as possible during context creation so there aren't
// any unexpected delays later.
let ctx = new CudaContext()
let numSm = ctx.GetDeviceInfo().MultiProcessorCount // The number of streaming multiprocessors on the device.

// Make a stream class.
let str = new CudaStream()
// Set the Cuda libraries handles to the above stream.
let cublas = CudaBlas(str.Stream)
let cudnn = new CudaDNN.CudaDNNContext()
cudnn.SetStream(str)
let cudaRandom = new CudaRand.CudaRandDevice(GeneratorType.PseudoDefault)
cudaRandom.SetStream(str.Stream)

// Type aliasing trick to make Spiral more generic. It is incomplete at the moment though due to Cuda math function being non-overloadable.
type floatType = float32
let inline floatType x = float32 x
let FloatTypeCpp = "float"

Simple helper functions for moving data to and from the device


/// Copies a host array to device.
let inline to_dev (host_ar: 't []) =
    let d_a = new CudaDeviceVariable<'t>(SizeT host_ar.Length)    
    d_a.CopyToDevice(host_ar)
    d_a

/// Copies a device array to host.
let inline to_host (dev_ar: CudaDeviceVariable<'t>) =
    let h_a = Array.zeroCreate<'t> (int dev_ar.Size)
    dev_ar.CopyToHost(h_a)
    h_a

/// Copies the device array to host. Extends the CudaDeviceVariable class.
type CudaDeviceVariable<'t when 't: struct and 't: (new: unit -> 't) and 't:> System.ValueType> with
    member inline this.Gather() =
        to_host this

/// Allocates a new device array without initializing it.
let inline new_dev<'t when 't: struct and 't: (new: unit -> 't) and 't:> System.ValueType> (n: int) =
    new CudaDeviceVariable<'t>(SizeT n)

A short example would be as follows:


let a = [|1.0f;2.0f;3.0f|]
let a' = to_dev a
a'.[SizeT 0] <- 5.0f // Annoyingly, it is necessary to add explicitly convert ints to SizeT when accessing individual items in the CudaDeviceVariable class.
let b = to_host a'

This prints out:


val a : float32 [] = [|1.0f; 2.0f; 3.0f|]
val a' : ManagedCuda.CudaDeviceVariable<float32>
val b : float32 [] = [|5.0f; 2.0f; 3.0f|]

A word of warning is that for performance reasons, you should never iterate ever elements of a Cuda array individually as in the above. It would be absolute wrong way to do it and will kill the performance of a algorithm.

Moving data back and forth from device to host is one of the slowest operations that exist even in batch mode and doing it individually like the above is doubly so. The library only ever uses it to gather the results of a reduction operation for that reason.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s