A few months ago, I spent an enormous amount of time implementing the k-sparse autoencoder as practice for machine learning. In fact, for this new year, I wanted to make this the subject of the autoencoder tutorial, but the method in the paper suffers from some drawbacks. It was fun playing around, but to get the k-sparse autoencoder working properly, one needs to gradually reduce the k-value to its optimal point which is needlessly time consuming.

The k-sparse function is interesting given its applications to sparsity. It is pretty much interchangeable with Relu.

For example, if one actually trains a net with Relu units with a large learning rate and high momentum and uses the k-sparse units during test time, it is possible to get 98% or more which was surprising to me at the time and taught me that a high learning rate combined with momentum has natural regularization properties.

I would not seriously use it as unfortunately as it requires a select operation on the columns which requires iterating over them which in turn requires storing them in local memory. When the columns are relatively medium sized like 1024 that is fine, but anymore than that would put put strain on the available registers on the GPU. The columns being large also puts pressure on the algorithm itself. It has to be more complex in order to the accommodate to the changed nature of the problem.

When the columns are small like 128, it allows one to use a relatively simple select like iteratively finding the next lowest max starting from positive infinity.

This is one of the big themes of parallel programming, in that changing the size of the collection also changes the structure of the problem.

The WTA autoencoder is similar to the k-sparse one in the fully connected case, except there is a transpose operation applied before the select.

/// o <- max_k(tranpose_if(x)) /// Sets all except the k number of max of a column to zero. /// Unlike for the other modules, to ensure it works swiftly, the column size and the number of iterations is fixed so the compiler can unroll the loops. type DeviceMaxSelectColumnActivationModule(column_size: int) = let block_size_x = 32 // This should not be changed for this module. Small block sizes such as these are much more efficient on Maxwell. let block_size_y = 16 // This should be a mutliple of 2. On the GTX 970, 16 seems to work best, although 4,8,16,32 are quite close. let kernel_code = " //Kernel code: extern \"C\" { #define INIT_MIN __int_as_float(0xff800000) // The constant init for the reduce operations. This is the float negative infinity. #define INIT_MAX __int_as_float(0x7f800000) // The constant init for the reduce operations. This is the float positive infinity. #define NUM_VARS "+string (divup column_size 32)+" // This is to ensure that the number of variables is encoded as a constant. #define NUM_COLS_32 "+string ((divup column_size 32)*32)+" // The column size has to be static for the shared memory array. This also to ensures it is a multiple of 32. #define BLOCK_SIZE_Y "+string block_size_y+" // I am not sure whether gridDim.x is a compile time constant. It was not in Alea. #define ROW_ITERS "+string (divup 32 block_size_y)+" // The number of iterations that should be done across the rows in shared memory. typedef "+FloatTypeCpp+" floatType; // The max reduce version. __device__ inline floatType warpReduce(floatType value){ #pragma unroll for (int i=1; i<32; i*=2) { floatType tmp = __shfl_xor(value, i); value = (tmp > value) ? tmp : value; } return value; } // Device code __global__ void Kernel(const floatType* A, floatType* O, const int num_rows, const int num_cols, const int k, const int transpose) { if (transpose) { __shared__ floatType ar[32][NUM_COLS_32]; // The reason why the second dimension is a mutliple of 32 is so that in the warp reduction phase, there are no inactive threads. // Inactive threads during the warp shuffle give undefined values. One has to go an extra mile to ensure that they are defined. { const int row = threadIdx.x; const int row_idx = row+blockIdx.x*32; #pragma unroll // Unroll the loops for performance, though it probably will not work on this part as the col = threadIdx.y is non static. for (int col = threadIdx.y; col < NUM_COLS_32; col += BLOCK_SIZE_Y) { // Stores everything into shared memory first by reading from the global in a contiguous manner. ar[row][col] = (row_idx < num_rows && col < num_cols) ? A[row_idx+col*num_rows] : INIT_MIN; } } __syncthreads(); floatType vars[NUM_VARS]; // The local array size needs to be made constant so the variables there get stored into registers instead of being spilled into global memory. #pragma unroll // Unroll the loop for performance. All the variables in the loop conditional are static. for (int row_iter=0; row_iter < ROW_ITERS; row_iter++) { // This loop does max selection on the rows in shared memory. // Unlike for the global memory, not only is shared memory much faster, but it does not need to be read // contiguosly to hit peak performance. // From shared memory, I put it into the local register memory and operate on that for even further speed gains. // This transposed kernel is adapted from the one in the other branch by adding the shared memory steps. floatType upper_bound = INIT_MAX; // This is the positive infinity for floats. floatType lower_bound = INIT_MIN; // This is the negative infinity for floats. #pragma unroll // Loop unrolling for improved performance. For this to work the number of unrolls has to be defined as a constant. for (int i=0; i < NUM_VARS;i++) { const int col = threadIdx.x + i*32; const int row_idx = threadIdx.y + row_iter*BLOCK_SIZE_Y; if (row_idx < 32) vars[i] = ar[row_idx][col]; } for (int iters=1; iters <= k; iters++){ #pragma unroll for (int i=0; i < NUM_VARS;i++) { if (vars[i] < upper_bound && lower_bound < vars[i]) lower_bound = vars[i]; } upper_bound = warpReduce(lower_bound); // Lowers the upper bound. lower_bound = INIT_MIN; } #pragma unroll for (int i=0; i < NUM_VARS;i++) { const int col = threadIdx.x + i*32; const int row_idx = threadIdx.y + row_iter*BLOCK_SIZE_Y; if (row_idx < 32) ar[row_idx][col] = (vars[i] < upper_bound) ? 0.0f : vars[i]; } } __syncthreads(); { const int row = threadIdx.x; const int row_idx = row+blockIdx.x*32; #pragma unroll for (int col = threadIdx.y; col < NUM_COLS_32; col += BLOCK_SIZE_Y) { if (row_idx < num_rows && col < num_cols) O[row_idx+col*num_rows] = ar[row][col]; } } } else { // Does not need a to do a tranpose so it reads directly off global memory. //int row = threadIdx.x; //const int col = blockIdx.x; const int col_idx = blockIdx.x*num_rows; floatType upper_bound = INIT_MAX; // This is the positive infinity for floats. floatType lower_bound = INIT_MIN; // This is the negative infinity for floats. floatType vars[NUM_VARS]; // The local array size needs to be made constant so the variables there get stored into registers instead of being spilled into global memory. #pragma unroll // Loop unrolling for improved performance. For this to work the number of unrolls has to be defined as a constant. for (int i=0; i < NUM_VARS;i++) { const int row = threadIdx.x + i*32; const int idx = row+col_idx; vars[i] = (row < num_rows) ? A[idx] : INIT_MIN; } for (int iters=1; iters <= k; iters++){ #pragma unroll for (int i=0; i < NUM_VARS;i++) { const int row = threadIdx.x + i*32; if (vars[i] < upper_bound && lower_bound < vars[i]) lower_bound = vars[i]; } upper_bound = warpReduce(lower_bound); // Lowers the upper bound. lower_bound = INIT_MIN; } #pragma unroll for (int i=0; i < NUM_VARS;i++) { const int row = threadIdx.x + i*32; const int idx = row+col_idx; if (row < num_rows){ O[idx] = (vars[i] < upper_bound) ? 0.0f : vars[i]; } } } } } " let k = new ManagedCuda.NVRTC.CudaRuntimeCompiler(kernel_code,"Kernel") do try k.Compile([|"-arch=compute_30"|]) with | 😕 NVRTCException as x -> printfn "%s" (k.GetLogAsString()) reraise() let kernel = ctx.LoadKernelPTX(k.GetPTX(),"Kernel") member t.AT(x: CudaDeviceVariable<floatType>, o: CudaDeviceVariable<floatType>, m:int, n: int, k: int) = kernel.GridDimensions <- dim3(divup m 32) kernel.BlockDimensions <- dim3(block_size_x,block_size_y) kernel.RunAsync(str.Stream,x.DevicePointer,o.DevicePointer,m,n,k,1) |> ignore /// Does a transpose in shared memory first. member t.AT(x: dMatrix, k, o: dMatrix) = if x.rc <> o.rc then failwith "x.rc <> o.rc" if x.num_cols <> column_size then failwith "Wrong num_cols." t.AT(x.dArray,o.dArray,x.num_rows,x.num_cols,k) /// Does a transpose in shared memory first. member t.AT(x: dMatrix, k) = let o = dMatrix.create(x.num_rows,x.num_cols) if x.num_cols <> column_size then failwith "Wrong num_cols." t.AT(x.dArray,o.dArray,x.num_rows,x.num_cols,k) o member t.A(x: CudaDeviceVariable<floatType>, o: CudaDeviceVariable<floatType>, m:int, n: int, k: int) = kernel.GridDimensions <- dim3(n) kernel.BlockDimensions <- dim3(block_size_x) kernel.RunAsync(str.Stream,x.DevicePointer,o.DevicePointer,m,n,k,0) |> ignore member t.A(x: dMatrix, k, o: dMatrix) = if x.rc <> o.rc then failwith "x.rc <> o.rc" if divup x.num_rows 32 <> divup column_size 32 then failwith "Wrong num_rows." t.A(x.dArray,o.dArray,x.num_rows,x.num_cols,k) member t.A(x: dMatrix, k) = let o = dMatrix.create(x.num_rows,x.num_cols) if divup x.num_rows 32 <> divup column_size 32 then failwith "Wrong num_rows." t.A(x.dArray,o.dArray,x.num_rows,x.num_cols,k) o

The above seems awe inspiring, but one just needs to keep in mind what it does, which is select on rows if called with .AT() and on columns if called with .A()

For performance reasons, in the row version, it first stores the rows of the matrix in shared memory by reading contiguously from the global memory and then from that into registers. It does the select iteratively on the variables in registers.

Probably the most intricate kernel up to now, but what it does is crystal clear.

Here is the library function that uses it:

let DeviceMaxSelectDict = lazy Dictionary<int,DeviceMaxSelectColumnActivationModule>() /// o <- max_k(x) /// Sets all except the k number of max of a column to zero. /// Unlike for the other modules, to ensure it works swiftly, the column size and the number of iterations is fixed so the compiler can unroll the loops. /// This name is for a function wrapper for the Dict that holds the DeviceMaxSelectColumnActivation modules. let DeviceMaxSelectColumnActivationModule column_size = let d = DeviceMaxSelectDict.Value if d.ContainsKey(divup column_size 32) then d.[divup column_size 32] else let t = DeviceMaxSelectColumnActivationModule column_size d.Add(divup column_size 32,t) t /// The winner take all activation. Zeroes out the non top-k values along the row. let sparseActivationErrorModule = lazy new DeviceTrinaryTransformModule "y*((x == 0.0f) ? 0.0f : 1.0f)+z;" let WTA k (a:DM) = let va = a.r.P let el = a.r.A let node = tape.GetDMIf let c = node.P let error = node.A let ff () = let nr,nc = (va).rc node.Resize nr nc DeviceMaxSelectColumnActivationModule(nc).AT(va,k,c) let fb () = sparseActivationErrorModule.Value.A(c,error,el,el) let t = DM.create(node,ff,fb) tape.Add(t) t

Interesting tidbit about the module is that storing the variables into registers requires them to be indexed statically. Not only does their size need to be determined statically during the compilation step, but the variable that indexes them also needs to be static.

The flaw of CUDA C++ unlike for Alea, is that the “#pragma unroll” directives might not get triggered, but as (int i=0; i < NUM_VARS;i++) is all static, one can be decently sure that here it will go as planned.

#pragma unroll // Unroll the loops for performance, though it probably will not work on this part as the col = threadIdx.y is non static. for (int col = threadIdx.y; col < NUM_COLS_32; col += BLOCK_SIZE_Y) { // Stores everything into shared memory first by reading from the global in a contiguous manner. ar[row][col] = (row_idx < num_rows && col < num_cols) ? A[row_idx+col*num_rows] : INIT_MIN; } }

The above probably will not work and I would have to check the PTX code to if does. Here is the above rewritten so it gets properly unrolled.

#pragma unroll // Unroll the loops for performance. for (int i = 0; i < NUM_VARS*ROW_ITERS; i++) { // Stores everything into shared memory first by reading from the global in a contiguous manner. const int col = threadIdx.y + i*BLOCK_SIZE_Y; ar[row][col] = (row_idx < num_rows && col < num_cols) ? A[row_idx+col*num_rows] : INIT_MIN; } }

NUM_VARS*ROW_ITERS are all constants that the compiler can replace now. The attempted optimization does not seem to have improved performance any though.

Now about the above wrapper, as the function has to know the number of rows divided by 32 ahead of time, what it does is compile a new kernel on the fly for each **different** column size while keeping the old ones stored inside a Dictionary. Much like for memory buggers this allows for an efficient reuse of components.

With that in place, the script for the autoencoder is trivial to adapt based on the previous example:

//let l1 = FeedforwardLayer.fromArray (load_data (__SOURCE_DIRECTORY__ + @"\l1_weights.dat") false) (WTAT 6) \\ For loading the weights from file. let l1 = FeedforwardLayer.createRandomLayer 1024 784 (WTA 6) let l2 = InverseFeedforwardLayer.createRandomLayer l1 (fun x -> x) // No nonlinearity at the end. With a steep sigmoid the cost is much better, but the visualizations are less crisp. let layers = [|l1 :> IFeedforwardLayer;l2 :> IFeedforwardLayer|] // Upcasting to the base type. The correct functions will get called with dynamic dispatch. //save_data (__SOURCE_DIRECTORY__ + @"\l1_weights.dat") l1.ToArray // For saving the weights // This does not actually train it, it just initiates the tree for later training. // The autoencoder version. let training_loop (data: DM) (layers: IFeedforwardLayer[]) = let outputs = Array.fold (fun state (layer:IFeedforwardLayer) -> (layer.runLayer state)) data layers squared_error_cost data outputs let train_mnist_sgd num_iters learning_rate (layers: IFeedforwardLayer[]) = [| let mutable r' = 0.0f let base_nodes = layers |> Array.map (fun x -> x.ToArray) |> Array.concat // Stores all the base nodes of the layer so they can later be reset. for i=1 to num_iters do for x in dtrain do let data, target = x let r = training_loop data layers // Builds the tape. tape.forwardpropTape 0 // Calculates the forward values. Triggers the ff() closures. r' <- r' + (!r.r.P/ float32 dtrain.Length) // Adds the cost to the accumulator. if System.Single.IsNaN r' then failwith "Nan error" for x in base_nodes do x.r.A.setZero() // Resets the base adjoints tape.resetTapeAdjoint 0 // Resets the adjoints for the training select r.r.A := 1.0f // Pushes 1.0f from the top node tape.reversepropTape 0 // Resets the adjoints for the test select add_gradients_to_weights' base_nodes learning_rate // The optimization step tape.Clear 0 // Clears the tape without disposing it or the memory buffer. It allows reuse of memory for a 100% gain in speed for the simple recurrent and feedforward case. printfn "The training cost at iteration %i is %f" i r' let r1 = r' r' <- 0.0f for x in dtest do let data, target = x let r = training_loop data layers // Builds the tape. tape.forwardpropTape 0 // Calculates the forward values. Triggers the ff() closures. r' <- r' + (!r.r.P/ float32 dtest.Length) // Adds the cost to the accumulator. if System.Single.IsNaN r' then failwith "Nan error" tape.Clear 0 // Clears the tape without disposing it or the memory buffer. It allows reuse of memory for a 100% gain in speed. printfn "The validation cost at iteration %i is %f" i r' let r2 = r' r' <- 0.0f yield r1,r2 |] let num_iters = 20 let learning_rate = 0.1f #time let s = train_mnist_sgd num_iters learning_rate layers #time let l = [|for l,_ in s do yield l|] let r = [|for _,r in s do yield r|] (Chart.Combine [|Chart.Line l;Chart.Line r|]).ShowChart() let bitmap = make_bitmap_from_dmatrix l1.W.r.P 28 28 25 40 bitmap.Save(__SOURCE_DIRECTORY__ + @"\weights.png")

The inverse layer takes in the weight matrix of the first feedforward layer and creates a new random bias vector.

// An inverse feedforward layer of neurons made from a regular one. Used in autoencoders type InverseFeedforwardLayer = { W:DM // Input weight matrix b:DM // Bias vector a:DM->DM } with // Activation function member l.ToArray = [|l.W;l.b|] static member fromArray (a : DM[]) act = { W = a.[0] b = a.[1] a = act } static member createRandomLayer (l: FeedforwardLayer) act = { W = l.W b = DM.makeUniformRandomNode(l.W.r.P.num_cols, 1) a = act } member l.runLayer (x:DM) = addb (matmultTN l.W x) l.b |> l.a interface IFeedforwardLayer with member l.runLayer (x:DM) = l.runLayer x member l.ToArray = l.ToArray

The linear layer for transposed matrix multiplication is not implemented yet, but the above works nicely for now.

### Personal

The first time I implemented this a few months ago adding the biases was such a chore that I did not bother to add them. Currently, the library is getting a bit big at 2327 lines of code, but it is not the size that determines the complexity of it. What does is its modularity at various levels. The above kernel was written just today in about five hours and it is in a sense the distillation of all my effort in September down to those five hours. It is satisfying.

There is no need to be understand all the code fragments, but the purpose of these tutorials, is to make the library slightly more familiar to the reader and for myself to revise the material.

The most important post I would say is the basics of AD.

During September of 2015, I think I spent around three weeks or more trying and failing to go above 96% on Mnist with a pretrained net.

As it turned out, there was a bug in the backwards step. One of the map modules was missing a boundary check and when I added it, I realized that I was applying the transpose of the error matrix. Given the error it was amazing that It got as high as 96.5% in the first place.

To get good results with pretraining, it is not actually necessary to do it for a large number of iterations. Even one should be enough in fact.

In a different experiment that used dropout with standard relu units I did just that, and even though the reconstructions were blurry and indistinct compared to the crisp picture at the top of the page, the logistic regression on top of the weights liked it.

More speculatively, the selection functions as employed in the WTA autoencoder have connections to discrete optimization problems. The K-means algorithm in fact is very similar to the sparse activation function and one could potentially use it to cluster cities in the TSP problem for easier problem partitioning. For the k-means specifically, it is also actually possible to optimize it using gradient descent instead of less efficiently by the expectation maximization steps as shown in the Andrew Ng’s course.

To me there is something profound that one can calculate gradients through branches and even sorting functions.

Had I known about the link between math and the general programming, I would have paid more attention in class.

It is speculative, but figuring out the correct way of using backprop like techniques to augment evolutionary optimization techniques could bring about very powerful abilities. I am looking forward to trying it out myself once I am done with these tutorials.

Updating in the comments because WordPress will butcher my posts if I try to edit them.

As it turns out one can speed up the DeviceMaxSelect module significantly by changing the “__shared__ floatType ar[32][NUM_COLS_32];” into “__shared__ floatType ar[32][NUM_COLS_32+1];”. I made a mistake by not padding the array which caused bank conflicts.

LikeLike