Speeding up Recurrent Neural Networks
Esha Uboweja, Bojian Han
In our project, we implemented a parallel RNN that trains using backpropagation through time (BPTT) in Halide and benchmarked our Halide implementation against our implementation of Theano RNN and NumPy RNN which are highly optimized for the task.
We have made algorithmic improvements to the backpropagation algorithm shown in the (WildML tutorial) which has an O(N^2) algorithm to implement BPTT, w.r.t. The number of input, hidden and output nodes in the RNN. Our version uses an O(N) algorithm for BPTT and this runs much faster. We have also implemented an RNN in NumPy which produces numerically equivalent updates and results as the original Theano RNN implementation, with some variation in floating point accuracy. Our NumPy implementation achieved over 2x speedup for small inputs and is slightly faster than Theano for larger inputs.
In Halide, we have utilized parallelism and scheduling techniques that we learnt in the first 3 assignments of the class as well as techniques for matrix multiplication and inspirations from Spark RDD lecture. This is demonstrated via vectorization of code using Halide scheduling optimizations, parallelization using multiple threads using the Halide scheduler, tiling for matrix operations, loop fusing via both of Halide JIT and Halide AOT pipelines, as well as manual loop fusing. Additionally, we have also explored the usage of AOT to remove the excessive compilation time due to JIT for functions that are repeatedly compiled and reused.
Our current Halide implementation scales well for large inputs, runs only at 50% of the speed of Numpy version for large inputs of 1024 by 1024 images. Take note however that this is a HUGE improvement from same Halide code without the optimizations we described above, as the naive serial Halide implementation is almost 100~300 times slower than the NumPy version. For example, 1 epoch of training the RNN takes 366 seconds in the naive Halide implementation, 8 seconds in our optimized Halide implementation and 5 seconds in our NumPy implementation.
We need the following setup on "latedays" machines so that we can test our implementation on the Xeon Phi:
We have code for training RNNs using Theano and Python with Numpy.
We have a better understanding of the Backpropagation Through Time (BPTT) for RNNs, and how some parts of it can depend
on our choice of the non-linear activation. Our Python code presently uses tanh
, a popular set of choices is [tanh
, sigmoid
, ReLU
].
We have gone through Halide tutorials and have decided to code our RNN implementation using Halide. We decided this because of the efficiency of matrix operations in Halide. Another reason is that while we can use OpenBLAS or similar BLAS libraries in C++ for efficient matrix algebra, we don't have to reimplement our RNN if we want to test it on GPU.
We will compare our parallel RNN implementation speedup against a Theano implementation. We are familiar with Theano and we know Theano compiles optimized code for different platforms. Our implementation will be benchmarked against similar performance in Theano with the exact same network structure (this means that if our network does not use bias, the Theano one will not use bias either). When our implementation uses mini-batch data-parallelism for training the network, the Theano implementation will use mini-batch training using Matrix-Matrix multiplication (Weights Matrices and Batches of Training Vectors concatenated into a matrix) and Theano's API for calculating gradients.
It is a fact that simple, 1 hidden layer RNNs are not widely used at this time because of more interesting networks like multi-layer RNNs and LSTMs. This implies a difficulty in finding good problems and datasets to test our implementation. However, our goal is to parallelize and efficiently train a 1 hidden layer RNN. We therefore have come up with 2 types of problems for testing our RNNs:
i. Learning simple convolutional filters - RNNs are designed to learn about temporal dependencies for a given data sequence of values at different time-steps. We can treat images as sequences of pixels, so that a row of pixels [0 0 100 200]
can be thought of as a sequence of pixels observed at time-steps 0,1,2,3
. This is similar to "walking" along the columns of an image from left to right. Treating images in this manner allows us to learn simple filters like a 1D Sobel filter [-1 0 1]
for example (we have code in Theano for this). Further, data generation and scalability work into this problem well, because we can simply generate a lot of images or use a popular dataset like MNIST, and train and test our implementation.
ii. Dealing with image sequences, videos - We can setup an RNN to generate sequences of spatial differences between frames. Problem setup:
We aim to present graphs showing performance and speed up on these two applications compared to a baseline implementation in Theano, and various scheduled implementations in Halide, on a multi-core CPU. Since we have chosen some interesting image based problems, we will also try to visualize results obtained by testing the RNN v/s expected output where possible.
Esha Uboweja, Bojian Han
We are going to implement a fast, parallel version of Recurrent Neural Networks (RNNs). We will optimize RNN training performance on multi-core CPUs.
RNNs are neural networks that make use of memory to store certain pieces of information, so that outputs at a time-step t are dependent on inputs at time-step t, as well as inputs at time-steps t-1, t-2, etc. While the duration of this "memory" is short, i.e. output at t is dependent only up to T steps back in time, such recurrent dependency pattern requires the network to be iterative in at least 2 ways. Firstly, as in many neural networks, an epoch of training can happen only after a previous epoch of training is completely. This iterative dependency is most likely unavoidable. Further, training the RNN's hidden units for each time-step happens iteratively, so that the update to the hidden unit at time t depends on the update to the hidden unit at time (t-1). This is the core idea implemented as "Back-propagation Through Time" or BPTT for RNNs.
Figure 1: An RNN architecture
An example of an RNN architecture is shown in Figure 1 (from (1)). As described in (1), the network is described as follows:
A fast parallel implementation of RNN should use training data of arbitrary sizes efficiently and also try to optimize over dependencies of the gradient descent updates to the 3 weight matrices for multiple hidden units.
We believe one way to solve this problem is to implement a data-parallel RNN training scheme, that batches up sets of data (or the entire set of data) and feeds it directly to the network for training. We are also investing some other approaches, such as those described in (2).
Plan:
Hope:
We will write code in C++ (or perhaps Halide). C++ is our first choice because we have used it throughout the semester for class assignments and we know our code doesn't require any additional setup on gates machines or on the "latedays" machine cluster. We are new to using Halide, which is a Domain Specific Language (DSL) for image processing. In Halide, the matrix computations could be faster, but we will finalize our choice of language in our first project week.
We are programming a CPU-based implementation because while GPUs are very common, a multi-core CPU is a more common platform, and perhaps even cheaper. If time permits, we may look into a GPU based implementation (and modify our "Hope to achieve" set of goals!).