A Tutorial Introduction¶
We will begin by a quick tutorial on Knet, going over the essential tools for defining, training, and evaluating real machine learning models in 10 short lessons. The examples cover linear regression, softmax classification, multilayer perceptrons, convolutional and recurrent neural networks. We will use these models to predict housing prices in Boston, recognize handwritten digits, and teach the computer to write like Shakespeare!
The goal is to get you to the point where you can create your own models and apply machine learning to your own problems as quickly as possible. So some of the details and exceptions will be skipped for now. No prior knowledge of machine learning or Julia is necessary, but general programming experience will be assumed. It would be best if you follow along with the examples on your computer. Before we get started please complete the installation instructions if you have not done so already.
1. Functions and models¶
See also
@knet, function, compile, forw, get, :colon
In this section, we will create our first Knet model, and learn how to
make predictions. To start using Knet, type using Knet
at the
Julia prompt:
julia> using Knet
...
In Knet, a machine learning model is defined using a special function
syntax with the @knet
macro. It may be helpful at this point to
review the Julia function syntax as the Knet syntax is based on it.
The following example defines a @knet function for a simple linear
regression model with 13 inputs and a single output. You can type this
definition at the Julia prompt, or you can copy and paste it into a
file which can be loaded into Julia using include("filename")
:
@knet function linreg(x)
w = par(dims=(1,13), init=Gaussian(0,0.1))
b = par(dims=(1,1), init=Constant(0))
return w * x .+ b
end
In this definition:
@knet
indicates thatlinreg
is a Knet function, and not a regular Julia function or variable.x
is the only input argument. We will use a(13,1)
column vector for this example.w
andb
are model parameters as indicated by thepar
constructor.dims
andinit
are keyword arguments topar
.dims
gives the dimensions of the parameter. Julia stores arrays in column-major order, i.e.(1,13)
specifies 1 row and 13 columns.init
describes how the parameter should be initialized. It can be a user supplied Julia array or one of the supported array fillers as in this example.- The final
return
statement specifies the output of the Knet function. - The
*
denotes matrix product and.+
denotes elementwise broadcasting addition. - Broadcasting operations like
.+
can act on arrays with different sizes, such as adding a vector to each column of a matrix. They expand singleton dimensions in array arguments to match the corresponding dimension in the other array without using extra memory, and apply the operation elementwise. - Unlike regular Julia functions, only a restricted set of
operators such as
*
and.+
, and statement types such as assignments and returns can be used in a @knet function definition.
In order to turn linreg
into a machine learning model that can be
trained with examples and used for predictions, we need to compile it:
julia> f1 = compile(:linreg) # The colon before linreg is required
...
To test our model let’s give it some input initialized with random numbers:
julia> x1 = randn(13,1)
13x1 Array{Float64,2}:
-0.556027
-0.444383
...
To obtain the prediction of model f1
on input x1
we use the
forw
function, which basically calculates w * x1 .+ b
:
julia> forw(f1,x1)
1x1 Array{Float64,2}:
-0.710651
We can query the model and see its parameters using get
:
julia> get(f1,:w) # The colon before w is required
1x13 Array{Float64,2}:
0.149138 0.0367563 ... -0.433747 0.0569829
julia> get(f1,:b)
1x1 Array{Float64,2}:
0.0
We can also look at the input with get(f1,:x)
, reexamine the output
using the special :return
symbol with get(f1,:return)
. In fact
using get
, we can confirm that our model gives us the same answer
as an equivalent Julia expression:
julia> get(f1,:w) * get(f1,:x) .+ get(f1,:b)
1x1 Array{Float64,2}:
-0.710651 DBG
You can see the internals of the compiled model looking at f1
. It
consists of 5 low level operations:
julia> f1
1 Knet.Input() name=>x,dims=>(13,1),norm=>3.84375,...
2 Knet.Par() name=>w,dims=>(1,13),norm=>0.529962,...
3 Knet.Par() name=>b,dims=>(1,1),norm=>0 ,...
4 Knet.Dot(2,1) name=>##tmp#7298,args=>(w,x),dims=>(1,1),norm=>0.710651,...
5 Knet.Add(4,3) name=>return,args=>(##tmp#7298,b),dims=>(1,1),norm=>0.710651,...
You may have noticed the colons before Knet variable names like
:linreg
, :w
, :x
, :b
, etc. Any variable introduced in
a @knet macro is not a regular Julia variable so its name needs to be
escaped using the colon character in ordinary Julia code. In
contrast, f1
and x1
are ordinary Julia variables.
In this section, we have seen how to create a Knet model by compiling
a @knet function, how to perform a prediction given an input using
forw
, and how to take a look at model parameters using get
.
Next we will see how to train models.
2. Training a model¶
See also
back, update!, setp, lr, quadloss
OK, so we can define functions using Knet but why should we bother? The thing that makes a Knet function different from an ordinary function is that Knet functions are differentiable programs. This means that for a given input not only can they compute an output, but they can also compute which way their parameters should be modified to approach some desired output. If we have some input-output data that comes from an unknown function, we can train a Knet model to look like this unknown function by manipulating its parameters.
We will use the Housing dataset from the UCI Machine Learning
Repository to train our linreg
model. The dataset has housing
related information for 506 neighborhoods in Boston from 1978. Each
neighborhood has 14 attributes, the goal is to use the first 13, such
as average number of rooms per house, or distance to employment
centers, to predict the 14’th attribute: median dollar value of the
houses. Here are the first 3 entries:
0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80 396.90 9.14 21.60
0.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.9671 2 242.0 17.80 392.83 4.03 34.70
...
Let’s download the dataset and use readdlm
to turn
it into a Julia array.
julia> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data";
julia> file = Pkg.dir("Knet/data/housing.data");
julia> download(url, file)
...
julia> data = readdlm(file)' # Don't forget the final apostrophe to transpose data
14x506 Array{Float64,2}:
0.00632 0.02731 0.02729 ... 0.06076 0.10959 0.04741
18.0 0.0 0.0 ... 0.0 0.0 0.0
...
The resulting data
matrix should have 506 columns representing
neighborhoods, and 14 rows representing the attributes. The last
attribute is the median house price to be predicted, so let’s separate
it:
julia> x = data[1:13,:]
13x506 Array{Float64,2}:...
julia> y = data[14,:]
1x506 Array{Float64,2}:...
Here we are using Julia’s array indexing notation to split the
data
array into input x
and output y
. Inside the square
brackets 1:13
means grab the rows 1 through 13, and the :
character by itself means grab all the columns.
You may have noticed that the input attributes have very different ranges. It is usually a good idea to normalize them by subtracting the mean and dividing by the standard deviation:
julia> x = (x .- mean(x,2)) ./ std(x,2);
The mean()
and std()
functions compute the mean and
standard deviation of x
. Their optional second argument gives the
dimensions to sum over, so mean(x)
gives us the mean of the whole
array, mean(x,1)
gives the mean of each column, and mean(x,2)
gives us the mean of each row.
It is also a good idea to split our dataset into training and test subsets so we can estimate how well our model will do on unseen data.
julia> n = size(x,2);
julia> r = randperm(n);
julia> xtrn=x[:,r[1:400]];
julia> ytrn=y[:,r[1:400]];
julia> xtst=x[:,r[401:end]];
julia> ytst=y[:,r[401:end]];
n
is set to the number of instances (columns) and r
is set to
randperm(n)
which gives a random permutation of
integers \(1\ldots n\). The first 400 indices in r
will be
used for training, and the last 106 for testing.
Let’s see how well our randomly initialized model does before training:
julia> ypred = forw(f1, xtst)
1x106 Array{Float64,2}:...
julia> quadloss(ypred, ytst)
307.9336...
The quadratic loss function quadloss()
computes \((1/2n) \sum (\hat{y} - y)^2\), i.e. half of the mean
squared difference between a predicted answer \(\hat{y}\) and the
desired answer \(y\). Given that \(y\) values range from 5 to
50, an RMSD of \(\sqrt{2\times 307.9}=24.8\) is a pretty bad
score.
We would like to minimize this loss which should get the predicted
answers closer to the desired answers. To do this we first compute
the loss gradient for the parameters of f1
– this is the direction
in parameter space that maximally increases the loss. Then we move
the parameters in the opposite direction. Here is a simple function
that performs these steps:
function train(f, x, y)
for i=1:size(x,2)
forw(f, x[:,i])
back(f, y[:,i], quadloss)
update!(f)
end
end
- The
for
loop grabs training instances one by one. forw
computes the prediction for the i’th instance. This is required for the next step.back
computes the loss gradient for each parameter inf
for the i’th instance.update!
moves each parameter opposite the gradient direction to reduce the loss.
Before training, it is important to set a good learning rate. The
learning rate controls how large the update steps are going to be: too
small and you’d wait for a long time, too large and train
may
never converge. The setp()
function is used to set
training options like the learning rate. Let’s
set the learning rate to 0.001 and train the model for 100 epochs
(i.e. 100 passes over the dataset):
julia> setp(f1, lr=0.001)
julia> for i=1:100; train(f1, xtrn, ytrn); end
This should take a few seconds, and this time our RMSD should be much better:
julia> ypred = forw(f1, xtst)
1x106 Array{Float64,2}:...
julia> quadloss(ypred,ytst)
11.5989...
julia> sqrt(2*ans)
4.8164...
We can see what the model has learnt looking at the new weights:
julia> get(f1,:w)
1x13 Array{Float64,2}:
-0.560346 0.924687 0.0446596 ... -1.89473 1.13219 -3.51418 DBG
The two weights with the most negative contributions are 13 and 8. We can find out from UCI that these are:
13. LSTAT: % lower status of the population
8. DIS: weighted distances to five Boston employment centres
And the two with the most positive contributions are 9 and 6:
9. RAD: index of accessibility to radial highways
6. RM: average number of rooms per dwelling
In this section we saw how to download data, turn it into a Julia
array, normalize and split it into input, output, train, and test
subsets. We wrote a simple training script using forw
, back
,
and update!
, set the learning rate lr
using setp
, and
evaluated the model using the quadloss
loss function. Now, there
are a lot more efficient and elegant ways to perform and analyze a
linear regression as you can find out from any decent statistics text.
However the basic method outlined in this section has the advantage of
being easy to generalize to models that are a lot larger and
complicated.
3. Making models generic¶
See also
keyword arguments, size inference
Hardcoding the dimensions of parameters in linreg
makes it
awfully specific to the Housing dataset. Knet allows keyword
arguments in @knet function definitions to get around this problem:
@knet function linreg2(x; inputs=13, outputs=1)
w = par(dims=(outputs,inputs), init=Gaussian(0,0.1))
b = par(dims=(outputs,1), init=Constant(0))
return w * x .+ b
end
Now we can use this model for another dataset that has, for example,
784 inputs and 10 outputs by passing these keyword arguments to
compile
:
julia> f2 = compile(:linreg2, inputs=784, outputs=10);
Knet functions borrow the syntax for keyword arguments from Julia,
and we will be using them in many contexts, so a brief aside is in
order: Keyword arguments are identified by name instead of position,
and they can be passed in any order (or not passed at all) following
regular (positional) arguments. In fact we have already seen
examples: dims
and init
are keyword arguments for par
(which has no regular arguments). Functions with keyword arguments
are defined using a semicolon in the signature, e.g. function
pool(x; window=2, padding=0)
. The semicolon is optional when the
function is called, e.g. both pool(x, window=5)
or pool(x;
window=5)
work. Unspecified keyword arguments take their default
values specified in the function definition. Extra keyword arguments
can be collected using three dots in the function definition:
function pool(x; window=2, padding=0, o...)
, and passed in
function calls: pool(x; o...)
.
In addition to keyword arguments to make models more generic, Knet implements size inference: Any dimension that relies on the input size can be left as 0, which tells Knet to infer that dimension when the first input is received. Leaving input dependent dimensions as 0, and using a keyword argument to determine output size we arrive at a fully generic version of linreg:
@knet function linreg3(x; out=1)
w = par(dims=(out,0), init=Gaussian(0,0.1))
b = par(dims=(out,1), init=Constant(0))
return w * x .+ b
end
In this section, we have seen how to make @knet functions more generic using keyword arguments and size inference. This will especially come in handy when we are using them as new operators as described next.
4. Defining new operators¶
See also
@knet function as operator, soft
The key to controlling complexity in computer languages is abstraction. Abstraction is the ability to name compound structures built from primitive parts, so they too can be used as primitives. In Knet we do this by using @knet functions not just as models, but as new operators inside other @knet functions.
To illustrate this, we will implement a softmax classification model. Softmax classification is basically linear regression with multiple outputs followed by normalization. Here is how we can define it in Knet:
@knet function softmax(x; out=10)
z = linreg3(x; out=out)
return soft(z)
end
The softmax
model basically computes soft(w * x .+ b)
with
trainable parameters w
and b
by calling linreg3
we defined
in the previous section. The out
keyword parameter determines the
number of outputs and is passed from softmax
to linreg3
unchanged. The number of inputs is left unspecified and is inferred
when the first input is received. The soft
operator normalizes
its argument by exponentiating its elements and dividing each by their
sum.
In this section we saw an example of using a @knet function as a new operator. Using the power of abstraction, not only can we avoid repetition and shorten the amount of code for larger models, we make the definitions a lot more readable and configurable, and gain a bunch of reusable operators to boot. To see some example reusable operators take a look at the Knet compound operators table and see their definitions in kfun.jl.
5. Training with minibatches¶
See also
minibatch, softloss, zeroone
We will use the softmax model to classify hand-written digits from the MNIST dataset. Here are the first 8 images from MNIST, the goal is to look at the pixels and classify each image as one of the digits 0-9:
The following loads the MNIST data:
julia> include(Pkg.dir("Knet/examples/mnist.jl"))
INFO: Loading MNIST...
Once loaded, the data is available as multi-dimensional Julia arrays:
julia> MNIST.xtrn
28x28x1x60000 Array{Float32,4}:...
julia> MNIST.ytrn
10x60000 Array{Float32,2}:...
julia> MNIST.xtst
28x28x1x10000 Array{Float32,4}:...
julia> MNIST.ytst
10x10000 Array{Float32,2}:...
We have 60000 training and 10000 testing examples. Each input x is a
28x28x1 array representing one image, where the first two numbers
represent the width and height in pixels, the third number is the
number of channels (which is 1 for grayscale images, 3 for RGB
images). The softmax model will treat each image as a 28*28*1=784
dimensional vector. The pixel values have been normalized to
\([0,1]\). Each output y is a ten-dimensional one-hot vector (a
vector that has a single non-zero component) indicating the correct
class (0-9) for a given image.
This is a much larger dataset than Housing. For computational
efficiency, it is not advisable to use these examples one at a time
during training like we did before. We will split the data into
groups of 100 examples called minibatches, and pass data to
forw
and back
one minibatch at a time instead of one instance
at a time. On my laptop, one epoch of training softmax on MNIST takes
about 0.34 seconds with a minibatch size of 100, 1.67 seconds with a
minibatch size of 10, and 10.5 seconds if we do not use minibatches.
Knet provides a small minibatch
function to split the data:
function minibatch(x, y, batchsize)
data = Any[]
for i=1:batchsize:ccount(x)
j=min(i+batchsize-1,ccount(x))
push!(data, (cget(x,i:j), cget(y,i:j)))
end
return data
end
minibatch
takes batchsize
columns of x
and y
at a
time, pairs them up and pushes them into a data
array. It works
for arrays of any dimensionality, treating the last dimension as
“columns”. Note that this type of minibatching is fine for small
datasets, but it requires holding two copies of the data in memory.
For problems with a large amount of data you may want to use
subarrays or iterables.
Here is minibatch
in action:
julia> batchsize=100;
julia> trn = minibatch(MNIST.xtrn, MNIST.ytrn, batchsize)
600-element Array{Any,1}:...
julia> tst = minibatch(MNIST.xtst, MNIST.ytst, batchsize)
100-element Array{Any,1}:...
Each element of trn
and tst
is an x, y pair that contains 100
examples:
julia> trn[1]
(28x28x1x100 Array{Float32,4}: ...,
10x100 Array{Float32,2}: ...)
Here are some simple train and test functions that use this type of minibatched data. Note that they take the loss function as a third argument and iterate through the x,y pairs (minibatches) in data:
function train(f, data, loss)
for (x,y) in data
forw(f, x)
back(f, y, loss)
update!(f)
end
end
function test(f, data, loss)
sumloss = numloss = 0
for (x,ygold) in data
ypred = forw(f, x)
sumloss += loss(ypred, ygold)
numloss += 1
end
return sumloss / numloss
end
Before training, we compile the model and set the learning rate to
0.2, which works well for this example. We use two new loss
functions: softloss
computes the cross entropy loss,
\(E(p\log\hat{p})\), commonly used for training classification
models and zeroone
computes the zero-one loss which is the
proportion of predictions that were wrong. I got 7.66% test error
after 40 epochs of training. Your results may be slightly different
on different machines, or different runs on the same machine because
of random initialization.
julia> model = compile(:softmax);
julia> setp(model; lr=0.2);
julia> for epoch=1:40; train(model, trn, softloss); end
julia> test(model, tst, zeroone)
0.0766...
In this section we saw how splitting the training data into
minibatches can speed up training. We trained our first
classification model on MNIST and used two new loss functions:
softloss
and zeroone
.
6. MLP¶
7. Convnet¶
Deprecated
See also
@knet as op, kwargs for @knet functions, function options (f=:relu). splat. lenet example, fast enough on cpu?
To illustrate this, we will use the LeNet convolutional neural network model designed to recognize handwritten digits. Here is the LeNet model defined using only the primitive operators of Knet:
@knet function lenet1(x) # dims=(28,28,1,N)
w1 = par(init=Xavier(), dims=(5,5,1,20))
c1 = conv(w1,x) # dims=(24,24,20,N)
b1 = par(init=Constant(0),dims=(1,1,20,1))
a1 = add(b1,c1)
r1 = relu(a1)
p1 = pool(r1; window=2) # dims=(12,12,20,N)
w2 = par(init=Xavier(), dims=(5,5,20,50))
c2 = conv(w2,p1) # dims=(8,8,50,N)
b2 = par(init=Constant(0),dims=(1,1,50,1))
a2 = add(b2,c2)
r2 = relu(a2)
p2 = pool(r2; window=2) # dims=(4,4,50,N)
w3 = par(init=Xavier(), dims=(500,800))
d3 = dot(w3,p2) # dims=(500,N)
b3 = par(init=Constant(0),dims=(500,1))
a3 = add(b3,d3)
r3 = relu(a3)
w4 = par(init=Xavier(), dims=(10,500))
d4 = dot(w4,r3) # dims=(10,N)
b4 = par(init=Constant(0),dims=(10,1))
a4 = add(b4,d4)
return soft(a4) # dims=(10,N)
end
Don’t worry about the details of the model if you don’t know much about neural nets. At 22 lines long, this model looks a lot more complicated than our linear regression model. Compared to state of the art image processing models however, it is still tiny. You would not want to code a state-of-the-art model like GoogLeNet using these primitives.
If you are familiar with neural nets, and peruse the Knet primitives table, you can see that the model has two convolution-pooling layers (commonly used in image processing), a fully connected relu layer and a final softmax output layer (I separated them by blank lines to help). Wouldn’t it be nice to say just that:
@knet function lenet2(x)
a = conv_pool_layer(x)
b = conv_pool_layer(a)
c = relu_layer(b)
return softmax_layer(c)
end
lenet2
is a lot more readable than lenet1
. But before we can
use this definition, we have to solve two problems:
conv_pool_layer
etc. are not primitive operators, we need a way to add them to Knet.- Each layer has some attributes, like
init
anddims
, that we need to be able to configure.
Knet solves the first problem by allowing @knet functions to be used
as operators as well as models. For example, we can define
conv_pool_layer
as an operator with:
@knet function conv_pool_layer(x)
w = par(init=Xavier(), dims=(5,5,1,20))
c = conv(w,x)
b = par(init=Constant(0), dims=(1,1,20,1))
a = add(b,c)
r = relu(a)
return pool(r; window=2)
end
With this definition, the the first a = conv_pool_layer(x)
operation in lenet2
will work exactly as we want, but not the
second (it has different convolution dimensions).
This brings us to the second problem, layer configuration. It would
be nice not to hard-code numbers like (5,5,1,20)
in the definition
of a new operation like conv_pool_layer
. Making these numbers
configurable would make such operations more reusable across models.
Even within the same model, you may want to use the same layer type in
more than one configuration. For example in lenet2
there is no
way to distinguish the two conv_pool_layer
operations, but looking
at lenet1
we clearly want them to do different things.
Knet solves the layer configuration problem using keyword
arguments. Knet functions borrow the keyword argument syntax from
Julia, and we will be using them in many contexts, so a brief aside is
in order: Keyword arguments are identified by name instead of
position, and they can be passed in any order (or not passed at all)
following regular (positional) arguments. In fact we have already
seen examples: dims
and init
are keyword arguments for par
(which has no regular arguments) and window
is a keyword argument
for pool
. Functions with keyword arguments are defined using a
semicolon in the signature, e.g. function pool(x; window=2,
padding=0)
. The semicolon is optional when the function is called,
e.g. both pool(x, window=5)
or pool(x; window=5)
work.
Unspecified keyword arguments take their default values specified in
the function definition. Extra keyword arguments can be collected
using three dots in the function definition: function pool(x;
window=2, padding=0, o...)
, and passed in function calls: pool(x;
o...)
.
Here is a configurable version of conv_pool_layer
using keyword
arguments:
@knet function conv_pool_layer(x; cwindow=0, cinput=0, coutput=0, pwindow=0)
w = par(init=Xavier(), dims=(cwindow,cwindow,cinput,coutput))
c = conv(w,x)
b = par(init=Constant(0), dims=(1,1,coutput,1))
a = add(b,c)
r = relu(a)
return pool(r; window=pwindow)
end
Similarly, we can define relu_layer
and softmax_layer
with
keyword arguments and make them more reusable. If you did this,
however, you’d notice that we are repeating a lot of code. That is
almost always a bad idea. Why don’t we define a generic_layer
that contains the shared code for all our layers:
@knet function generic_layer(x; f1=:dot, f2=:relu, wdims=(), bdims=(), winit=Xavier(), binit=Constant(0))
w = par(init=winit, dims=wdims)
y = f1(w,x)
b = par(init=binit, dims=bdims)
z = add(b,y)
return f2(z)
end
Note that in this example we are not only making initialization
parameters like winit
and binit
configurable, we are also
making internal operators like relu
and dot
configurable
(their names need to be escaped with colons when passed as keyword
arguments). This generic layer will allow us to define many layer
types easily:
@knet function conv_pool_layer(x; cwindow=0, cinput=0, coutput=0, pwindow=0)
y = generic_layer(x; f1=:conv, f2=:relu, wdims=(cwindow,cwindow,cinput,coutput), bdims=(1,1,coutput,1))
return pool(y; window=pwindow)
end
@knet function relu_layer(x; input=0, output=0)
return generic_layer(x; f1=:dot, f2=:relu, wdims=(output,input), bdims=(output,1))
end
@knet function softmax_layer(x; input=0, output=0)
return generic_layer(x; f1=:dot, f2=:soft, wdims=(output,input), bdims=(output,1))
end
Finally we can define a working version of LeNet using 4 lines of code:
@knet function lenet3(x)
a = conv_pool_layer(x; cwindow=5, cinput=1, coutput=20, pwindow=2)
b = conv_pool_layer(a; cwindow=5, cinput=20, coutput=50, pwindow=2)
c = relu_layer(b; input=800, output=500)
return softmax_layer(c; input=500, output=10)
end
There are still a lot of hard-coded dimensions in lenet3
. Some of
these, like the filter size (5), and the hidden layer size (500) can
be considered part of the model design. We should make them
configurable so the user can experiment with different sized models.
But some, like the number of input channels (1), and the input to the
relu_layer
(800) are determined by input size. If we tried to
apply lenet3
to a dataset with different sized images, it would
break. Knet solves this problem using size inference: Any
dimension that relies on the input size can be left as 0, which tells
Knet to infer that dimension when the first input is received.
Leaving input dependent dimensions as 0, and using keyword arguments
to determine model size we arrive at a fully configurable version of
LeNet:
@knet function lenet4(x; cwin1=5, cout1=20, pwin1=2, cwin2=5, cout2=50, pwin2=2, hidden=500, nclass=10)
a = conv_pool_layer(x; cwindow=cwin1, coutput=cout1, pwindow=pwin1)
b = conv_pool_layer(a; cwindow=cwin2, coutput=cout2, pwindow=pwin2)
c = relu_layer(b; output=hidden)
return softmax_layer(c; output=nclass)
end
To compile an instance of lenet4
with particular dimensions, we
pass keyword arguments to compile
:
julia> f = compile(:lenet4; cout1=30, cout2=60, hidden=600)
...
In this section we saw how to use @knet functions as new operators, and configure them using keyword arguments. Using the power of abstraction, not only did we cut the amount of code for the LeNet model in half, we made its definition a lot more readable and configurable, and gained a bunch of reusable operators to boot. I am sure you can think of more clever ways to define LeNet and other complex models using your own set of operators. To see some example reusable operators take a look at the Knet compound operators table and see their definitions in kfun.jl.
8. Conditional Evaluation¶
See also
if-else, runtime conditions (kwargs for forw), dropout
There are cases where you want to execute parts of a model conditionally, e.g. only during training, or only during some parts of the input in sequence models. Knet supports the use of runtime conditions for this purpose. We will illustrate the use of conditions by implementing a training technique called dropout to improve the generalization power of the LeNet model.
If you keep training the LeNet model on MNIST for about 30 epochs you will observe that the training error drops to zero but the test error hovers around 0.8%:
for epoch=1:100
train(net, trn, softloss)
println((epoch, test(net, trn, zeroone), test(net, tst, zeroone)))
end
(1,0.020466666666666505,0.024799999999999996)
(2,0.013649999999999905,0.01820000000000001)
...
(29,0.0,0.008100000000000003)
(30,0.0,0.008000000000000004)
This is called overfitting. The model has memorized the training set, but does not generalize equally well to the test set.
Dropout prevents overfitting by injecting random noise into the model.
Specifically, for each forw
call during training, dropout layers
placed between two operations replace a random portion of their input
with zeros, and scale the rest to keep the total output the same.
During testing random noise would degrade performance, so we would
like to turn dropout off. Here is one way to implement this in Knet:
@knet function drop(x; pdrop=0, o...)
if dropout
return x .* rnd(init=Bernoulli(1-pdrop, 1/(1-pdrop)))
else
return x
end
end
The keyword argument pdrop
specifies the probability of dropping an
input element. The if ... else ... end
block causes conditional
evaluation the way one would expect. The variable dropout
next to
if
is a global condition variable: it is not declared as an argument
to the function. Instead, once a model with a drop
operation is
compiled, the call to forw
accepts dropout
as an optional keyword
argument and passes it down as a global condition:
forw(model, input; dropout=true)
This means every time we call forw
, we can change whether dropout
occurs or not. During test time, we would like to stop dropout, so we
can run the model with dropout=false
:
forw(model, input; dropout=false)
By default, all unspecified condition variables are false, so we could also omit the condition during test time:
forw(model, input) # dropout=false is assumed
Here is one way to add dropout to the LeNet model:
@knet function lenet5(x; pdrop=0.5, cwin1=5, cout1=20, pwin1=2, cwin2=5, cout2=50, pwin2=2, hidden=500, nclass=10)
a = conv_pool_layer(x; cwindow=cwin1, coutput=cout1, pwindow=pwin1)
b = conv_pool_layer(a; cwindow=cwin2, coutput=cout2, pwindow=pwin2)
bdrop = drop(b; pdrop=pdrop)
c = relu_layer(bdrop; output=hidden)
return softmax_layer(c; output=nclass)
end
Whenever the condition variable dropout
is true, this will replace
half of the entries in the b
array with zeros. We need to modify
our train
function to pass the condition to forw
:
function train(f, data, loss)
for (x,y) in data
forw(f, x; dropout=true)
back(f, y, loss)
update!(f)
end
end
Here is our training script. Note that we reduce the learning rate whenever the test error gets worse, another precaution against overfitting:
lrate = 0.1
decay = 0.9
lasterr = 1.0
net = compile(:lenet5)
setp(net; lr=lrate)
for epoch=1:100
train(net, trn, softloss)
trnerr = test(net, trn, zeroone)
tsterr = test(net, tst, zeroone)
println((epoch, lrate, trnerr, tsterr))
if tsterr > lasterr
lrate = decay*lrate
setp(net; lr=lrate)
end
lasterr = tsterr
end
In 100 epochs, this should converge to about 0.5% error, i.e. reduce the total number of errors on the 10K test set from around 80 to around 50. Congratulations! This is fairly close to the state of the art compared to other benchmark results on the MNIST website:
(1,0.1,0.020749999999999824,0.01960000000000001)
(2,0.1,0.013699999999999895,0.01600000000000001)
...
(99,0.0014780882941434613,0.0003333333333333334,0.005200000000000002)
(100,0.0014780882941434613,0.0003666666666666668,0.005000000000000002)
In this section, we saw how to use the if ... else ... end
construct to perform conditional evaluation in a model, where the
conditions are passed using keyword arguments to forw
. We used
this to implement dropout
, an effective technique to prevent
overfitting.
9. Recurrent neural networks¶
See also
read-before-write, simple rnn, lstm
In this section we will see how to implement recurrent neural networks (RNNs) in Knet. A RNN is a class of neural network where connections between units form a directed cycle, which allows them to keep a persistent state (memory) over time. This gives them the ability to process sequences of arbitrary length one element at a time, while keeping track of what happened at previous elements. Contrast this with feed forward nets like LeNet, which have a fixed sized input, output and perform a fixed number of operations. See (Karpathy, 2015) for a nice introduction to RNNs.
To support RNNs, all local variables in Knet functions are static variables, i.e. their values are preserved between calls unless otherwise specified. It turns out this is the only language feature you need to define RNNs. Here is a simple example:
@knet function rnn1(x; hsize=100, xsize=50)
a = par(init=Xavier(), dims=(hsize, xsize))
b = par(init=Xavier(), dims=(hsize, hsize))
c = par(init=Constant(0), dims=(hsize, 1))
d = a * x .+ b * h .+ c
h = relu(d)
end
Notice anything strange? The first three lines define three model
parameters. Then the fourth line sets d
to a linear combination
of the input x
and the hidden state h
. But h
hasn’t been
defined yet. Exactly! Having read-before-write variables is the only
thing that distinguishes an RNN from feed-forward models like LeNet.
The way Knet handles read-before-write variables is by initializing
them to 0 arrays before any input is processed, then preserving the
values between the calls. Thus during the first call in the above
example, h
would start as 0, d
would be set to a * x .+ c
,
which in turn would cause h
to get set to relu(a * x .+ c)
.
During the second call, this value of h
would be remembered and
used, thus making the value of h
at time t dependent on
its value at time t-1.
It turns out simple RNNs like rnn1
are not very good at
remembering things for a very long time. There are some techniques to
improve their retention based on better initialization or smarter
updates, but currently the most popular solution is using more
complicated units like LSTMs and GRUs. These units control the
information flow into and out of the unit using gates similar to
digital circuits and can model long term dependencies. See (Colah,
2015) for a good overview of LSTMs.
Defining an LSTM in Knet is almost as concise as writing its mathematical definition:
@knet function lstm(x; fbias=1, o...)
input = wbf2(x,h; o..., f=:sigm)
forget = wbf2(x,h; o..., f=:sigm, binit=Constant(fbias))
output = wbf2(x,h; o..., f=:sigm)
newmem = wbf2(x,h; o..., f=:tanh)
cell = input .* newmem + cell .* forget
h = tanh(cell) .* output
return h
end
The wbf2
operator applies an affine function (linear function +
bias) to its two inputs followed by an activation function (specified
by the f
keyword argument). Try to define this operator yourself
as an exercise, (see kfun.jl for the Knet definition).
The LSTM has an input gate, forget gate and an output gate that
control information flow. Each gate depends on the current input
x
, and the last output h
. The memory value cell
is
computed by blending a new value newmem
with its old value under
the control of input
and forget
gates. The output
gate
decides how much of the cell
is shared with the outside world.
If an input
gate element is close to 0, the corresponding element
in the new input x
will have little effect on the memory cell. If
a forget
gate element is close to 1, the contents of the
corresponding memory cell can be preserved for a long time. Thus the
LSTM has the ability to pay attention to the current input, or
reminisce in the past, and it can learn when to do which based on the
problem.
In this section we introduced simple recurrent neural networks and LSTMs. We saw that having static variables is the only language feature necessary to implement RNNs. Next we will look at how to train them.
10. Training with sequences¶
(Karpathy, 2015) has lots of fun examples showing how character based language models based on LSTMs are surprisingly adept at generating text in many genres, from Wikipedia articles to C programs. To demonstrate training with sequences, we’ll implement one of these examples and build a model that can write like Shakespeare! After training on “The Complete Works of William Shakespeare” for less than an hour, here is a sample of brilliant writing you can expect from your model:
LUCETTA. Welcome, getzing a knot. There is as I thought you aim
Cack to Corioli.
MACBETH. So it were timen'd nobility and prayers after God'.
FIRST SOLDIER. O, that, a tailor, cold.
DIANA. Good Master Anne Warwick!
SECOND WARD. Hold, almost proverb as one worth ne'er;
And do I above thee confer to look his dead;
I'll know that you are ood'd with memines;
The name of Cupid wiltwite tears will hold
As so I fled; and purgut not brightens,
Their forves and speed as with these terms of Ely
Whose picture is not dignitories of which,
Their than disgrace to him she is.
GOBARIND. O Sure, ThisH more.,
wherein hath he been not their deed of quantity,
No ere we spoke itation on the tent.
I will be a thought of base-thief;
Then tears you ever steal to have you kindness.
And so, doth not make best in lady,
Your love was execreed'd fray where Thoman's nature;
I have bad Tlauphie he should sray and gentle,
First let’s download “The Complete Works of William Shakespeare” from Project Gutenberg:
julia> using Requests
julia> url="http://gutenberg.pglaf.org/1/0/100/100.txt";
julia> text=get(url).data
5589917-element Array{UInt8,1}:...
The text
array now has all 5,589,917 characters of “The Complete
Works” in a Julia array. If get
does not work, you can download
100.txt
by other means and use text=readall("100.txt")
on the
local file. We will use one-hot vectors to represent characters, so
let’s map each character to an integer index \(1\ldots n\):
julia> char2int = Dict();
julia> for c in text; get!(char2int, c, 1+length(char2int)); end
julia> nchar = length(char2int)
92
Dict
is Julia’s standard associative collection for mapping
arbitrary keys to values. get!(dict,key,default)
returns the
value for the given key, storing key=>default
in dict
if no
mapping for the key is present. Going over the text
array we
discover 92 unique characters and map them to integers \(1\ldots
92\).
We will train our RNN to read characters from text
in sequence,
and predict the next character after each. The training will go much
faster if we can use the minibatching trick we saw earlier and process
multiple inputs at a time. For that, we split the text array into
batchsize
equal length subsequences. Then the first batch has the
first character from each subsequence, second batch contains the
second characters etc. Each minibatch is represented by a nchar x
batchsize
matrix with one-hot columns. Here is a function that
implements this type of sequence minibatching:
function seqbatch(seq, dict, batchsize)
data = Any[]
T = div(length(seq), batchsize)
for t=1:T
d=zeros(Float32, length(dict), batchsize)
for b=1:batchsize
c = dict[seq[t + (b-1) * T]]
d[c,b] = 1
end
push!(data, d)
end
return data
end
Let’s use it to split text
into minibatches of size 128:
julia> batchsize = 128;
julia> data = seqbatch(text, char2int, batchsize)
43671-element Array{Any,1}:...
julia> data[1]
92x128 Array{Float32,2}:...
The data array returned has T=length(text)/batchsize
minibatches.
The columns of minibatch data[t]
refer to characters t
,
t+T
, t+2T
, ... from text
. During training, when
data[t]
is the input, data[t+1]
will be the desired output.
Now that we have the data ready to go, let’s talk about RNN training.
RNN training is a bit more involved than training feed-forward models.
We still have the prediction, gradient calculation and update steps,
but not all three steps should be performed after every input. Here
is a basic algorithm: Go forward nforw
steps, remembering the
desired outputs and model state, then perform nforw
back steps
accumulating gradients, finally update the parameters and reset the
network for the next iteration:
function train(f, data, loss; nforw=100, gclip=0)
reset!(f)
ystack = Any[]
T = length(data) - 1
for t = 1:T
x = data[t]
y = data[t+1]
sforw(f, x; dropout=true)
push!(ystack, y)
if (t % nforw == 0 || t == T)
while !isempty(ystack)
ygold = pop!(ystack)
sback(f, ygold, loss)
end
update!(f; gclip=gclip)
reset!(f; keepstate=true)
end
end
end
Note that we use sforw
and sback
instead of forw
and
back
during sequence training: these save and restore internal
state to allow multiple forward steps followed by multiple backward
steps. reset!
is necessary to zero out or recover internal state
before a sequence of forward steps. ystack
is used to store gold
answers. The gclip
is for gradient clipping, a common RNN
training strategy to keep the parameters from diverging.
With data and training script ready, all we need is a model. We will define a character based RNN language model using an LSTM:
@knet function charlm(x; embedding=0, hidden=0, pdrop=0, nchar=0)
a = wdot(x; out=embedding)
b = lstm(a; out=hidden)
c = drop(b; pdrop=pdrop)
return wbf(c; out=nchar, f=:soft)
end
wdot
multiplies the one-hot representation x
of the input
character with an embedding matrix and turns it into a dense vector of
size embedding
. We apply an LSTM of size hidden
to this dense
vector, and dropout the result with probability pdrop
. Finally
wbf
applies softmax to a linear function of the LSTM output to get
a probability vector of size nchar
for the next character.
(Karpathy, 2015) uses not one but several LSTM layers to simulate
Shakespeare. In Knet, we can define a multi-layer LSTM model using
the high-level operator repeat
:
@knet function lstmdrop(a; pdrop=0, hidden=0)
b = lstm(a; out=hidden)
return drop(b; pdrop=pdrop)
end
@knet function charlm2(x; nlayer=0, embedding=0, hidden=0, pdrop=0, nchar=0)
a = wdot(x; out=embedding)
c = repeat(a; frepeat=:lstmdrop, nrepeat=nlayer, hidden=hidden, pdrop=pdrop)
return wbf(c; out=nchar, f=:soft)
end
In charlm2
, the repeat
instruction will perform the
frepeat
operation nrepeat
times starting with input a
.
Using charlm2
with nlayer=1
would be equivalent to the
original charlm
.
In the interest of time we will start with a small single layer model. With the following parameters, 10 epochs of training takes about 35-40 minutes on a K20 GPU:
julia> net = compile(:charlm; embedding=256, hidden=512, pdrop=0.2, nchar=nchar);
julia> setp(net; lr=1.0)
julia> for i=1:10; train(net, data, softloss; gclip=5.0); end
After spending this much time training a model, you probably want to
save it. Knet uses the JLD module to save and load models and data.
Calling clean(model)
during a save is recommended to strip the
model of temporary arrays which may save a lot of space. Don’t forget
to save the char2int
dictionary, otherwise it will be difficult to
interpret the output of the model:
julia> using JLD
julia> JLD.save("charlm.jld", "model", clean(net), "dict", char2int);
julia> net2 = JLD.load("charlm.jld", "model") # should create a copy of net
...
TODO: put load/save and other fns in the function table.
Finally, to generate the Shakespearean output we promised, we need to
implement a generator. The following generator samples a character
from the probability vector output by the model, prints it and feeds
it back to the model to get the next character. Note that we use
regular forw
in generate
, sforw
is only necessary when
training RNNs.
function generate(f, int2char, nchar)
reset!(f)
x=zeros(Float32, length(int2char), 1)
y=zeros(Float32, length(int2char), 1)
xi = 1
for i=1:nchar
copy!(y, forw(f,x))
x[xi] = 0
xi = sample(y)
x[xi] = 1
print(int2char[xi])
end
println()
end
function sample(pdist)
r = rand(Float32)
p = 0
for c=1:length(pdist)
p += pdist[c]
r <= p && return c
end
end
julia> int2char = Array(Char, length(char2int));
julia> for (c,i) in char2int; int2char[i] = Char(c); end
julia> generate(net, int2char, 1024) # should generate 1024 chars of Shakespeare
TODO: In this section...
Some useful tables¶
Table 1: Primitive Knet operators
Operator | Description |
---|---|
par() |
a parameter array, updated during training; kwargs: dims, init |
rnd() |
a random array, updated every call; kwargs: dims, init |
arr() |
a constant array, never updated; kwargs: dims, init |
dot(A,B) |
matrix product of A and B ; alternative notation: A * B |
add(A,B) |
elementwise broadcasting addition of arrays A and B , alternative notation: A .+ B |
mul(A,B) |
elementwise broadcasting multiplication of arrays A and B ; alternative notation: A .* B |
conv(W,X) |
convolution with filter W and input X ; kwargs: padding=0, stride=1, upscale=1, mode=CUDNN_CONVOLUTION |
pool(X) |
pooling; kwargs: window=2, padding=0, stride=window, mode=CUDNN_POOLING_MAX |
axpb(X) |
computes a*x^p+b ; kwargs: a=1, p=1, b=0 |
copy(X) |
copies X to output. |
relu(X) |
rectified linear activation function: (x > 0 ? x : 0) |
sigm(X) |
sigmoid activation function: 1/(1+exp(-x)) |
soft(X) |
softmax activation function: (exp xi) / (Σ exp xj) |
tanh(X) |
hyperbolic tangent activation function. |
Table 2: Compound Knet operators
These operators combine several primitive operators and typically hide the parameters in their definitions to make code more readable.
Operator | Description |
---|---|
wdot(x) |
apply a linear transformation w * x ; kwargs: out=0, winit=Xavier() |
bias(x) |
add a bias x .+ b ; kwargs: binit=Constant(0) |
wb(x) |
apply an affine function w * x .+ b ; kwargs: out=0, winit=Xavier(), binit=Constant(0) |
wf(x) |
linear transformation + activation function f(w * x) ; kwargs: f=:relu, out=0, winit=Xavier() |
wbf(x) |
affine function + activation function f(w * x .+ b) ; kwargs: f=:relu, out=0, winit=Xavier(), binit=Constant(0) |
wbf2(x,y) |
affine function + activation function for two variables f(a*x .+ b*y .+ c) ; kwargs:f=:sigm, out=0, winit=Xavier(), binit=Constant(0) |
wconv(x) |
apply a convolution conv(w,x) ; kwargs: out=0, window=0, padding=0, stride=1, upscale=1, mode=CUDNN_CONVOLUTION, cinit=Xavier() |
cbfp(x) |
convolution, bias, activation function, and pooling; kwargs: f=:relu, out=0, cwindow=0, pwindow=0, cinit=Xavier(), binit=Constant(0) |
drop(x) |
replace pdrop of the input with 0 and scale the rest with 1/(1-pdrop) ; kwargs: pdrop=0 |
lstm(x) |
LSTM; kwargs:fbias=1, out=0, winit=Xavier(), binit=Constant(0) |
irnn(x) |
IRNN; kwargs:scale=1, out=0, winit=Xavier(), binit=Constant(0) |
gru(x) |
GRU; kwargs:out=0, winit=Xavier(), binit=Constant(0) |
repeat(x) |
apply operator frepeat to input x nrepeat times; kwargs: ``frepeat=nothing, nrepeat=0 |
Table 3: Random distributions
This table lists random distributions and other array fillers that can
be used to initalize parameters (used with the init
keyword
argument for par
).
Distribution | Description |
---|---|
Bernoulli(p,scale) |
output scale with probability p and 0 otherwise |
Constant(val) |
fill with a constant value val |
Gaussian(mean, std) |
normally distributed random values with mean mean and standard deviation std |
Identity(scale) |
identity matrix multiplied by scale |
Uniform(min, max) |
uniformly distributed random values between min and max |
Xavier() |
Xavier initialization: deprecated, please use Glorot. Uniform in \([-\sqrt{3/n},\sqrt{3/n}]\) where n=length(a)/size(a)[end] |
Table 4: Loss functions
Function | Description |
---|---|
softloss(ypred,ygold) |
Cross entropy loss: \(E[p\log\hat{p}]\) |
quadloss(ypred,ygold) |
Quadratic loss: \(½ E[(y-\hat{y})^2]\) |
zeroone(ypred,ygold) |
Zero-one loss: \(E[\arg\max y \neq \arg\max\hat{y}]\) |
Table 5: Training options
We can manipulate how exactly update!
behaves by setting some
training options like the learning rate lr
. I’ll explain the
mathematical motivation elsewhere, but algorithmically these training
options manipulate the dw
array (sometimes using an auxiliary
array dw2
) before the subtraction to improve the loss faster.
Here is a list of training options supported by Knet and how they
manipulate dw
:
Option | Description |
---|---|
lr |
Learning rate: dw *= lr |
l1reg |
L1 regularization: dw += l1reg * sign(w) |
l2reg |
L2 regularization: dw += l2reg * w |
adagrad |
Adagrad (boolean): dw2 += dw .* dw; dw = dw ./ (1e-8 + sqrt(dw2)) |
rmsprop |
Rmsprop (boolean): dw2 = dw2 * 0.9 + 0.1 * dw .* dw; dw = dw ./ (1e-8 + sqrt(dw2)) |
adam |
Adam (boolean); see http://arxiv.org/abs/1412.6980 |
momentum |
Momentum: dw += momentum * dw2; dw2 = dw |
nesterov |
Nesterov: dw2 = nesterov * dw2 + dw; dw += nesterov * dw2 |
Table 6: Summary of modeling related functions
Function | Description |
---|---|
@kfun function ... end |
defines a @knet function that can be used as a model or a new operator |
if cond ... else ... end |
conditional evaluation in a @knet function with condition variable cond supplied by forw |
compile(:kfun; o...) |
creates a model given @knet function kfun ; kwargs used for model configuration |
forw(f,x; o...) |
returns the prediction of model f on input x ; kwargs used for setting conditions |
back(f,ygold,loss) |
computes the loss gradients for f parameters based on desired output ygold and loss function loss |
update!(f) |
updates the parameters of f using the gradients computed by back to reduce loss |
get(f,:w) |
return parameter w of model f |
setp(f; opt=val...) |
sets training options for model f |
minibatch(x,y,batchsize) |
split data into minibatches |