IntroKerasTF
Introduction to Tensorflow and Keras
This notebook is a part of AI for Beginners Curricula. Visit the repository for complete set of learning materials.
Neural Frameworks
We have learnt that to train neural networks you need:
- Quickly multiply matrices (tensors)
- Compute gradients to perform gradient descent optimization
What neural network frameworks allow you to do:
- Operate with tensors on whatever compute is available, CPU or GPU, or even TPU
- Automatically compute gradients (they are explicitly programmed for all built-in tensor functions)
Optionally:
- Neural Network constructor / higher level API (describe network as a sequence of layers)
- Simple training functions (
fit, as in Scikit Learn) - A number of optimization algorithms in addition to gradient descent
- Data handling abstractions (that will ideally work on GPU, too)
Most Popular Frameworks
- Tensorflow 1.x - first widely available framework (Google). Allowed to define static computation graph, push it to GPU, and explicitly evaluate it
- PyTorch - a framework from Facebook that is growing in popularity
- Keras - higher level API on top of Tensorflow/PyTorch to unify and simplify using neural networks (Francois Chollet)
- Tensorflow 2.x + Keras - new version of Tensorflow with integrated Keras functionality, which supports dynamic computation graph, allowing to perform tensor operations very similar to numpy (and PyTorch)
We will consider Tensorflow 2.x and Keras. Make sure you have version 2.x.x of Tensorflow installed:
pip install tensorflow
or
conda install tensorflow
2.7.0
Basic Concepts: Tensor
Tensor is a multi-dimensional array. It is very convenient to use tensors to represent different types of data:
- 400x400 - black-and-white picture
- 400x400x3 - color picture
- 16x400x400x3 - minibatch of 16 color pictures
- 25x400x400x3 - one second of 25-fps video
- 8x25x400x400x3 - minibatch of 8 1-second videos
Simple Tensors
You can easily create simple tensors from lists of np-arrays, or generate random ones:
tf.Tensor( [[1 2] [3 4]], shape=(2, 2), dtype=int32) tf.Tensor( [[-0.33552304 -1.8252622 -1.8532339 ] [ 1.0871267 -1.2779568 0.5240014 ] [-0.12793781 -1.8618349 -0.9020286 ] [ 0.5948797 0.11144501 -2.0396452 ] [ 0.47620854 1.1726047 -0.4405675 ] [-0.27211484 -0.08985762 -0.03376012] [ 0.64274263 0.53368104 -0.9006528 ] [-0.43745974 -1.0081122 -0.13442488] [ 0.36497566 1.3221073 -1.8739727 ] [ 0.94821155 -0.02817811 1.3563292 ]], shape=(10, 3), dtype=float32)
You can use arithmetic operations on tensors, which are performed element-wise, as in numpy. Tensors are automatically expanded to required dimension, if needed. To extract numpy-array from tensor, use .numpy():
tf.Tensor( [[ 0. 0. 0. ] [ 1.4226497 0.54730535 2.3772354 ] [ 0.20758523 -0.03657269 0.9512053 ] [ 0.93040276 1.9367073 -0.18641126] [ 0.8117316 2.9978669 1.4126664 ] [ 0.0634082 1.7354046 1.8194739 ] [ 0.97826564 2.3589432 0.9525811 ] [-0.1019367 0.81715 1.718809 ] [ 0.7004987 3.1473694 -0.02073872] [ 1.2837346 1.7970841 3.2095633 ]], shape=(10, 3), dtype=float32) [0.71496403 0.16117539 0.15672949]
Variables
Variables are useful to represent tensor values that can be modified using assign and assign_add. They are often used to represent neural network weights.
As an example, here is a silly way to get a sum of all rows of tensor a:
<tf.Variable 'Variable:0' shape=(3,) dtype=float32, numpy=array([ 2.9411097, -2.9513645, -6.2979555], dtype=float32)>
Much better way to do it:
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 2.9411097, -2.9513645, -6.2979555], dtype=float32)>
Computing Gradients
For back propagation, you need to compute gradients. This is done using tf.GradientTape() idiom:
- Add
with tf.GradientTapeblock around our computations - Mark those tensors with respect to which we need to compute gradients by calling
tape.watch(all variables are watched automatically) - Compute whatever we need (build computational graph)
- Obtain gradients using
tape.gradient
tf.Tensor( [[ 0.40935674 -0.3495818 ] [ 0.94165146 -0.33209163]], shape=(2, 2), dtype=float32)
Example 1: Linear Regression
Now we know enough to solve the classical problem of Linear regression. Let's generate small synthetic dataset:
<matplotlib.collections.PathCollection at 0x12892776880>
Linear regression is defined by a straight line , where are model parameters that we need to find. An error on our dataset (also called loss function) can be defined as mean square error:
Let's define our model and loss function:
We will train the model on a series of minibatches. We will use gradient descent, adjusting model parameters using the following formulae:
Let's do the training. We will do several passes through the dataset (so-called epochs), divide it into minibatches and call the function defined above:
Epoch 0: last batch loss = 94.5247 Epoch 1: last batch loss = 9.3428 Epoch 2: last batch loss = 1.4166 Epoch 3: last batch loss = 0.5224 Epoch 4: last batch loss = 0.3807 Epoch 5: last batch loss = 0.3495 Epoch 6: last batch loss = 0.3413 Epoch 7: last batch loss = 0.3390 Epoch 8: last batch loss = 0.3384 Epoch 9: last batch loss = 0.3382
We now have obtained optimized parameters and . Note that their values are similar to the original values used when generating the dataset ()
(<tf.Variable 'Variable:0' shape=(1, 1) dtype=float32, numpy=array([[1.8616779]], dtype=float32)>, , <tf.Variable 'Variable:0' shape=(1,) dtype=float32, numpy=array([1.0710956], dtype=float32)>)
[<matplotlib.lines.Line2D at 0x12892ae5eb0>]
Computational Graph and GPU Computations
Whenever we compute tensor expression, Tensorflow builds a computational graph that can be computed on the available computing device, e.g. CPU or GPU. Since we were using arbitrary Python function in our code, they cannot be included as part of computational graph, and thus when running our code on GPU we would need to pass the data between CPU and GPU back and forth, and compute custom function on CPU.
Tensorflow allows us to mark our Python function using @tf.function decorator, which will make this function a part of the same computational graph. This decorator can be applied to functions that use standard Tensorflow tensor operations.
The code has not changed, but if you were running this code on GPU and on larger dataset - you would have noticed the difference in speed.
Dataset API
Tensorflow contains a convenient API to work with data. Let's try to use it. We will also train our model from scratch.
Epoch 0: last batch loss = 173.4585 Epoch 1: last batch loss = 13.8459 Epoch 2: last batch loss = 4.5407 Epoch 3: last batch loss = 3.7364 Epoch 4: last batch loss = 3.4334 Epoch 5: last batch loss = 3.1790 Epoch 6: last batch loss = 2.9458 Epoch 7: last batch loss = 2.7311 Epoch 8: last batch loss = 2.5332 Epoch 9: last batch loss = 2.3508
Example 2: Classification
Now we will consider binary classification problem. A good example of such a problem would be a tumour classification between malignant and benign based on it's size and age.
The core model is similar to regression, but we need to use different loss function. Let's start by generating sample data:
C:\Users\dmitryso\AppData\Local\Temp/ipykernel_66184/2721537645.py:17: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
Normalizing Data
Before training, it is common to bring our input features to the standard range of [0,1] (or [-1,1]). The exact reasons for that we will discuss later in the course, but in short the reason is the following. We want to avoid values that flow through our network getting too big or too small, and we normally agree to keep all values in the small range close to 0. Thus we initialize the weights with small random numbers, and we keep signals in the same range.
When normalizing data, we need to subtract min value and divide by range. We compute min value and range using training data, and then normalize test/validation dataset using the same min/range values from the training set. This is because in real life we will only know the training set, and not all incoming new values that the network would be asked to predict. Occasionally, the new value may fall out of the [0,1] range, but that's not crucial.
Training One-Layer Perceptron
Let's use Tensorflow gradient computing machinery to train one-layer perceptron.
Our neural network will have 2 inputs and 1 output. The weight matrix will have size , and bias vector -- .
Core model will be the same as in previous example, but loss function will be a logistic loss. To apply logistic loss, we need to get the value of probability as the output of our network, i.e. we need to bring the output to the range [0,1] using sigmoid activation function: .
If we get the probability for the i-th input value corresponding to the actual class , we compute the loss as .
In Tensorflow, both those steps (applying sigmoid and then logistic loss) can be done using one call to sigmoid_cross_entropy_with_logits function. Since we are training our network in minibatches, we need to average out the loss across all elements of a minibatch using reduce_mean:
We will use minibatches of 16 elements, and do a few epochs of training:
Epoch 0: last batch loss = 0.3823 Epoch 1: last batch loss = 0.5243 Epoch 2: last batch loss = 0.4510 Epoch 3: last batch loss = 0.3261 Epoch 4: last batch loss = 0.4177 Epoch 5: last batch loss = 0.3323 Epoch 6: last batch loss = 0.6294 Epoch 7: last batch loss = 0.6334 Epoch 8: last batch loss = 0.2571 Epoch 9: last batch loss = 0.3425
To make sure our training worked, let's plot the line that separates two classes. Separation line is defined by the equation
C:\Users\dmitryso\AppData\Local\Temp/ipykernel_66184/2721537645.py:17: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
Let's see how our model behaves on the validation data.
<matplotlib.collections.PathCollection at 0x12892a01460>
To compute the accuracy on the validation data, we can cast boolean type to float, and compute the mean:
<tf.Tensor: shape=(), dtype=float32, numpy=0.46666667>
Let's explain what goes on here:
predis the values predicted by the network. They are not quite probabilities, because we have not used an activation function, but values greater than 0.5 correspond to class 1, and smaller - to class 0.pred[0]>0.5creates a boolean tensor of results, whereTruecorresponds to class 1, andFalse- to class 0- We compare that tensor to expected labels
valid_labels, getting the boolean vector or correct predictions, whereTruecorresponds to the correct prediction, andFalse- to incorrect one. - We convert that tensor to floating point using
tf.cast - We then compute the mean value using
tf.reduce_mean- that is exactly our desired accuracy
Using TensorFlow/Keras Optimizers
Tensorflow is closely integrated with Keras, which contains a lot of useful functionality. For example, we can use different optimization algorithms. Let's do that, and also print obtained accuracy during training.
Epoch 0: last batch loss = 4.7787, acc = 1.0000 Epoch 1: last batch loss = 8.4343, acc = 0.5000 Epoch 2: last batch loss = 8.3255, acc = 0.5000 Epoch 3: last batch loss = 7.5579, acc = 0.5000 Epoch 4: last batch loss = 6.5254, acc = 0.5000 Epoch 5: last batch loss = 7.3800, acc = 0.5000 Epoch 6: last batch loss = 7.7586, acc = 0.5000 Epoch 7: last batch loss = 10.4724, acc = 0.0000 Epoch 8: last batch loss = 9.4423, acc = 0.5000 Epoch 9: last batch loss = 4.1888, acc = 1.0000 Epoch 10: last batch loss = 11.2127, acc = 0.0000 Epoch 11: last batch loss = 9.0417, acc = 0.5000 Epoch 12: last batch loss = 7.9847, acc = 0.5000 Epoch 13: last batch loss = 3.7879, acc = 1.0000 Epoch 14: last batch loss = 6.8455, acc = 0.5000 Epoch 15: last batch loss = 6.5204, acc = 0.5000 Epoch 16: last batch loss = 9.2386, acc = 0.5000 Epoch 17: last batch loss = 6.2447, acc = 0.5000 Epoch 18: last batch loss = 3.9107, acc = 1.0000 Epoch 19: last batch loss = 5.7645, acc = 1.0000
Task 1: Plot the graphs of loss function and accuracy on training and validation data during training
Task 2: Try to solve MNIST classificiation problem using this code. Hint: use softmax_crossentropy_with_logits or sparse_softmax_cross_entropy_with_logits as loss function. In the first case you need to feed expected output values in one hot encoding, and in the second case - as integer class number.
Keras
Deep Learning for Humans
- Keras is a library originally developed by Francois Chollet to work on top of Tensorflow, CNTK and Theano, to unify all lower-level frameworks. You can still install Keras as a separate library, but it is not advised to do so.
- Now Keras is included as part of Tensorflow library
- You can easily construct neural networks from layers
- Contains
fitfunction to do all training, plus a lot of functions to work with typical data (pictures, text, etc.) - A lot of samples
- Functional API vs. Sequential API
Keras provides higher level abstractions for neural networks, allowing us to operate in terms of layers, models and optimizers, and not in terms of tensors and gradients.
Classical Deep Learning book from the creator of Keras: Deep Learning with Python
Functional API
When using functional API, we define the input to the network as keras.Input, and then compute the output by passing it through a series of computations. Finally, we define model as an object that transforms input into output.
Once we obtained model object, we need to:
- Compile it, by specifying loss function and the optimizer that we want to use with our model
- Train it by calling
fitfunction with the training (and possibly validation) data
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 2)] 0
dense (Dense) (None, 1) 3
=================================================================
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
9/9 [==============================] - 1s 2ms/step - loss: 0.7812 - accuracy: 0.2857
Epoch 2/15
9/9 [==============================] - 0s 2ms/step - loss: 0.7142 - accuracy: 0.4000
Epoch 3/15
9/9 [==============================] - 0s 2ms/step - loss: 0.6683 - accuracy: 0.6143
Epoch 4/15
9/9 [==============================] - 0s 2ms/step - loss: 0.6221 - accuracy: 0.8429
Epoch 5/15
9/9 [==============================] - 0s 2ms/step - loss: 0.5843 - accuracy: 0.8857
Epoch 6/15
9/9 [==============================] - 0s 2ms/step - loss: 0.5447 - accuracy: 0.9429
Epoch 7/15
9/9 [==============================] - 0s 2ms/step - loss: 0.5135 - accuracy: 0.9286
Epoch 8/15
9/9 [==============================] - 0s 2ms/step - loss: 0.4878 - accuracy: 0.9429
Epoch 9/15
9/9 [==============================] - 0s 2ms/step - loss: 0.4679 - accuracy: 0.9429
Epoch 10/15
9/9 [==============================] - 0s 2ms/step - loss: 0.4446 - accuracy: 0.9429
Epoch 11/15
9/9 [==============================] - 0s 2ms/step - loss: 0.4349 - accuracy: 0.8714
Epoch 12/15
9/9 [==============================] - 0s 2ms/step - loss: 0.4156 - accuracy: 0.9286
Epoch 13/15
9/9 [==============================] - 0s 2ms/step - loss: 0.4019 - accuracy: 0.9429
Epoch 14/15
9/9 [==============================] - 0s 2ms/step - loss: 0.3908 - accuracy: 0.9286
Epoch 15/15
9/9 [==============================] - 0s 2ms/step - loss: 0.3777 - accuracy: 0.9286
[<matplotlib.lines.Line2D at 0x12894b95250>]
Sequential API
Alternatively, we can start thinking of a model as of a sequence of layers, and just specify those layers by adding them to the model object:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 5) 15
dense_2 (Dense) (None, 1) 6
=================================================================
Total params: 21
Trainable params: 21
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
9/9 [==============================] - 1s 64ms/step - loss: 0.6994 - accuracy: 0.5000 - val_loss: 0.6719 - val_accuracy: 0.4667
Epoch 2/15
9/9 [==============================] - 0s 6ms/step - loss: 0.6635 - accuracy: 0.5429 - val_loss: 0.6531 - val_accuracy: 0.4667
Epoch 3/15
9/9 [==============================] - 0s 5ms/step - loss: 0.6469 - accuracy: 0.5857 - val_loss: 0.5775 - val_accuracy: 1.0000
Epoch 4/15
9/9 [==============================] - 0s 4ms/step - loss: 0.5639 - accuracy: 0.9143 - val_loss: 0.5395 - val_accuracy: 0.7333
Epoch 5/15
9/9 [==============================] - 0s 5ms/step - loss: 0.5236 - accuracy: 0.7143 - val_loss: 0.4498 - val_accuracy: 0.9333
Epoch 6/15
9/9 [==============================] - 0s 5ms/step - loss: 0.4573 - accuracy: 0.8714 - val_loss: 0.3584 - val_accuracy: 1.0000
Epoch 7/15
9/9 [==============================] - 0s 5ms/step - loss: 0.3867 - accuracy: 0.8714 - val_loss: 0.2989 - val_accuracy: 0.9333
Epoch 8/15
9/9 [==============================] - 0s 7ms/step - loss: 0.3388 - accuracy: 0.8857 - val_loss: 0.2204 - val_accuracy: 1.0000
Epoch 9/15
9/9 [==============================] - 0s 6ms/step - loss: 0.2815 - accuracy: 0.9429 - val_loss: 0.1957 - val_accuracy: 1.0000
Epoch 10/15
9/9 [==============================] - 0s 6ms/step - loss: 0.2692 - accuracy: 0.8857 - val_loss: 0.1323 - val_accuracy: 1.0000
Epoch 11/15
9/9 [==============================] - 0s 5ms/step - loss: 0.2591 - accuracy: 0.9429 - val_loss: 0.1105 - val_accuracy: 1.0000
Epoch 12/15
9/9 [==============================] - 0s 6ms/step - loss: 0.2229 - accuracy: 0.9286 - val_loss: 0.1051 - val_accuracy: 1.0000
Epoch 13/15
9/9 [==============================] - 0s 5ms/step - loss: 0.2146 - accuracy: 0.9143 - val_loss: 0.0919 - val_accuracy: 1.0000
Epoch 14/15
9/9 [==============================] - 0s 5ms/step - loss: 0.2031 - accuracy: 0.9429 - val_loss: 0.0859 - val_accuracy: 1.0000
Epoch 15/15
9/9 [==============================] - 0s 5ms/step - loss: 0.1997 - accuracy: 0.9429 - val_loss: 0.0829 - val_accuracy: 1.0000
<keras.callbacks.History at 0x12894cfba30>
Classification Loss Functions
It is important to correctly specify loss function and activation function on the last layer of the network. The main rules are the following:
- If the network has one output (binary classification), we use sigmoid activation function, for multiclass classification - softmax
- If the output class is represented as one-hot-encoding, the loss function will be cross entropy loss (categorical cross-entropy), if the output contains class number - sparse categorical cross-entropy. For binary classification - use binary cross-entropy (same as log loss)
- Multi-label classification is when we can have an object belonging to several classes at the same time. In this case, we need to encode labels using one-hot encoding, and use sigmoid as activation function, so that each class probability is between 0 and 1.
| Classification | Label Format | Activation Function | Loss |
|---|---|---|---|
| Binary | Probability of 1st class | sigmoid | binary crossentropy |
| Binary | One-hot encoding (2 outputs) | softmax | categorical crossentropy |
| Multiclass | One-hot encoding | softmax | categorical crossentropy |
| Multiclass | Class Number | softmax | sparse categorical crossentropy |
| Multilabel | One-hot encoding | sigmoid | categorical crossentropy |
Binary classification can also be handled as a special case of multi-class classification with two outputs. In this case, we need to use softmax.
Task 3: Use Keras to train MNIST classifier:
- Notice that Keras contains some standard datasets, including MNIST. To use MNIST from Keras, you only need a couple of lines of code (more information here)
- Try several network configuration, with different number of layers/neurons, activation functions.
What is the best accuracy you were able to achieve?
Takeaways
- Tensorflow allows you to operate on tensors at low level, you have most flexibility.
- There are convenient tools to work with data (
td.Data) and layers (tf.layers) - For beginners/typical tasks, it is recommended to use Keras, which allows to construct networks from layers
- If non-standard architecture is needed, you can implement your own Keras layer, and then use it in Keras models
- It is a good idea to look at PyTorch as well and compare approaches.
A good sample notebook from the creator of Keras on Keras and Tensorflow 2.0 can be found here.