Notebooks
W
Weights and Biases
TensorFlow 2 0 + Keras Crash Course + W&B

TensorFlow 2 0 + Keras Crash Course + W&B

keras-tensorflow2wandb-exampleskerasexamples

This tutorial is an adaptation of Francois Chollet's brilliant introduction to TensorFlow 2.0. It modifies the code to add the Weights & Biases callbacks to help the reader save and track the progress of their models.

You find the accompanying W&B dashboard here.

[ ]
[ ]
2.0.0
[ ]
[ ]

TensorFlow 2.0 + Keras Overview for Deep Learning Researchers

@fchollet, October 2019


This document serves as an introduction, crash course, and quick API reference for TensorFlow 2.0.


TensorFlow and Keras were both released over four years ago (March 2015 for Keras and November 2015 for TensorFlow). That's a long time in deep learning years!

In the old days, TensorFlow 1.x + Keras had a number of known issues:

  • Using TensorFlow meant manipulating static computation graphs, which would feel awkward and difficult to programmers used to imperative styles of coding.
  • While the TensorFlow API was very powerful and flexible, it lacked polish and was often confusing or difficult to use.
  • While Keras was very productive and easy to use, it would often lack flexibility for research use cases.

TensorFlow 2.0 is an extensive redesign of TensorFlow and Keras that takes into account over four years of user feedback and technical progress. It fixes the issues above in a big way.

It's a machine learning platform from the future.


TensorFlow 2.0 is built on the following key ideas:

  • Let users run their computation eagerly, like they would in Numpy. This makes TensorFlow 2.0 programming intuitive and Pythonic.
  • Preserve the considerable advantages of compiled graphs (for performance, distribution, and deployment). This makes TensorFlow fast, scalable, and production-ready.
  • Leverage Keras as its high-level deep learning API, making TensorFlow approachable and highly productive.
  • Extend Keras into a spectrum of workflows ranging from the very high-level (easier to use, less flexible) to the very low-level (requires more expertise, but provides great flexibility).

Part 1: TensorFlow basics

Tensors

This is a constant tensor:

[ ]
tf.Tensor(
[[5 2]
 [1 3]], shape=(2, 2), dtype=int32)

You can get its value as a Numpy array by calling .numpy():

[ ]
array([[5, 2],
,       [1, 3]], dtype=int32)

Much like a Numpy array, it features the attributes dtype and shape:

[ ]
dtype: <dtype: 'int32'>
shape: (2, 2)

A common way to create constant tensors is via tf.ones and tf.zeros (just like np.ones and np.zeros):

[ ]
tf.Tensor(
[[1.]
 [1.]], shape=(2, 1), dtype=float32)
tf.Tensor(
[[0.]
 [0.]], shape=(2, 1), dtype=float32)

Random constant tensors

This is all pretty normal:

[ ]
<tf.Tensor: id=12, shape=(2, 2), dtype=float32, numpy=
,array([[-1.635773  , -0.16110057],
,       [-0.95438343, -0.87519467]], dtype=float32)>

And here's an integer tensor with values drawn from a random uniform distribution:

[ ]
<tf.Tensor: id=16, shape=(2, 2), dtype=int32, numpy=
,array([[9, 4],
,       [2, 6]], dtype=int32)>

Variables

Variables are special tensors used to store mutable state (like the weights of a neural network). You create a Variable using some initial value.

[ ]
<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[-0.75942826, -1.4234152 ],
       [ 0.43387443, -1.5654352 ]], dtype=float32)>

You update the value of a Variable by using the methods .assign(value), or .assign_add(increment) or .assign_sub(decrement):

[ ]
[ ]

Doing math in TensorFlow

You can use TensorFlow exactly like you would use Numpy. The main difference is that your TensorFlow code can run on GPU and TPU.

[ ]

Computing gradients with GradientTape

Oh, and there's another big difference with Numpy: you can automatically retrieve the gradient of any differentiable expression.

Just open a GradientTape, start "watching" a tensor via tape.watch(), and compose a differentiable expression using this tensor as input:

[ ]
tf.Tensor(
[[-0.6456953  -0.92202187]
 [ 0.35869437 -0.10704172]], shape=(2, 2), dtype=float32)

By default, variables are watched automatically, so you don't need to manually watch them:

[ ]
tf.Tensor(
[[-0.6456953  -0.92202187]
 [ 0.35869437 -0.10704172]], shape=(2, 2), dtype=float32)

Note that you can compute higher-order derivatives by nesting tapes:

[ ]
tf.Tensor(
[[0.3387409  0.06030157]
 [4.625122   2.1051965 ]], shape=(2, 2), dtype=float32)

An end-to-end example: linear regression

So far you've learned that TensorFlow is a Numpy-like library that is GPU or TPU accelerated, with automatic differentiation. Time for an end-to-end example: let's implement a linear regression, the FizzBuzz of Machine Learning.

For the sake of demonstration, we won't use any of the higher-level Keras components like Layer or MeanSquaredError. Just basic ops.

[ ]

Let's generate some artificial data to demonstrate our model:

[ ]
<matplotlib.collections.PathCollection at 0x7f9801123908>
Output

Now let's train our linear regression by iterating over batch-by-batch over the data and repeatedly calling train_on_batch:

[ ]
Epoch 0: last batch loss = 0.0764
Epoch 1: last batch loss = 0.0803
Epoch 2: last batch loss = 0.0417
Epoch 3: last batch loss = 0.0248
Epoch 4: last batch loss = 0.0331
Epoch 5: last batch loss = 0.0288
Epoch 6: last batch loss = 0.0294
Epoch 7: last batch loss = 0.0245
Epoch 8: last batch loss = 0.0256
Epoch 9: last batch loss = 0.0313

Here's how our model performs:

[ ]
<matplotlib.collections.PathCollection at 0x7f97f0b19748>
Output

Making it fast with tf.function

But how fast is our current code running?

[ ]
Time per epoch: 0.124 s

Let's compile the training function into a static graph. Literally all we need to do is add the tf.function decorator on it:

[ ]

Let's try this again:

[ ]
Time per epoch: 0.072 s

40% reduction, neat. In this case we used a trivially simple model; in general the bigger the model the greater the speedup you can get by leveraging static graphs.

Remember: eager execution is great for debugging and printing results line-by-line, but when it's time to scale, static graphs are a researcher's best friends.

Part 2: The Keras API

Keras is a Python API for deep learning. It has something for everyone:

  • If you're an engineer, Keras provides you with reusable blocks such as layers, metrics, training loops, to support common use cases. It provides a high-level user experience that's accessible and productive.

  • If you're a researcher, you may prefer not to use these built-in blocks such as layers and training loops, and instead create your own. Of course, Keras allows you to do this. In this case, Keras provides you with templates for the blocks you write, it provides you with structure, with an API standard for things like Layers and Metrics. This structure makes your code easy to share with others and easy to integrate in production workflows.

  • The same is true for library developers: TensorFlow is a large ecosystem. It has many different libraries. In order for different libraries to be able to talk to each other and share components, they need to follow an API standard. That's what Keras provides.

Crucially, Keras brings high-level UX and low-level flexibility together fluently: you no longer have on one hand, a high-level API that's easy to use but inflexible, and on the other hand a low-level API that's flexible but only approachable by experts. Instead, you have a spectrum of workflows, from the very high-level to the very low-level. Workflows that are all compatible because they're built on top of the same concepts and objects.

Spectrum of Keras workflows

The base Layer class

The first class you need to know is Layer. Pretty much everything in Keras derives from it.

A Layer encapsulates a state (weights) and some computation (defined in the call method).

[ ]

A layer instance works like a function. Let's call it on some data:

[ ]

The Layer class takes care of tracking the weights assigned to it as attributes:

[ ]

Note that's also a shortcut method for creating weights: add_weight. Instead of doing

	w_init = tf.random_normal_initializer()
self.w = tf.Variable(initial_value=w_init(shape=shape, dtype='float32'))

You would typically do:

	self.w = self.add_weight(shape=shape, initializer='random_normal')

It’s good practice to create weights in a separate build method, called lazily with the shape of the first inputs seen by your layer. Here, this pattern prevents us from having to specify input_dim in the constructor:

[ ]

Trainable and non-trainable weights

Weights created by layers can be either trainable or non-trainable. They're exposed in trainable_weights and non_trainable_weights. Here's a layer with a non-trainable weight:

[ ]
[2. 2.]
[4. 4.]

Recursively composing layers

Layers can be recursively nested to create bigger computation blocks. Each layer will track the weights of its sublayers (both trainable and non-trainable.

[ ]

Built-in layers

Keras provides you with a wide range of built-in layers, so that you don't have to implement your own layers all the time.

  • Convolution layers
  • Transposed convolutions
  • Separateable convolutions
  • Average and max pooling
  • Global average and max pooling
  • LSTM, GRU (with built-in cuDNN acceleration)
  • BatchNormalization
  • Dropout
  • Attention
  • ConvLSTM2D
  • etc.

Keras follows the principles of exposing good default configurations, so that layers will work fine out of the box for most use cases if you leave keyword arguments to their default value. For instance, the LSTM layer uses an orthogonal recurrent matrix initializer by default, and initializes the forget gate bias to one by default.

The training argument in call

Some layers, in particular the BatchNormalization layer and the Dropout layer, have different behaviors during training and inference. For such layers, it is standard practice to expose a training (boolean) argument in the call method.

By exposing this argument in call, you enable the built-in training and evaluation loops (e.g. fit) to correctly use the layer in training and inference.

[ ]

A more Functional way of defining models

To build deep learning models, you don't have to use object-oriented programming all the time. Layers can also be composed functionally, like this (we call it the "Functional API"):

[ ]

The Functional API tends to be more concise than subclassing, and provides a few other advantages (generally the same advantages that functional, typed languages provide over untyped OO development). However, it can only be used to define DAGs of layers -- recursive networks should be defined as Layer subclasses instead.

Key differences between models defined via subclassing and Functional models are explained in this blog post.

Learn more about the Functional API here.

In your research workflows, you may often find yourself mix-and-matching OO models and Functional models.

For models that are simple stacks of layers with a single input and a single output, you can also use the Sequential class which turns a list of layers into a Model:

[ ]

Loss classes

Keras features a wide range of built-in loss classes, like BinaryCrossentropy, CategoricalCrossentropy, KLDivergence, etc. They work like this:

[ ]
Loss: 11.522857

Note that loss classes are stateless: the output of __call__ is only a function of the input.

Metric classes

Keras also features a wide range of built-in metric classes, such as BinaryAccuracy, AUC, FalsePositives, etc.

Unlike losses, metrics are stateful. You update their state using the update_state method, and you query the scalar metric result using result:

[ ]
Intermediate result:  0.6666667
Final result:  0.71428573

The internal state can be cleared with metric.reset_states.

You can easily roll out your own metrics by subclassing the Metric class:

  • Create the state variables in __init__
  • Update the variables given y_true and y_pred in update_state
  • Return the metric result in result
  • Clear the state in reset_states

Here's a quick implementation of a BinaryTruePositives metric as a demonstration:

[ ]

Optimizer classes & a quick end-to-end training loop

You don't normally have to define by hand how to update your variables during gradient descent, like we did in our initial linear regression example. You would usually use one of the built-in Keras optimizer, like SGD, RMSprop, or Adam.

Here's a simple MNSIT example that brings together loss classes, metric classes, and optimizers.

[ ]
Step: 0
Loss from last step: 2.2916040420532227
Total running accuracy so far: 0.15625
Step: 100
Loss from last step: 0.32435721158981323
Total running accuracy so far: 0.8261138796806335
Step: 200
Loss from last step: 0.2299645096063614
Total running accuracy so far: 0.8729011416435242
Step: 300
Loss from last step: 0.23473316431045532
Total running accuracy so far: 0.8927533030509949
Step: 400
Loss from last step: 0.2581174373626709
Total running accuracy so far: 0.9058213829994202
Step: 500
Loss from last step: 0.13789471983909607
Total running accuracy so far: 0.9129241704940796
Step: 600
Loss from last step: 0.2330610603094101
Total running accuracy so far: 0.9195091724395752
Step: 700
Loss from last step: 0.13869448006153107
Total running accuracy so far: 0.9248840808868408
Step: 800
Loss from last step: 0.16177119314670563
Total running accuracy so far: 0.9285463690757751
Step: 900
Loss from last step: 0.08393995463848114
Total running accuracy so far: 0.9324014782905579

We can reuse our SparseCategoricalAccuracy metric instance to implement a testing loop:

[ ]
Final test accuracy: 0.963699996471405

The add_loss method

Sometimes you need to compute loss values on the fly during a foward pass (especially regularization losses). Keras allows you to compute loss values at any time, and to recursively keep track of them via the add_loss method.

Here's an example of a layer that adds a sparsity regularization loss based on the L2 norm of the inputs:

[ ]

Loss values added via add_loss can be retrieved in the .losses list property of any Layer or Model:

[ ]
[<tf.Tensor: id=186011, shape=(), dtype=float32, numpy=0.61277115>]

These losses are cleared by the top-level layer at the start of each forward pass -- they don't accumulate. So layer.losses always contain only the losses created during the last forward pass. You would typically use these losses by summing them before computing your gradients when writing a training loop.

[ ]
0 4.0465497970581055
100 2.3009450435638428
200 2.2300477027893066
300 2.2246510982513428
400 2.1922898292541504
500 2.0787229537963867
600 1.9927294254302979
700 1.8819587230682373
800 1.9145734310150146
900 1.7928987741470337

A detailed end-to-end example: a Variational AutoEncoder (VAE)

If you want to take a break from the basics and look at a slightly more advanced example, check out this Variational AutoEncoder implementation that demonstrates everything you've learned so far:

  • Subclassing Layer
  • Recursive layer composition
  • Loss classes and metric classes
  • add_loss
  • GradientTape

Using built-in training loops

It would be a bit silly if you had to write your own low-level training loops every time for simple use cases. Keras provides you with a built-in training loop on the Model class. If you want to use it, either subclass from Model or create a Functional or Sequential model.

To demonstrate it, let's reuse the MNIST setup from above:

[ ]

First, call compile to configure the optimizer, loss, and metrics to monitor.

[ ]

Then we call fit on our model to pass it the data:

[ ]
W&B Run: https://app.wandb.ai/lavanyashukla/tf2-keras/runs/3wa6uyc5
[ ]
Epoch 1/3
938/938 [==============================] - 4s 4ms/step - loss: 0.2187 - sparse_categorical_accuracy: 0.9353
Epoch 2/3
938/938 [==============================] - 2s 3ms/step - loss: 0.0875 - sparse_categorical_accuracy: 0.9726
Epoch 3/3
938/938 [==============================] - 2s 3ms/step - loss: 0.0560 - sparse_categorical_accuracy: 0.9828
<tensorflow.python.keras.callbacks.History at 0x7f97d1e8bb00>

Done! Now let's test it:

[ ]
79/79 [==============================] - 0s 2ms/step - loss: 0.0921 - sparse_categorical_accuracy: 0.9732
loss: 0.09207370897380729 acc: 0.9732

Note that you can also monitor your loss and metrics on some validation data during fit.

Also, you can call fit directly on Numpy arrays, so no need for the dataset conversion:

[ ]
W&B Run: https://app.wandb.ai/lavanyashukla/tf2-keras/runs/lp0reroh
[ ]
Train on 50000 samples, validate on 10000 samples
Epoch 1/3
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.656626). Check your callbacks.
   64/50000 [..............................] - ETA: 12:19 - loss: 2.2869 - sparse_categorical_accuracy: 0.0938WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.328359). Check your callbacks.
50000/50000 [==============================] - 3s 59us/sample - loss: 0.2398 - sparse_categorical_accuracy: 0.9302 - val_loss: 0.1124 - val_sparse_categorical_accuracy: 0.9675
Epoch 2/3
50000/50000 [==============================] - 2s 37us/sample - loss: 0.0939 - sparse_categorical_accuracy: 0.9711 - val_loss: 0.1005 - val_sparse_categorical_accuracy: 0.9701
Epoch 3/3
50000/50000 [==============================] - 2s 36us/sample - loss: 0.0602 - sparse_categorical_accuracy: 0.9813 - val_loss: 0.0874 - val_sparse_categorical_accuracy: 0.9745
<tensorflow.python.keras.callbacks.History at 0x7f9780058a58>

Callbacks

One of the neat features of fit (besides built-in support for sample weighting and class weighting) is that you can easily customize what happens during training and evaluation by using callbacks.

A callback is an object that is called at different points during training (e.g. at the end of every batch or at the end of every epoch) and does stuff.

There's a bunch of built-in callback available, like ModelCheckpoint to save your models after each epoch during training, or EarlyStopping, which interrupts training when your validation metrics start stalling.

And you can easily write your own callbacks.

[ ]
W&B Run: https://app.wandb.ai/lavanyashukla/tf2-keras/runs/3dzbmpic
[ ]
Train on 50000 samples, validate on 10000 samples
Epoch 1/30
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (2.082854). Check your callbacks.
   64/50000 [..............................] - ETA: 31:57 - loss: 2.3758 - sparse_categorical_accuracy: 0.0156WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.041440). Check your callbacks.
50000/50000 [==============================] - 4s 89us/sample - loss: 0.2406 - sparse_categorical_accuracy: 0.9283 - val_loss: 0.1143 - val_sparse_categorical_accuracy: 0.9662
Epoch 2/30
50000/50000 [==============================] - 2s 37us/sample - loss: 0.0921 - sparse_categorical_accuracy: 0.9717 - val_loss: 0.0971 - val_sparse_categorical_accuracy: 0.9711
Epoch 3/30
50000/50000 [==============================] - 2s 37us/sample - loss: 0.0606 - sparse_categorical_accuracy: 0.9809 - val_loss: 0.0948 - val_sparse_categorical_accuracy: 0.9718
Epoch 4/30
50000/50000 [==============================] - 2s 36us/sample - loss: 0.0430 - sparse_categorical_accuracy: 0.9859 - val_loss: 0.0979 - val_sparse_categorical_accuracy: 0.9723
<tensorflow.python.keras.callbacks.History at 0x7f97e1262860>

Parting words

I hope this guide has given you a good overview of what's possible with TensorFlow 2.0 and Keras!

Remember that TensorFlow and Keras don't represent a single workflow. It's a spectrum of workflows, each with its own trade-off between usability and flexbility. For instance, you've noticed that it's much easier to use fit than to write a custom training loop, but fit doesn't give you the same level of granular control for research use cases.

So use the right tool for the job!

A core principle of Keras is "gradual disclosure of complexity": it's easy to get started, and you can gradually dive into workflows where you write more and more logic from scratch, providing you with complete control.

This applies to both model definition, and model training.

Model definition: spectrum of workflows

Model training: spectrum of workflows