In [0]:
%install '.package(url: "https://github.com/tensorflow/swift-models", .branch("tensorflow-0.10"))' Datasets ImageClassificationModels
print("\u{001B}[2J")
In [0]:
// #@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" }
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// https://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
By default, Swift For TensorFlow performs tensor operations using eager dispatch. This allows for rapid iteration, but isn't the most performant option for training machine learning models.
The X10 tensor library adds a high-performance backend to Swift for TensorFlow, leveraging tensor tracing and the XLA compiler. This tutorial will introduce X10 and guide you through the process of updating a training loop to run on GPUs or TPUs.
Accelerated calculations in Swift for TensorFlow are performed through the Tensor type. Tensors can participate in a wide variety of operations, and are the fundamental building blocks of machine learning models.
By default, a Tensor uses eager execution to perform calculations on an operation-by-operation basis. Each Tensor has an associated Device that describes what hardware it is attached to and what backend is used for it.
In [0]:
import TensorFlow
import Foundation
In [0]:
let eagerTensor1 = Tensor([0.0, 1.0, 2.0])
let eagerTensor2 = Tensor([1.5, 2.5, 3.5])
let eagerTensorSum = eagerTensor1 + eagerTensor2
eagerTensorSum
In [0]:
eagerTensor1.device
If you are running this notebook on a GPU-enabled instance, you should see that hardware reflected in the device description above. The eager runtime does not have support for TPUs, so if you are using one of them as an accelerator you will see the CPU being used as a hardware target.
When creating a Tensor, the default eager mode device can be overridden by specifying an alternative. This is how you opt-in to performing calculations using the X10 backend.
In [0]:
let x10Tensor1 = Tensor([0.0, 1.0, 2.0], on: Device.defaultXLA)
let x10Tensor2 = Tensor([1.5, 2.5, 3.5], on: Device.defaultXLA)
let x10TensorSum = x10Tensor1 + x10Tensor2
x10TensorSum
In [0]:
x10Tensor1.device
If you're running this in a GPU-enabled instance, you should see that accelerator listed in the X10 tensor's device. Unlike for eager execution, if you are running this in a TPU-enabled instance, you should now see that calculations are using that device. X10 is how you take advantage of TPUs within Swift for TensorFlow.
The default eager and X10 devices will attempt to use the first accelerator on the system. If you have GPUs attached, the will use the first available GPU. If TPUs are present, X10 will use the first TPU core by default. If no accelerator is found or supported, the default device will fall back to the CPU.
Beyond the default eager and XLA devices, you can provide specific hardware and backend targets in a Device:
In [0]:
// let tpu1 = Device(kind: .TPU, ordinal: 1, backend: .XLA)
// let tpuTensor1 = Tensor([0.0, 1.0, 2.0], on: tpu1)
Let's take a look at how you'd set up and train a model using the default eager execution mode. In this example, we'll be using the simple LeNet-5 model from the swift-models repository and the MNIST handwritten digit classification dataset.
First, we'll set up and download the MNIST dataset.
In [0]:
import Datasets
let epochCount = 5
let batchSize = 128
let dataset = MNIST(batchSize: batchSize)
Next, we will configure the model and optimizer.
In [0]:
import ImageClassificationModels
var eagerModel = LeNet()
var eagerOptimizer = SGD(for: eagerModel, learningRate: 0.1)
Now, we will implement basic progress tracking and reporting. All intermediate statistics are kept as tensors on the same device where training is run and scalarized()
is called only during reporting. This will be especially important later when using X10, because it avoids unnecessary materialization of lazy tensors.
In [0]:
struct Statistics {
var correctGuessCount = Tensor<Int32>(0, on: Device.default)
var totalGuessCount = Tensor<Int32>(0, on: Device.default)
var totalLoss = Tensor<Float>(0, on: Device.default)
var batches: Int = 0
var accuracy: Float {
Float(correctGuessCount.scalarized()) / Float(totalGuessCount.scalarized()) * 100
}
var averageLoss: Float { totalLoss.scalarized() / Float(batches) }
init(on device: Device = Device.default) {
correctGuessCount = Tensor<Int32>(0, on: device)
totalGuessCount = Tensor<Int32>(0, on: device)
totalLoss = Tensor<Float>(0, on: device)
}
mutating func update(logits: Tensor<Float>, labels: Tensor<Int32>, loss: Tensor<Float>) {
let correct = logits.argmax(squeezingAxis: 1) .== labels
correctGuessCount += Tensor<Int32>(correct).sum()
totalGuessCount += Int32(labels.shape[0])
totalLoss += loss
batches += 1
}
}
Finally, we'll run the model through a training loop for five epochs.
In [0]:
print("Beginning training...")
for (epoch, batches) in dataset.training.prefix(epochCount).enumerated() {
let start = Date()
var trainStats = Statistics()
var testStats = Statistics()
Context.local.learningPhase = .training
for batch in batches {
let (images, labels) = (batch.data, batch.label)
let 𝛁model = TensorFlow.gradient(at: eagerModel) { eagerModel -> Tensor<Float> in
let ŷ = eagerModel(images)
let loss = softmaxCrossEntropy(logits: ŷ, labels: labels)
trainStats.update(logits: ŷ, labels: labels, loss: loss)
return loss
}
eagerOptimizer.update(&eagerModel, along: 𝛁model)
}
Context.local.learningPhase = .inference
for batch in dataset.validation {
let (images, labels) = (batch.data, batch.label)
let ŷ = eagerModel(images)
let loss = softmaxCrossEntropy(logits: ŷ, labels: labels)
testStats.update(logits: ŷ, labels: labels, loss: loss)
}
print(
"""
[Epoch \(epoch)] \
Training Loss: \(String(format: "%.3f", trainStats.averageLoss)), \
Training Accuracy: \(trainStats.correctGuessCount)/\(trainStats.totalGuessCount) \
(\(String(format: "%.1f", trainStats.accuracy))%), \
Test Loss: \(String(format: "%.3f", testStats.averageLoss)), \
Test Accuracy: \(testStats.correctGuessCount)/\(testStats.totalGuessCount) \
(\(String(format: "%.1f", testStats.accuracy))%) \
seconds per epoch: \(String(format: "%.1f", Date().timeIntervalSince(start)))
""")
}
As you can see, the model trained as we would expect, and its accuracy against the validation set increased each epoch. This is how Swift for TensorFlow models are defined and run using eager execution, now let's see what modifications need to be made to take advantage of X10.
In [0]:
let device = Device.defaultXLA
device
For the datasets, we'll do that at the point in which batches are processed in the training loop, so we can re-use the dataset from the eager execution model.
In the case of the model and optimizer, we'll initialize them with their internal tensors on the eager execution device, then move them over to the X10 device.
In [0]:
var x10Model = LeNet()
x10Model.move(to: device)
var x10Optimizer = SGD(for: x10Model, learningRate: 0.1)
x10Optimizer = SGD(copying: x10Optimizer, to: device)
The modifications needed for the training loop come at a few specific points. First, we'll need to move the batches of training data over to the X10 device. This is done via Tensor(copying:to:)
when each batch is retrieved.
The next change is to indicate where to cut off the traces during the training loop. X10 works by tracing through the tensor calculations needed in your code and just-in-time compiling an optimized representation of that trace. In the case of a training loop, you’re repeating the same operation over and over again, an ideal section to trace, compile, and re-use.
In the absence of code that explicitly requests a value from a Tensor (these usually stand out as .scalars
or .scalarized()
calls), X10 will attempt to compile all loop iterations together. To prevent this, and cut the trace at a specific point, we place an explicit LazyTensorBarrier()
after the optimizer updates the model weights and after the loss and accuracy are obtained during validation. This creates two reused traces: each step in the training loop and each batch of inference during validation.
These modifications result in the following training loop.
In [0]:
print("Beginning training...")
for (epoch, batches) in dataset.training.prefix(epochCount).enumerated() {
let start = Date()
var trainStats = Statistics(on: device)
var testStats = Statistics(on: device)
Context.local.learningPhase = .training
for batch in batches {
let (eagerImages, eagerLabels) = (batch.data, batch.label)
let images = Tensor(copying: eagerImages, to: device)
let labels = Tensor(copying: eagerLabels, to: device)
let 𝛁model = TensorFlow.gradient(at: x10Model) { x10Model -> Tensor<Float> in
let ŷ = x10Model(images)
let loss = softmaxCrossEntropy(logits: ŷ, labels: labels)
trainStats.update(logits: ŷ, labels: labels, loss: loss)
return loss
}
x10Optimizer.update(&x10Model, along: 𝛁model)
LazyTensorBarrier()
}
Context.local.learningPhase = .inference
for batch in dataset.validation {
let (eagerImages, eagerLabels) = (batch.data, batch.label)
let images = Tensor(copying: eagerImages, to: device)
let labels = Tensor(copying: eagerLabels, to: device)
let ŷ = x10Model(images)
let loss = softmaxCrossEntropy(logits: ŷ, labels: labels)
LazyTensorBarrier()
testStats.update(logits: ŷ, labels: labels, loss: loss)
}
print(
"""
[Epoch \(epoch)] \
Training Loss: \(String(format: "%.3f", trainStats.averageLoss)), \
Training Accuracy: \(trainStats.correctGuessCount)/\(trainStats.totalGuessCount) \
(\(String(format: "%.1f", trainStats.accuracy))%), \
Test Loss: \(String(format: "%.3f", testStats.averageLoss)), \
Test Accuracy: \(testStats.correctGuessCount)/\(testStats.totalGuessCount) \
(\(String(format: "%.1f", testStats.accuracy))%) \
seconds per epoch: \(String(format: "%.1f", Date().timeIntervalSince(start)))
""")
}
Training of the model using the X10 backend should have proceeded in the same manner as the eager execution model did before. You may have noticed a delay before the first batch and at the end of the first epoch, due to the just-in-time compilation of the unique traces at those points. If you're running this with an accelerator attached, you should have seen the training after that point proceeding faster than with eager mode.
There is a tradeoff of initial trace compilation time vs. faster throughput, but in most machine learning models the increase in throughput from repeated operations should more than offset compilation overhead. In practice, we've seen an over 4X improvement in throughput with X10 in some training cases.
As has been stated before, using X10 now makes it not only possible but easy to work with TPUs, unlocking that whole class of accelerators for your Swift for TensorFlow models.