Assorted things I learned about training machine learning models in the cloud

I’m using Google Cloud Platform (GCP) to train a machine learning model. Specifically I’m using Tensorflow and installing it myself on a Compute Engine instance. GCP offers a product which lets you train with Tensorflow without having to set it up yourself, but I elected to install it from scratch on a bare computer running just a fresh Ubuntu operating system. (Side note: this website is also running on a Compute Engine instance.)

This post contains some interesting things I learned, and a link to something I wrote to be able to cleanly stop an active training process while saving progress.

1. You can train on a GPU, and use preemptible instances, with free trial credits.

I found training on a GPU to be much faster than training on CPUs. Using a preemptible instance means that you pay half price in exchange for Google reserving the right to randomly shut off your computer if they feel like it. That’s a reasonable trade if you don’t need to get it done now; they provide time to shut down running processes and save data.

Also, Google offers free trial credits! The amount they offer has changed over time but I got started with $300 for free. The catch: they don’t allow training on GPUs or using preemptible instances while you’re in your free trial.

The double catch: You can upgrade your free trial to a paid account, keep the free trial credits in your paid account, and train with GPUs and preemptible instances with those free credits.

2. Installing software necessary to train on a GPU is very annoying.

Nvidia does not, so far as I know, have a one-step process to install all necessary drivers and software required to train with its GPUs using Tensorflow. I had to install something called a CUDA Toolkit, a cuDNN, and a driver for the GPU. Each of those had to be the exact right version for the version of Tensorflow I was using or else it would not work. Usually if something says it requires “Version 4.0 of thing X” then installing Version 4.1 of thing X is okay. Not so in this case. If Tensorflow expects CUDA Toolkit version 9.0, then you cannot use version 9.1.

Nvidia “helpfully” provides multiple different ways of installing each of those things. That’s not actually that helpful, because it means that there is not a single way to do it, so searching elsewhere for instructions provides multiple answers. Some of those answers change over time and become invalid. No one provides easy methods of uninstalling anything, so if you mess up and install the wrong version, you have to spend more time searching for how to remove the stuff.

For the record, my winning solution involved installing CUDA Toolkit via “the runfile method” (sudo sh cuda-toolkit.run — apparently you have to make sure to select “no” when prompted to install the driver), installing cuDNN via copying files manually to the /usr directory, and installing the driver via sudo ubuntu-drivers autoinstall (ubuntu-drivers had to be installed separately).

3. Cleanly interrupting training using Tensorflow’s Estimator API

I’m training a model using Tensorflow’s Estimator API. Tensorflow is a Python library, and by default, if you send an interrupt signal to a running Python process (for example by entering ctrl-C at the terminal), the Python runtime raises a KeyboardInterrupt exception which abruptly halts everything.

This is usually what one would want — if you interrupt something then you want it to stop right away — but sometimes you’d like it to clean up and put away its things first. Since I’m using preemptible instances to make training cheaper, I’m running the risk of getting shut down at any moment. Which means that I’d like to save my current progress instead of immediately tearing everything down if I get interrupted.

The Estimator API allows one to specify custom “hooks”: a hook is a piece of code that runs at pre-defined times, such as before training starts, before or after each single training step, after training ends. So I wrote a custom hook that catches an interrupt signal, and politely requests that the Estimator please stop training after this step, and save its current progress. You can view the source code.

Then I add a shutdown script to my preemptible instance:

#!/bin/bash
pgrep -f "train" | xargs kill -SIGINT

This just searches for all running processes whose containing the term “train” (for instance, it will match a command executed with python train.py — my training process should be the only one that matches the search) and then sends whatever it finds an interrupt signal. When my preemptible instance is being shut down, GCP is kind enough to first run the shutdown script, which gives my training process time to cleanly save its progress.

Addendum: It turns out that GCP is not so reliable about running the shutdown script. See my post about working around unreliable shutdown scripts for details.