Update, Dec 2020: I wrote this several years ago, before Google released colab. If I were training ML models in the cloud now, as an individual instead of setting up scalable infrastructure for training, I’d just use colab instead of all the stuff I describe below.
Reliably executing shutdown scripts in Google Compute Engine
In a previous post about training machine learning models in the cloud, I wrote about some interesting things I learned in the process of installing and using Tensorflow on Google Compute Engine instances. One of the things I wrote about was using shutdown scripts to save current progress when an instance gets preempted. It turns out I was a bit optimistic about its reliability; in this post I’ll describe the difficulties I had getting a shutdown script to run, and how I worked around it.
What is a shutdown script?
When you create an instance with Google Compute Engine, you can specify a shutdown script. That means you specify a script that gets automatically run, ostensibly, whenever your instance is rebooted, turned off, stopped, or deleted. If you only ever take those actions manually, then a shutdown script might not be too important — if you’re taking one action to stop your instance, you might as well take two actions and manually run a script right before you stop your instance.
But if you will ever have your instance stopped by someone else for some other reason — like if you’re trying to save money by using a preemptible instance (where you get half off the price in exchange for maybe the instance will be shut off at some random time) — then it can be very important to have an automatic script running just before shutoff that will save any computations in progress.
Can I trust a shutdown script to be run?
Maybe. In Google’s docs that I linked above, they write:
Create and run shutdown scripts that execute commands right before an instance is terminated or restarted, on a best-effort basis. Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee that the shutdown script will be run in all cases.
Emphasis theirs. Later, on the same page:
Before an instance shuts down or restarts, the shutdown script has a limited time period to run. During this period, Compute Engine attempts to run your shutdown script. If the script takes longer than this time period to complete, the instance automatically terminates and all running tasks are killed.
How long is this “limited time period”? Let’s look at their documentation on the shutdown period:
When you shut down or delete an instance, Compute Engine sends the ACPI Power Off signal to the instance and waits a short period of time for your instance to shut down cleanly. If your instance is still running after this grace period, Compute Engine forcefully terminates it even if your shutdown script is still running. The length of the shutdown period depends on the type of your instance.
- Normal instances have a shutdown period that usually lasts at least 90 seconds, but could be longer.
- Preemptible instances have a shutdown period that lasts 30 seconds, which is the same length as the shutdown period that happens during the preemption process.
So, 30 seconds for a preemptible instance? Not so fast:
Note: Compute Engine does not guarantee the length of these shutdown periods and we recommend that you do not create any hard dependencies on these time limits.
So, not much in the way of guarantees.
What actually happens in practice?
What I want my shutdown script to do is find my training process and send it an interrupt signal. I’d also like to pause until the training process cleans up, so that the computer doesn’t just shut down immediately after the interrupt signal. Here’s the content of my script
#!/bin/bash pgrep -f "train" | xargs kill -SIGINT while pgrep -f "train" > /dev/null; do sleep 1; done
My training process is designed so that when it receives an interrupt signal, it finishes the last training step and then saves a training checkpoint. This takes about 6 seconds.
$ time ./shutdown.sh real 0m6.194s user 0m0.029s sys 0m0.016s
Well within the documented 30 seconds for preemptible instances! It should work fine… yet it doesn’t. (For testing, I added a line to write a “did this work” message to a file, to check if the script was being run at all. It was not being run.)
When I execute the script via
google_metadata_script_runner --script-type=shutdown, it runs. When I reboot via
sudo reboot, it runs. When I stop the instance via the Compute Engine console… nope. When the instance gets preempted… nope.
I tried creating a new fresh instance, not preemptible, with just a sample shutdown script… and it did run when I stopped the instance via the console, so there’s something weird going on with the particular preemptible instance I’m running.
What did I do about it?
I definitely wanted a working shutdown script. How does Google implement theirs, anyway? Maybe if I figure that out, I can adapt their method to my own use.
Here’s a hint, from the same excerpt I posted above:
When you shut down or delete an instance, Compute Engine sends the ACPI Power Off signal to the instance and waits a short period of time for your instance to shut down cleanly.
What’s an ACPI Power Off signal? Searching around for keywords like “acpi power off ubuntu” (Ubuntu is the operating system I’m running on this instance) led me to several useful pages like e.g. this one. Turns out there are two systems in place (at least in Ubuntu 16.04) that listen to the power off button;
Taking a look at the file
/etc/acpi/events/powerbtn suggests that the script
/etc/acpi/powerbtn.sh is called when the power button is pushed. That script, in turn, starts with:
#!/bin/bash # /etc/acpi/powerbtn.sh # Initiates a shutdown when the power putton has been # pressed. [ -r /usr/share/acpi-support/power-funcs ] && . /usr/share/acpi-support/power-funcs # If logind is running, it already handles power button presses; desktop # environments put inhibitors to logind if they want to handle the key # themselves. if pidof systemd-logind >/dev/null; then exit 0 fi ... # more stuff
That line that begins
if pidof systemd-logind >/dev/null; suggests that in fact
systemd-logind handles shutdown behavior, but that this
powerbtn.sh script is still called as some sort of legacy system that hands off responsibility to the newer system.
Presumably Google uses the
logind system to add their own custom shutdown behavior. I didn’t want to go investigating how the
logind system works, and luckily it turned out that I didn’t have to do that. That’s because altering the
powerbtn.sh script is sufficient! I added the following line just above that
sudo su - h_nuchi bash -c '/home/h_nuchi/shutdown.sh'
And that did the trick! Now the shutdown script is gracefully called when the instance is manually stopped. I haven’t gotten the chance to see whether it’s called when the instance is preempted, because that happens unpredictably, but I’m cautiously optimistic because the documentation for preemptible instances state you can simulate preemption by stopping the instance.
If I were managing a large number of instances of different types, and needed custom shutdown behavior for each one, in complicated ways depending on lots of different factors… then the solution I discovered is probably too hacky. I’d want to use the hooks that Google provides, and properly use the shutdown metadata for the instances to handle the clean shutdown. But given that I’m only running this one instance (for now), and given that their system was unreliable, I’ll stick to the hacky solution.