It was just another day at work. I was working on just another task in my everyday routine. I was required to login to a VM, let’s just call it $INSTANCE throughout this post and update few configs. I logged into google cloud console. Selected the project from project selector. Navigated to compute engine and clicked on SSH, doing so would open a pop up window and drops you into the familiar bash shell, not today. Instead, it kept on loading.
I was confused, this has never happened before. I double-checked my internet, tried a different browser, used alternate internet connection, all actions ended up with same result. The loading pop up window
Attempt #1 : gcloud command
Attempt #2 : gcloud command with username
Attempt #3 : gcloud command with verbose flag
Attempt #4 : gcloud command with compute engine and my newly generated ssh keypair
Attempt #5 : Reconfiguring gcloud ssh
After this step, I went thought all above step once again.All yielding same result
Attempt #6 : ssh command with default and new keys
That same result was.
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code 
I discussed this problem with my project manager, He asked to get help from one of our cloud team member.During out conversation he suggested that enabling serial port with help the debugging of the problem and also there is something called startup-script, which does what it says, runs a script on VM start up. With these new-found hints I started to dig deeper.
Analysing serial port log
This step right there revealed that VM ran out of storage.
Solution #1 : startup-script
I added metadata startup-script with content below. I have also tried below script with sudo, making sure I don’t leave any stone un-turned. After 3-4 trial-errors and extensive analysing of logs. I could conclude that startup-script was also not triggering.
Solution #2 : shutdown-script
shutdown-script is again a script that is executed before machine is switched off, its content was same as startup-script. These were not triggering since there was not enough storage on VM.
Solution #3 : Resizing disk
If it ran out of storage, simply add more storage to VM boot disk will fix this problem. So, I decide to resize the boot disk after switching off the VM. I must say the resize command completed almost instantly.
I started the VM, thinking Issue is resolved, But I was wrong. It greeted me with same error message when tried connecting it.
Solution #4 : Final Solution
While I was skimming though the documentation I read that you could detach and re-attach boot disks. I got an idea. I remembered that there is one snapshot of this VM which was taken when things were green. Here are my steps to solution.
- Switch off the VM
- Creating a disk from snapshot
- detaching current boot disk
- re-attaching disk create in first step
- Switch it back on and hope it will work
Voilà! I was able to access the machine. Someone would ask why go through all the hassle. You could have just create a new VM using snapshot. I couldn’t do that, I didn’t want to lose the VM metadata and more importantly VM IP. Since, this server was used by many of our customers, and they connect to it via IP.
What I have learned
- There is a serial port on compute instance that GCP providers.
- startup-script and shutdown-script
- You can detach and re-attach boot disk, again this might not work exactly for a windows VM
I have cleaning up to do. Deleting the old boot disk, removing extra ssh keys from metadata, updating my code such that it removes old log files. These log files were the very reason for existence of this problem