This page looks best with JavaScript enabled

GCP VM SSH problem and solution

 ·  ☕ 4 min read  ·  ✍️ Syed Dawood

The Start

It was just another day at work. I was working on just another task in my everyday routine. I was required to login to a VM, let’s just call it $INSTANCE throughout this post and update few configs. I logged into google cloud console. Selected the project from project selector. Navigated to compute engine and clicked on SSH, doing so would open a pop up window and drops you into the familiar bash shell, not today. Instead, it kept on loading.

The Denial

I was confused, this has never happened before. I double-checked my internet, tried a different browser, used alternate internet connection, all actions ended up with same result. The loading pop up window

Attempt #1 : gcloud command

1
gcloud beta compute ssh $INSTANCE --zone $ZONE --project $PROJECT

Attempt #2 : gcloud command with username

1
gcloud compute ssh  $USR@$INSTANCE --zone $ZONE --project $PROJECT

Attempt #3 : gcloud command with verbose flag

1
gcloud compute ssh --zone $ZONE $INSTANCE --project $PROJECT --ssh-flag="-vvvvv"

Attempt #4 : gcloud command with compute engine and my newly generated ssh keypair

1
2
gcloud compute ssh --zone $ZONE $INSTANCE --project $PROJECT --ssh-key-file=$HOME/.ssh/google_compute_engine --ssh-flag="-vvv" # compute engine default
gcloud compute ssh --zone $ZONE $INSTANCE --project $PROJECT --ssh-key-file=$HOME/.ssh/new-ssh-key --ssh-flag="-vvv"

Attempt #5 : Reconfiguring gcloud ssh

1
2
rm $HOME/.ssh/google_compute_engine $HOME/.ssh/google_compute_engine.pub # removing default key pair
gcloud compute config-ssh

After this step, I went thought all above step once again.All yielding same result

Attempt #6 : ssh command with default and new keys

1
2
ssh -i $HOME/.ssh/new_key $USR@$INSTANCE_IP
ssh -i $HOME/.ssh/google_compute_engine $USR@$INSTANCE_IP

That same result was.

ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]

The Hint

I discussed this problem with my project manager, He asked to get help from one of our cloud team member.During out conversation he suggested that enabling serial port with help the debugging of the problem and also there is something called startup-script, which does what it says, runs a script on VM start up. With these new-found hints I started to dig deeper.

Analysing serial port log

1
gcloud compute connect-to-serial-port $INSTANCE --zone=$ZONE --project=$PROJECT

This step right there revealed that VM ran out of storage.

Solution #1 : startup-script

I added metadata startup-script with content below. I have also tried below script with sudo, making sure I don’t leave any stone un-turned. After 3-4 trial-errors and extensive analysing of logs. I could conclude that startup-script was also not triggering.

1
2
#!/usr/bin/env bash
find /home/user/ -name "*.log" -delete

Solution #2 : shutdown-script

shutdown-script is again a script that is executed before machine is switched off, its content was same as startup-script. These were not triggering since there was not enough storage on VM.

Solution #3 : Resizing disk

If it ran out of storage, simply add more storage to VM boot disk will fix this problem. So, I decide to resize the boot disk after switching off the VM. I must say the resize command completed almost instantly.

1
gcloud compute disks resize $INSTANCE --zone $ZONE --size <int> --project $PROJECT

I started the VM, thinking Issue is resolved, But I was wrong. It greeted me with same error message when tried connecting it.

Solution #4 : Final Solution

While I was skimming though the documentation I read that you could detach and re-attach boot disks. I got an idea. I remembered that there is one snapshot of this VM which was taken when things were green. Here are my steps to solution.

  • Switch off the VM
  • Creating a disk from snapshot
  • detaching current boot disk
  • re-attaching disk create in first step
  • Switch it back on and hope it will work
1
2
3
gcloud compute disks create $NEW_DISK --source-snapshot $SNAPSHOT --project=$PROJECT --size <int> --zone $ZONE
gcloud beta compute instances detach-disk $INSTANCE --disk $OLD_DISK --project=$PROJECT
gcloud beta compute instances attach-disk $INSTANCE --disk $NEW_DISK --boot --project=$PROJECT

Conclusion

Voilà! I was able to access the machine. Someone would ask why go through all the hassle. You could have just create a new VM using snapshot. I couldn’t do that, I didn’t want to lose the VM metadata and more importantly VM IP. Since, this server was used by many of our customers, and they connect to it via IP.

What I have learned

  • There is a serial port on compute instance that GCP providers.
  • startup-script and shutdown-script
  • You can detach and re-attach boot disk, again this might not work exactly for a windows VM

clean up

I have cleaning up to do. Deleting the old boot disk, removing extra ssh keys from metadata, updating my code such that it removes old log files. These log files were the very reason for existence of this problem

References

Share on

ALLSYED
WRITTEN BY
Syed Dawood
< frontend | backend | fullstack > Developer