Troubleshooting environments is always a challenging task because many things may go wrong: installation scripts can fail, dependencies may break and artifacts can be missing or not packaged properly. When deploying environments to a public cloud, it gets even more challenging. You can have quota limitations and complex network setups that make it difficult to identify and fix problems.
CloudShell Colony provides you with an improved troubleshooting experience using the Bastion, a feature that helps you work through the potential errors that occur when deploying your sandbox(es). The Bastion creates a direct communication link to your applications from within the sandbox. CloudShell Colony's Bastion is provided with an initial state of "Enabled-on." You can change its state to "Enabled-off" or "Disabled" in the YAML file. You can also disable the Bastion directly from the UI.
Enabling the Bastion feature incurs an infrastructure cost from your cloud-provider, although it is designed to minimize costs while maximizing your efficiency for the troubleshooting process. Once a sandbox is deployed, the Bastion can be turned on or off. However, if the Bastion is not enabled before deployment then it does not exist in the sandbox and therefore cannot be turned on or off.
To assure the Bastion is attached, see Enabling the Bastion Capability.
For more information about the sandbox orchestration process, see The Sandbox Deployment Process.
If your sandbox fails to launch, it’s usually related to one of the following scenarios:
- Errors that prevented your sandbox from starting (there is no sandbox in this case)
- Errors that prevented applications from starting to deploy
- Errors that occurred while applications were deploying
- No error but the sandbox is stuck on deploying state
Errors that prevented your sandbox from starting (there is no sandbox in this case)
You start a sandbox and instantly receive an error message. In this case the sandbox is not created.
- CloudShell Colony could not correctly interpret your blueprint and/or applications.
- CloudShell Colony could not access your cloud account using the given role.
- CloudShell Colony tried to send your infrastructure request to your cloud account but received a critical error.
A blueprint specifies the wrong region name.
Errors that prevented applications from starting to deploy
You start a sandbox, but some of the infrastructure cannot get created, which causes an application to move to "aborted" state.
You are reaching the limit of the cloud provider’s quota or specifying incorrect image identifiers or incorrect machine sizes. In this case, some cloud resources may have been created. You may access the VMs and investigate further and end the sandbox manually whenever you like.
By specifying a wrong AMI ID in AWS, an error displays in the sandbox summary tab, and the application aborts.
Errors that occurred while applications were deploying
You start a sandbox, the infrastructure is created and provisioned and while the applications are deploying, some fail and report an error state.
The error here comes from the health check that runs in parallel to the start script of your applications. The health check reports a failure to start the application when it reaches its defined timeout. In this case, cloud resources will remain live to allow you to access the VMs and investigate further. You may end the sandbox manually and clean up the cloud resources whenever you like.
To investigate further, you need to connect to the instance and look in the deployment log files. For troubleshooting guidelines, see Troubleshooting Initialization Scripts.
A simple scenario of a failed initialization script. In the Applications tab, one of the applications is marked with an error:
In the Troubleshooting tab, selecting the application displays exactly which compute instance failed:
Connecting to the failed instance using SSH allows you to browse the ‘events’ logs, which indicate the deployment failed, as expected:
Printing the initialization script log shows the cause of the problem:
No error but the sandbox is stuck in deploying state
You start a sandbox, the infrastructure gets created and provisioned, and then some applications continue deploying without stopping.
This usually happens when your initialization script does not stop. There may be an ‘infinite loop’ bug or it may mistakenly wait for user input. To verify that this is the case, you can login to the compute instance and check the events log.
The deployment process is stuck on the Initialization step: