Wednesday, September 8, 2021

SSM Execution Timeout

 

Problem Statement

One of core business servers didn't get boot up on scaling out new instance.  The next box is stale and didn't proceed.  The business impact - no new deployments are possible to release further.

Why

On deep analysis, it has been found that the bootstrap execution got stuck after 60 minutes.  Question is how do we resolve this?

One clue is timeout settings in send-command of aws ssm document.  There are two types namely delivery and execution timeout; both with default of 3600 seconds.

Total timeout is equal to the value of delivery timeout plus execution timeout. If execution timeout isn't required by the SSM document, then total timeout is equal to the value of delivery timeout plus default execution timeout.

How

Let us review the purpose of two timeout parameters. 

If Systems Manager receives an execution timeout reply from SSM Agent on a target, then Systems Manager marks the command invocation as executionTimeout.

If Run Command doesn't receive a document terminal response from SSM Agent, the command invocation is marked as deliveryTimeout.

To fix this bootstrap stale state, the below terraform code is built programmatically

resource "aws_ssm_document" "TestServer-ssmCommand" {
  name          = "TestServer-Execute-Userdata-Prod"
  document_type = "Command"

  content = <<DOC
  {
    "schemaVersion": "2.0",
    "description": "Downloads and executes the userdata for Test Server",
    "parameters": {},
    "mainSteps": [
      {
        "action": "aws:runShellScript",
        "name": "runShellScript",
        "inputs": {
            "timeoutSeconds": 4500,
            "runCommand": [
              "sudo yum install dos2unix -y",
              "sudo yum install aws-cli -y ",
              ......
            ]
        }
    ]
  }
}

Key take away is timeoutSeconds property in mainSteps->inputs section of ssm_document object.

Conclusion

Thus the reported booting timeout issue is resolved to meet the business expectation.  Technology needs to enable the business.


2 comments: