Monday, September 13, 2021

Make500


 On BlogSpot alone (beyond other publications), I've written over 500 passionate weekly blog posts.

Appreciate almighty and every human being for Make500 opportunity to me, on logging into 50th life year.

Details are inked at https://www.linkedin.com/pulse/make500-ganesan-senthilvel

Wednesday, September 8, 2021

SSM Execution Timeout

 

Problem Statement

One of core business servers didn't get boot up on scaling out new instance.  The next box is stale and didn't proceed.  The business impact - no new deployments are possible to release further.

Why

On deep analysis, it has been found that the bootstrap execution got stuck after 60 minutes.  Question is how do we resolve this?

One clue is timeout settings in send-command of aws ssm document.  There are two types namely delivery and execution timeout; both with default of 3600 seconds.

Total timeout is equal to the value of delivery timeout plus execution timeout. If execution timeout isn't required by the SSM document, then total timeout is equal to the value of delivery timeout plus default execution timeout.

How

Let us review the purpose of two timeout parameters. 

If Systems Manager receives an execution timeout reply from SSM Agent on a target, then Systems Manager marks the command invocation as executionTimeout.

If Run Command doesn't receive a document terminal response from SSM Agent, the command invocation is marked as deliveryTimeout.

To fix this bootstrap stale state, the below terraform code is built programmatically

resource "aws_ssm_document" "TestServer-ssmCommand" {
  name          = "TestServer-Execute-Userdata-Prod"
  document_type = "Command"

  content = <<DOC
  {
    "schemaVersion": "2.0",
    "description": "Downloads and executes the userdata for Test Server",
    "parameters": {},
    "mainSteps": [
      {
        "action": "aws:runShellScript",
        "name": "runShellScript",
        "inputs": {
            "timeoutSeconds": 4500,
            "runCommand": [
              "sudo yum install dos2unix -y",
              "sudo yum install aws-cli -y ",
              ......
            ]
        }
    ]
  }
}

Key take away is timeoutSeconds property in mainSteps->inputs section of ssm_document object.

Conclusion

Thus the reported booting timeout issue is resolved to meet the business expectation.  Technology needs to enable the business.


Tuesday, September 7, 2021

Hudson FilePath Pipeline


Problem Statement

Last week, one of build servers start failing every jobs.  The business impact is not to proceed the entire build and deployment processes of every release.


Every jenkins build jobs report the below exceptions without any clue.  
[INFO] ------------------------------------------------------------------------
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] sh
Required context class hudson.FilePath is missing
Perhaps you forgot to surround the code with a step that provides this, such as: node
[Pipeline] End of Pipeline
org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
......
Finished: FAILURE
 

Why

Jenkins Pipeline is a suite of plugins which supports implementing and integrating continuous delivery pipelines into Jenkins.  It is an automated expression for getting right versioned software in the underlying build server.
 

There are two types of pipelines in Jenkins namely declarative and scripted.  Declarative pipeline is a more recent feature of Jenkins pipeline that provides better syntactical features compared to scripted pipeline syntax.
 

How

On troubleshooting the build job failures, it has been found that some software configuration/setup got changed.  


The resolution steps are applied in two steps processes
1. Upgraded the required pipeline plugins as in screenshot
2. Restarted the build server after plugins upgrades.
 

There are two ways of build server restart namely cold and warm. As we know, cold restart is the complete physical restart of the build server.  In warm restart, click on "Restart from Stage" of classic UI panel once pipeline has completed.
 

Conclusion

Thus the repored build server error is resolved to meet the business expectation.  Technology needs to enable the business.

Sunday, September 5, 2021

Lifecycle Hooks

 

Problem Statement

One of the monster production box holds the core business functionalities with the complete set of data and services. 

During the deployment, the scale-in process kick starts only after an hour.  The business impact is to await for the new scale-out services, after an hour of the effectiveness.

Why

A lifecycle hook provides a specified amount of time (one hour by default) to complete the lifecycle action before the instance transitions to the next state.

It allows to control what happens when Amazon EC2 instances are launched and terminated as they are scale out and in.

How

On troubleshooting the mess-up in lifecycle hook, it has been found the error in instance_terminate lifecycle transition.  As the resolution, new lifecycle hook is created for instance_terminate with the right heartbeat timeout and auto scaling default result with 'Continue' option.

This fix is applied in two methods - at AWS console manually and at terraform programmatically.

resource "aws_autoscaling_lifecycle_hook" "app" {
    depends_on             = ....
    autoscaling_group_name = ....
    name                   = ....
    default_result         = "CONTINUE"
    heartbeat_timeout      = 600
    lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
    notification_target_arn = ....
    role_arn                = ....
}

Conclusion

Thus the reported scale-in delay problem is resolved to meet the business expectation.  Technology needs to enable the business.

Thursday, September 2, 2021

Custom Widgets for CloudWatch

 


Last week, Amazon CloudWatch announced the immediate availability of custom widgets.  This new feature is used to gain operational visibility and agility by customizing the content of your CloudWatch dashboard such as adding visualizations, displaying information from multiple data sources or adding controls like buttons to take remediation actions.

Custom widgets can help you to correlate trends over time and spot issues more easily by displaying related data from different sources side by side on CloudWatch dashboards. Custom widgets allow you to extend your CloudWatch dashboards’ out of the box capabilities including line, bar and pie charts with rich, business specific visualizations that represent the operational health and performance of your workloads.

Technically, custom widgets has 3 steps process as below:

  1. CloudWatch dashboard calls the Lambda function containing the widget code. It passes in any custom defined parameters in the widget.
  2. The Lambda function returns a string of HTML, JSON, or Markdown. Markdown is returned as JSON {"markdown":"markdown content"}
  3. The dashboard displays the returned HTML or JSON.


A set of templates and a sample library is available at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_custom_widget_samples.html