Easy Health Checks With SocketPing Test

Technical Deep Dive
February 4, 2025
by
Dana Gibson
Easy Health Checks With SocketPing Test

Easy Health Checks With SocketPing Test

In the world of cloud computing, maintaining a vigilant eye on our systems is crucial. Continuous improvement in our monitoring and alerting practices is a fundamental philosophy. Fortunately, AWS provides valuable tools which simplify the automation of these processes. These tools play an important role in detecting and addressing potential issues promptly.

...2 months ago... "How long has this been non-functional?" I'm sure they had never been more thankful that their cameras were off, because no one was really sure. This was how a meeting ended about 2 months ago with a company I was supporting.

The next morning, I looked on my JIRA, and there it was, a new shiny story called SocketPing Test. I had no idea what this was but as I opened up the architecture diagram, it all started to come into focus. A way for us to do a connectivity test periodically on different applications and be alerted if it became unavailable.

The concept was simple, trigger a Lambda function to ping a specific IP address or host name, if the connection was a success, no further action needed. If the connection failed, a message would be sent to Microsoft Teams (although it could be sent anywhere) indicating an outage.

Lets work through code. I deployed using a stack in CloudFormation, hence a yaml file.

Triggering the Lambda

The Lambda function will be triggered in two different ways. A scheduled event using EventBridge, and the putObject action on a S3 Bucket. When creating a rule in EventBridge, your rule can be triggered at either a scheduled rate, or when a specific event occurs (i.e. "CreateUser"). In this case, we have chosen to trigger the Lambda function every 15 minutes so we will use a scheduled rate. It is also required that you list the Lambda function as your target.

EventRule: Type: "AWS::Events::Rule" Properties: Description: Event rule to trigger Lambda every 15 minutes Name: EventRule ScheduleExpression: rate(15 minutes) State: ENABLED Targets: - Arn: !GetAtt 'SocketPingTestFunction.Arn' Id: SocketPingTestFunction

That takes care of our scheduled trigger, lets look at how we want to trigger the Lambda function when we upload a new manifest file. Manifest file? I had the same question. Our manifest file is simply going to be a json file where we list the different IPs or hostnames we want to ping.

Below is a sample of what a manifest file could look like. In this example, the first line reaches to google on port 443, this results in a positive ping test. The second line reaches to google on port 111, this results in a negative ping test, which will send an alert to to Microsoft teams.

{ "manifest": [ {"host":"www.google.com","port":"443","env":"dev","description":"google-success"}, {"host":"www.google.com","port":"111","env":"dev","description":"google-fail"} ] }

We will create the s3 bucket, following AWS best practices but also include the NotificationConfiguration property which will notify the Lambda that when an object is created (or the manifest file is updated and uploaded to the bucket), the Lambda should be triggered and ran.

 S3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: socketpingtest PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: true IgnorePublicAcls: true RestrictPublicBuckets: true NotificationConfiguration: LambdaConfigurations: - Function: !GetAtt SocketPingTestFunction.Arn Event: 's3:ObjectCreated:*' Tags: - Key: Owner Value:  - Key: Environment Value: 

Before we jump to the Lambda function, I'm going to go ahead and share the secret sauce...the part that makes it all work...also the part I forgot. Permissions. YES! In order for the EventBridge rule and the manifest file upload to be able to trigger the Lambda function, you have to grant them permission to communicate with the Lambda.

EventBridgeLambdaPermission: Type: AWS::Lambda::Permission Properties: FunctionName: !Ref 'SocketPingTestFunction' Action: lambda:InvokeFunction Principal: events.amazonaws.com SourceArn: !GetAtt 'EventRule.Arn' S3LambdaPermission: Type: AWS::Lambda::Permission Properties: FunctionName: !Ref SocketPingTestFunction Action: lambda:InvokeFunction Principal: s3.amazonaws.com SourceArn: arn:aws:s3:::socketpingtest

Lambda Function

What exactly is it that we want the Lambda function to do? After it is triggered, we want it to retrieve the manifest file from our s3 bucket, iterate through the file and send a ping to each entry. If it connects successfully, no further action needed. If the connection is not successful, a message should be sent. You can customize your message to include whatever information you want, the example below sends the account id, the host we pinged, the port, and the Description. We use a function inside the Lambda code to pull the account id, and the other information is retrieved from the manifest file.

One suggestion I will share, as you are testing your Lambda code, include a timeout period, Lambda's default timeout period is 15 minutes, that makes for a long debug session :) spoken from experience!

I am not going to go into depth on sending a message to teams and using a Webhook URL, that is a future article...stay tuned. But you can see the code I used to send the teams message below.

SocketPingTestFunction: Type: "AWS::Lambda::Function" Properties: Code: ZipFile: | import json import logging import os import http.client import boto3 import socket s3 = boto3.client('s3') from urllib.request import Request, urlopen from urllib.error import URLError, HTTPError #Get account id for Teams message def get_aws_account_id(): client = boto3.client('sts') response = client.get_caller_identity() return response['Account'] def lambda_handler(event, context): aws_account_id = get_aws_account_id() print(aws_account_id) bucket_name = f'socketpingtest' object_key = 'manifest.json' # Read the JSON object from S3 response = s3.get_object(Bucket=bucket_name, Key=object_key) data = json.loads(response['Body'].read()) print(data) # Iterate through each manifest item for item in data.get("manifest", []): host = item.get("host") port = item.get("port") env = item.get("env") description = item.get("description") print (host) # Perform socket test s = socket.socket() s.settimeout(1) try: s.connect((host, int(port))) print("Success connecting" ) except: print("Timed out connecting" ) # Send Teams message for timed out connection HookUrl = os.environ.get('HookUrl') message = { "@context": "https://schema.org/extensions", "@type": "MessageCard", "themeColor": "#FF0000", "title": "SocketPingTest", "text": f"A connectivity test against Host: {host}, Port: {port} failed. More details below...", "sections": [ { "facts": [ { "name": "Account", "value": f'{aws_account_id}' }, { "name": "Host:", "value": f"{host}" }, { "name": "Port", "value": f"{port}" }, { "name": "Description", "value": f'{description}' } ] } ] } message_json = json.dumps(message).encode('utf-8') req = Request(HookUrl, data=message_json, headers={'Content-Type': 'application/json'}, method='POST') try: teamsresponse = urlopen(req) teamsresponse.read() logger.info("Message posted") print("Teams message sent") except HTTPError as e: print(f"Request failed: {e.code} {e.reason}") except URLError as e: print(f"Server connection failed: {e.reason}") except Exception as e: print(f"Socket error: {str(e)}") finally: s.close() Handler: index.lambda_handler FunctionName: SocketPingTestFuntion Runtime: python3.11 TracingConfig: Mode: Active Timeout: 60 Role: !GetAtt 'LambdaRole.Arn' Environment: Variables: S3BucketName: socketpingtest HookUrl:  

The last step - creating a role for the Lambda to use. Technically the only actions you need to include are allowing your lambda to assume the role and allowing the lambda access to s3 so that it can pull the manifest file. However, I always include permissions to allow for logs so that I can debug and track down errors as I am attempting to deploy my stack.

LambdaRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: lambda.amazonaws.com Action: sts:AssumeRole Path: / ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Policies: - PolicyName: SocketTestPolicy PolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Resource: "*" Action: - logs:CreateLogGroup - logs:CreateLogStream - logs:PutLogEvents - xray:PutTraceSegments - xray:PutTelemetryRecords - Effect: Allow Action: - sts:AssumeRole Resource: "*" - Effect: Allow Action: - s3:GetObject - s3:ListBucket Resource: - arn:aws:s3:::socketpingtest/*' - arn:aws:s3:::socketpingtest'

Thats it! Now I realize scheduled, periodic checks are suboptimal, but this is a quick and easy way to actively know when something you are expecting to be up, isn't. Plus now, you get to be the attentive SRE that discovered the outage in 15 minutes or less and not have to watch your client sit in awkward silence when the CEO asks "How long has this been down?".

You can find the full template in my GitHub account.

Easy Health Checks With SocketPing Test
Got Questions? Contact us

Related posts

No items found.

Your data is trying to tell you something

Contact us

... are you listening?