At Flux7 we work with a wide variety of customers and regardless of their level of IT maturity we are passionate about helping them apply DevOps processes in their pursuit of continuous improvement. In doing so, we naturally find ourselves moving from simple application driven problems to layers deeper in the technology stack where DevOps automation can play a significant role in growing efficiency and productivity. Today we’d like to share the story of how we deployed AWS Step Functions to help drive DevOps automation in pursuit of continuous improvement for a Flux7 customer.
One of our more sophisticated customers, this client had reached the point in their continuous improvement where they were interested in growing their self-serve IT. For those of you new to the concept, self-serve IT revolves around automating technological overhead such as approval processes, auditing and logging systems, and deployment so when someone needs a new application or service they can simply summon it. While we used a custom web portal in this case, you could also use an API or chatbot.
For this customer’s self-serve IT needs, we wanted to simplify the process of bringing up new AWS accounts and VPCs by creating a REST API that their internal customers could call. However, this process is complex with multiple actions needed across different departments with a variety of manual and automated tasks. Even the automated tasks are performed over a variety of APIs and platforms.
The cloud platform team had to take several steps to coordinate between different teams to create accounts and VPCs. Therefore, our goal was to apply automation to simplify the process and eliminate extra steps. Specifically, we set out to:
- Automate the creation of new accounts and VPCs including accounting, security, networking, and operations tasks.
- Handle notification for manual tasks and provide a mechanism for the operator of a manual task to mark a task as done or failed.
- Integrate with existing automation spread over REST APIs, Jenkins jobs, while adding new automation as needed.
- Define the work in a way that makes it easy for the customer to maintain and extend.
The best solution to address our goal was AWS Step Functions. AWS Step Functions allow you to build -- and coordinate the components of -- distributed applications with visual workflows. AWS Step Functions have several features that make the service a very attractive option as:
- They define the process as a state machine, (an abstract machine that can be in exactly one of a finite number of states at any given time). This makes it easy to break complex processes into smaller, more manageable parts. Moreover, the individual parts are defined by well-contained states and new states can be added or removed as needed by adding new blocks to a simple declarative JSON structure.
- Step Functions state machines are serverless themselves and can directly integrate with Lambda functions to perform tasks in an automated manner, and
- They support activities that allow you to integrate with other tools, use other platforms, and add manual steps.
With the ability to define an easily extensible serverless solution, the direction was clear.
AWS Step Functions
We fronted our step function with an Amazon API Gateway. (Note: Amazon API Gateway integrates with AWS Step Functions, allowing you to call Step Functions with APIs that you create to simplify and customize interfaces to your applications.) Our request was received on the Amazon API Gateway, and after some initial validation on the input parameters using a Lambda function, we triggered AWS Step Functions to start a workflow execution, and gave the requester a 202 ACCEPTED code to signify that we’re working on the request. In the meantime, the Step Functions state machine implements the various stages of the process. Most of the automated tasks can be completed using AWS Lambda that has native integration with AWS Step Functions.
For manual tasks we used a solution similar to the one proposed by AWS’ Ali Baghani in his blog post, Implementing Serverless Manual Approval Steps in AWS Step Functions and Amazon API Gateway. Specifically, we created an activity that we used as the target for all our manual tasks in the step function. This activity was monitored by a Lambda function that was periodically triggered by Amazon CloudWatch Events. We created an email template in Jinja that was used to prepare an outgoing email. On seeing an activity, our Lambda function took the request and parameters and sent an email (using our template) to the operator of the next step.
For this customer, we had previously set up a notification API that sent notification messages to appropriate parties; we utilized this system to send out our notification emails. In addition to other information the operator might need, the notification emails included two links for marking the completion or failure of the manual step. As the same Lambda function would be reused multiple times, we also needed to add a state before the actual call to the manual function to massage the step inputs into a unified pattern.
Adding in Jenkins
In addition to this work, we decided to implement other automated tasks as activities using Jenkins. There were two reasons for this:
- Part of our ethos at Flux7 is to contain the number of parallel changes made to ensure that our customers can effectively manage change. And, as this customer had existing Jenkins jobs that we could leverage, it allowed us to accomplish the same goal while minimizing the amount of change, and
- Jenkins was the right solution because these tasks included making changes to a Git repository and deploying a CloudFormation template. Both of these tasks can possibly cross the five minute time limit of Lambda functions. Moreover, when making a commit to a Git repo, it doesn’t make sense to check out the latest version of the Git repo in a completely stateless system every time a request is made.
To leverage existing jobs, we created a system similar to that for manual jobs. That is, an activity is created and a Lambda function periodically triggered by CloudWatch Events looks for new activities and triggers a Jenkins job. The Jenkins jobs were modified to take the activity Amazon Resource Name (ARN), as an additional parameter. We also added a post-build step to these Jenkins jobs to mark the task as failed or completed on the activity. (Note, you could also add a polling agent to Jenkins instances/containers to long-poll for activity tasks, which eliminates the need for a scheduled Lambda function as a dispatcher. For this customer, our approach was least intrusive.)
As an initial implementation, we wanted to quickly get people using it and therefore didn’t want to over complicate the solution. So, to simplify error handling we created a single error state to which all states route. We have that error state send a notification to a member of the cloud team for investigation. The cloud team can manually correct any issues observed or manually revert the changes as needed. As the team works with the new automated process, the errors seen will become better understood and continuous improvement will continue to grow.