Fiber optics

Using AWS SQS to Create a Distributed Task Scheduler and Execution Framework 

Share with your network!

Engineering Insights is an ongoing blog series that gives a behind-the-scenes look into the technical challenges, lessons and advances that help our customers protect people and defend data every day. Each post is a firsthand account by one of our engineers about the process that led up to a Proofpoint innovation. 

A distributed task executor is a crucial component when you’re building scalable and resilient systems that can handle a wide range of tasks, from data processing and batch jobs to real-time event processing and microservices orchestration. 

At Proofpoint, we were looking for a simple distributed task executor for subset of the tasks to be executed in AWS. We wanted it to be easy to maintain and deploy, provide consistent ordering of tasks, have good error handling, and have the ability to retry failed tasks. Integration with serverless components was a plus as it would provide us many more options to scale up and parallelize task processing.  

Among the several frameworks that we evaluated the most notable ones were Hazelcast, Apache Airflow and Facebook Bistro. Apache Airflow and Bistro need additional infrastructure deployment, while Hazelcast needs persistence to be configured to store state of the task. We wanted to limit dependencies other than what we already use in AWS. In the end, we decided that AWS SQS was the best of all options. 

Using the features of AWS SQS queues, we were able to put together a distributed task scheduler and execution framework as shown below in Figure 1. 

Design for orchestrating tasks from multiple sources to task executors 

Figure 1: AWS SQS FIFO queue as an orchestrator between task source and executors. 

Key design features of the above architecture 

The blueprint of the above architecture is shared as a terraform module so that each team or service can deploy their own highly available distributed task scheduler. Here’s what you should know about its key design features: 

Feature 

Description 

Availability 

A key aspect of any distributed task executor is to eliminate single point of failures. Because AWS SQS is resilient and highly available, this means the task queue is as well. 

Scalability 

  • SQS features out-of-the-box integration with AWS Lamda using the Lamda invocation through event source triggers. 

  • The concurrency control configuration on SQS increases parallelization. 

  • By default, the supported message size is 256KiB; it can be extended up to 2GB with the extended client lib

Fault tolerance  

  • The message visibility timeout configuration ensures failed execution tasks are redistributed to available workers. 

  • Heartbeats can be added to extend visibility timeouts for long running tasks. 

Retries 

Once the set number of retries are reached, SQS provides in-build dead-letter queues (DLQs) where the events are redirected. This allows us to manually intervene, debug or re-process tasks. This also helps us to mitigate and remove poison pill tasks if they show up in the queue.  

Message ordering and grouping 

  • SQS FIFO queues provide message grouping and ordering within the groups.  

  • Deduplication of FIFO queues helps prevent duplicate task execution. 

 

Workflows 

By utilizing AWS step functions, we can achieve workflows for task execution. 

Deployment 

It’s possible to utilize AWS cloud formation or terraform to deploy all the dependent components. 

Testing 

Teams using the task consumer pipeline can limit themselves to test task execution because the infrastructure setup for delivering the tasks to workers is taken care of by the architecture. 

Security 

Logging and monitoring 

AWS Cloudwatch provides the ability to capture, store and search through logs and metrics. Component observability decreases mean time to respond (MTTR). 

Cost 

The cost of using SQS is very minimal. As expected, most of it is for executing the tasks. 

About the author 

Author

Sathish Krishna Raju is a senior staff engineer on the shared services team at Proofpoint. In his role as a technical lead, he has spearheaded numerous initiatives aimed at constructing data processing pipelines within the organization. Currently, he is at the forefront of an effort to enhance the identity and access management (IAM) system for Proofpoint.