AWS Serverless Architecture Patterns:
S3 Event Driven Data Processing
Amazon Simple Storage Service (S3) is an object-based storage solution that allows you to store and retrieve any amount of data from anywhere. Combined together with Amazon S3 Event Notifications enables users to act on different S3 events (like object creation, removal, replication) by publishing event to Lamda function, SNS topic, SQS queue or Amazon EventBridge
S3 Event Types & Destinations
Amazon S3 Event Notifications allows us to receive notifications when certain events happen in your S3 bucket. For example when new object is created, removed or when object is replicated. Here is the full list of S3 events:
There are four S3 event destinations:
- AWS Lambda functions
- Amazon SQS queues
- Amazon SNS topics
- Amazon EventBridge
Amazon S3 invokes your Lambda function asynchronously with an event that contains details about the object. For asynchronous invocation, caller places the event in an internal queue and returns a success response without additional information.
Good to know details about Lambda asynchronous invocation:
- The maximum amount of time Lambda retains an event in the asynchronous event queue, up to 6 hours
- Max retries up to 2
- If the function returns an error, Lambda attempts to run it two more times, with a one-minute wait between the first two attempts, and two minutes between the second and third attempts
- If the function doesn't have enough concurrency available to process all events, additional requests are throttled
- If the queue contains many entries, Lambda increases the retry interval and reduces the rate at which it reads events from the queue
- Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent
- If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function
- When an event expires or fails all processing attempts, Lambda discards it
- your function code gracefully handles duplicate events
- your function have enough concurrency available to handle all invocations
- you know what to do with (or accept) discarded events (ex. by configuring Dead-letter queues for further processing)
- event ordering is not of critical importance
If it's clear that your use case should cover scenario when many data files arrive simultaneously and it's business critical that all events are processed (no discarded ones) then to make your service more fault tolerant you should be considering SQS as event destination (see asynchronous processing)
Use cases (for Lambda as event destination):
- Image processing (ex. AI/ML object analysis of specific image). Image is pushed to S3 bucket and S3 asynchronously invokes lambda function. Lambda sends request with image details to Rekognition service (ex. DetectLabels API) to analyze objects/labels in particular image. When the Rekognition responds with list of labels - lambda function updates specific S3 image file with tags based on this list
- Data file (ex. CSV, XML, JSON) ETL processing.
Data file is pushed to S3 bucket and S3 asynchronously invokes lambda function. Lambda function initiates AWS Glue job based on received S3 event. Glue job processes source data and stores data Parquet format files in S3 (ex. for later Athena queries) and loads the processed files in Amazon Redshift
Asynchronous and Queued Point-to-point processing
By using Amazon SQS as event destination, you can process S3 events asynchronously. Lambda polls the SQS queue at its own pace (starting at five concurrent batches with five functions at a time) and invokes your Lambda function synchronously. This allows it to control the processing flow by processing files sequentially without risk of being overloaded.
You should consider SQS as event destination when:
- It's business critical that all published events are successfully processed. SQS is a better alternative as it provides guaranteed delivery and event re-driving capabilities
- The service must handle incomplete or partial uploads when a connection is temporarily lost
- Your service should successfully handle and process any traffic spikes or thousands of simultaneous events
- There is need for more custom event processing configuration like delay queues, visibility timeouts or message retention period (for lambda asynchronous request max is 6 hours but for SQS max is 14 days)
- Event ordering is not of critical importance (because SQS FIFO is still not supported as S3 event destination)
Let's look at the same use cases from previous section (with lambda as destination) but with additional SQS (as event destination) usage. Now the image and data ETL processing services are more fault tolerant and reliable and can support even thousands of simultaneous events:
Parallel Processing With “Fan-Out” Architecture
Both previously described S3 event destinations like Lambda and SQS can cover most of use cases when point-to-point processing is needed.
But when there is need for "fan-out" style architecture where single event is sent to many destinations in parallel SNS as event destination could be good choice.
Use cases (for SNS as event destination):
- Invoice PDF file processing. Invoice PDF file is pushed to S3 bucket and S3 sends event to SNS topic. There are three subscribers to this topic:
- Email is being sent to specific accountants channel
- SQS receives message and Lambda polls it and process invoice PDF file by calling Amazon Textract to extract text data and store JSON data in S3 bucket
- SQS and lambda integration to store invoice PDF related metadata (ex. file name, creation data) in DynamoDB table which is used by invoice management application front-end
- Image processing for different resolutions and analyzing image objects with Rekognition in parallel. Image is pushed to S3 bucket and S3 sends event to SNS topic. There are three subscribers for this topic:
- Two SQS and Lambda integrations to process the same image in different resolutions in parallel
- SQS receives message and Lambda polls it and sends request with image details to Rekognition service to analyze objects in image
Route Events Dynamically With EventBridge
Latest addition as S3 event destination is EventBridge. This is a service that allows you to route events to different targets based on event conditions.
Previously to react on S3 events in EventBridge it extracted S3 API calls from CloudTrail logs (which definitely added latency to whole process) but now
it's possible to configure S3 Event Notifications to directly deliver to EventBridge. This means paters are matched more quickly and directly.
Couple reasons why you could consider EventBridge instead of SNS as S3 event destination:
- EventBridge supports many additional targets (comparing to SNS) like Kinesis, ECS, Step Functions, Redshift and more
- EventBridge is not built to handle only S3 events but can capture different other events from many AWS services, web applications, third-party partners
- Additionally to S3 event notification provided object events it supports bucket specific events like createBucket, deleteBucket, security etc.
- EventBridge rules support content filtering which allows more complex pattern matching than S3 Notifications support
- EventBridge allows you to transform the event before passing it as input to the targets
Here is simple use case by using EventBridge event target - Step functions workflow to support the ETL data process. Data file is pushed to S3 bucket and S3 sends event to EventBridge. EventBridge rules are configured to route event (S3:ObjectCreated) to Step function workflow to orchestrate an ETL pipeline
All the diagrams have been created with our Cloudviz app and will be available there as diagram templates to kick-start your next awesome AWS Serverless Architecture. Happy Diagramming!