AWS Serverless Architecture Patterns: S3 Event Driven Data Processing

S3 event driven data processing diagrams

by Valts Ausmanis · April 11, 2022

Amazon Simple Storage Service (S3) is an object-based storage solution that allows you to store and retrieve any amount of data from anywhere. Combined together with Amazon S3 Event Notifications enables users to act on different S3 events (like object creation, removal, replication) by publishing event to Lambda function, SNS topic, SQS queue or Amazon EventBridge

S3 Event Types & Destinations

Amazon S3 Event Notifications allows us to receive notifications when certain events happen in your S3 bucket. For example when new object is created, removed or when object is replicated. Here is the full list of S3 events:

There are four S3 event destinations:

AWS Lambda functions
Amazon SQS queues
Amazon SNS topics
Amazon EventBridge

Point-to-point Processing

Amazon S3 invokes your Lambda function asynchronously with an event that contains details about the object. For asynchronous invocation, caller places the event in an internal queue and returns a success response without additional information.

Good to know details about Lambda asynchronous invocation:

The maximum amount of time Lambda retains an event in the asynchronous event queue, up to 6 hours
Max retries up to 2
If the function returns an error, Lambda attempts to run it two more times, with a one-minute wait between the first two attempts, and two minutes between the second and third attempts
If the function doesn't have enough concurrency available to process all events, additional requests are throttled
If the queue contains many entries, Lambda increases the retry interval and reduces the rate at which it reads events from the queue
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent
If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function
When an event expires or fails all processing attempts, Lambda discards it

Based on your use case and the number of events you receive if you choose Lambda as event destination you should ensure that:

your function code gracefully handles duplicate events
your function have enough concurrency available to handle all invocations
you know what to do with (or accept) discarded events (ex. by configuring Dead-letter queues for further processing)
event ordering is not of critical importance

If it's clear that your use case should cover scenario when many data files arrive simultaneously and it's business critical that all events are processed (no discarded ones) then to make your service more fault tolerant you should be considering SQS as event destination (see asynchronous processing)

Use cases (for Lambda as event destination):

Image processing (ex. AI/ML object analysis of specific image). Image is pushed to S3 bucket and S3 asynchronously invokes lambda function. Lambda sends request with image details to Rekognition service (ex. DetectLabels API) to analyze objects/labels in particular image. When the Rekognition responds with list of labels - lambda function updates specific S3 image file with tags based on this list
Data file (ex. CSV, XML, JSON) ETL processing. Data file is pushed to S3 bucket and S3 asynchronously invokes lambda function. Lambda function initiates AWS Glue job based on received S3 event. Glue job processes source data and stores data Parquet format files in S3 (ex. for later Athena queries) and loads the processed files in Amazon Redshift

Asynchronous and Queued Point-to-point processing

By using Amazon SQS as event destination, you can process S3 events asynchronously. Lambda polls the SQS queue at its own pace (starting at five concurrent batches with five functions at a time) and invokes your Lambda function synchronously. This allows it to control the processing flow by processing files sequentially without risk of being overloaded.

You should consider SQS as event destination when:

It's business critical that all published events are successfully processed. SQS is a better alternative as it provides guaranteed delivery and event re-driving capabilities
The service must handle incomplete or partial uploads when a connection is temporarily lost
Your service should successfully handle and process any traffic spikes or thousands of simultaneous events
There is need for more custom event processing configuration like delay queues,visibility timeouts or message retention period (for lambda asynchronous request max is 6 hours but for SQS max is 14 days)
Event ordering is not of critical importance (because SQS FIFO is still not supported as S3 event destination)

Let's look at the same use cases from previous section (with lambda as destination) but with additional SQS (as event destination) usage. Now the image and data ETL processing services are more fault tolerant and reliable and can support even thousands of simultaneous events:

Parallel Processing With “Fan-Out” Architecture

Both previously described S3 event destinations like Lambda and SQS can cover most of use cases when point-to-point processing is needed. But when there is need for "fan-out" style architecture where single event is sent to many destinations in parallel SNS as event destination could be good choice.

Use cases (for SNS as event destination):

Invoice PDF file processing. Invoice PDF file is pushed to S3 bucket and S3 sends event to SNS topic. There are three subscribers to this topic:
- Email is being sent to specific accountants channel
- SQS receives message and Lambda polls it and process invoice PDF file by calling Amazon Textract to extract text data and store JSON data in S3 bucket
- SQS and lambda integration to store invoice PDF related metadata (ex. file name, creation data) in DynamoDB table which is used by invoice management application front-end
Image processing for different resolutions and analyzing image objects with Rekognition in parallel. Image is pushed to S3 bucket and S3 sends event to SNS topic. There are three subscribers for this topic:
- Two SQS and Lambda integrations to process the same image in different resolutions in parallel
- SQS receives message and Lambda polls it and sends request with image details to Rekognition service to analyze objects in image

S3 SNS parallel image processing diagram

Route Events Dynamically With EventBridge

Latest addition as S3 event destination is EventBridge. This is a service that allows you to route events to different targets based on event conditions. Previously to react on S3 events in EventBridge it extracted S3 API calls from CloudTrail logs (which definitely added latency to whole process) but now it's possible to configure S3 Event Notifications to directly deliver to EventBridge. This means paters are matched more quickly and directly.

Couple reasons why you could consider EventBridge instead of SNS as S3 event destination:

EventBridge supports many additional targets (comparing to SNS) like Kinesis, ECS, Step Functions, Redshift and more
EventBridge is not built to handle only S3 events but can capture different other events from many AWS services, web applications, third-party partners
Additionally to S3 event notification provided object events it supports bucket specific events like createBucket, deleteBucket, security etc.
EventBridge rules support content filtering which allows more complex pattern matching than S3 Notifications support
EventBridge allows you to transform the event before passing it as input to the targets

Here is simple use case by using EventBridge event target - Step functions workflow to support the ETL data process. Data file is pushed to S3 bucket and S3 sends event to EventBridge. EventBridge rules are configured to route event (S3:ObjectCreated) to Step function workflow to orchestrate an ETL pipeline

S3 EventBridge Step Functions data ETL process diagram

All the diagrams have been created with our Cloudviz app and will be available there as diagram templates to kick-start your next awesome AWS Serverless Architecture. Happy Diagramming!

Tired of browsing through the AWS console?

Try out Cloudviz.io and visualize your AWS cloud environment in seconds

Start your free trial

As experienced AWS architects and developers, our mission is to provide users an easy way to generate stunning AWS architecture diagrams and detailed technical documentation. Join us to simplify your diagramming process and unleash the beauty of your cloud infrastructure

Product

Support

Contact

support@cloudviz.io