Skip to main content

Amazon S3

Amazon S3 is a flexible object storage product offered by Amazon Web Services. It can be used as an upstream or downstream resource in your Turbine streaming apps by using the write function to a select S3 bucket.

Setup

Resource Configuration

Use the meroxa resource create command to configure your Amazon S3 resource.

The following example depicts how this command is used to create an Amazon S3 resource named datalake with the minimum configuration required.

$ meroxa resource create datalake \
--type s3 \
--url "s3://$AWS_ACCESS_KEY:$AWS_ACCESS_SECRET@$AWS_REGION/$AWS_S3_BUCKET"

In the command above, replace the following variables with valid credentials from your Amazon S3 environment:

  • $AWS_ACCESS_KEY - AWS Access Key
  • $AWS_ACCESS_SECRET - AWS Access Secret
  • $AWS_REGION - AWS Region (e.g., us-east-2)
  • $AWS_S3_BUCKET - AWS S3 Bucket Name

Configuration Options

To know more about what you can do with this resource, make sure you check out its connector configuration options.

Using with Turbine, it is as simple as (using TypeScript as an example):

let destination = await turbine.resources("s3");

await destination.write(anonymized, `my_directory_in_s3`, {
  "file.name.template": "{{topic}}-{{partition}}-{{start_offset}}-{{timestamp:unit=yyyy}}{{timestamp:unit=MM}}{{timestamp:unit=dd}}{{timestamp:unit=HH}}.gz"
});

In the code snippet shared above, we only change file.name.template however you can specify these other options.

Permissions

The following AWS Access Policy is required to be attached to the IAM user of the AWS_ACCESS_KEY provided in the Connection URL:

{
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<bucket-name>/*",
                "arn:aws:s3:::<bucket-name>"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Data Record

Data records are written a folder within the root of the S3 bucket as gzipped JSON, with one record per file and using the following naming format:

<stream-name>-<partition-number>-<starting-offset>

In the following example, the record is from the resource-5-499379.public.orders stream with starting offset 0000000000 and partition 0.

aws s3 ls s3://data-lake-bucket/resource-7-133274/resource-5-499379.public.orders-0-0000000000.gz