Skip to main content

How to Ingest and Analyze Data in Amazon Athena

· 2 min read
@anaptfox

Amazon Athena allows you easily query data stored in Amazon S3 with SQL. With Meroxa, you can build a pipeline to pull data from any source and send it to Amazon S3 in real-time. Altogether, this pipeline allows you to query and analyze real-time data across multiple sources using Athena.

Data Pipeline Diagram

To accomplish ingestion and analysis, you will:

  1. Send data to Amazon S3.
  2. Create a table in Amazon Athena.
  3. Query data.

Send Data to Amazon S3

To build a pipeline to send data to Amazon S3, you can use the Meroxa Dashboard or the Meroxa CLI.

Selected Destination PostgreSQLSelected Destination Amazon S3

Add PostgreSQL Resource:

 meroxa resource create my-postgres --type postgres -u postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB

Add Amazon S3 Resource:

 meroxa resource create my-s3 --type s3 -u s3://$AWS_ACCESS_KEY:$AWS_ACCESS_SECRET@$AWS_REGION/$AWS_S3_BUCKET

Create Pipeline:

meroxa pipeline create to-datalake-for-athena

Add PostgreSQL Source Connector:

meroxa connector create from-my-postgres --from postgres --input $INPUT --pipeline to-datalake-for-athena

Add Amazon S3 Destination Connector:

meroxa connector create to-my-s3  --to s3 --input $STREAM_NAME --pipeline to-datalake-for-athena

For more information, see:

Verifying

Once your pipeline is up and running, you will be able to see data within an S3 bucket:

Meroxa Data in Amazon S3

You may change the behavior of the connector with configuration. For example, you can change how the data record is compressed when stored in the bucket.

Create Table in Amazon Athena

Next, you can create a database in Athena.

CREATE DATABASE userdemo

The command above creates a database called userdemo.

Then, you can create a table within this database.

Meroxa Data in Amazon S3

CREATE EXTERNAL TABLE IF NOT EXISTS userdemo.`user` (
`schema` string,
`payload` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://your/location/to/folder';

The SQL query above creates a table called user that points to a location in S3 (s3://your/location/to/folder). The table created will have two columns schema and payload. The data records emitted by the connector are in the following format:

{ "schema": <schema>, payload: <event data> }

If desired, you can create mappings from a nested JSON data structure.

Query Data

Now that the table is created, you can use SQL to perform a query:

SELECT * FROM "userdemo"."user" limit 10

Amazon Athena Results

In Athena, you can also parse and extract from JSON data. This is helpful to serialize JSON into columns for easy SQL analysis. For more information, see Amazon's Guide on Extracting data from JSON.

You have now completed this guide and now have a running pipeline to analyze information in S3.

Happy Building 🚀