Amazon Athena
This article will help you to understand Amazon Anthena along with use cases & best practises.
Amazon Athena is serverless interactive query managed service which makes it easy to analyze data stored in Amazon S3 using ANSI-standard SQL.
Amazon Athena uses PrestoDB, open-source software, as its SQL query engine. It can handle complex analysis, including large joins, window functions, and arrays. Users can use SQL to interface directly with Amazon S3 data. Amazon Athena provides the easiest way to run quick interactive ad-hoc queries for data in S3 without the need to setup or manage any servers.
Amazon Athena helps to analyze unstructured, semi-structured, and structured data stored in Amazon S3. Examples include CSV, JSON, or columnar data formats like Apache Parquet & ORC.
Amazon Athena can be accessed via the AWS Management Console, or API, or an ODBC or JDBC driver. Amazon Athena allows you to control access to your data by using Amazon IAM policies, ACLs, and S3 bucket policies.
To start with you will need to create a database & define table schema definitions before you can query the data stored in Amazon S3. Amazon Athena uses a managed Data Catalog (Amazon Glue Data Catalog) to store information and schemas about the databases and tables that you create for your data stored in Amazon S3.
Pricing is set at $5 for each TB of data scanned. You pay only for the queries you run. Hence you should always look out to reduce the data scan in order to lower query costs.
Query & Cost Optimization:
- Using a compressed and columnar format can lower query and storage costs while improving performance further.
- Partitioning divides table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitioning data can also help to reduce the amount of data that the query needs to scan, which in turn boosts query performance.
- You can use Amazon EMR or Glue to transform the data formats to increase efficiencies of file structure and format.
- And finally as usual you should use optimised SQL query.
In the next article let's do a hands-on, how to query S3 data using Athena.