Skip to main content

Starting with Amazon Athena



Hi All! Its long since I have offered my readers something  technical to go through. The wait is over and its really worth it. We will dive into cloud today. I will be explaining you in details how can you use one of the popular service of AWS known as Athena.
The blog is written keeping in mind that you have little or no idea about any AWS service. Albeit, a tad bit of knowledge on S3 service will help you to understand the blog better.

Lets take an use case before proceeding ahead. Suppose we have a csv file that contains some data into it. lets say it has details of Employees in a company. I have another csv file that contains details of the department of a company. Now, I wish to do some operations that needs data from the employee file and department file to be joined. e.g to get max salary of employee in every department etc. If we had the same data in 2 tables in a database, things had been simple. right? We could have just joined both the tables, grouped on department ids and found the max of salary. Since I have the data in file, now the same work becomes more tedious. Or perhaps it was tedious because its really very simple with Athena now.

Athena is a service used to run SQL queries against S3 deliminated text files without the need of a installed Database. Being server-less it is inherently scalable and highly available. You just need to pay when you use it.

 



The above diagram shows how Athena works. Its closely works with another AWS service knows as S3. The files to be queried needs to be stored at S3. The query results are not only displayed in the Athena GUI panel but is also stored in the defined S3 bucket path. All the metadata(information about the data) is stored in Athena Data Catalog.

I have the data set to be used ready with me. Below is data from dep.csv file containing dep_id, dep_name

01,DI
02,BIG DATA
03,BI


The emp.csv  file contains name, salary, dep_id

Suraj,100000,01
Rajbikram,200000,01
Koushik,170000,02
Sudipta,100000,02
Pritam,400000,02
Sourabh,150000,03
Nabarun,190000,03


Its time to make our hands dirty now. 

1. Create a S3 bucket and Object

a. Login to AWS and select s3 service. You can see I already have a bucket created with name athena.sj . However, for your convenience of understanding I will create a new one. By clicking in the Create Bucket button.


b. Enter the bucket name and don't tamper with region now. I will get back to explaining regions later. Click on Create button to create your bucket.



c. You can see that my new bucket with name athena.abc is created. Now lets open it and create 2 folders inside employee and department. One for my emp.csv file ad another for my dep.csv file.



d. So I have both my folders created now. 


e. Lets upload the files now to the respective folders. Enter into a folder and click on upload. A pop up will appear then upload your file there.




f. Repeat the same for employee file as well.




--------------------------------------------------------------------------------------------------------------------------


With this we are done with the S3 part.Now lets move to Athena console which kinda looks like this.


default is the name of the default database. You can also see some of my tables created under that database. 

2. Create a metadata database

a. Creating a table is easy here using  Create Table Wizard. Click on create table link beside the Tables label. Then, click on from S3 bucket data. 




b. On the next page you would need to enter the table name and the location of the file to be queried. You can also create a new database or use an existing one. In the example I an using the existing database default.
In the table name field I have entered emp
In the Location of Input data set field I entered the address of my bucket that is s3://athena.abc/employee/ .
It is of the format s3://bucket_name/folder_name. Once done click on next.



c. On the next step select the data format of the file. In our case its a CSV file.



d. Column names and their type are to be added now. In my emp.csv file the columns are name, salary and dep_id with string, int and int type respectively. If you have a data set with multiple columns then Athena also has a bulk upload option to ease your task of manually adding columns and types. 




e. Then on the next step click on create table. Post this you can see a DDL generated for you.



f. Click on run query to execute the DDL and you table will be created. You might be asked to select a storage area to store the output of a query. Just enter a bucket path where you wish to store you output. 

g. Lets preview the data in my emp table now. You can do it now using a select query.

select * from default.emp;



h. The same thing is done for dep.csv file as well.








Tada! so we have both the emp.csv and dep.csv data into emp and dep tables. So lets try to achieve our goal which was to get max salary of employee in every department. 

select dep.name,max(salary) max_sal_in_dep from default.emp join default.dep on emp.dep_id=dep.dep_id group by dep.name;

Is the query which is expected to give me the desired result. Lets check!




So, It worked. The possibilities are immense now. 




Hope the blog helped you understand Athena. If you want me to dive into any other AWS service. Do comment below. Also share this with other technocrats! 

Comments

Popular posts from this blog

KPMG Interview Experience

Hello Readers, This blog is about my interview experience with KPMG India. I have covered the intricacies of the interview process along with the questions asked. I must highlight my profile to you guys so that you get the context of the interview. I had 1 year of experience working as a web developer and had worked as an ETL developer at cognizant for 3 years. During my stint, I got to work on tools and technologies like Informatica Powercenter, Informatica B2B Dx, Informatica MFT, Splunk, Python, Shell Scripting, PHP, MongoDB, Laravel, NodeJS, and SQL.  So, let me begin. I have divided it into parts so that it gets easy to comprehend the sections separately.  The Application. I have been applying for jobs since September 2020. In the beginning, I hardly received any calls for the interview. I applied using Naukri and even got a Linkedin premium account. I must admit that most of the calls that I received later were via my applications over LinkedIn but during the initial period, I ha

Cognizant Interview Experience

Hi all, I am back again with a new interview experience with cognizant. The interview was held on 11th March 18, at Unitech, Kolkata …. The candidates who cleared an online assessment by amcat were called for the interview. The assessment was on quantitative, verbal and non verbal questions. Also, there was a coding test of one hour that comprised of two programs. The problems were quite simple based on loops, arrays and functions. However, it must not be taken lightly because some Of the problems were quite challenging. I insists one to solve some questions from hackerank before you sit for the automata round. Let's move on to the interview experience now. My reporting time was 11:00 am and I reached there on time with all the asked documents. The amcat team was present there to assess and check the documents. I must highlight that they are very strict with the documents, so don't miss out carrying any of the document they have asked for. Each of the cand

Technical Interview at Josh Technology

Hi, Its Suraj Jeswara, a Computer Engineer. I had my technical interview under Josh technologies on 29.03.17 conducted by Amcat at BP Poddar Institute of Management & Technology . The technical interview was basically comprised of an online and a written test . The test was far more challenging than anticipated at the beginning. The online test was comprised of quantitative aptitude, computer programming and automata . The quants portion was kinda the one we encounter in Amcat examination. Easy but time crux do exists. In automata section only 2 questions were to be done in 45. Seems pretty easy but I vouch if you are not in flow with coding you can not solve a single one. I could do a pattern problem. Question : Print the following pattern 1 3*2 4*5*6 10*9*8*7 11*12*13*14*15 For n rows… we were suppose to run and compile it against the test cases provided. The 2nd one was on linked list and unfortunetly I couldn’t dispose it on time. The computer prog