May 8th, 2017
Do you want your business intelligence (BI) users to be able to use “self-service” for analytics and reporting in a modern cloud architecture? Then this article is for you. Amazon has a perfect combination of storage, transformation and visualization that can solve this very common request, empowering data analysts to do their job without the long wait to load, model and prepare data.
First, let’s review the AWS services you’ll need for this solution:
The solution for this self-service BI is described in the figure below. Let’s review the data flow from left to right.
In the next few sections, we will review each of the key tasks in detail.
For this demonstration, we will use all files from the Data.gov College Scorecard. The dataset is available from the following public URL: https://catalog.data.gov/dataset/college-scorecard
Here are the detailed steps to upload a file to S3 filesystem:
1. Download the CollegeScorecard_Raw_Data.zip to your local system (laptop) and unzip the file.
2. Next, upload the file to AWS S3, login to our account and select S3 from the Services menu.
3. Next, select the S3 bucket or create a new S3 bucket. In the following screenshot, I have the select
collegescorecard bucket that I created earlier.
4. Next, create a folder CollegeRaw and then sub-folders, one for each year 2010, 2011, 2012 and 2013 as shown in the following figure:
5. To each sub-folder, upload the corresponding data file; for example, MERGED2010_PP.csv file goes to year=2010 sub-folder and repeat the same for all years.
This completes the data loading to S3. Next, we’ll see how to view this data in Athena.
Need an expert solution, quickly? Zaloni’s FastTrack Package provides the data management software and expertise to quickly solve your most pressing data challenge and cloud analytics . . . in just 6 weeks!
Now that the data is loaded into S3, we can query it using Athena with the following steps:
1. Open the AWS management console for Athena using this link https://console.aws.amazon.com/athena/home OR search for Athena in the AWS services search bar.
2. Using the Query Editor, run the create database collegestatsathenadb; statement.
3. The new database, collegestatsathenadb, should appear in the drop down on the left-hand side. Select the new database.
4. Next, create a new table for the files in S3 on CollegeScorecard Raw [RP1] data with partition clause. The query is in GitHub here: https://github.com/rnadipalli/quicksight/blob/master/sqlscripts/loadtoAthena.sql
5. After the table is created, verify it by browsing for it on the left-hand panel.
6. Next, to load all partitions of the table, run the following command:
7. Next, you can query the table and view data as shown in the following figure:
This completes the creation of the table in Athena database. Next, we will see how to visualize this data in QuickSight.
Now that data is accessible via Athena, follow the steps below to create a new dataset in QuickSight:
1. From the QuickSight home page, click on Manage Data.
2. Next, select New data set and then select Athena option.
3. For the Data source name, enter the same name as the Athena database CollegeStatsAthenaDB.
4. Click on Validate connection to confirm QuickSight can connect to Athena. After it is Validated, click on Create data source to complete the data source creation as shown in the following figure:
5. Next, select the table collegestats from the table selection and then select Edit/Preview data.
6. The table has over 1700 fields; for this demonstration we will focus only on enrollment related fields and only for public colleges. For this we will use the QuickSight option to report data based on custom SQL as checked in https://github.com/rnadipalli/quicksight/blob/master/sqlscripts/loadtoAthena.sql.
7. Next update the data type of all numeric fields to integer as this will help with the reporting and visualization.
8. Next, click on save and visualize to analyze this data.
9. Next, change the visualization type to vertical stacked bar chart; set the Enrollment_All_Count to a measure; select X axis as Year, Value as Enrollement_All_Count and Group Color as STABBAR. Finally, to focus only on top states by count, filter the chart with Enrollment_All_Count greater than 200,000.
10. This will give you a pretty useful trend that shows that the state of Ohio has a drop of enrollment from 219K in year 2010 to 217K in year 2013 in comparison to the state of Texas where enrollment grew from 438K to 467K in same time period as shown in the following figure:
This shows how we can go from data in flat files to meaningful charts with zero infrastructure administration, no custom software installation and easy to use cloud services.
Further, If you are interested to also move your complex transformations on S3 to a transient cluster, read “S3, Athena and QuickSight: The Perfect Combination for Cloud Analytics“.
If you are looking at strategies on Data Lake migration from a local data center to cloud-based Data Lakes, read “Migrating On-Premises Data Lakes to Cloud“.
For further reading, order “Effective Business Intelligence with QuickSight” by Rajesh Nadipalli.