Skip to content

WORKING WITH TPC-DS DATA & RDS MYSQL DATABASE

Welcome to Part 1!

In the Part 1 - Working with TPC-DS Data & RDS MySQL Database - you will be installing and configuring everything that is required to prepare and load TPC-DS* data into a RDS MySQL Database Instance in order to supply the necessary dataset samples for each piece of the ETL Train The Trainer Workshop.

1. Preparing & Generating TPCDS Dataset

  • TPC-DS - TPC data is used for a decision support benchmark. When you load this data the schema tpcds is updated with sample data. For more information about the tpcds data, see TPC-DS.

Before you can install and generate TPC-Data, you need to build the necessary folders and download the necessary files.

Run the following command to create the base level directory for the ETL Train the Trainer Workshop and to download the TCP-DS tool to your local AWS Cloud9 Enviroment:

mkdir -p ~/environment/ttt-demo
cd ~//environment/ttt-demo//
aws s3 cp s3://ee-assets-prod-us-east-1/modules/31e125cc66e9400c9244049b3b243c38/v1/downloads/etl-ttt-workshop/16d7423c-23d3-4185-8ab4-5b002ec51153-tpc-ds-tool.zip tpc-ds-tool.zip

Next, run the following commands to unzip the previous downloaded (.zip) file and get into the DSGen software's directory to install the tool.

unzip tpc-ds-tool.zip 
cd DSGen-software-code-3.2.0rc1/tools/
make
head -200 tpcds.sql 

The last command (head) prints out the TPC-DS sql script for the tables' schemas to validate the instalation of the tool.

Now, let's run the DSGen software to generate the sample data for all the TPC-DS tables and store the sample datasets in a temporary directory. TIP: This should take about 3 minutes to complete!

mkdir -p /tmp/dsd/csv_tables/
./dsdgen -scale 1 -dir /tmp/dsd -parallel 2 -child 1
cd /tmp/dsd/
ls -lrt

SHOW ALL

[Optional]: (Performing this optional step should allow for enough time for the previous step to complete!)

While the above command is running, you can proactively test the reacheability of your RDS MySQL Database Instance for access coming from your AWS Cloud9 Enviroment. To do that, run the following command in a NEW terminal tab (+ icon):

sudo yum -y install telnet
mysqlendpoint=$(aws cloudformation describe-stacks --query 'Stacks[*].Outputs[?OutputKey==`MySQLEndpoint`].OutputValue | [0] | [0]' --output text)
telnet $mysqlendpoint 3306

TIP: Close this additional Terminal tab once Telnet command succeed and go back to the original one!

2. Populating the Amazon RDS-MySQL Database with TPCDS Dataset

Once the dataset samples get generated, it is time to load the data into the RDS MySQL Database. But first, run the following commands to create the TPC-DS tables' schemas in the MySQL database:

tpcds_script_path=~/environment/ttt-demo//DSGen-software-code-3.2.0rc1/tools/tpcds.sql 
mysql -h ${mysqlendpoint} -u etluser -petltttdemopwd -Dtpcds < $tpcds_script_path
mysql -h ${mysqlendpoint} -u etluser -petltttdemopwd -Dtpcds -e "show tables"

The last output should be a list of all the tables that have been created in the database.

SHOW ALL

Now, it is time to load the data into the tables. To easily do that, you can run the following commands to download and run the a Shell Script that does all the loading process into the RDS MySQL Database! TIP: This should take about 3 minutes to complete!

aws s3 cp s3://ee-assets-prod-us-east-1/modules/31e125cc66e9400c9244049b3b243c38/v1/downloads/etl-ttt-workshop/load_TPC-DS_MySQL.sh .
. load_TPC-DS_MySQL.sh $mysqlendpoint

You can see all the tables getting loaded one by one...

LOAD

And here's a snippet of the Shell Script code.

SCRIPT

NOTE: You don't have to create it but if you want just to have a look at the script, simply run the following commands:

cd /tmp/dsd/ 
cat load_TPC-DS_MySQL.sh

3. Unloading Tables (in CSV) from RDS MySQL Database and Uploading to S3

Finally, running the following commands will:

  • Extract the 3 required tables (web_page, customers and income_band) from the MySQL database;
  • Save these tables in the local temporary directory (in CSV format);
  • And, subsequently, upload these CSV files into the Workshop's S3 Bucket's path: " .../etl-ttt-demo/csv_tables/"
mysql -h ${mysqlendpoint} -u etluser -petltttdemopwd -Dtpcds --batch -e "select * from web_page" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > /tmp/dsd/csv_tables/web_page.csv
mysql -h ${mysqlendpoint} -u etluser -petltttdemopwd -Dtpcds --batch -e "select * from customer" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > /tmp/dsd/csv_tables/customer.csv
mysql -h ${mysqlendpoint} -u etluser -petltttdemopwd -Dtpcds --batch -e "select * from income_band" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > /tmp/dsd/csv_tables/income_band.csv
aws s3 cp --recursive /tmp/dsd/csv_tables/ s3://$BUCKET_NAME/etl-ttt-demo/csv_tables/

CSV


You are finished populating the RDS MySQL Database with TPC-DS sample data. Now, feel free to explore the Amazon S3 workshop's bucket again (etl-ttt-demo-${AWS_ACCOUNT_ID}-${AWS_REGION}) and check the path and files inside the etl-ttt-demo/csv_tables/ path. Once you are ready you can move on to Part 2 - AWS GLUE COMPONENTS!