To demonstrate the power of data lake architectures, In this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.
- Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)
- Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake
- Install the Kinesis Data Generator Tool (KDG)
Monitoring for the Firehose Delivery Stream
Amazon Kinesis Firehose writes data to Amazon S3
- Cataloging your Data with AWS Glue
- Create crawler to auto discover schema of your data in S3
- Create a database and a table then Edit the Metadata Schema
- Create a Transformation Job with Glue Studio
- SQL analytics on a Large Scale Open Dataset usimg AWS Athena
-
create a database CREATE DATABASE gdelt;
-
Create Metadata Table for GDELT EVENTS Data CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events (
globaleventid
INT,day
INT,monthyear
INT,year
INT,fractiondate
FLOAT,actor1code
string,actor1name
string,actor1countrycode
string,actor1knowngroupcode
string,actor1ethniccode
string,actor1religion1code
string,actor1religion2code
string,actor1type1code
string,actor1type2code
string,actor1type3code
string,actor2code
string,actor2name
string,actor2countrycode
string,actor2knowngroupcode
string,actor2ethniccode
string,actor2religion1code
string,actor2religion2code
string,actor2type1code
string,actor2type2code
string,actor2type3code
string,isrootevent
BOOLEAN,eventcode
string,eventbasecode
string,eventrootcode
string,quadclass
INT,goldsteinscale
FLOAT,nummentions
INT,numsources
INT,numarticles
INT,avgtone
FLOAT,actor1geo_type
INT,actor1geo_fullname
string,actor1geo_countrycode
string,actor1geo_adm1code
string,actor1geo_lat
FLOAT,actor1geo_long
FLOAT,actor1geo_featureid
INT,actor2geo_type
INT,actor2geo_fullname
string,actor2geo_countrycode
string,actor2geo_adm1code
string,actor2geo_lat
FLOAT,actor2geo_long
FLOAT,actor2geo_featureid
INT,actiongeo_type
INT,actiongeo_fullname
string,actiongeo_countrycode
string,actiongeo_adm1code
string,actiongeo_lat
FLOAT,actiongeo_long
FLOAT,actiongeo_featureid
INT,dateadded
INT,sourceurl
string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '\t', 'field.delim' = '\t') LOCATION 's3://gdelt-open-data/events/'; -
Create Metadata Table for GDELT Lookup Tables
- Example output:
This workshop is base on AWS workshop studio the link is below. https://catalog.us-east-1.prod.workshops.aws/workshops/ea7ddf16-5e0a-4ec7-b54e-5cadf3028b78/en-US