Build and automate Serverless DataLake using an AWS Glue , Lambda , Cloudwatch

Описание к видео Build and automate Serverless DataLake using an AWS Glue , Lambda , Cloudwatch

In this video, how to create a fully automated data cataloging and ETL pipeline to transform your data is explained in-depth from scratch.

Prerequisite:
-----------------------
Implement a CloudWatch Events Rule That Calls an AWS Lambda Function
   • Implement a CloudWatch Events Rule Th...  
Using AWS Lambda with Amazon CloudWatch Events | Send notification when ec2 stops
   • Using AWS Lambda with Amazon CloudWat...  
Pipeline design with monitoring and alert functionalities using Cloudwatch Alarm , EC2 & Lambda
   • Pipeline design with monitoring and a...  
Enable CloudWatch logs for API Gateway | Monitoring and Logging API Activity
   • Enable CloudWatch logs for API Gatewa...  
Invoking State Machine with CloudWatch
   • Invoking State Machine with CloudWatch  
AWS Glue Workflow in-depth intuition with Lab
   • AWS Glue Workflow in-depth intuition ...  
An automated data pipeline using Lambda, S3 and Glue - Big Data with Cloud Computing
   • An automated data pipeline using Lamb...  

Lambda Code to trigger Glue Crawler:
---------------------------------------------------------------
import json
import boto3
glue=boto3.client('glue');

def lambda_handler(event, context):
TODO implement
response = glue.start_crawler(
Name='{Put the Name of the Glue Crawler here}'
)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}



Lambda Code to trigger Glue Job:
----------------------------------------------------------

import json
import boto3


def lambda_handler(event, context):
glue=boto3.client('glue');
response = glue.start_job_run(JobName = "{Put the Glue ETL Job name here}")
print("Lambda Invoke")

Glue Code:
---------------------

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

@params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "{}", table_name = "{}", transformation_ctx = "datasource0")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3",
connection_options = {"path": "s3://{}/{}/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

Cloudwatch rule for trigger the Lambda on success of the Glue Crawler:
-----------------------------------------------------------------------------------------------------------------------
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Crawler State Change"
],
"detail": {
"state": [
"Succeeded"
],
"crawlerName": [
"{Put your Crawler Name here}"
]
}
}

Cloudwatch rule for Triggering the SNS on success of Glue Job:
---------------------------------------------------------------------------------------------------------
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Job State Change"
],
"detail": {
"jobName": [
"{Put your Job name here}"
],
"state": [
"SUCCEEDED"
]
}
}

Check this playlist for more AWS Projects in Big Data domain:
   • Demystifying Data Engineering with Cl...  

Комментарии

Информация по комментариям в разработке