Arrange and monitor AWS Glue crawlers utilizing the improved AWS Glue UI and crawler historical past

    0
    27


    A knowledge lake is a centralized, curated, and secured repository that shops all of your knowledge, each in its unique kind and ready for evaluation. Establishing and managing knowledge lakes immediately includes lots of handbook, sophisticated, and time-consuming duties. AWS Glue and AWS Lake Formation make it straightforward to construct, safe, and handle knowledge lakes. As knowledge from present knowledge shops is moved within the knowledge lake, there’s a have to catalog the information to organize it for analytics from companies equivalent to Amazon Athena.

    AWS Glue crawlers are a preferred approach to populate the AWS Glue Catalog. AWS Glue crawlers are a key part that let you connect with knowledge sources or targets, use completely different classifiers to find out the logical schema for the information, and create metadata within the Information Catalog. You may run crawlers on a schedule, on demand, or triggered primarily based on an Amazon Easy Storage Service (Amazon S3) occasion to make sure that the Information Catalog is updated. Utilizing S3 occasion notifications can scale back the fee and time a crawler must replace giant and regularly altering tables.

    The AWS Glue crawlers UI has been redesigned to supply a greater consumer expertise, and new functionalities have been added. This new UI offers simpler setup of crawlers throughout a number of sources, together with Amazon S3, Amazon DynamoDB, Amazon Redshift, Amazon Aurora, Amazon DocumentDB (with MongoDB compatibility), Delta Lake, MariaDB, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, and MongoDB. A brand new AWS Glue crawler historical past function has additionally been launched, which offers a handy approach to view crawler runs, their schedules, knowledge sources, and tags. For every crawl, the crawler historical past provides a abstract of knowledge modifications equivalent to modifications within the database schema or Amazon S3 partition modifications. Crawler historical past additionally offers DPU hours that may scale back the time to investigate and debug crawler operations and prices.

    This put up exhibits tips on how to create an AWS Glue crawler that helps S3 occasion notification utilizing the brand new UI. We additionally present tips on how to navigate by means of the brand new crawler historical past part and get helpful insights.

    Overview of resolution

    To reveal tips on how to create an AWS Glue crawler utilizing the brand new UI, we use the Toronto parking tickets dataset, particularly the information about parking tickets issued within the metropolis of Toronto between 2017–2018. The purpose is to create a crawler primarily based on S3 occasions, run it, and discover the data confirmed within the UI concerning the run of this crawler.

    As talked about earlier than, as a substitute of crawling all of the subfolders on Amazon S3, we use an S3 event-based strategy. This helps enhance the crawl time by utilizing S3 occasions to determine the modifications between two crawls by itemizing all of the recordsdata from the subfolder that triggered the occasion as a substitute of itemizing the total Amazon S3 goal. For this put up, we create an S3 occasion, Amazon Easy Storage Service (Amazon SNS) subject, and Amazon Easy Queue Service (Amazon SQS ) queue.

    The next diagram illustrates our resolution structure.

    Stipulations

    For this walkthrough, you must have the next conditions:

    If the AWS account you employ to comply with this put up makes use of Lake Formation to handle permissions on the AWS Glue Information Catalog, just be sure you log in as a consumer with entry to create databases and tables. For extra info, discuss with Implicit Lake Formation permissions.

    Launch your CloudFormation stack

    To create your sources for this use case, full the next steps:

    1. Launch your CloudFormation stack in us-east-1:
      BDB-2063-launch-cloudformation-stack
    2. Underneath Parameters, enter a reputation in your S3 bucket (embrace your account quantity).
    3. Choose I acknowledge that AWS CloudFormation would possibly create IAM sources with customized names.
    4. Select Create stack.
    5. Wait till the creation of the stack is full, as proven on the AWS CloudFormation console.
    6. On the stack’s Outputs tab, pay attention to the SQS queue ARN—we use it throughout the crawler creation course of.

    Launching this stack creates AWS sources. You want the next sources from the Outputs tab for the subsequent steps:

    • GlueCrawlerRole – The IAM function to run AWS Glue jobs
    • BucketName – The title of the S3 bucket to retailer solution-related recordsdata
    • GlueSNSTopic – The SNS subject, which we use because the goal for the S3 occasion
    • SQSArn – The SQS queue ARN; this queue goes to be consumed by the AWS Glue crawler

    Create an AWS Glue crawler

    Let’s first create the dataset that’s going for use because the supply of the AWS Glue crawler:

    1. Open AWS CloudShell.
    2. Run the next command:
      aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui/sourcedata/yr=2017/Parking_Tags_Data_2017_2.csv s3://glue-crawler-blog-<YOUR ACCOUNT NUMBER>/torontotickets/yr=2017/Parking_Tags_Data_2017_2.csv


      This motion triggers an S3 occasion that sends a message to the SNS subject that you simply created utilizing the CloudFormation template. This message is consumed by an SQS queue that might be enter for the AWS Glue crawler.

      Now, let’s create the AWS Glue crawler.

    3. On the AWS Glue console, select Crawlers within the navigation pane.
    4. Select Create crawler.
    5. For Title, enter a reputation (for instance, BlogPostCrawler).
    6. Select Subsequent.
    7. For Is your knowledge already mapped to Glue tables, choose Not but.
    8. Within the Information sources part, select Add knowledge supply.

      For this put up, you employ an S3 dataset as a supply.
    9. For Information supply, select S3.
    10. For Location of S3 knowledge, choose On this account.
    11. For S3 path, enter the trail to the S3 bucket you created with the CloudFormation template (s3://glue-crawler-blog-YOUR ACCOUNT NUMBER/torontotickets/).
    12. For Subsequent crawler runs, choose Crawl primarily based on occasions.
    13. Enter the SQS queue ARN you created earlier.
    14. Select Add a S3 knowledge supply.
    15. Select Subsequent.
    16. For Present IAM function¸ select the function you created (GlueCrawlerBlogRole).
    17. Select Subsequent.

      Now let’s create an AWS Glue database.
    18. Underneath Goal database, select Add database.
    19. For Title, enter blogdb.
    20. For Location, select the S3 bucket created by the CloudFormation template.
    21. Select Create database.
    22. On the Set output and scheduling web page, for Goal database, select the database you simply created (blogdb).
    23. For Desk title prefix, enter weblog.
    24. For Most desk threshold, you’ll be able to optionally set a restrict for the variety of tables that this crawler can scan. For this put up, we depart this feature clean.
    25. For Frequency, select On demand.
    26. Select Subsequent.
    27. Evaluate the configuration and select Create crawler.

    Run the AWS Glue crawler

    To run the crawler, navigate to the crawler on the AWS Glue console.

    Select Run crawler.

    On the Crawler runs tab, you’ll be able to see the present run of the crawler.

    Discover the crawler run historical past knowledge

    When the crawler is full, you’ll be able to see the next particulars:

    • Period – The precise period time of the crawler run
    • DPU hours – The variety of DPU hours spent throughout the crawler run; that is very helpful to calculate prices
    • Desk modifications – The modifications utilized to the desk, like new columns or partitions

    Select Desk modifications to see the crawler run abstract.

    You may see the desk blogtorontotickets was created, and in addition a 2017 partition.

    Let’s add extra knowledge to the S3 bucket to see how the crawler processes this modification.

    1. Open CloudShell.
    2. Run the next command:
      aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui/sourcedata/yr=2018/Parking_Tags_Data_2018_1.csv s3://glue-crawler-blog-<YOUR ACCOUNT NUMBER>/torontotickets/yr=2018/Parking_Tags_Data_2018_1.csv

    3. Select Run crawler to run the crawler yet one more time.

    You may see the second run of the crawler listed.

    Be aware that the DPU hours had been diminished by greater than half; it’s because just one partition was scanned and added. Having an event-based crawler helps scale back runtime and value.

    You may select the Desk modifications info of the second run to see extra particulars.

    Be aware underneath Partitions added, the 2018 partition was created.

    Further notes

    Have in mind the next concerns:

    • Crawler historical past is supported for crawls which have occurred because the launch date of the crawler historical past function, and solely retains as much as 12 months of crawls. Older crawls is not going to be returned.
    • To arrange a crawler utilizing AWS CloudFormation, you need to use following template.
    • You will get all of the crawls of a specified crawler by utilizing list-crawls APIs.
    • You may replace present crawlers with a single Amazon S3 goal to make use of this new function. You are able to do this both through the AWS Glue console or by calling the update_crawler API.

    Clear up

    To keep away from incurring future costs, and to scrub up unused roles and insurance policies, delete the sources you created: the CloudFormation stack, S3 bucket, AWS Glue crawler, AWS Glue database, and AWS Glue desk.

    Conclusion

    You need to use AWS Glue crawlers to find datasets, extract schema info, and populate the AWS Glue Information Catalog. AWS Glue crawlers now present an easier-to-use UI workflow to arrange crawlers and in addition present metrics related to previous crawlers run to simplify monitoring and auditing. On this put up, we offered a CloudFormation template to arrange AWS Glue crawlers to make use of S3 occasion notifications, which reduces the time and value wanted to incrementally course of desk knowledge updates within the AWS Glue Information Catalog. We additionally confirmed you tips on how to monitor and perceive the price of crawlers.

    Particular because of everybody who contributed to the crawler historical past launch: Theo Xu, Jessica Cheng and Joseph Barlan.

    Blissful crawling!


    Concerning the authors

    Leonardo Gómez is a Senior Analytics Specialist Options Architect at AWS. Based mostly in Toronto, Canada, He has over a decade of expertise in knowledge administration, serving to prospects across the globe handle their enterprise and technical wants. Join with him on LinkedIn.

    Sandeep Adwankar is a Senior Technical Product Supervisor at AWS. Based mostly within the California Bay Space, he works with prospects across the globe to translate enterprise and technical necessities into merchandise that allow prospects to enhance how they handle, safe, and entry knowledge.

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here