Run an information processing job on Amazon EMR Serverless with AWS Step Features


    There are a number of infrastructure as code (IaC) frameworks out there as we speak, that can assist you outline your infrastructure, such because the AWS Cloud Improvement Package (AWS CDK) or Terraform by HashiCorp. Terraform, an AWS Companion Community (APN) Superior Expertise Companion and member of the AWS DevOps Competency, is an IaC device much like AWS CloudFormation that permits you to create, replace, and model your AWS infrastructure. Terraform offers pleasant syntax (much like AWS CloudFormation) together with different options like planning (visibility to see the adjustments earlier than they really occur), graphing, and the power to create templates to interrupt infrastructure configurations into smaller chunks, which permits higher upkeep and reusability. We use the capabilities and options of Terraform to construct an API-based ingestion course of into AWS. Let’s get began!

    On this put up, we showcase methods to construct and orchestrate a Scala Spark software utilizing Amazon EMR Serverless, AWS Step Features, and Terraform. On this end-to-end resolution, we run a Spark job on EMR Serverless that processes pattern clickstream information in an Amazon Easy Storage Service (Amazon S3) bucket and shops the aggregation ends in Amazon S3.

    With EMR Serverless, you don’t must configure, optimize, safe, or function clusters to run functions. You’ll proceed to get the advantages of Amazon EMR, akin to open supply compatibility, concurrency, and optimized runtime efficiency for well-liked information frameworks. EMR Serverless is appropriate for patrons who need ease in working functions utilizing open-source frameworks. It gives fast job startup, automated capability administration, and easy price controls.

    Resolution overview

    We offer the Terraform infrastructure definition and the supply code for an AWS Lambda perform utilizing pattern buyer consumer clicks for on-line web site inputs, that are ingested into an Amazon Kinesis Knowledge Firehose supply stream. The answer makes use of Kinesis Knowledge Firehose to transform the incoming information right into a Parquet file (an open-source file format for Hadoop) earlier than pushing it to Amazon S3 utilizing the AWS Glue Knowledge Catalog. The generated output S3 Parquet file logs are then processed by an EMR Serverless course of, which outputs a report detailing combination clickstream statistics in an S3 bucket. The EMR Serverless operation is triggered utilizing Step Features. The pattern structure and code are spun up as proven within the following diagram.

    emr serverless application

    The offered samples have the supply code for constructing the infrastructure utilizing Terraform for working the Amazon EMR software. Setup scripts are offered to create the pattern ingestion utilizing Lambda for the incoming software logs. For the same ingestion sample pattern, confer with Provision AWS infrastructure utilizing Terraform (By HashiCorp): an instance of net software logging buyer information.

    The next are the high-level steps and AWS providers used on this resolution:

    • The offered software code is packaged and constructed utilizing Apache Maven.
    • Terraform instructions are used to deploy the infrastructure in AWS.
    • The EMR Serverless software offers the choice to submit a Spark job.
    • The answer makes use of two Lambda features:
      • Ingestion – This perform processes the incoming request and pushes the information into the Kinesis Knowledge Firehose supply stream.
      • EMR Begin Job – This perform begins the EMR Serverless software. The EMR job course of converts the ingested consumer click on logs into output in one other S3 bucket.
    • Step Features triggers the EMR Begin Job Lambda perform, which submits the applying to EMR Serverless for processing of the ingested log information.
    • The answer makes use of 4 S3 buckets:
      • Kinesis Knowledge Firehose supply bucket – Shops the ingested software logs in Parquet file format.
      • Loggregator supply bucket – Shops the Scala code and JAR for working the EMR job.
      • Loggregator output bucket – Shops the EMR processed output.
      • EMR Serverless logs bucket – Shops the EMR course of software logs.
    • Pattern invoke instructions (run as a part of the preliminary setup course of) insert the information utilizing the ingestion Lambda perform. The Kinesis Knowledge Firehose supply stream converts the incoming stream right into a Parquet file and shops it in an S3 bucket.

    For this resolution, we made the next design selections:

    • We use Step Features and Lambda on this use case to set off the EMR Serverless software. In a real-world use case, the information processing software may very well be lengthy working and will exceed Lambda’s timeout limits. On this case, you should use instruments like Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Amazon MWAA is a managed orchestration service makes it simpler to arrange and function end-to-end information pipelines within the cloud at scale.
    • The Lambda code and EMR Serverless log aggregation code are developed utilizing Java and Scala, respectively. You should utilize any supported languages in these use instances.
    • The AWS Command Line Interface (AWS CLI) V2 is required for querying EMR Serverless functions from the command line. You too can view these from the AWS Administration Console. We offer a pattern AWS CLI command to check the answer later on this put up.


    To make use of this resolution, it’s essential to full the next conditions:

    • Set up the AWS CLI. For this put up, we used model 2.7.18. That is required in an effort to question the aws emr-serverless AWS CLI instructions out of your native machine. Optionally, all of the AWS providers used on this put up will be considered and operated by way of the console.
    • Be sure to have Java put in, and JDK/JRE 8 is about within the setting path of your machine. For directions, see the Java Improvement Package.
    • Set up Apache Maven. The Java Lambda features are constructed utilizing mvn packages and are deployed utilizing Terraform into AWS.
    • Set up the Scala Construct Device. For this put up, we used model 1.4.7. Be sure to obtain and set up primarily based in your working system wants.
    • Arrange Terraform. For steps, see Terraform downloads. We use model 1.2.5 for this put up.
    • Have an AWS account.

    Configure the answer

    To spin up the infrastructure and the applying, full the next steps:

    1. Clone the next GitHub repository.
      The offered shell script builds the Java software JAR (for the Lambda ingestion perform) and the Scala software JAR (for the EMR processing) and deploys the AWS infrastructure that’s wanted for this use case.
    2. Run the next instructions:
      $ chmod +x
      $ ./

      To run the instructions individually, set the applying deployment Area and account quantity, as proven within the following instance:

      $ APP_DIR=$PWD
      $ APP_PREFIX=clicklogger
      $ STAGE_NAME=dev
      $ REGION=us-east-1
      $ ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')

      The next is the Maven construct Lambda software JAR and Scala software bundle:

      $ cd $APP_DIR/supply/clicklogger
      $ mvn clear bundle
      $ sbt reload
      $ sbt compile
      $ sbt bundle

    3. Deploy the AWS infrastructure utilizing Terraform:
      $ terraform init
      $ terraform plan
      $ terraform apply --auto-approve

    Take a look at the answer

    After you construct and deploy the applying, you’ll be able to insert pattern information for Amazon EMR processing. We use the next code for example. The script has a number of pattern insertions for Lambda. The ingested logs are utilized by the EMR Serverless software job.

    The pattern AWS CLI invoke command inserts pattern information for the applying logs:

    aws lambda invoke --function-name clicklogger-dev-ingestion-lambda —cli-binary-format raw-in-base64-out —payload '{"requestid":"OAP-guid-001","contextid":"OAP-ctxt-001","callerid":"OrderingApplication","part":"login","motion":"load","sort":"webpage"}' out

    To validate the deployments, full the next steps:

    1. On the Amazon S3 console, navigate to the bucket created as a part of the infrastructure setup.
    2. Select the bucket to view the information.
      You must see that information from the ingested stream was transformed right into a Parquet file.
    3. Select the file to view the information.
      The next screenshot reveals an instance of our bucket contents.
      Now you’ll be able to run Step Features to validate the EMR Serverless software.
    4. On the Step Features console, open clicklogger-dev-state-machine.
      The state machine reveals the steps to run that set off the Lambda perform and EMR Serverless software, as proven within the following diagram.
    5. Run the state machine.
    6. After the state machine runs efficiently, navigate to the clicklogger-dev-output-bucket on the Amazon S3 console to see the output information.
    7. Use the AWS CLI to test the deployed EMR Serverless software:
      $ aws emr-serverless list-applications 
            | jq -r '.functions[] | choose(.identify=="clicklogger-dev-loggregrator-emr-<Your-Account-Quantity>").id'

    8. On the Amazon EMR console, select Serverless within the navigation pane.
    9. Choose clicklogger-dev-studio and select Handle functions.
    10. The Utility created by the stack can be as proven under clicklogger-dev-loggregator-emr-<Your-Account-Quantity>
      Now you’ll be able to overview the EMR Serverless software output.
    11. On the Amazon S3 console, open the output bucket (us-east-1-clicklogger-dev-loggregator-output-).
      The EMR Serverless software writes the output primarily based on the date partition, akin to 2022/07/28/ next code reveals an instance of the file output:

    Clear up

    The offered ./ script has the required steps to delete all of the information from the S3 buckets that have been created as a part of this put up. The terraform destroy command cleans up the AWS infrastructure that you just created earlier. See the next code:

    $ chmod +x
    $ ./

    To do the steps manually, you too can delete the sources by way of the AWS CLI:

    # CLI Instructions to delete the Amazon S3  
    aws s3 rb s3://clicklogger-dev-emr-serverless-logs-bucket-<your-account-number> --force
    aws s3 rb s3://clicklogger-dev-firehose-delivery-bucket-<your-account-number> --force
    aws s3 rb s3://clicklogger-dev-loggregator-output-bucket-<your-account-number> --force
    aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
    aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
    # Destroy the AWS Infrastructure 
    terraform destroy --auto-approve


    On this put up, we constructed, deployed, and ran an information processing Spark job in EMR Serverless that interacts with varied AWS providers. We walked by deploying a Lambda perform packaged with Java utilizing Maven, and a Scala software code for the EMR Serverless software triggered with Step Features with infrastructure as code. You should utilize any mixture of relevant programming languages to construct your Lambda features and EMR job software. EMR Serverless will be triggered manually, automated, or orchestrated utilizing AWS providers like Step Features and Amazon MWAA.

    We encourage you to check this instance and see for your self how this general software design works inside AWS. Then, it’s simply the matter of changing your particular person code base, packaging it, and letting EMR Serverless deal with the method effectively.

    In the event you implement this instance and run into any points, or have any questions or suggestions about this put up, please go away a remark!


    Concerning the Authors

    Sivasubramanian Ramani (Siva Ramani) is a Sr Cloud Utility Architect at Amazon Net Companies. His experience is in software optimization & modernization, serverless options and utilizing Microsoft software workloads with AWS.

    Naveen Balaraman is a Sr Cloud Utility Architect at Amazon Net Companies. He’s keen about Containers, serverless Purposes, Architecting Microservices and serving to clients leverage the ability of AWS cloud.


    Please enter your comment!
    Please enter your name here