In case you are at present operating Amazon EMR 5.X clusters, take into account transferring to Amazon EMR 6.X as it consists of new options that helps you enhance efficiency and optimize on price. As an illustration, Apache Hive is 2 occasions quicker with LLAP on Amazon EMR 6.X, and Spark 3 reduces prices by 40%. Moreover, Amazon EMR 6.x releases embody Trino, a quick distributed SQL engine and Iceberg, high-performance open knowledge format for petabyte scale tables.
To improve Amazon EMR clusters from 5.X to six.X launch, a Hive Metastore improve is step one earlier than functions resembling Hive and Spark could be migrated. This put up offers steerage on the best way to improve Amazon EMR Hive Metastore from 5.X to six.X in addition to migration of Hive Metastore to the AWS Glue Information Catalog. As Hive 3 Metastore is suitable with Hive 2 functions, you possibly can proceed to make use of Amazon EMR 5.X with the upgraded Hive Metastore.
Resolution overview
Within the following part, we offer steps to improve the Hive Metastore schema utilizing MySQL because the backend.. For every other backends (resembling MariaDB, Oracle, or SQL Server), replace the instructions accordingly.
There are two choices to improve the Amazon EMR Hive Metastore:
- Improve the Hive Metastore schema from 2.X to three.X by utilizing the Hive Schema Device
- Migrate the Hive Metastore to the AWS Glue Information Catalog
We stroll by way of the steps for each choices.
Pre-upgrade stipulations
Earlier than upgrading the Hive Metastore, you have to full the next stipulations steps:
- Confirm the Hive Metastore database is operating and accessible.
It is best to be capable of run Hive DDL and DML queries efficiently. Any errors or points should be mounted earlier than continuing with improve course of. Use the next pattern queries to check the database: - To get the Metastore schema model within the present EMR 5.X cluster, run the next command within the main node:
The next code exhibits our pattern output:
- Cease the Metastore service and prohibit entry to the Metastore MySQL database.
It’s essential that nobody else accesses or modifies the contents of the Metastore database when you’re performing the schema improve.To cease the Metastore, use the next instructions:For Amazon EMR launch 5.30 and 6.0 onwards (Amazon Linux 2 is the working system for the Amazon EMR 5.30+ and 6.x launch collection), use the next instructions:
You may also observe the whole variety of databases and tables current within the Hive Metastore earlier than the improve, and confirm the variety of databases and tables after the improve.
- To get the whole variety of tables and databases earlier than the improve, run the next instructions after connecting to the exterior Metastore database (assuming the Hive Metadata DB title is hive):
- Take a backup or snapshot of the Hive database.
This lets you revert any adjustments made in the course of the improve course of if one thing goes mistaken. Should you’re utilizing Amazon Relational Database Service (Amazon RDS), check with Backing up and restoring an Amazon RDS occasion for directions. - Be aware of the Hive desk storage location if knowledge is saved in HDFS.
If all of the desk knowledge is on Amazon Easy Storage Service (Amazon S3), then no motion is required. If HDFS is used because the storage layer for Hive databases and tables, then take a observe of them. You have to to repeat the information on HDFS to an analogous path on the brand new cluster, after which confirm or replace the placement attribute for databases and tables on the brand new cluster accordingly.
Improve the Amazon EMR Hive Metastore schema with the Hive Schema Device
On this method, you utilize the persistent Hive Metastore on a distant database (Amazon RDS for MySQL or Amazon Aurora MySQL-Appropriate Version). The next diagram exhibits the improve process.
To improve the Amazon EMR Hive Metastore from 5.X (Hive model 2.X) to six.X (Hive model 3.X), we are able to use the Hive Schema Device. The Hive Schema Device is an offline device for Metastore schema manipulation. You need to use it to initialize, improve, and validate the Metastore schema. Run the next command to indicate the obtainable choices for the Hive Schema Device:
You should definitely full the stipulations talked about earlier, together with taking a backup or snapshot, earlier than continuing with the subsequent steps.
- Word down the small print of the present Hive exterior Metastore to be upgraded.
This consists of the RDS for MySQL endpoint host title, database title (for this put up, hive), consumer title, and password. You are able to do this by way of one of many following choices:- Get the Hive Metastore DB info from the Hive configuration file – Log in to the EMR 5.X main node, open the file
/and so on/hive/conf/hive-site.xml
, and observe the 4 properties:
- Get the Hive Metastore DB info from the Amazon EMR console – Navigate to the EMR 5.X cluster, select the Configurations tab, and observe down the Metastore DB info.
- Get the Hive Metastore DB info from the Hive configuration file – Log in to the EMR 5.X main node, open the file
- Create a brand new EMR 6.X cluster.
To make use of the Hive Schema Device, we have to create an EMR 6.X cluster. You’ll be able to create a brand new EMR 6.X cluster through the Hive console or the AWS Command Line Interface (AWS CLI), with out specifying exterior hive Metastore particulars. This lets the EMR 6.X cluster launch efficiently utilizing the default Hive Metastore. For extra details about EMR cluster administration, check with Plan and configure clusters. - After your new EMR 6.X cluster is launched efficiently and is within the ready state, SSH to the EMR 6.X main node and take a backup of
/and so on/hive/conf/hive-site.xml
: - Cease Hive providers:
Now you replace the Hive configuration and level it to the outdated hive Metastore database.
- Modify
/and so on/hive/conf/hive-site.xml
and replace the properties with the values you collected earlier: - On the identical or new SSH session, run the Hive Schema Device to test that the Metastore is pointing to the outdated Metastore database:
The output ought to look as follows (old-hostname, old-dbname, and old-username are the values you modified):
You’ll be able to improve the Hive Metastore by passing the
-upgradeSchema
choice to the Hive Schema Device. The device figures out the SQL scripts required to initialize or improve the schema after which runs these scripts in opposition to the backend database. - Run the
upgradeSchema
command with-dryRun
, which solely lists the SQL scripts wanted in the course of the precise run:The output ought to appear like the next code. It exhibits the Metastore improve path from the outdated model to the brand new model. You’ll find the improve order on the GitHub repo. In case of failure in the course of the improve course of, these scripts could be run manually in the identical order.
- To improve the Hive Metastore schema, run the Hive Schema Device with
-upgradeSchema
:The output ought to appear like the next code:
In case of any points or failures, you possibly can run the previous command with verbose. This prints all of the queries getting run so as and their output.
Should you encounter any failures throughout this course of and also you wish to improve your Hive Metastore by operating the SQL your self, check with Upgrading Hive Metastore.
If HDFS was used as storage for the Hive warehouse or any Hive DB location, that you must replace the
NameNode
alias or URI with the brand new cluster’s HDFS alias. - Use the next instructions to replace the HDFS
NameNode
alias (change<new-loc> <old-loc>
with the HDFS root location of the brand new and outdated clusters, respectively):You’ll be able to run the next command on any EMR cluster node to get the HDFS
NameNode
alias:At first you possibly can run with the
dryRun
choice, which shows all of the adjustments however aren’t continued. For instance:Nevertheless, if the brand new location must be modified to a distinct HDFS or S3 path, then use the next method.
First hook up with the distant Hive Metastore database and run the next question to tug all of the tables for a particular database and listing the areas. Exchange
HiveMetastore_DB
with the database title used for the Hive Metastore within the exterior database (for this put up, hive) and the Hive database title (default):Determine the desk for which location must be up to date. Then run the Alter desk command to replace the desk areas. You’ll be able to put together a script or chain of Alter desk instructions to replace the areas for a number of tables.
- Begin and test the standing of Hive Metastore and HiveServer2:
Publish-upgrade validation
Carry out the next post-upgrade steps:
- Verify the Hive Metastore schema is upgraded to the brand new model:
The output ought to appear like the next code:
- Run the next Hive Schema Device command to question the Hive schema model and confirm that it’s upgraded:
- Run some DML queries in opposition to outdated tables and guarantee they’re operating efficiently.
- Confirm the desk and database counts utilizing the identical instructions talked about within the stipulations part, and examine the counts.
The Hive Metastore schema migration course of is full, and you can begin working in your new EMR cluster. If for some cause you wish to relaunch the EMR cluster, then you definately simply want to supply the Hive Metastore distant database that we upgraded within the earlier steps utilizing the choices on the Amazon EMR Configurations tab.
Migrate the Amazon EMR Hive Metastore to the AWS Glue Information Catalog
The AWS Glue Information Catalog is versatile and dependable, and may cut back your operation price. Furthermore, the Information Catalog helps completely different variations of EMR clusters. Due to this fact, once you migrate your Amazon EMR 5.X Hive Metastore to the Information Catalog, you need to use the identical Information Catalog with any new EMR 5.8+ cluster, together with Amazon EMR 6.x. There are some components you need to take into account when utilizing this method; check with Issues when utilizing AWS Glue Information Catalog for extra info. The next diagram exhibits the improve process.
Emigrate your Hive Metastore to the Information Catalog, you need to use the Hive Metastore migration script from GitHub. The next are the key steps for a direct migration.
Make certain all of the desk knowledge is saved in Amazon S3 and never HDFS. In any other case, tables migrated to the Information Catalog can have the desk location pointing to HDFS, and you may’t question the desk. You’ll be able to test your desk knowledge location by connecting to the MySQL database and operating the next SQL:
Make certain to finish the prerequisite steps talked about earlier earlier than continuing with the migration. Make sure the EMR 5.X cluster is in a ready state and all of the elements’ standing are in service.
- Word down the small print of the present EMR 5.X cluster Hive Metastore database to be upgraded.
As talked about earlier than, this consists of the endpoint host title, database title, consumer title, and password. You are able to do this by way of one of many following choices:- Get the Hive Metastore DB info from the Hive configuration file – Log in to the Amazon EMR 5.X main node, open the file
/and so on/hive/conf/hive-site.xml
, and observe the 4 properties:
- Get the Hive Metastore DB info from the Amazon EMR console – Navigate to the Amazon EMR 5.X cluster, select the Configurations tab, and observe down the Metastore DB info.
- Get the Hive Metastore DB info from the Hive configuration file – Log in to the Amazon EMR 5.X main node, open the file
- On the AWS Glue console, create a connection to the Hive Metastore as a JDBC knowledge supply.
Use the connection JDBC URL, consumer title, and password you gathered within the earlier step. Specify the VPC, subnet, and safety group related together with your Hive Metastore. You’ll find these on the Amazon EMR console if the Hive Metastore is on the EMR main node, or on the Amazon RDS console if the Metastore is an RDS occasion. - Obtain two extract, rework, and cargo (ETL) job scripts from GitHub and add them to an S3 bucket:
Should you configured AWS Glue to entry Amazon S3 from a VPC endpoint, you have to add the script to a bucket in the identical AWS Area the place your job runs.
Now you have to create a job on the AWS Glue console to extract metadata out of your Hive Metastore emigrate it to the Information Catalog.
- On the AWS Glue console, select Jobs within the navigation pane.
- Select Create job.
- Choose Spark script editor.
- For Choices¸ choose Add and edit an present script.
- Select Select file and add the
import_into_datacatalog.py
script you downloaded earlier. - Select Create.
- On the Job particulars tab, enter a job title (for instance,
Import-Hive-Metastore-To-Glue
). - For IAM Function, select a task.
- For Kind, select Spark.
- For Glue model¸ select Glue 3.0.
- For Language, select Python 3.
- For Employee kind, select G1.X.
- For Requested variety of staff, enter 2.
- Within the Superior properties part, for Script filename, enter
import_into_datacatalog.py
. - For Script path, enter the S3 path you used earlier (simply the mother or father folder).
- Beneath Connections, select the connection you created earlier.
- For Python library path, enter the S3 path you used earlier for the file
hive_metastore_migration.py
. - Beneath Job parameters, enter the next key-pair values:
--mode: from-jdbc
--connection-name: EMR-Hive-Metastore
--region: us-west-2
- Select Save to avoid wasting the job.
- Run the job on demand on the AWS Glue console.
If the job runs efficiently, Run standing ought to present as Succeeded. When the job is completed, the metadata from the Hive Metastore is seen on the AWS Glue console. Test the databases and tables listed to confirm that they have been migrated appropriately.
Recognized points
In some circumstances the place the Hive Metastore schema model is on a really outdated launch or if some required metadata tables are lacking, the improve course of might fail. On this case, you need to use the next steps to establish and repair the difficulty. Run the schemaTool upgradeSchema
command with verbose as follows:
This prints all of the queries being run so as and their output:
Word down the question and the error message, then take the required steps to handle the difficulty. For instance, relying on the error message, you could have to create the lacking desk or alter an present desk. Then you possibly can both rerun the schemaTool upgradeSchema
command, or you possibly can manually run the remaining queries required for improve. You may get the entire script that schemaTool runs from the next path on the first node /usr/lib/hive/scripts/metastore/improve/mysql/
or from GitHub.
Clear up
Operating extra EMR clusters to carry out the improve exercise in your AWS account might incur extra expenses. Once you full the Hive Metastore improve efficiently, we advocate deleting the extra EMR clusters to avoid wasting price.
Conclusion
To improve Amazon EMR from 5.X to six.X and benefit from some options from Hive 3.X or Spark SQL 3.X, you must improve the Hive Metastore first. Should you’re utilizing the AWS Glue Information Catalog as your Hive Metastore, you don’t must do something as a result of the Information Catalog helps each Amazon EMR variations. Should you’re utilizing a MySQL database because the exterior Hive Metastore, you possibly can improve by following the steps outlined on this put up, or you possibly can migrate your Hive Metastore to the Information Catalog.
There are some purposeful variations between the completely different variations of Hive, Spark, and Flink. If in case you have some functions operating on Amazon EMR 5.X, be sure that take a look at your functions in Amazon EMR 6.X and validate the perform compatibility. We’ll cowl software upgrades for Amazon EMR elements in a future put up.
Concerning the authors
Jianwei Li is Senior Analytics Specialist TAM. He offers advisor service for AWS enterprise assist clients to design and construct fashionable knowledge platform. He has greater than 10 years expertise in massive knowledge and analytics area. In his spare time, he like operating and climbing.
Narayanan Venkateswaran is an Engineer within the AWS EMR group. He works on creating Hive in EMR. He has over 17 years of labor expertise within the business throughout a number of corporations together with Solar Microsystems, Microsoft, Amazon and Oracle. Narayanan additionally holds a PhD in databases with concentrate on horizontal scalability in relational shops.
Partha Sarathi is an Analytics Specialist TAM – at AWS primarily based in Sydney, Australia. He brings 15+ years of know-how experience and helps Enterprise clients optimize Analytics workloads. He has extensively labored on each on-premise and cloud Bigdata workloads together with varied ETL platform in his earlier roles. He additionally actively works on conducting proactive operational opinions across the Analytics providers like Amazon EMR, Redshift, and OpenSearch.
Krish is an Enterprise Help Supervisor liable for main a staff of specialists in EMEA centered on BigData & Analytics, Databases, Networking and Safety. He’s additionally an skilled in serving to enterprise clients modernize their knowledge platforms and encourage them to implement operational greatest practices. In his spare time, he enjoys spending time along with his household, travelling, and video video games.