Part 1 – the good stuff. Recently we have upgraded 4 clusters from CDH 6.1.2 to CDP 7.1.7 for one of our customers. This was triggered by 2 main factors: support for CDH 6.1 running out and new features present in CDP. All our clusters are a bare metal, on-premise deployment. We have a multitenant setup with many projects that should be separated from each other (each one gets its own HDFS directory, Yarn Queue, Hive Databases, HBase namespace etc.).
The whole upgrade was caried out in place. This article will cover the process from the administrator’s perspective, things that were good, things that were broken and things that changed behaviour. We will not cover any code changes required for applications running on cluster (different default storage format for Impala, change in Hive managed/external tables etc.).
The first cluster upgrade took us a solid week. The last, production, environment we managed to do in about 8 hours (although scattered over a whole week, because of safety margins and downtime windows communication).
The good
There are many things Cloudera improved with the CDP release. There are, of course, new services, some of them vastly improving the admin experience (e.g. the Streams Messaging Manager gives you a great overview on what is going on with Kafka or Ranger, finally giving you a graphical overview of permissions).
However, there are a few things that are more important in the upgrade than “new and flashy stuff”. These are the things Cloudera did right:
Sentry to Ranger migration
Cloudera mostly did a good job on permission migration (for Hive/Impala and Kafka). In CDP they just work out of the box. Granted, you will get a mess of permissions For Hive/Impala (6k+ entries in our case), which you will have to clean up, but you can do the clean-up later.
There are also some things that did not pan out nicely, but we will talk about these later.
Fair scheduler to Capacity scheduler – migration of queues
A Cloudera provided tool migrates queues from the Fair to Capacity scheduler quite nicely. All weights and structures are retained. There are some changes resulting from the Capacity Scheduler being different.
Upgrade command itself
This is, in my opinion, the biggest advantage of upgrading to CDP7. The upgrade tool is finally reliable, resumable and overall, a big improvement on previous releases. You can hit “back” in your browser, you can close the tab, you can even resume it from a different window.
A great improvement from previous versions where, if you closed the tab with the upgrade command running, you could end up in a very broken Cluster stuck in the middle of the upgrade where you would have to do all the steps manually.
Hive-sra tool
If you are planning the upgrade, you will find a lot of talk about “changes in Hive”. There are basically 2 things here: managed to external tables migration and missing directories. Both can be identified with the hive-sra tool (you set it up to connect to your Metastore DB, run it with HDFS superuser credentials and it gives you problematic tables) available on Cloudera github: https://github.com/cloudera-labs/hive-sre .
From this output you must deal with missing directories (if you leave any of them it can lead to an amusing upgrade failure). For managed tables, it is more of a performance tool since all of them will be migrated to external anyway. It is just better to do that in advance so you will not be waiting for it during the upgrade.
In the next part, we will look at things that Cloudera broke during the upgrade and how it could be remedied. Stay tuned for The bad part.