Encryption of Data at Rest is a highly desirable or sometimes mandatory requirement for data platforms in a range of industry verticals including HealthCare, Financial & Government organizations. The capability increases security and protects sensitive data from various kinds of attack that could be internal or external to the platform.
Access to HDFS data can be managed by Apache Ranger HDFS policies and audit trails help administrators to monitor the activity. However, any user with HDFS admin or root access on cluster nodes would be able to impersonate the “hdfs” user and access sensitive data in clear text. HDFS Encryption prevents access to clear text data. Data security and data privacy are bolstered by this approach, ensuring the protection of sensitive and personal data which, should it be exposed in an accidental or malicious breach, would result in negative impact for both the individuals concerned (customers, employees, partners) as well as the organization as a whole.
HDFS Encryption delivers transparent end-to-end encryption of data at rest and is an integral part of HDFS. End to end encryption means that the data is only encrypted and decrypted by the client. In other words, data remains encrypted until it reaches the HDFS client.
Each HDFS file is encrypted using an encryption key. To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. To add another layer of security, the file encryption key is stored in encrypted form, using another “encryption zone key”.
Configuring this feature is relatively straightforward. It protects the data by controlling the decrypt access to HDFS data with key management policies handled by Ranger.
HDFS Native encryption works in combination with solutions such as Protegrity Tokenization where encrypted data in HDFS can be tokenized and detokenized based on the policies defined by the Protegrity ESA server. What’s more, Ranger offers dynamic column masking features that include redacting, hashing, and masking data that can be applied on top of data that is already encrypted at rest on HDFS for an additional layer of security.
HDFS encryption combined with column masking features by Ranger and/or Protegrity form a complete solution where data is fully protected: at rest, over the network, and where clear text access is managed by authorization policies.
Encryption & Decryption Flow:
The way HDFS encrypts data is explained very well in Cloudera documentation and many articles. However, I am going to go over the basic flow here:
- An HDFS encryption zone encryption key (EZK) needs to be created to encrypt files in HDFS
- An HDFS encryption zone need to be created; this is an empty HDFS folder, associated with an EZK
- For every file created or copied into HDFS encryption zone, a data encryption key (DEK) is created
- Data in the file is encrypted with DEK
- DEK is encrypted using EZK to give rise to encrypted data encryption key (EDEK)
- Each file will have an EDEK which is stored in the file’s metadata
- Attempt to access an encrypted file requires a user to have “DECRYPT” access on the corresponding EZK
- “hdfs dfs -cat” on the file triggers a hadoop KMS API call to validate the “DECRYPT” access
- If the user has access on EZK, the EDEK on the file is decrypted using EZK
- The DEK is then used for decrypting the contents of the file and display to the user
The following diagram shows how the HDFS client invokes Key provider API to decrypt the EDEK and gain access to the file contents:
The following diagram shows how the EZK, DEK, EDEK are related to each other:
HDFS Native Encryption capability completely relies on the Ranger KMS service that is central to creation of encryption zone keys, creation of authorization policies to grant ENCRYPT, DECRYPT access on the keys.
Because of its crucial role, installing and configuring Ranger KMS is the first step to enable HDFS Native Encryption.
Ranger KMS needs a backend infrastructure to store & retrieve encryption zone keys. Cloudera Manager offers two different options to install & operate Ranger KMS:
- Ranger KMS backed by an RDBMS
- Ranger KMS backed by Key Trustee Server
However, Cloudera Manager makes it straightforward to configure either.
Enable HDFS Data at Rest Encryption:
On the Cloudera Manager (CM) UI, click on “Clusters” and click on the cluster name ( in our case, the cluster name is “mycdp”)
Click on the “Actions” drop down, and click on “Set up HDFS Data at Rest Encryption” as shown below:
The operation “Setup HDFS Data At Rest Encryption” in the Cloudera Manager UI, will prompt you to pick one of three choices:
(1) Ranger KMS with RDBMS
(2) Ranger KMS with KTS
(3) File based Keystore
The following screenshot shows all the three options mentioned above:
Assuming the first option in the above screenshot “Ranger Key Management Service backed by Key Trustee Server” is chosen, Cloudera Manager prompts a few prerequisites to be completed along with choices on how KTS infrastructure can be stood up.
The below screenshot indicates these strong recommendations to be implemented before enabling HDFS Data at Rest Encryption:
- Enable kerberos security
- Enable TLS / SSL
The following are the two choices on how KTS infrastructure can be setup:
- Add a dedicated cluster for KTS (helps manage KTS infrastructure outside the cluster, and it is a best practice as well)
- Install KTS using parcels (it requires parcels to be downloaded from archive.cloudera.com, and configure into CM)
Once KTS is in place with one of the above two choices,
- Add KTS as a service by selecting “Add service” option on Cloudera manager UI
- Add Ranger KMS with Key Trustee Server Service by selecting “Add service” on Cloudera Manager UI
In this document, the option of “Installing KTS as a service inside the cluster” is chosen since additional nodes to create a dedicated cluster of KTS servers is not available in our demo system.
Parcels Configuration for KTS:
Download the parcels for KTS as they are not part of the CDP parcels.
$ wget https://username:firstname.lastname@example.org/p/keytrusteeserver7/22.214.171.124/parcels/KEYTRUSTEE_SERVER-126.96.36.199-1.keytrustee188.8.131.52.p0.3050880-el7.parcel $ wget https://username:email@example.com/p/keytrusteeserver7/184.108.40.206/parcels/KEYTRUSTEE_SERVER-220.127.116.11-1.keytrustee18.104.22.168.p0.3050880-el7.parcel.sha
Copy the parcel files into /opt/cloudera/parcel-repo folder on the Cloudera Manager server.
Once the files are copied, change the ownership to the user “cloudera-scm”
Now, in CM, click on Parcels, and click on “Check for New Parcels”. You should see a new parcel “”KEYTRUSTEE_SERVER”.
Distribute and activate the KEYTRUSTEE_SERVER parcel. Once it is activated, you will the status as “Distributed, Activated” on the parcels page:
Installation & Configuration:
Now, select the “Add Service” option in Cloudera Manager, and select KeyTrustee Server. Select hosts for Active and Passive KTS servers.
Check entropy using the command :
$ cat /proc/sys/kernel/random/entropy_avail
Entropy should be greater than 500; if not, we need to install other software packages to increase entropy levels.
If the entropy available is low, you must increase the entropy available. Otherwise, subsequent cryptographic operations can take a long time. View More Details
To determine the amount of available entropy on the target machines, run these commands:
ssh firstname.lastname@example.org cat /proc/sys/kernel/random/entropy_avail If the result is below 500, you may want to consider this workaround by installing an entropy generator such as rng-tools. Consult the security policies, procedures, and practices in your organization before proceeding. Install rng-tools yum install rng-tools # For Centos/RHEL 6, 7+ systems apt-get install rng-tools # For Debian systems zypper install rng-tools # For SLES systems For Centos/RHEL 6, Debian, SLES systems echo 'EXTRAOPTIONS="-r /dev/urandom"' >> /etc/sysconfig/rngd service rngd start chkconfig rngd on cat /proc/sys/kernel/random/entropy_avail For Centos/RHEL 7+ systems cat /proc/sys/kernel/random/entropy_avail cp /usr/lib/systemd/system/rngd.service /etc/systemd/system/ sed -i -e 's/ExecStart=/sbin/rngd -f/ExecStart=/sbin/rngd -f -r /dev/urandom/' /etc/systemd/system/rngd.service systemctl daemon-reload systemctl start rngd systemctl status rngd # if the status command returns the service is loaded and enabled, skip the following step systemctl enable rngd
Generate private key on the Active KTS by running the below command:
[root@ccycloud-4 ~]# ktadmin init INFO:keytrustee.server.util:Creating self-signed cert INFO:keytrustee.util:`/usr/bin/openssl req -nodes -new -days 3650 -subj /C=US/ST=TX/L=Austin/CN=ccycloud-4.cdpvcb.root.hwx.site/Eemail@example.com -x509 -out /tmp/tmpeKpVfQ.csr -keyout /tmp/tmpdO8Bnp.key` INFO:keytrustee.server.util:Generating PGP key, this may take a while Initialized directory for 4096R/0A9F3FEBEE343FEA839DA417F1516D034C0E2E78
Install “rsync” on both active and passive KTS servers.
Run below command to keep the private key of the active KTS in sync between both active and passive KTS.
[root@ccycloud-4 ~]# rsync -zav --exclude .ssl /var/lib/keytrustee/.keytrustee ccycloud-3.cdpvcb.root.hwx.site:/var/lib/keytrustee/ firstname.lastname@example.org's password: sending incremental file list .keytrustee/ .keytrustee/gpg.conf .keytrustee/keytrustee.conf .keytrustee/logging.conf .keytrustee/pubring.gpg .keytrustee/pubring.gpg~ .keytrustee/random_seed .keytrustee/secring.gpg .keytrustee/trustdb.gpg sent 11,286 bytes received 172 bytes 2,546.22 bytes/sec total size is 12,317 speedup is 1.07
Initialize the Passive Key Trustee Server with the same private key. Ensure both ktadmin commands output the same initialized directory.
[root@ccycloud-4 ~]# ssh ccycloud-3.cdpvcb.root.hwx.site email@example.com's password: Last login: Thu Feb 25 19:47:10 2021 from 172.27.172.135 [root@ccycloud-3 ~]# ktadmin init INFO:keytrustee.server.util:Creating self-signed cert INFO:keytrustee.util:`/usr/bin/openssl req -nodes -new -days 3650 -subj /C=US/ST=TX/L=Austin/CN=ccycloud-3.cdpvcb.root.hwx.site/Efirstname.lastname@example.org -x509 -out /tmp/tmpVxzRKB.csr -keyout /tmp/tmpC6NvaB.key` Initialized directory for 4096R/0A9F3FEBEE343FEA839DA417F1516D034C0E2E78 [root@ccycloud-3 ~]#
The Initialized directory values must be identical on both active and passive KTS.
Install Ranger KMS with KTS Backend:
On Cloudera Manager UI, click on Add Service, and choose “Ranger KMS with Key Trustee Server”.
Setup Authorization Secret
This step helps you create an organization and retrieve the “auth_secret” value for this Ranger KMS with Key Trustee Server to use. An organization is required to register with Key Trustee Server.
The following screenshot indicates where to enter the “Org Name” and where the generated “auth_secret” is to be entered.
Enter a name for “Org name” say, “qa-test”. ( You can choose any name here)
And proceed further
Switch to the primary Key Trustee Server and run the following commands
[root@ccycloud-4 ~]# keytrustee-orgtool add -n qa-test -c root@localhost Dropped privileges to keytrustee 2021-02-26 12:32:53,561 - keytrustee.server.orgtool - INFO - Adding organization to database 2021-02-26 12:32:53,564 - keytrustee.server.orgtool - INFO - Initializing random secret 2021-02-26 12:32:53,584 - keytrustee.server.util - ERROR - An exception of type error occurred. Arguments:(111, 'Connection refused'). This probably happened because there is no Mail Transfer Agent setup. You will not receive any emails you were to receive from the Key Trustee Server. [root@ccycloud-4 ~]# keytrustee-orgtool list Dropped privileges to keytrustee "qa-test": "auth_secret": "ZJ76qlaTev6ehyP/D9GJ/Q==", "contacts": [ "root@localhost" ], "creation": "2021-02-26T12:32:53", "expiration": "9999-12-31T15:59:59", "key_info": null, "name": "qa-test", "state": 0, "uuid": "I8eCm6jxihRwJmFJJkthYk9CUgFf10o94dYsgTWPxHB"
Copy the above “auth_secret” value and enter it on the Cloudera Manager screen where it asks for “auth_secret” as shown in the above screenshot. Upon this action, CM takes you to the next page “Setup TLS for Ranger KMS for KeyTrustee Server”
Here, the best practice is to enable TLS across all nodes of the CDP cluster with certificates signed by a well-known Certificate Authority. However, we can continue without enabling TLS for the purpose of this blog. For the same reason, I have also chosen to run both the “Active KeyTrustee server” and the “Ranger KMS with KTS” on the same host for the sake of simplicity.
Upon clicking NEXT, it will prompt you to review your changes. the authentication method is to be chosen as Kerberos; click next to complete installation.
If Ranger KMS with KTS is not started automatically, start the service along with any other stale services.
Incase Ranger KMS does not start, please go through the following logs:
- Cloudera manager agent logs at: /var/log/cloudera-scm-agent/ on the host where Ranger KMS is installed
- Ranger KMS server logs at: /var/log/ranger/kms/
Install & Configure Ranger KMS with RDBMS Backend:
Ranger KMS installation with RDBMS backend is a much simpler installation. The prerequisite to implement this option is to have an RDBMS installed and configured. In this article, we will provide instructions on how to install and configure a MySQL instance as a backend for Ranger KMS.
Install & Configure RDBMS:
In case option 1 above is chosen, the following instructions help to stand up an RDBMS instance that can act as a backend to Ranger KMS. Some customers might use the same RDBMS that forms the backend for the Hive or Ranger metastore. However, it is a best practice to have a dedicated RDBMS for Ranger KMS since it stores sensitive information like encryption zone keys and the master secret key.
Ranger KMS supports MySQL, Postgresql as well as Oracle. In this article, we will install a dedicated MySQL instance as a backend for Ranger KMS.
Run below command to install MySQL 5.7 from the internet. If you don’t have access to internet from the cluster linux host, download the file and SCP it onto the Linux host.
$ yum localinstall https://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm $ yum install -y mysql-community-server $ systemctl enable mysqld $ systemctl start mysqld The initial password is found in the log file; it can be found as follows: [root@ccycloud-1 ~]# grep 'temporary password' /var/log/mysqld.log 2021-02-16T02:50:34.638064Z 1 [Note] A temporary password is generated for root@localhost: E;Pm;YgNp3zh [root@ccycloud-1 ~]#
Run the following command to enter the default password and change it to a new password by following the prompts:
Once the root password is entered, try to login and create a database and user for Ranger KMS.The default password for mysql root user is set to “Hadoop_123”
CREATE DATABASE rangerkms DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; GRANT ALL ON rangerkms.* TO 'rangerkms'@'%' IDENTIFIED BY 'Hadoop_123'; GRANT ALL ON rangerkms.* TO 'rangerkms'@'localhost' IDENTIFIED BY 'Hadoop_123';
Download and install mysql java connector jar:
$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.46.tar.gz tar zxvf mysql-connector-java-5.1.46.tar.gz sudo mkdir -p /usr/share/java/ cd mysql-connector-java-5.1.46 sudo cp mysql-connector-java-5.1.46-bin.jar /usr/share/java/mysql-connector-java.jar
Create /usr/share/java folder on all the hosts:
$ for i in $(cat hosts);do echo $i;ssh $i ‘mkdir -p /usr/share/java’;done
Copy the mysql connector jar to the /usr/share/java folder on all the hosts:
$ for i in $(cat hosts);do echo $i;scp /root/mysql-connector-java-5.1.46/mysql-connector-java-5.1.46-bin.jar $i:/usr/share/java/mysql-connector-java.jar;done
Install Ranger KMS:
Click on the “Add Service” option in Cloudera Manager UI:
and chose “Ranger KMS” as shown below:
Configure the Ranger KMS backend database, using the MySQL instance along with the user, database and access to the user on the database.
The below screenshot shows the configuration page along with a successful test connection.
If the cluster is kerberized, chose “kerberos” as authentication method, otherwise chose “simple”
Also, enter “Ranger KMS Master Key Password” and save this password. This field is not auto populated, the user will have to enter the master secret password.
Follow the prompts on Cloudera Manager to complete the installation and start Ranger KMS.
If Ranger KMS service does not start, please look into cloudera manager agent logs on the host @ /var/log/cloudera-scm-agent, or the ranger kms logs @ /var/log/ranger/kms
Usually, it could be the connectivity with the backend database (MySQL). Verify the host name and port number of the database server and whether the database admin user has sufficient privileges on the database created for storing the encryption keys
Ranger User Roles:
Once Ranger KMS is started, go to Ranger service, open up the Ranger Web UI.
**** Please note that the Web UI URL is the same for both Ranger UI and Ranger KMS UI. Depending on the user’s role, the URL takes the user to either Ranger UI or Ranger KMS UI. This is to ensure separation of duties between Info security team that manages encryption keys, KMS policies, and Cluster Admin or Data Steward team that manages Ranger Hive, HDFS, HBase policies ****
Here are the roles users can have in Ranger / Ranger KMS :
Users with KeyAdmin role can login to Ranger KMS UI and create encryption keys, create KMS policies to define which users, groups can decrypt files in encryption zones
When Ranger KMS is installed and configured, Cloudera Manager asks for the password of the “keyadmin” user. Please save the password so that you can login to Ranger KMS UI later.
Configure Ranger KMS service:
Login to Ranger KMS UI with keyadmin user credentials, and the URL would look something like this: http://ccycloud-4.cdpvcb.root.h:6080/login.jsp
Click on the edit button of the KMS policy service, and modify the KMS URL in the “Config Properties” section
Click on “save” to save the changes.
Validate HDFS Data Encryption:
HDFS Data Encryption at Rest works with the construct of “Encryption Zones”. An encryption zone is a HDFS folder where all the files in that folder or its sub folders are encrypted using an encryption zone key.
An encryption zone is created by associating an empty HDFS folder with an encryption zone key
Create Encryption key:
To create the encryption key, the administrator needs to login to Ranger KMS UI with the “keyadmin” user or any user with “keyadmin” role.
- Click on Encryption button at the top
- Click on Key Manager
- Select the service “cm_kms” from drop down menu
- Click on “Add Key” button
Click on save, and you have successfully created an encryption key
Create Encryption Zone:
Login to one of the cluster nodes, and kinit with “hdfs” user or any user that has the privileges on “Generate Metadata” & “Generate_EEK” operations
In our case, “hdfs” user has access to create keys as shown in the below screenshot, so we will kinit as “hdfs” user, and try to create an encryption zone
[root@ccycloud-1 ~]# hdfs crypto -createZone -keyName myenckey -path /user/anatva/protected Added encryption zone /user/anatva/protected [root@ccycloud-1 ~]# hdfs crypto -listZones /user/anatva/protected myenckey
Since “hdfs” user does not have access to DECRYPT the key, hdfs user cannot write or read files in encryption zone:
[root@ccycloud-1 ~]# hdfs dfs -put /etc/passwd /user/anatva/protected/ put: User:hdfs not allowed to do 'DECRYPT_EEK' on 'myenckey'
However, since the user “anatva” has access to “DECRYPT_EEK” privilege, anatva should be able to read files within encryption zone
[anatva@ccycloud-1 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_1002 Default principal: anatva@EXAMPLE.COM Valid starting Expires Service principal 02/23/2021 20:01:01 02/24/2021 20:01:01 krbtgt/EXAMPLE.COM@EXAMPLE.COM [anatva@ccycloud-1 ~]$ hdfs dfs -cat /user/anatva/protected/passwd | head -3 root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin
Replication of Encrypted Data:
With third-party encryption systems, simple read/write operations may not decrypt/encrypt the data automatically. The replication of encrypted data between two on-prem clusters or between on-prem & cloud storage usually fails citing the file checksums not matching if the encryption keys are different on source and destination clusters. In order to make distributed copy work, either the “skipcrccheck” flag is to be used, or maintain the same encryption key on source and destination, which is not recommended.
With HDFS native encryption, the read/write operations on files within encryption zones automatically decrypt/encrypt the data provided the user has “DECRYPT_EEK” access on the encryption zone. The replication process (eg: distributed copy) automatically decrypts data from source while reading and encrypts data while writing to the target cluster. While the file checksum dont match in this scenario either, it allows for different encryption keys on source and target
Cloudera strongly recommends customers to enable encryption of data at rest as it protects sensitive data within the enterprise against external as well as internal threats. Since Cloudera supports the Key Trustee Server cluster to reside outside the main cluster, it can be managed by Info Security teams in the enterprise on separate hardware as well as separate network if required.
The post HDFS Data Encryption at Rest on Cloudera Data Platform appeared first on Cloudera Blog.