In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters.
We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.
6. Data
SynchronizationData synchronization is the process of establishing consistency among data from a
source to a target data storage and vice versa and the continuous harmonization of the
data over time.
- From wikipedia “Data synchronization”
7. One-way file synchronization
Updated files copied from source to destination
Two-way file synchronization
Updated files are copied in both directories
Dropbox, SafeSync, etc
14. DistCp with the same
Hadoop version is trivial.
different
a little bit tricky
15. Oops …
[root@tw-spnhadoop1 hadooppet]# hadoop distcp hdfs://cluster1/test hdfs://krb-1.spn.lab.trendnet.org:8020/test
15/01/22 15:11:44 INFO tools.DistCp: srcPaths=[hdfs://cluster1/test]
15/01/22 15:11:44 INFO tools.DistCp: destPath=hdfs://krb-1.spn.lab.trendnet.org:8020/test
15/01/22 15:11:45 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 381 for hdfs on ha-hdfs:cluster1
15/01/22 15:11:45 INFO security.TokenCache: Got dt for hdfs://cluster1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:cluster1, Ident:
(HDFS_DELEGATION_TOKEN token 381 for hdfs)
15/01/22 15:11:46 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs/tw-
spnhadoop1.spn.tw.trendnet.org@ISPN.TRENDMICRO.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(null):
org.apache.hadoop.ipc.RPC$VersionMismatch
15/01/22 15:11:46 INFO security.UserGroupInformation: Initiating logout for hdfs/tw-spnhadoop1.spn.tw.trendnet.org@ISPN.TRENDMICRO.COM
15/01/22 15:11:46 INFO security.UserGroupInformation: Initiating re-login for hdfs/tw-spnhadoop1.spn.tw.trendnet.org@ISPN.TRENDMICRO.COM
15/01/22 15:11:50 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs/tw-
spnhadoop1.spn.tw.trendnet.org@ISPN.TRENDMICRO.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(null):
org.apache.hadoop.ipc.RPC$VersionMismatch
15/01/22 15:11:50 WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds
before.
15/01/22 15:11:53 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs/tw-
spnhadoop1.spn.tw.trendnet.org@ISPN.TRENDMICRO.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(null):
org.apache.hadoop.ipc.RPC$VersionMismatch
16. Apache Hadoop – 2.0
Cluster1
Apache Hadoop – 2.6
Cluster2
$ hadoop distcp hftp://cluster1_nn:50070/test
hdfs://cluster2_nn:8020/test
HftpFileSystem is a read-only FileSystem, so DistCp
must be run on the destination cluster
17. TMH6 Cluster1 TMH7 Cluster2
$ hadoop distcp ????://TMH6_NN:????/test
hdfs://TMH7_NN:8020/test
CDH Based Apache Based
18. TMH6 Cluster1 TMH7 Cluster2
$ hadoop distcp hftp://TMH6_NN:50070/test
hdfs://TMH7_NN:8020/test
CDH Based Apache Based
Only support data sync from TMH6 to TMH7
19. DistCp with different Hadoop
version is a little bit tricky
plus kerberos security
annoying !!
21. DistCp Data Copy Matrix:
HDP1/HDP2 to HDP2
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_system-admin-
guide/content/distcp-table.html
Webhdfs is a HTTP REST API
supports the complete
FileSystem interface for HDFS
22. DistCp Data Copy Matrix:
TMH6/TMH7 to TMH6/TMH7
TMH6
TMH7
insecure
secure
hdfs
hftp
webhdfs
2
24. Hadoop Security with Kerberos
Kerberos is a computer
network authentication protocol which
works on the basis of 'tickets' to
allow nodes communicating over a non-
secure network to prove their identity to
one another in a secure manner
- From wikipedia “Kerberos_(Protocol)”
27. Kerberos Federation for Hadoop
Kerberos Setting
• Set different REALM in
each cluster’s KDC
• Add both cluster’s kerberos
information to configs
• Add federated kerberos
principal to both KDC DB
• Restart kerberos services
Hadoop Setting
• Add Hadoop configurations
• Make sure both cluster
nodes can recognize each
other
• Restart necessary Hadoop
services
28. Multi-Cluster Kerberos Federation
Cluster1
•Set different REALM
in each cluster’s KDC
•Add all other cluster’s
kerberos information
to configuration
•Add all federated
kerberos principal to
KDC DB
•Add Hadoop
configurations
•Make sure all cluster
nodes can recognize
each others
•Restart necessary
services
Cluster2
•Set different REALM
in each cluster’s KDC
•Add all other cluster’s
kerberos information
to configuration
•Add all federated
kerberos principal to
KDC DB
•Add Hadoop
configurations
•Make sure all cluster
nodes can recognize
each others
•Restart necessary
services
…
•…
Cluster N
•Set different REALM
in each cluster’s KDC
•Add all other cluster’s
kerberos information
to configuration
•Add all federated
kerberos principal to
KDC DB
•Add Hadoop
configurations
•Make sure all cluster
nodes can recognize
each others
•Restart necessary
services
29. DistCp with different Hadoop version
plus kerberos federation
is annoying !!
in cross DC multi-cluster
not easy.
Done!!
30. DistCp with different Hadoop
version plus kerberos federation in
cross DC mult-clusters is not easy
at all.
34. Computing Resource
• Principle
– Do not have production service impact when
many DistCp jobs running
• Strategy
– Run distcp on Staging Env. Instand of Production
Env.
37. Zero-downtime
• Principle
– Do not have Production Env. downtime
• Strategy
– Change KDC REALM in Staging only
– Rolling restart services
38. Schedule Limitation
• Principle
– Provide minimum dataset that fulfill production
services requirement
• Strategy
– Divide dataset into cold data and hot data
– All necessary hot data need to be ready before
service move to new DC
45. Kerberos Cross-Realm Federation
• Set different REALM in each cluster’s KDC
• Add both cluster’s kerberos information to configs
• Add federated kerberos principal to both KDC DB
• Add Hadoop configurations
• Make sure both cluster nodes can recognize each
other
• Restart necessary services
48. Add federated kerberos principal to both KDC DB
$ kadmin.local: addprinc -e "rc4-hmac:normal des3-hmac-sha1:normal" krbtgt/
CLUSTER1.DOMAIN.COM@CLUSTER2.DOMAIN.COM
WARNING: no policy specified for krbtgt/CLUSTER1.DOMAIN.COM@ CLUSTER2.DOMAIN.COM; defaulting to
no policy Enter password for principal "krbtgt/ CLUSTER1.DOMAIN.COM@CLUSTER2.DOMAIN.COM ": //
123456
Re-enter password for principal "krbtgt/CLUSTER1.DOMAIN.COM@CLUSTER2.DOMAIN.COM": // 123456
Principal "krbtgt/CLUSTER1.DOMAIN.COM@CLUSTER2.DOMAIN.COM" created.
$ kadmin.local: addprinc -e "rc4-hmac:normal des3-hmac-sha1:normal"
krbtgt/CLUSTER2.DOMAIN.COM@CLUSTER1.DOMAIN.COM
WARNING: no policy specified for krbtgt/CLUSTER2.DOMAIN.COM@CLUSTER1.DOMAIN.COM; defaulting to no
policy Enter password for principal "krbtgt/CLUSTER2.DOMAIN.COM @CLUSTER1.DOMAIN.COM ": // 654321
Re-enter password for principal "krbtgt/CLUSTER2.DOMAIN.COM@CLUSTER1.DOMAIN.COM ": // 654321
Principal "krbtgt/CLUSTER2.DOMAIN.COM@CLUSTER1.DOMAIN.COM " created.
use the same password for a principal to make sure the encryption key is the same
50. Make sure both cluster nodes can recognize each other
• /etc/hosts for both cluster1 and cluster2 nodes
10.1.145.1 machine1.cluster1.domain.com
10.1.145.2 machine2.cluster1.domain.com
10.1.145.3 machine3.cluster1.domain.com
10.1.144.1 machine1.cluster2.domain.com
10.1.144.2 machine2.cluster2.domain.com
10.1.144.3 machine3.cluster2.domain.com
51. Restart necessary services
• KDC server
– service krb5kdc restart
– service kadmin restart
• Namenodes, Datanodes
– service hadoop-hdfs-namenode restart
– servcie hadoop-hdfs-datanode restart