2017 might have just started, but I’ve already noticed a trend that I believe will be huge this year. Many of the people I talk with who are using Hadoop & friends are curious about HDFS Federation.
Here are a few of the questions I hear
How can we use HDFS Federation to extend our current Hadoop environment?
Is there anyway to offload some of the workloads from our NameNode to help speed it up?
Or my favorite……
We originally thought we were boxed in with our Hadoop architecture but now with HDFS Federation our cluster has more flexibility.
So what is HDFS Federation? First we need to level set on how the NameNode and Namespace work in HDFS.
How the NameNode Works in Hadoop
The HDFS architecture is a master/slave architecture. The NameNode is the leader with the DataNodes being the followers in HDFS. Before data is ingested or moved into HDFS it must first pass through the NameNode to be indexed. The DataNodes in HDFS are responsible for storing the data blocks, but have no clue about the other DataNode or data blocks. So if the NameNode falls off the end of the earth your in trouble because what good are the data blocks without the indexing.
HDFS not only stores the data, but provides the file system for users/clients to access the data inside HDFS. For example in my Hadoop environment I have Sales and Marketing data I want to logically separate. So I would, setup to different directories and populate sub directories in each depending on the data. Just like you have setup on your own work space environment. Pictures and Documents are in different directories or file folders. The key is that structure is stored as meta data and the NameNode in HDFS retains all that data.
HDFS Namespace
The NameNode is also responsible for the HDFS namespace in the Hadoop environment. The namespace is set at the file level, meaning all files are hierarchical and follow a tree structure. NameSpace gives the structure users need to traverse the file system. Imagine an organized toolbox with all the tools laid out in a structured way. Once the tools are used they are put back in the same place.
Back to our Windows example the “C” drive is the top level file and everything else on the computer resides under it. Try to create another “Program Files” directory and you will get an error stating that file name already exists. However, if you drop down one level into another file and create a “Program Files” because it would be C:/Program Files/Program Files.
As data is accelerated into HDFS, the NameNode begins to grow out of it’s compute and storage. Just like a hermit crab moving into a new shell, so is the same for the NameNode (vicious and expensive cycle). What if we could begin using scale-out architecture without having to re-architect the entire Hadoop environment? Well this is where HDFS Federation helps big time.
Hadoop Federation to the Rescue
A little know change in HDFS 2.x was the addition of HDFS Federation. Oftentimes confused with the ability to create high availability (HA) in Hadoop clusters or secondary NameNodes. However HDFS Federation allows for Hadoop clusters to add another NameNode and namespace. This Federated NameNode is one that has access to the DataNodes and indexes data moved to those nodes, but only when the data flows through that NameNode.
For example, I have two NameNodes in my cluster NN1 and NN2. NN1 will support all data in hdfs/data/…and NN2 will handle the hdfs/users directory. So as data from users/applications comes my hdfs/data namespace NN1 will index it and move it to the DataNodes. However if an application connects to NN1 and tries to query data in the hdfs/user directory it will get an error saying no known directory. For the application to query data in that namespace requires a connection to NN2. Think of HDFS Federation as adding a new cluster, in the form of a NameNode, while still using the DataNodes for storage.
Benefits of HDFS Federation
Here are a few of the immediate benefits I see being played out with HDFS Federation in the Hadoop world.
- NameNode Dockerization – The ability to set up multiple NameNodes allows for new Hadoop architectures now allows for a module Hadoop architecture. As we start to move into a Microservices world, we will see architectures that contain multiple NameNodes. Hadoop environment will have the ability to break down and spin up new NameNodes on the fly.
- Logically Separate Namespaces – For charge back IT enterprise HDFS Federation gives another tool for Hadoop administrators to setup multiple environments. These environments will still have the cost saving of a single Hadoop environment.
- Ease NameNode Bottlenecks – The pain of having all data index through a single NameNode can be eliminated by create multiple NameNodes.
- Options for Tiering Performance – Segmenting different NameNodes and namespaces by customer requirements instead of setting up multiple complicated performance quotas is now an option. Simply provision the NameNode specs and move customer to NameNode based on the initial requirements.
One of the big reasons for HDFS Federations uptick this year is based on the growing adoption of Hadoop and the sheer amount of data being analyzed. More data more problems and particularly those problems are at scale.
Final Thoughts on HDFS Federation
HDFS Federation is helping solve problems at scale with the NameNode. Since Hadoop’s 1.x code release the NameNode has always been the soft underbelly of the architecture. The NameNode has continued to struggle with high availability, bottlenecks, and replications. The community is continually working on improving the NameNode. HDFS Federation and the movement of Virtualized/Dockerized Hadoop are moving to mitigate these issues. As the Hadoop community continues to innovate with projects like Kudu and others, look for HDFS Federation to play a bigger role.