What happens when you need a duplicate file in two different locations?
It’s not a trivial problem you just need to copy that file to the new location. In Hadoop and HDFS you can copy files easily. You just have to understand how you want to copy then pick the correct command. Let’s walk though all the different ways of copying data in HDFS.
HDFS dfs or Hadoop fs?
Many commands in HDFS are prefixed with the hdfs dfs – [command] or the legacy hadoop fs – [command]. Although not all hadoop fs commands and hdfs dfs are interchangeable. To ease the confusion, below I have broken down both the hdfs dfs and hadoop fs copy commands. My preference is to use hdfs dfs prefix vs. the hadoop fs.
Copy Data in HDFS Examples
The example commands assume my HDFS data is located in /user/thenson and local files are in the /tmp directory (not to be confused with the HDFS /tmp directory). The example data will be loan data set from Kaggle. Using the data set or same file structure isn’t necessary it’s just for a frame of reference.
Hadoop fs Commands
Hadoop fs cp – Easiest way to copy data from one source directory to another. Use the hadoop fs -cp [source] [destination].
hadoop fs -cp /user/thenson/loan.csv /loan.csv
Hadoop fs copyFromLocal – Need to copy data from local file system into HDFS? Use the hadoop fs -copyFromLocal [source] [destination].
hadoop fs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv
Hadoop fs copyToLocal – Copying data from HDFS to local file system? Use the hadoop fs -copyToLocal [source] [destination].
>hadoop fs -copyToLocal /user/thenson/loan.csv /tmp/
HDFS dfs Commands
HDFS dfs CP – Easiest way to copy data from one source directory to another. The same as using hadoop fs cp. Use the hdfs dfs cp [source] [destination].
hdfs dfs -cp /user/thenson/loan.csv /loan.csv
HDFS dfs copyFromLocal -Need to copy data from local file system into HDFS? The same as using hadoop fs -copyFromLocal. Use the hdfs dfs -copyFromLocal [source] [destination].
hdfs dfs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv
HDFS dfs copyToLocal – Copying data from HDFS to local file system? The same as using hadoop fs -copyToLocal. Use the hdfs dfs -copyToLocal [source] [destination].
hdfs dfs -copyToLocal /user/thenson/loan.csv /tmp/loan.csv
Hadoop Cluster to Cluster Copy
Distcp used in Hadoop – Need to copy data from one cluster to another? Use the MapReduce’s distributed copy to move data with a MapReduce job. For the listed command below the original data exist on cluster namenode in the /user/thenson directory and is being transferred to the newNameNode cluster. Make sure to use the full hdfs url in command. Command hadoop -distcp [source] \ [destination].
hadoop -distcp hdfs://namenode:8020/user/thenson \ hdfs://newNameNode:8020/user/thenson
It’s the Scale that Matters..
While copying data is a simple matter in most application, everything in Hadoop is more complicated because of the scale. Make sure when copying data in HDFS to understand the use case and scale, then choose one of the commands above.
Interested in learning more HDFS commands? Checkout out my Top HDFS Commands post.