Hands-on Hadoop : Hadoop Space Quotas, HDFS Block Size, Replication and Small Files

A common way to put restrictions on the resource usage of your Hadoop cluster is to set quotas. Recently, I ran into a problem where I could not upload a small file to an empty HDFS directory for which a space quota of 200MB was set.

I won’t go into the details of using quotas in Hadoop, but here it is in a nutshell. Hadoop differentiates between two kinds of quotas: name quotas and space quotas. The former limits thenumber of file and (sub)directory names whereas the latter limits the HDFS “disk” space of the directory (i.e. the number of bytes used by files under the tree rooted at this directory).

You can set HDFS name quotas with the command

$ hadoop dfsadmin -setQuota <max_number> <directory>

and you can set HDFS space quotas with the command

$ hadoop dfsadmin -setSpaceQuota <max_size> <directory>

To clear quotas, use -clrQuota and -clrSpaceQuota, respectively.

So much for the introduction. Recently, I stumbled upon a problem where Hadoop (version 0.20.2) reported a quota violation for a directory for which a space quota of 200 MB (209,715,200 bytes) was set but no name quota:

$ hadoop fs -count -q /user/kiran
none    inf    209715200    209715200    5   1   0

The output columns for fs -count -q are: QUOTA, REMAINING_QUOTA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME.

The directory /user/kiran was empty and not a single byte was used yet (according to the quota report above). However, when I tried to upload a very small file to the directory, the upload failed with a (space) quota violation.

$ hadoop fs -copyFromLocal small-file.txt /user/kiran
11/03/25 13:09:16 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota of /user/kiran is exceeded: quota=209715200 diskspace consumed=384.0m
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java               :39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorI               mpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:58)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:293               9)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:28               19)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
11/03/25 13:09:16 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
11/03/25 13:09:16 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/kiran/small-file.txt" - Aborting...
copyFromLocal: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota of /user/kiran is exceeded: quota=209715200 diskspace consumed=384.0m
11/03/25 13:09:16 ERROR hdfs.DFSClient: Exception closing file /user/kiran/small-file.txt :
org.apache.hadoop.hdfs.protocol.DSQuotaExceededException:
org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota of /user/kiran is exceeded: quota=209715200 diskspace consumed=384.0m

Clearly, the small file could not have exceeded the space quota of 200MB I thought. I also checked whether the Trash feature of Hadoop could be the culprit but that wasn’t the case.

Eventually, the mystery was solved: Hadoop checks space quotas during space allocation. This means that HDFS block size (here: 128MB) and the replication factor of the file (here: 3, i.e. the default value in the cluster set by the dfs.replication property) play an important role. In my case, this is what seems to have happened: When I tried to copy the local file to HDFS, Hadoop figured it would require a single block of 128MB size to store the small file. With replication factored in, the total space would be 3 * 128MB = 384 MB. And this would violate the space quota of 200MB.

  required_number_of_HDFS blocks * HDFS_block_size * replication_count
= 1 * 128MB * 3 = 384MB > 200MB.

I verified this by manually overwriting the default replication factor of 3 to 1 via

$ hadoop fs -D dfs.replication=1 -copyFromLocal small-file.txt /user/kiran 

which worked successfully.

To be honest, I was a bit puzzled at first because – remembering Tom White’s Hadoop book – a file in HDFS that is smaller than a single HDFS block does not occupy a full block’s worth of underlying storage (i.e. a kind of “sparse” use of storage).

Reference:- http://apache.hadoop.org and Michael G. Noll

Hands-on Hadoop

Pages

Friday, 2 May 2014

Hadoop Space Quotas, HDFS Block Size, Replication and Small Files

No comments:

Post a Comment