-
Notifications
You must be signed in to change notification settings - Fork 10
HDFS
ATTENTION! AS OF OCTOBER 2022 THE HADOOP FILE SYSTEM HAS BEEN PHASED OUT IN FAVOR OF CEPH
See this page for some basic commands that can be used with Ceph.
HDFS (Hadoop Distributed File System) is the chosen file system for our main storage.
The users should be concerned about two directories:
-
/cms/store/
which hosts CMS MC and data files, and the output of CMS grid (i.e. CRAB) jobs. Files in this subdirectory can only be added or removed via grid proxy by anyone with proper credentials (read: anyone who has a valid grid certificate). -
/local/$USER
which is meant for personal storage. It is important to remember that local users share the same resources as the rest of world.
There a few things that every user should be aware of:
- the central part of HDFS is the name node which acts like a DNS server for the file system. Whenever users try to access certain file on the file system, these request go through the name node which then points to actual data blocks in data nodes that make up the files.
- data blocks have a fixed size of 128MiB;
- this means that the files stored on HDFS should be as big as possible. If the files stored there are very small and the data block that backs up the file gets corrupted, it would also render the remaining files corrupted if they are backed by the same data block.
- if there are too many small files, HDFS rebalancing may take ages.
- HDFS has built-in redundancy by which every data block is backed up by its copy in a different data node.
- even though we have plenty of free storage in HDFS for most of the time, it doesn't really hurt to think twice when copying over huge datasets.
For instance, if you plan to copy over 1 TB of data, it will take up 2 TB of physical storage.
Tools like
df
report the physical storage and doesn't take into account the redundancy factor of HDFS.
- even though we have plenty of free storage in HDFS for most of the time, it doesn't really hurt to think twice when copying over huge datasets.
For instance, if you plan to copy over 1 TB of data, it will take up 2 TB of physical storage.
Tools like
- the file system supports only read and writes, which means that
- files cannot be opened in append mode (i.e. you cannot "grow" or downright edit the file);
- files cannot be given executable rights;
- there can be significant delays of O(10s) between the creation of a file on HDFS and its "actual" creation. This behavior is not fully understood but it probably happens more often during heavy loads on the file system as it creates higher latency environment and puts more stress on the name node. For instance, in case the user needs to check if a file has been copied over successfully to HDFS, there must be at least a 10 second delay between the copy operation and such check.
- HDFS is written in Java and needs JVM to function.
The HDFS storage is mounted via FUSE to /hdfs
so that it looks like a regular file system.
Thus, every path that starts with /hdfs
must go through FUSE.
On the other hand, "pure" HDFS is completely clueless about /hdfs
mount point and would consider paths that start with /hdfs
as invalid paths.
FUSE also makes the usage of POSIX tools (cd
, mkdir
, ls
, rm
, touch
etc) totally seamless.
Unfortunately, FUSE appears to perform suboptimal when executing ls
on a directory that contains hundreds or event thousands of files.
From what we know and understand, once the user executes ls
on the directory FUSE tries to cache the file names in memory before passing it on to the user.
When multiple users act like that FUSE might run out of memory.
The situation becomes especially dire when ls --color=auto
is called since this would force FUSE to look up and cache additional file attributes such as file type and permissions.
It's not entirely clear what JVM options have been tried to gain in performance and stability and it's even more confusing how JVM settings could even affect FUSE in the first place (since it's just a library written in C) but all we know right now is that experimentation with such settings has not been fruitful and FUSE should be avoided as much as possible. Besides, playing around with various JVM options is expensive as it would require complete restart of HDFS. This in turn implies significant downtime and is therefore prohibitive.
At the very minimum, the users should avoid ls
-ing a directory that contains hundreds or thousands of files with --color=auto
enabled.
Fortunately, Hadoop stack comes with alternative tools that can access files on HDFS almost like a regular file system.
Each command should start with hdfs dfs
followed by POSIX-equivalent option.
The most common options that the users should be aware of are:
-ls [-d] [-h] [-R] [<path> ...]
-mkdir [-p] <path> ...
-rm [-f] [-r|-R] <src> ...
-copyFromLocal <localsrc> ... <dst>
-copyToLocal [-ignoreCrc] [-crc] <src> ... <localdst>
-du [-s] [-h] <path> ...
-count [path]
For example, hdfs dfs -ls /local/$USER
does the same job as ls /hdfs/local/$USER
but the former skips FUSE altogether (which is the preferred choice).
For more information see hdfs dfs -help
.
These tools are not very comfortable to use, though:
- there is no autocompletion of non-FUSEd paths since vanilla bash recognizes the paths that are visible through regular mount points. There may be a possibility to create custom autocomplete function that supports pure HDFS paths but this needs to be investigated.
- there is no equivalent for POSIX
find
in the HDFS stack. The next best thing is to usehdfs dfs -ls -R
. With or without FUSE, its a bit prohibitive anyways in case the directory contains tens of thousands of files. - the commands are very long to type.
Therefore, it makes sense to
alias
them into shorter number of characters, e.g.
alias hls='hdfs dfs -fs hdfs://hdfs-nn:9000/ -ls' # example: hls /local/$USER
alias hcp='hdfs dfs -fs hdfs://hdfs-nn:9000/ -copyToLocal' # example: hcp /local/$USER/afile.txt /home/$USER/afile.txt
# (make up your own aliases)
All these aliases should be put to ~/.bashrc
file (and source it if you want these changes to immediately take effect).
Note that the aliases also include a URL that may be useful if you're working in a host machine that has incorrect defaults.
As mentioned earlier, HDFS relies on JVM which in turn is set by JRE libraries.
It turns out that the setup for grid computing uses a bit different set of libraries which are incompatible with HDFS stack.
This mismatch of libraries comes basically down to one environment variable: JAVA_HOME
.
In vanilla shell session this variable is unset and the above hdfs
commands work, but grid setup populates it with some value that breaks these commands.
The solution is to temporarily unset JAVA_HOME
when executing these commands:
alias hdfs='JAVA_HOME="" hdfs'
This line should also be placed to ~/.bashrc
with the other aliases.
In the python code we make extensive use of os.path.*
functions that all eventually go through FUSE.
The reasoning is that these functions come down to basic syscalls such as stat()
which can only see "conventional" files accessible via regular mount points and not pure HDFS files.
Fortunately, we don't necessarily need to use FUSE here either thanks to our hdfs
python module which automatically skips FUSE when appropriate.
Since the module works as a wrapper for both HDFS and non-HDFS paths, the user is still expected to pass FUSE-compatible paths to the functions as this is the only distinguishable factor in the path name that tells the module to use direct calls to HDFS and not treat it as "regular" path.
The direct calls are possible only when appropriate "session" has been established between the user and HDFS service. In Python this can be done simply by executing:
from tthAnalysis.HiggsToTauTau.hdfs import hdfs
Note that there is only a single session established per one process as it's expensive (read: slow as in seconds or more) to create the session.
Therefore, it is ill-advised to access the module with many individual processes at a time.
In other words, don't call python -c "from tthAnalysis.HiggsToTauTau.hdfs import hdfs; hdfs..."
in a loop but try to make use of one hdfs
instance as much as possible.
Important It turns out that importing ROOT
overrides default Java signal handlers which in turn break the HDFS Java libraries (as explained here).
This can be prevented by resetting ROOT
signal handlers with ROOT.gSystem.ResetSignals()
.
For some reason, it also overrides the ability for argparse
to properly parse sys.argv
.
The only viable solution I found is to temporarily set sys.argv
to an empty array before import ROOT
is issued.
Since all of this implies quite a few lines of code that the user should keep in mind, I've encapsulated all of it into python/safe_root.py
.
Whenever the user needs to import ROOT
, they should use
from tthAnalysis.HiggsToTauTau.safe_root import ROOT
instead.
Here are some examples how to use the module:
- check if a file or directory exists on the file system:
# on HDFS
hdfs.exists(os.path.join('/hdfs/local', getpass.getuser()))
# True
# on regular file system
hdfs.exists(os.path.expanduser('~'))
# True
- check if a given path is directory
# on HDFS
hdfs.isdir(os.path.join('/hdfs/local', getpass.getuser()))
# True
# on regular file system
hdfs.isdir(os.path.expanduser('~'))
# True
- check if a given path is file
# on HDFS
hdfs.isfile(os.path.join('/hdfs/local', getpass.getuser()))
# False
# on regular file system
hdfs.isfile(os.path.expanduser('~'))
# False
- obtain file sizes in bytes (not working with a directory!)
# on HDFS
hdfs.getsize(os.path.join('/hdfs/local', getpass.getuser(), 'somefile.txt'))
# 10242
# on regular file system
hdfs.getsize(os.path.join('/home', getpass.getuser(), 'someotherfile.txt'))
# 15644
- check the file type and its size in more efficient way (works only with HDFS paths!)
path_object = hdfs.get_path_info(os.path.join('/hdfs/local', getpass.getuser(), 'somefile.txt'))
path_object.isfile()
# True
path_object.size
# 10242
- get list of files as string array
# on HDFS
hdfs.listdir(os.path.join('/hdfs/local', getpass.getuser()))
# [ '/hdfs/local/$USER/somefile.txt', '/hdfs/local/$USER/someotherfile.txt', ... ]
# on regular file system
hdfs.listdir(os.path.expanduser('~'))
# [ '/home/$USER/somefile.txt', '/home/$USER/someotherfile.txt', ... ]
- get list of files as object arrays (works only with HDFS paths!)
# on HDFS
path_objs = hdfs.listdir(os.path.join('/hdfs/local', getpass.getuser()), return_objs = True)
path_objs[0].name
# '/hdfs/local/$USER/somefile.txt'
path_objs[1].size
# 15644
More examples on how to e.g. move, copy, remove files/directories and create directories can be found in the unit tests.
Another avenue of avoiding FUSE is to use HDFS plugin in ROOT. However, we need to build the HDFS plugin ourselves because CMSSW stack doesn't ship ROOT with HDFS support. This can be done by executing the following command which builds the plugin and sets necessary environment variables to access any file via the HDFS protocol:
source $CMSSW_BASE/src/tthAnalysis/HiggsToTauTau/misc/set_env.sh
The files residing on /hdfs can be read via
TFile * f = TFile::Open("hdfs:///your/desired/path");
anywhere. NB! The plugin does not support writing via this protocol.
Disclaimer: some problems surfaced with the plugin a while back when running analysis jobs in full scale.
If my memory serves right, the problem was that each analysis job spawned a ridiculous number of JVM threads which quickly hit the limit of maximum number of threads in JVM.
The experimentation with the plugin was a while back before we totally abandoned TChain
in favor of TTreeWrapper
that keeps only one file link alive per analysis job.
It may be the case that each file link spawned their own set of JVM threads and therefore hit the cap quickly.
It may be worth to revisit this problem again but for now it's not recommended to use this plugin.