Best Practices for I/O
Efficient I/O is essential for good performance in data-intensive applications. Often, the file system is a substantial bottleneck on HPC systems, because CPU and memory technology has improved much more drastically in the last few decades than I/O technology.
Parallel I/O libraries such as MPI-IO, HDF5 and netCDF can help parallelize, aggregate and efficiently manage I/O operations. HDF5 and netCDF also have the benefit of using self-describing binary file formats that support complex data models and provide system portability. However, some simple guidelines can be used for almost any type of I/O on Oscar:
- Try to aggregate small chunks of data into larger reads and writes.For the GPFS file systems, reads and writes in multiples of 512KBprovide the highest bandwidth.
- Avoid using ASCII representations of your data. They will usuallyrequire much more space to store, and require conversion to/frombinary when reading/writing.
- Avoid creating directory hierarchies with thousands or millions offiles in a directory. This causes a significant overhead in managingfile metadata.
While it may seem convenient to use a directory hierarchy for managing large sets of very small files, this causes severe performance problems due to the large amount of file metadata. A better approach might be to implement the data hierarchy inside a single HDF5 file using HDF5's grouping and dataset mechanisms. This single data file would exhibit better I/O performance and would also be more portable than the directory approach.
Last modified 4yr ago