# Best Practices for I/O

Efficient I/O is essential for good performance in data-intensive applications. Often, the file system is a substantial bottleneck on HPC systems, because CPU and memory technology has improved much more drastically in the last few decades than I/O technology.

Parallel I/O libraries such as MPI-IO, HDF5 and netCDF can help parallelize, aggregate and efficiently manage I/O operations. HDF5 and netCDF also have the benefit of using self-describing binary file formats that support complex data models and provide system portability. However, some simple guidelines can be used for almost any type of I/O on Oscar:

* Try to aggregate small chunks of data into larger reads and writes.

  For the GPFS file systems, reads and writes in multiples of 512KB

  provide the highest bandwidth.
* Avoid using ASCII representations of your data. They will usually

  require much more space to store, and require conversion to/from

  binary when reading/writing.
* Avoid creating directory hierarchies with thousands or millions of

  files in a directory. This causes a significant overhead in managing

  file metadata.

While it may seem convenient to use a directory hierarchy for managing large sets of very small files, this causes severe performance problems due to the large amount of file metadata. A better approach might be to implement the data hierarchy inside a single HDF5 file using HDF5's grouping and dataset mechanisms. This single data file would exhibit better I/O performance and would also be more portable than the directory approach.
