Computational fluid dynamics (CFD) enables the study of the dynamics of things that flow, typically fluids, by generating numerical solutions to a system of partial differential equations that describe fluid flow. The primary purpose of CFD is to better understand qualitative and quantitative physical phenomena in the flow, which can be used to improve engineering design.
Figure 1: Productivity of OpenFOAM with InfiniBand vs. 10GigE and GigE.
CFD simulations are typically carried out on high-performance compute clusters (HPCC) or other large computational systems because they require a compute resource that can handle complex problems with large computational requirements, large memory requirements, and possibly large storage requirements. Because CFD codes are scalable, HPCCs typically offer the best price/performance solution for running simulations.
Scalability Analysis Specs
The scalability analysis was performed at the HPC Advisory Council compute center, using the following components:
• Dell PowerEdge SC 1435 24-node cluster with Quad-Core AMD Opteron 2382 (“Shanghai”) CPUs
• Mellanox InfiniBand ConnectX HCAs
Each node had 16GB memory, DDR2 800MHz. The operating system was RHEL5U3, InfiniBand drivers OFED 1.4.1, OpenMPI-1.3.3 and OpenFOAM 1.6. For testing purposes, the input dataset of lid-driven cavity flow—mesh of 1000 x 1000 cells, icoFoam solver for laminar, 2D, 1000 steps—was used.
Connecting Compute Nodes
A basic HPCC system consists of off-the-shelf servers, a high-speed interconnect and a storage solution. The interconnect has a great influence on the total cluster performance and scalability. A slow interconnect will cause delays in data transfers between servers and between servers and storage, causing poor utilization of the compute resources and slow execution of simulations. An interconnect that requires a large number of CPU cycles as part of the networking process (“onloading”) will decrease the compute resources available to the applications—and therefore will slow down and limit the numbers of simulations that can be executed on a given cluster. Furthermore, this will limit the cluster scalability because when the number of CPUs increases (the number of cores), a corresponding increase in CPU cycles will be required to handle networking.
The InfiniBand high-speed interconnect provides fast communication between compute nodes to constantly feed the CPU cores with data and eliminating idle times. Moreover, InfiniBand was designed to be fully offloaded, meaning all the communications are being handled within the interconnect with no involvement from the CPU. This further enhances the ability for the code to scale up, with close to linear performance when more compute resources are required.
OpenFoam CFD Applications
Open Field Operation and Manipulation, or OpenFOAM, is an open-source CFD application for simulating fluid flow problems. The broad physical modeling capabilities of OpenFOAM have been used by the aerospace, automotive, biomedical, energy and processing industries.
The original development of OpenFOAM began in the late 1980s at the Imperial College, London, motivated by a desire to find a more powerful and flexible general simulation platform than the de-facto standard at the time, Fortran. Since then, it has evolved by exploiting the latest features of the C++ language, having been effectively re-written several times over. The predecessor, FOAM, was sold by Nabla Ltd. before being released to the public under the general public license in 2004. It is now developed primarily by OpenCFD Ltd., with assistance from an active user community.
OpenFOAM is among the first major scientific packages written in C++ that provides relatively simple, top-level, human-readable descriptions of partial differential equations. It is one of the first major general-purpose CFD packages to use polyhedral cells. This functionality is a natural consequence of the hierarchical description of simulation objects, and it is the first general-purpose CFD package to be released under an open-source license.
Figure 2: Productivity of OpenFOAM with Optimized InfiniBand vs. the out-of-the-box setting.
At the core of any CFD calculation is a computational grid, used to divide the solution domain into thousands or millions of elements used for solving the partial differential equations. OpenFOAM uses a finite volume approach to solve systems of partial differential equations on any 3D unstructured mesh of polyhedral cells. In addition, domain decomposition parallelism is fundamental to developing solvers, and OpenFOAM has the ability to decompose the domain for good scalability.
For performance and scalability analysis, OpenFOAM provides a set of solvers and utilities representing typical industry usage for performing pre- and post-processing tasks. It also has a set of libraries of physical models. Mesh generation is made simple by representing a cell as a list of faces, and each face as a list of vertices. OpenFOAM applications handle unstructured meshes of mixed polyhedra with any number of faces: hexahedra, tetrahedral or even degenerate cells. The scalability analysis of OpenFOAM was done using the lid-driven cavity flow benchmark case.
Figure 3: OpenFOAM MPI profiling.
Using OpenFOAM to Analyze Scalability
Using InfiniBand, we were able to achieve higher productivity in every job size (number of cores) compared to Gigabit Ethernet (GigE) and to 10 Gigabit Ethernet (10GigE). In particular, at 24 server nodes (192 cores), we achieved 219% higher performance vs. GigE and 109% higher than 10GigE. The lid-driven cavity flow benchmark timings are shown in Figure 1 as the number of jobs run per day, as a measure of productivity.
The results shown in Figure 1 are not out-of-the-box results that can be achieved with OpenFOAM. The default OpenFOAM binary is definitely not optimized for InfiniBand-connected clusters. For some reason, when InfiniBand is present, the default OpenFOAM binary will still use the slow Ethernet network for partial data transfers, and does not take full advantage of the InfiniBand networking capabilities. Choosing a precompiled Open MPI doesn’t solve the issue, and recompiling OpenFOAM properly is necessary.
There are two ways to compile OpenFOAM. Please note that Option 2 takes longer to create—more than four hours:
• Modify OpenFOAM-1.6/etc/bashrc to use MPI entry rather OPENMPI (WM_MPLIB:=MPI).
• Change MPI entry within settings.sh to system OpenMPI (export MPI_HOME=/usr/mpi/gcc/openmpi-1.3.3).
• Add the following to wmake/rules/linux64Gcc/mplib (PFLAGS = -DOMPI_SKIP_MPICXX; PINC = -I$(MPI_ARCH_PATH)/include; PLIBS = -L$(MPI_ARCH_PATH)/lib64 –lmpi).
• Keep the default OPENMPI entry in bashrc.
• Modify default Open MPI compiler option in ThirdParty-1.6/Allwmake (refer to Open MPI website for full compiling options).
With the proper optimization, InfiniBand-delivered productivity will be improved by 172%, as shown in Figure 2, and will deliver the performance advantages shown in Figure 1.
While this can be expected because of the InfiniBand latency, bandwidth and low CPU overhead advantages, the ability to execute more than twice the number of simulations per day solely because of the interconnect demonstrates the importance of choosing the right elements when building HPC clusters. With the same server configuration, a user with an InfiniBand-connected cluster will be able to reduce the engineering design time by more than half, resulting in faster time to market and competitive advantage.
We also noticed that OpenFOAM did not scale with Gigabit Ethernet beyond 16 servers (128 cores) for this problem, which means that any server added to a 16-node cluster with Gigabit Ethernet will not add any performance or productivity to OpenFOAM tasks.
In recent CFD and FEA application profiling analysis, the dependency on highest throughput and lowest latency for optimal scalability was shown. Many of the CFD codes generate large MPI messages, in the range of 16KB to 64KB, for data transfer, and use fast small messages for synchronizations. When cluster increases in server count, or node count, the number of messages both large and small increase as well; this increases the burden on the cluster interconnect. This is also where Ethernet becomes the bottleneck and limits any performance gain, or productivity. The profiling results of lid-driven cavity flow are shown in Figure 3.
The most used MPI functions tend to be MPI_Allreduce, MPI_Waitall, MPI_Isend and MPI_recv. The number of MPI functions increases with cluster size, as expected. The majority of the MPI timing overhead is within the MPI_Allredcue collective operation. The recent technology development between Mellanox Technologies and Oak Ridge National Laboratory of MPI collectives offloads can increase OpenFOAM productivity through better, efficient execution of MPI_All reduce and MPI_Waitall—and more importantly, by enabling asynchronous progress. In other words, by overlapping of the MPI compute cycles and the MPI communications cycles.
OpenFOAM provides an open-source CFD simulation code, and had demonstrated a high degree of parallelism and scalability. This enables it to take full advantage of multi-core HPC clusters. InfiniBand shows greater efficiency and scalability with cluster size, providing more than 100% higher performance—or twice the productivity. The productivity advantage of InfiniBand translates not only to the ability to complete the product design in half of the time, but also to the efficient use of the HPCC.
The study shows that economical integration of the CPUs and the InfiniBand network can save power up to $8,400 a year to achieve the same number of application jobs vs. GigE, and up to $6,400 to achieve the same number of application jobs vs. 10GigE. This is based on 24-node cluster configuration. As cluster size increases, more power can be saved. For the user, this translates into a reduction of operating expenses.
HPC Advisory Council
InfiniBand Trade Association
Oak Ridge National Laboratory
Gilad Shainer is a high-performance computing evangelist who focuses on HPC, high-speed interconnects, leading-edge technologies and performance characterizations. Tong Liu is the manager of application performance at Mellanox. Pak Lui works at Mellanox Technologies as a senior HPC applications performance engineer. Jeffrey Layton is the enterprise technologist for HPC within Dell. Scot Schultz is a board member of OpenFabrics Alliance, representing AMD, and also maintains active member roles with the HyperTransport Consortium, HPC Advisory Council, SEG and many other industry organizations.