FDS now uses OpenMP to run faster on PCs that have multiple cores. This post presents two benchmark simulations that show how much improvement you can expect and how to configure simulations to use OpenMP effectively.
As of FDS version 6.1.0, OpenMP is integrated with FDS and is used automatically. OpenMP is not present in the MPI version of FDS. OpenMP works by allocating a number of threads at startup and then running specific sections of the code in parallel.
This loop in mass.f90 is one of about 40 similar parallel regions in FDS. Click for a larger, clearer image.
Starting and stopping an OpenMP parallel region requires some computational overhead and a program using only 1 OpenMP thread to run the parallel regions can be expected to run slower than the same program without OpenMP.
By default, OpenMP queries the operating system for the number of processors when the FDS simulation starts and creates a matching number of threads. Unlike other parts of FDS that are controlled by the input file, OpenMP is controlled by setting environmental variables. You can control the number of threads made available to OpenMP by setting the OMP_NUM_THREADS environment variable. This can be accomplished by setting the OpenMP Threads property in PyroSim’s OpenMP Environment dialog.
By default, PyroSim leaves the OMP_NUM_THREADS variable unset.
The following benchmarks examine performance using different numbers of OpenMP threads. The test machine’s processor was an Intel Xeon E5-1650 with 6 physical cores. With Intel’s Hyper Threading technology enabled, the operating system recognized 12 cores and OpenMP’s default number of threads was 12. With Hyper Threading disabled, OpenMP’s default was to use 6 cores.
Hyper Threading is enabled and disabled using the BIOS.
Each model was run 24 times using FDS version 6.1.0 – 12 different OMP_NUM_THREADS settings with Hyper Threading active and another 12 with Hyper Threading disabled. Additionally, each model was run using FDS version 6.0.1 to determine a baseline for comparing the OpenMP version to a previous version. Hyper Threading was disabled for the FDS 6.0.1 runs.
These tests were carried out informally on a workstation that was in use and reported times use the “Total Elapsed Wall Clock Time” output from the FDS OUT file.
The first benchmark is a simple burner that uses 1 mesh with 1.5 million cells and simulates 10 seconds of model time. This model was designed to give good OpenMP performance based on guidance in the NIST wiki document Running FDS with OpenMP which suggests that cell counts in the 0.5 million to 2 million cell range will demonstrate the most speedup as additional threads are added.
The FDS 6.0.1 baseline run (shown as a horizontal line) completed in 6.86 hours. The default OpenMP run with Hyper Threading enabled used 12 threads and ran in 5.19 hours (24% faster than baseline). The default OpenMP run with Hyper Threading disabled used 6 threads, ran in 4.15 hours (40% faster than baseline), and was the fastest of the 24 FDS 6.1.0 simulations.
The burner results show that using 4 threads is roughly twice as fast as using 1 thread and that running with Hyper Threading is usually slower than without. Both of these results were predicted accurately by NIST’s Running FDS with OpenMP guidelines.
The second benchmark was performed using the Switchgear PyroSim example. This model uses 2 meshes with a total of 162,000 cells (144,000 and 18,000 cells) and runs for 600 seconds. The total cell count is below the recommendation for OpenMP.
The FDS 6.0.1 baseline run (shown as a horizontal line) completed in 4.86 hours. The default OpenMP run with Hyper Threading enabled used 12 threads and ran in 5.82 hours (20% slower than baseline). The default OpenMP run with HyperThreading disabled used 6 threads, ran in 4.32 hours (11% faster than baseline), and was the fastest of the 24 FDS 6.1.0 simulations.
This benchmark reinforces the suggestion that Hyper Threading is not helpful. In the best case, the addition of OpenMP only results 11% reduction in simulation time compared to the baseline Switchgear run. This poor performance is likely a combination of the relatively low cell count and the poor balancing of the meshes. Identifying the impact of mesh balancing when using OpenMP will require a different benchmark where the meshes are varied while maintaining the total cell count.
You can download the raw OUT files and representative FDS input files (only 2 FDS input files are included as only the HEAD parameter CHID changed across variants of the same simulation) the the following location:
PyroSim users seeking faster runtimes with the OpenMP version of FDS should consider either disabling Hyper Threading in the BIOS or limiting the number of threads using the OpenMP Environment dialog. Even in the burner benchmark where the poor performance of the default 12 threads run was less exaggerated, setting the number of threads to match the number of physical cores (6) saved 38 minutes off the default 5 hours and 12 minutes (i.e. 12% faster).