Tools for Monitoring CPU Usage and Affinity in
Multicore Supercomputers
Lei Huang
Texas Advanced Computing Center
The University of Texas at Austin
10100 Burnet Rd. Austin, TX, 78758
Kent Milfeld
Texas Advanced Computing Center
The University of Texas at Austin
10100 Burnet Rd. Austin, TX, 78758
Si Liu
Texas Advanced Computing Center
The University of Texas at Austin
10100 Burnet Rd. Austin, TX, 78758
Abstract—Performance boosts in HPC nodes have come from
making SIMD units wider and aggressively packing more and
more cores in each processor. With multiple processors and so
many cores it has become necessary to understand and manage
process and thread affinity and pinning. However, affinity tools
have not been designed specifically for HPC users to quickly
evaluate process affinity and execution location. To fill in the
gap, three HPC user-friendly tools, core usage, show affinity, and
amask, have been designed to eliminate barriers that frustrate
users and impede users from evaluating and analyzing affinity
for applications. These tools focus on providing convenient
methods, easy-to-understand affinity representations for large
process counts, process locality, and run-time core load with
socket aggregation. These tools will significantly help HPC users,
developers and site administrators easily monitor processor
utilization from an affinity perspective.
Index Terms—Supercomputers, User support tool, Multicore
system, Affinity, Resource utilization, Core binding, Real-time
monitoring, Debugging
I. INTRODUCTION
Up to the millennium, the processor frequency of com-
modity CPUs increased exponentially year after year. High
CPU frequency had been one of the major driving forces to
boost CPU performance, other than the introduction of vector
processor units. However, it ceased to grow significantly in
recent years due to both technical reasons and market forces.
To accommodate the high demand of computing power in
HPC, significantly more cores are being packed into a single
compute node [15].
The needs of HPC and the use of core-rich processors are
exemplified in the extraordinary large-scale supercomputers
found throughout the world. The Sierra supercomputer [17] at
the Lawrence Livermore National Laboratory and the Summit
supercomputer [19] at the Oak Ridge National Lab have 44
processing cores per compute node with two IBM Power9
CPUs [13]. The Sunway TaihuLight supercomputer [18] at
the National Supercomputing Center in Wuxi deploys Sunway
SW26010 manycore processors, containing 256 processing
cores and additional 4 auxiliary cores for system manage-
ment [32] per node. The Stampede2 supercomputer [28] at
the Texas Advanced Computing Center (TACC) provides Intel
Knights Landing (KNL) nodes with 68 cores per node. The
Stampede2 [28] and Frontera [27] supercomputers at TACC
provide 48 and 56 processing cores per node with Intel’s
Skylake (SKX) and Cascade Lake (CLX) processors [31],
respectively. These, and other HPC processors, also support
Simultaneous Multi-Threading (SMT) to a level of 2 to 4
per core. Consequently, there could be 2x to 4x more logical
processors than physical processors on a node.
When working with nodes of such large core counts, the
performance of HPC applications is not only dependent upon
the number and speed of the cores, but also upon proper
scheduling of processes and threads. HPC application runs
with proper affinity settings will take full advantage of re-
sources like local memory, reusable caches, etc., and will
obtain a distinct benefit in performance.
II. BACKGROUND
A. Process and Thread Affinity
A modern computer often has more than one socket per
node and therefore HPC applications may have non-uniform
access to memory. Ideally, an application process should be
placed on a processor that is close to the data in memory
it accesses, to get the best performance. Process and thread
affinity/pinning allows a process or a thread to bind to a
single processor or a set of (logical) processors. The processes
or threads with specific affinity settings will then only run
on the designated processor(s). For Single-Program Multiple-
Data (SPMD) applications, managing this affinity can be
difficult. Moreover, the present-day workflows on modern
supercomputers have moved beyond the SPMD approach and
now include hierarchical levels of Multiple-Program Multiple-
Data (MPMD), demanding even more attention to affinity.
MPI affinity for Intel MPI (IMPI), MVAPICH2
(MV2), Open MPI (OMPI), and IBM Spectrum MPI
(SMPI) have a variety of mechanisms for setting
affinity. IMPI relies solely on “I
MPI x” environment
variables, as does MV2 (MV2 CPU/HYBRID BINDING x,
MV2 CPU MAPPING x, etc.). SMPI uses both environment
variables (MP TASK/CPU x) and mpirun command-line
options (-map-by, -bind-to, -aff shortcuts, etc.). Similarly,
OMPI uses mpirun options (-bind-to-core, –cpus-per-proc,
etc.) and also accepts a rankfile file with a map (slot-list) for
each rank.
When no affinity is specified these MPIs evaluate a node’s
hardware configuration (for example with hwloc for MV2 and
OMPI) and make appropriate default affinity settings. OpenMP
affinity for hybrid runs can be specified by various “vendor”
methods. However, since all of these MPIs accept OpenMP’s
OMP PLACES/OMP PROC BIND specifications, it is best
to use the standard’s mechanism. Hence, for portable hybrid
computing a user must deal with many ways of setting each
rank’s affinity. (When a master thread encounters a parallel
region it inherits the MPI rank’s mask, and OpenMP Affinity
specifications take over).
Fig. 1 shows a schematic of the affinity process. A mask
is maintained for each process by the kernel that describes
which processor(s) the process can run on. The mask consists
of a bit for each processor, and the process can execute on any
processor where a mask bit is set. There are a myriad of ways
to set and alter the affinity mask for processes of a parallel
application. For instance, vendors have their own way to set
affinities for MPI and OpenMP, usually through environment
variables. Only recently has OpenMP 4.5 [20], [21] provided a
standard way to set affinity for threads, and MPI has yet to do
this for MPI tasks. As shown in Fig. 1 the affinity can not only
be affected before an application is launched but also while
it is running. There are utilities such as numactl [2] and util-
linux taskset [7] to do this. Furthermore, the affinity can even
be changed within a program with the sched
setaffinity [6]
function.
Fig. 1. The left box indicates mechanisms for setting the affinity mask. The
right box illustrates how a BIOS setting has designated the processor ids
for the hardware (cores). The center section shows a mask with bits set for
execution on cores 1, 3, 5, and 7.
Understanding the “vernaculars” of all these methods can be
challenging. Even the default settings are sometimes unknown
to users. In addition, users are commonly uncertain of their
attempts to set the affinity for processes of their parallel
applications. Other factors (see Fig. 1), such as user environ-
ment variables, the MPI launcher, etc., also create a lack of
confidence in a user’s attempt to control affinity. Incorrect core
binding for processes and/or threads can have adverse effects
on performance, even reducing the program performance by
single or multi-digit factors.
B. Related Work
There are many ways to view CPU loads and the affinity
mask of a process. Moreover, some methods are not well-
known or only available through unexpected means.
The Linux command-line tool top [8] and its more recent
counterparts (htop or atop) can be used to monitor CPU loads,
and manage process and thread affinity in real time. The
Linux command ps [3] can report which core a process is
running on. However, it does not report the affinity mask
explicitly. It only reports the core the process is presently
running on. The taskset [7] command-line utility is normally
more helpful since it can query and modify the binding affinity
of a given thread or process. Linux also provides API functions
sched getaffinity [5] and pthread setaffinity np [4] for a
process or thread to query and set the affinity (kernel mask)
of itself. While these tools are pervasive and do provide the
information needed, they are sometimes cumbersome to use,
particularly for supercomputer users working with large core
counts.
For HPC users, these tools may provide too much admin-
istrative information, and it may not be apparent how to get
HPC-relevant information for their applications. Users need
to remember extra options or give extra instructions to obtain
relevant CPU information. For instance, top does not show
the loads of individual processors by default. For example,
pressing 1 within a top session is required to display the
performance data of individual CPUs; and pressing “z” is
needed to display running process in color. The htop utility
does show usage information for all logical processors and the
load on each individual core is represented by a progress bar
in text mode. It works up to about one hundred sixty cores.
However, the progress bar is distracting on a computer with
many cores.
Furthermore, such tools were originally designed for admin-
istrators to display multi-user information, not for a single-
user screen display of HPC information. Therefore, there is
a real need for convenient HPC tools to readily display each
CPU utilization and affinity information for multicore compute
nodes.
The MPI libraries and OpenMP 5.0 [20] implementions
themselves can present affinity mask information on the pro-
cesses or threads they instantiate. For instance, by setting the
I MPI DEBUG environment to 4 or above, the Intel MPI
runtime will print a list of processor ids (mask bits set) for each
process (rank) at launch time. Likewise for OpenMP 5.0 im-
plementations, setting OMP AFFINITY DISPLAY to TRUE
will have the runtime print a line for each thread (number)
reporting the processor ids associated with its binding, at the
beginning of the first parallel region. However, it is difficult to
make sense of these lists for multicore or manycore compute
nodes.
There are other comprehensive tools that can be used to
collect the CPU loads and the affinity information. TACC
Stats [12] is a well-established one. It monitors parallel jobs
on supercomputers and collects a series of system statistics
and hardware performance count including the CPU usage of
each core. But the data processing and display is not real-time
and this tools is mainly designed for system administrators.
Another practical tool suite is Likwid [23], [30]. It consists
of many convenient command-line applications. Particularly,
likwid-topology is used to print thread, cache, and NUMA in-
formation. likwid-pin can be used to pin threaded applications.
Based on years of experience administrating multiple super-
computer systems and supporting thousands of HPC users, the
following questions are always asked by users and administra-
tors when monitoring a program running on an HPC system:
Does my application use the maximum capacity of CPU?
How many physical or logical processors are practical for a
running application? What is the currently used process and
thread affinity pattern based on my current settings? To help
HPC users and administrators answer these questions easily,
three innovative tools core usage, show affinity, and amask
were designed and developed. They are now serving the HPC
communities by presenting real-time CPU usage and affinity
information on systems with large core counts.
III. THREE INNOVATIVE TOOLS
A. core usage
1) IMPLEMENTATION: The first tool we designed and
developed to quickly and efficiently show processor loads
is core usage [25]. It employs the logical processor (core)
usage information directly from /proc/stat on Linux systems.
Specifically, non-idle time (t N onIdle) has six components:
user, nice, system, irq, softirq, and steal columns. Idle time
(t Idle) is calculated as the sum of idle and iowait columns.
core usage regularly reads kernel activity information of every
logical processor on a node, then calculates core utilization
with the two most-recent core status data points according to
the following equation.
utilization =
( t_NonIdle_new - t_NonIdle_old ) /
( (t_NonIdle_new + t_Idle_new) -
(t_NonIdle_old + t_Idle_old) )
core usage then displays CPU load for all logical proces-
sors. The data are grouped by socket id and the first core of
every socket is highlighted to make it easy to determine if the
processes/threads are evenly distributed across sockets.
2) COMMANDS AND REPORTS: The syntax to run
core usage is:
core_usage [<int>] [txt]
where <int> is the update interval and must be an integer or
float (unit is second, default is 1). For example, “core usage
3” will provide an update every three seconds. Users can also
add “txt” as a parameter to force the text mode.
The core usage command can present a graphical-user-
interface (GUI) or a command-line-interface (CLI). When X
Forwarding is supported in the current environment, the GUI
version is presented. In the GUI, the size of the plot area
is automatically set according to the number of cores on the
running computer. The usage percentage of each individual
logical processor is represented by the height of a blue bar, as
shown in Fig. 2, for a hybrid job run on the Stampede2 system.
The GUI version is an ideal way to visualize core usage
information and can easily be extended to support thousands
of cores per node in the future by adding more rows in the
bar chart.
Fig. 2. Snapshot of core usage (GUI) display for a hybrid application run
with 4 MPI and 8 threads per MPI task on a Stampede2 Intel Xeon Skylake
compute node. Colors and format are slightly modified for presentation.
If X Forwarding is not detected, the CLI version automati-
cally launches the reports in text mode as shown in Fig. 3. A
floating point number between 0.0 and 1.0 is calculated and
displayed for each logical processor to represent the current
core usage. The text is monochrome if the core is idle (usage
less than 2%), otherwise it is green to highlight the cores in
use.
From these figures, it can be seen that core usage presents
in logical order the current usage of each individual processor
in real time. The results are collected and displayed in a
socket-aware manner so that users can easily track processor
status by socket. In the latest version, core usage explicitly
displays the name of the application that keeps the individual
core busy as shown in Fig. 3. To make the results clear and
concise, core usage only shows the application with the top
usage for each individual logical processor in this version.
As mentioned above, it is also possible to run core usage
manually in terminal mode with an argument “txt”, even if
X11 environment is available and GUI is the default starting
mode.
B. show affinity
1) BACKGROUND AND IMPLEMENTATION: Though
core usage is valuable for monitoring individual CPU core
usage and detecting under utilization issues, it doesn’t report
the process and thread binding (affinity) that may be needed
to adjust the resource usage.
Normally, the Linux tool taskset [7] can be used to retrieve
the CPU binding affinity of individual threads or processes
with commands like “taskset -p pid”. However, users need
to compile a full list of pid/tid’s for all processes/threads,
and this would require running taskset for each process. This
would be tedious and error-prone, and “simple” command-
line expressions or scripts may require Unix skills unfamiliar
to inexperienced HPC users.
Fig. 3. Snapshot of core usage CLI text report for a Weather Research and Forecasting (WRF) run with 16 MPI tasks and 4 OpenMP threads per MPI task
on one Stampede2 KNL compute node. Colors and format are slightly modified for presentation.
To report this type of affinity information automatically
and clearly, the show affinity [26] tool was developed. This
tool was also designed to be intuitive and simple to use.
When executing show affinity, all running processes/threads
on a computer node are enumerated by inspecting the direc-
tories under /proc and their owners. To avoid unnecessary
information, show affinity only queries and reports binding
affinity for the processes owned by the current user (on a
compute node). Application names are then extracted from
/proc/pid/exe. For each process the threads are enumerated and
the core binding affinity of each individual process/thread is
queried and displayed.
2) COMMANDS AND REPORTS: There are two modes of
operation for show affinity. The syntax is:
show_affinity [all]
In the first and default mode, the tool shows the process-
es/threads launched by the current user that keeps CPUs busy
as demonstrated in Fig. 4. To make the results more concise
and clear, the outputs are organized in four columns: process
id (pid), executable name, thread id (tid), and binding affinity
respectively. They are grouped with pid. The second mode is
invoked with the “all” argument, and show affinity displays
all running processes and threads on the current compute node
owned by the current user as demonstrated in Fig. 5.
C. amask
1) BACKGROUND AND IMPLEMENTATION: The initial
amask utility [24] was designed [1] as an analysis tool to
confirm affinity settings for the OpenMP 4.0 Affinity im-
plementation on the manycore Intel Xeon Phi system (68
pid Exe_Name tid Affinity
91884 namd2_skx 91884 0
91910 2
91915 4
...
91942 20
91945 22
91885 namd2_skx 91885 24
91911 26
91914 28
...
91941 44
91944 46
91886 namd2_skx 91886 1
91909 3
91913 5
...
91943 21
91946 23
91887 namd2_skx 91887 25
91908 27
91912 29
...
91950 45
91951 47
Fig. 4. Snapshot of show affinity showing the running processes and threads
that keep CPU busy for a NAMD [22] run with 4 MPI tasks and 12 threads
per MPI task on one Stampede2 Skylake compute node. The output contains
four columns: process id (pid), executable name, thread id (tid), and core
binding affinity. Format is slightly modified for presentation.
pid Exe_Name tid Affinity
91544 slurm_script 91544 0-95
91551 sleep 91551 0-95
91649 sshd 91649 0-95
91650 bash 91650 0-95
91829 ibrun 91829 0-95
91879 mpiexec.hydra 91879 0-95
91880 pmi_proxy 91880 0-95
91884 namd2_skx 91884 0
91905 0,2,4,...
91910 2
...
91942 20
91945 22
91885 namd2_skx 91885 24
91904 24,26,...
91911 26
...
91941 44
91944 46
91886 namd2_skx 91886 1
91906 1,3,5,...
91909 3
...
91943 21
91946 23
91887 namd2_skx 91887 25
91907 25,27,...
91908 27
...
91950 45
91951 47
91975 tee 91975 0-95
Fig. 5. Snapshot of show affinity with “all” argument showing all running
processes and threads for a NAMD run with 4 MPI tasks and 12 threads
per MPI task on one Stampede2 Skylake compute node. Format is slightly
modified for presentation.
cores, 272 processor ids). It consisted of a single, argumentless
library function called within an application. However, it was
found that users were more interested in executing a command
immediately before their application (and after setting the
affinity environment) to report the affinity of a “generic”
parallel region, rather than instrumenting an application with
a library call. Therefore, stand-alone external commands
were created. amask was soon adapted for MPI, and the
external commands (and library calls) became amask omp,
amask mpi, and amask hybrid for pure OpenMP, pure MPI,
and hybrid OpenMP-MPI applications, respectively.
The OpenMP component works for any version of OpenMP.
The commands with MPI components must be compiled/used
with the same flavor (OpenMPI, IMPI, MVAPICH2, etc.) used
by the application, so that the same runtimes are invoked. It
is worth mentioning that the amask code does not rely on
any vendor-specific features or APIs. The library (and other
utilities) remain available for developers or power-users who
want in-situ reporting.
2) COMMANDS AND API: All three amask commands
accept the same options. The syntax is:
amask_[omp|mpi|hybrid] -h -vk -w# -pf
Commands amask omp or amask mpi are for report-
ing masks for a pure OpenMP or pure-MPI run. The
amask hybrid command is used for reporting the (parent)
MPI masks followed by the OpenMP thread masks for each
MPI task. The h option provides help. The vk option
overrides the automatic core view, forcing a kernel (k) view
(mask of processor ids). amask will load each process for #
seconds when the w (wait) option is invoked (default is 10).
This is helpful when used in combination with monitoring
tools like core usage and htop. A slight pause after printing
each row (mask) was found to give the viewer time to start
comprehending the content of each mask, and then allow
analysis of the pattern as more rows are reported. This slow-
mode can be turned off by requesting the fast printing mode
with pf .
The library has function names corresponding to the com-
mands:
C/C++ Fortran
amask_omp(); call amask_omp()
amask_mpi(); call amask_mpi()
amask_hybrid(); call amask_hybrid()
These can be inserted in a pure OpenMP parallel region,
after an MPI Init of a pure MPI code, or within an OpenMP
parallel region of a hybrid code (calls within a loop structure
should be conditionally executed for only one iteration.)
3) REPORTS: The important feature that makes amask
more useful is that it reports a mask for each process of
a parallel execution in a matrix format (process number vs.
processor id), so that the user can quickly visualize relevant
patterns of the affinity (such as socket, NUMA nodes, tile,
core, and single hardware-thread assignments).
In the reports shown in Fig. 6, each row represents a
“kernel” mask for the process number labeled at the left. Each
label (process) is followed by N characters, one for each bit
of the kernel’s affinity mask. A dash (-) represents an unset
bit, while a digit (0-9) represents a set bit. In order to easily
evaluate the process id of a set bit, the single digit (0-9) of
the set bit is added to the header group process id label at the
top (labels represent groups of 10s).
For instance, the mask of process 1 in Fig. 6a) has mask
bit 12 (proc-id 12) set (proc-id = 2 + 10 from group
value). This single-character bit mask representation is ideal
for working with systems with hundreds of logical processors.
Fig. 6a) and Fig. 6b) show processes bound to single cores
and sockets, respectively (where proc-ids sets 0-11 and 12-
23 are on different sockets). In the latter case each process
can “float” on any core in the socket. Fig. 6c) illustrates a
socket affinity, just as in Fig. 6b), but for a system with even
and odd proc-id sets for each socket. While the sequential
or even-odd assignments could have been determined from
hwloc [9], or /proc/cpuinfo on certain Linux systems, the
amask report identifies the proc-id assignment pattern. The
last report, Fig. 6d), shows a scenario where each process is
allowed to execute on any core of the system.
proc-id > | 0 | 10 | 20 | a)
process v | | | |
0000 0-----------------------
0001 ------------2----------
proc-id > | 0 | 10 | 20 | b)
process v | | | |
0000 012345678901------------
0001 ------------234567890123
proc-id > | 0 | 10 | 20 | c)
process v | | | |
0000 0-2-4-6-8-0-2-4-6-8-0-2-
0001 -1-3-5-7-9-1-3-5-7-9-1-3
proc-id > | 0 | 10 | 20 | d)
process v | | | |
0000 012345678901234567890123
0001 012345678901234567890123
Fig. 6. Masks for 2 processes on a 2-socket, 24 core platform. Dash (-)
represents unset bit, while a single digit represents a set bit. Add digit to
column group value to obtain processor id (core number) value. a) Process
0 can only execute on core 0; process 1 can only execute on core 12. b)
Process 0 can execute on cores 0-11; process 1 can execute on cores 12-23.
c) Process 0 can execute on even-numbered cores; process 1 can execute on
odd-numbered cores. d) Processes 0 and 1 can execute on any cores
With simultaneous multithreading (SMT), available on IBM,
Intel, AMD, and other processors, the OS assigns multiple
(virtual) processors to a core. Hence each core has multiple
processor ids, also called hardware threads (HWT) - the term
used here.
When amask detects hardware threading, it reports a “core”
view, showing a column for each core id, and each process
reporting a row (mask) for each hardware thread. Hence, core
group numbers appear in the header instead of processor-id
group numbers.
For a 2-socket system with sequential process id numbering
Fig. 7a) shows the affinity mask for process 0 execution on
either HWT of core 0, and process 1 execution on either HWT
on core 12; while Fig. 7b) shows executions are available only
on the 1st hardware thread of two adjacent cores. Fig. 7c)
shows 68 threads executing with “cores” affinity (execution
available on all hardware threads of a core) for a 4-SMT, 68-
core Intel Xeon Phi system. It is easy to see that each process
is assigned to all HWTs of a core. A process id list for core
number 67 is the set {67, 135, 203, 271}, and determining that
these represent a single core in the amask kernel (processor
id) view would be difficult, and checking the assignment with
just a process id listing would be tedious.
a) Core ids
proc-id > | 0 | 10 | 20 |
process v | | | |
0000 0======================= HWT0
0----------------------- HWT1
0001 ===========2============
-----------2------------
b) Core ids
proc-id > | 0 | 10 | 20 |
process v | | | |
0000 01======================
------------------------
0001 ===========23===========
------------------------
c) Core ids
proc-id > | 0 | ... | 60 |
process v | | | |
0000 0========== ======== HWT0
0---------- -------- HWT1
0---------- -------- HWT2
0---------- -------- HWT3
: :
0067 =========== =======7
----------- -------7
----------- -------7
----------- -------7
Fig. 7. Core view of masks for SMT systems: Equals (=) represent unset bits
and distinguishes the first HWT. Dashes (-) represent unset bits for the other
HWTs. a) Process 0 can execute on either HWT of core 0. Process 1 can
execute on either HWT of core 12. b) Process 0 can execute only on HWT
0 of cores 0 and 1. Process 1 can execute only on HWT 0 of cores 12 and
13. c) Process i can execute on any HWT of core i. (for a 68-core 4-SMT
system)
For multi-node executions, amask reports the masks on
each node, and labels each process (row) with a node name
and rank number as shown in Fig. 8a). Masks for Hybrid
(OpenMP/MPI) executions are reported by the amask hybrid
command. The report consists of two parts. The first part
contains the masks of the MPI task (just as the amask mpi
would report). It is important to show these masks, because
the OpenMP runtime inherits the task’s map for the parallel
region, and can only assign thread masks as subsets (or the
full set) of the set bits in the MPI task mask. Fig. 8b) shows
the MPI (parent) masks and thread masks for a hybrid run of
4 MPI tasks with 6 OpenMP threads per task. Each thread is
bound to a single core, as would be desired for a 4 x 6 (task
x thread) run on a 24-core system.
IV. CASE STUDY
A. Unexpected Slow VASP Runs
In 2018, one of our experienced Vienna Ab Initio Simulation
Package (VASP) [16] users reported unexpected performance
a)
proc-id > | 0 | 10 | 20 |
node rank | | | |
c123-509 0000 012345678901------------
c123-509 0001 ------------234567890123
c123-802 0002 012345678901------------
c123-802 0003 ------------234567890123
b) (parent) MPI mask
proc-id > | 0 | 10 | 20 |
rank v | | | |
0000 012345------------------
0001 ------678901------------
0002 ------------234567------
0003 ------------------890123
MPI-thread mask
proc-id > | 0 | 10 | 20 |
rank thrd | | | |
0000 0000 0-----------------------
0000 0001 -1----------------------
0000 0002 --2---------------------
0000 0003 ---3--------------------
...
0003 0003 ---------------------1--
0003 0004 ----------------------2-
0003 0005 -----------------------3
Fig. 8. a) Affinity for a pure MPI multi-node execution, 4 MPI tasks on
two separate nodes. b) Affinity for an OpenMP/MPI hybrid (amask hybrid)
execution with parent MPI and hybrid (rank/thread) reports.
drop in his jobs and he needed help to debug this issue on
TACC’s Stampede2 system. The user had created a top-level
script in Python to manage the overall workflow, and invoked
numpy [29] in this script for scientific computing work.
Numpy then invoked Intel Math Kernel Library (MKL) [14]
for threaded and vectorized function calls. Executions of
VASP were employed later by the Python scripts for material
modeling simulations.
We could reproduce the user’s issue and we employed our
new tools on the user’s workflow. core usage demonstrated
that all cores were allocated and used at the beginning of
the job. However, after a while only a single core was busy
when the VASP runs finally started. show affinity also showed
that all running threads on a compute node were bound to a
single core (0) instead of separate cores. A more in-depth
investigation revealed that the Intel MKL functions called
Intel OpenMP function “omp get num procs” from the Intel
OpenMP library. This function was directly binding the parent
process (where the python script was calling numpy) to only
core 0. This is an Intel default setting when the environment
variable “OMP PROC BIND” is not set. Consequently, all
child processes of the python script including the following
VASP runs inherited this binding affinity unexpectedly. Hence
only a single core was used by all new processes/threads for
the VASP runs, though they were designed to run in parallel
on all the available cores.
Due to this incorrect binding, the job ran a hundred times
slower. With the core usage and show affinity tools, the
source of the problem was quickly and efficiently determined.
The fix was easy: “OMP PROC BIND” was set to TRUE in
the user’s default environment, forcing each parallel region to
obey OpenMP’s affinity policy.
B. MPI Library Evaluation on A New System
When building and deploying the new Frontera system [27]
at TACC in 2019, several different MPI Stacks were tested
and evaluated, including the Intel MPI library [10], the
MVAPICH2 library [11], etc. The objective was to determine
configurations and settings for optimal performance of the
system. In the evaluation process, two significant issues related
to process/thread affinity were discovered for hybrid (MPI +
OpenMP) application runs.
The first problem was an unbalanced work distribution when
all cores on a compute node were not fully used (due to mem-
ory or other limitations). On Frontera CLX nodes there are 56
physical cores on two sockets (28 cores/socket). The initial
MVAPICH2 “MV2 HYBRID BINDING POLICY” was set
to “linear” at TACC as the default and recommended value.
By evaluating application executions with our tools, it was
soon discovered that this was not an appropriate setting for
hybrid applications that did not use all the cores of a node. For
instance for application requiring 2
n
cores/node, an application
execution with 2 tasks per node and 16 threads per task was
assigning 28 cores on the first socket and 4 cores on the
second, while ideally one would want 16 cores assigned to
each socket. The default setting was changed to “spread“,
which generally works well for all cases.
The second problem was with the different thread binding
behavior of Intel MPI and MVAPICH2. The default/recom-
mended setting of Intel MPI binds each process to a single
core throughout a run. The bind-to-one-core affinity assumes
that cache/memory locality will normally provide optimal per-
formance. However, with MVAPICH2, every bit in a process
mask is set to 1 by default, and therefore each process can
run on any core. Explicitly setting core binding for certain
hybrid applications compiled with MVAPICH2 has been found
to increase performance slightly.
C. Affinity Discovery
While amask has benefited many in discovering the affinity
for their HPC applications, another potential feature is learning
how to correctly interpret the complicated syntax of certain
MPI implementations. That is, a concise report of affinity can
help users when experimenting with unfamiliar options and
syntax. While the syntax for OpenMP Affinity is standard, the
implementation of certain features is implementation defined
amask can quickly show the effects of an implementation-
defined affinity setting in its report.
Discovering the affinity for a pure MPI application can
be complicated even if the number of processes does evenly
divide the number of processors. For instance, on Intel KNL
with 68 cores (4 SMT threads per core), users might want to
use 16 MPI tasks. The default setting will mask 17 sequential
bits (in proc-id space, from 0 to 271) for each task. However,
this mask allows tasks (processes) to overlap on a single core
as shown in Fig. 9. Using 17 MPI tasks produces 16 sequential
bits (4 cores) for each mask and makes for a more balanced
distribution without core sharing.
core > | 0 | 10 | ... 60 |
process v
0000 01234================
0123-----------------
0123-----------------
0123-----------------
0001 =====5678============
----45678------------
----4567-------------
----4567-------------
0002 =========9012========
---------9012--------
--------89012--------
--------8901---------
...
Fig. 9. amask shows process 0 and 1 masks overlapping on core 4; likewise
process 1 and 2 overlap on core 8.
D. General
These cases demonstrate that with our affinity tools, pro-
cess/thread kernel masks can be determined easily and process
execution location can be easily monitored in real time. This
can be particularly important when one begins to work on a
new system and/or in an unfamiliar environment. These tools
also help users and site staff discover issues that can be im-
mediately reported back to developers and site administrators
so that parallel applications can achieve higher performance.
V. BEST PRACTICE
The core usage and show affinity tools are simple and
convenient. They are recommended for daily use, especially
when a workflow is changed or any environment variables
related to process/thread binding are introduced or modified.
Neither the source code nor the workflow needs to be changed.
A user or a system administrator can easily ssh to the compute
node that is running an application and then run core usage
and show affinity at any time. core usage shows how many
cores are being used by an application so that a user can
validate that it is the expected number. If the core occupation
count is smaller than the number of tasks/threads set by the
user, show affinity should be run to check whether multiple
processes/threads are bound to a core. Whenever users ob-
serve a drastic drop in application performance, show affinity
should be executed with the job to make sure that the number
of worker threads/processes is correct and they have expected
binding affinities.
It should be noted that it may take some time, e.g. up to
several minutes, for a large job to complete MPI initialization
or read large input files before all working threads execute.
Complicated model design and workflow may also alter the
process/thread binding status through the job, and different
process/thread binding patterns are likely present during the
run of these jobs. Users can try Linux commands watch and
show affinity to monitor thread affinity in real time for very
complicated workflows. If a test application finishes in less
than a few milliseconds, show affinity may not have enough
time to determine the binding affinity.
The amask tool allows users to quickly see the kernel mask
of all processes/threads in a “matrix” format that facilitates
analysis of the interaction between processes/threads. It can
also be used to evaluate the effects of changing affinity settings
that may not be familiar to the user.
VI. CONCLUSION
Working with modern supercomputers with large core
counts is not trivial. To help supercomputer users run parallel
applications efficiently with the hardware, three convenient
tools core usage, show affinity, and amask were designed
and developed to monitor how computing resources are uti-
lized in practice. These tools have helped many HPC users and
administrators detect, understand, and resolve issues related to
process and thread affinity. Consequently, they help user jobs
to run faster and supercomputers to be used more efficiently.
ACKNOWLEDGMENT
We would like to thank all our users who worked with
these new tools and provided us with constructive feedback
and suggestions to make improvements. We would also like
to thank our colleagues in the High-Performance Computing
group and Advanced Computing Systems group who provided
expertise and insight that significantly assisted this work.
Particularly, we would like to show our gratitude to Hang Liu,
Albert Lu, John Cazes, Robert McLay, Victor Eijkhout, and
Bill Barth who helped us design, test, and debug the early
versions of these products. We also appreciate the technical
writing assistance from Bob Garza.
All these tools are mainly developed and tested on TACC’s
supercomputer systems, including Stampede, Stampede2, Lon-
estar5, Wrangler, Maverick2, and Frontera. The computation
of all experiments was supported by the National Science
Foundation, through the Frontera (OAC-1818253), Stampede2
(OAC-1540931) and XSEDE (ACI-1953575) awards.
REFERENCES
[1] 2017 IXPUG US Annual Meeting, Austin, TX, USA. https://www.ixpug.
org/events/ixpug-2017-us, 2017. [Online; accessed 27-Aug-2019].
[2] Linux Documentation: numactl(8): Linux man page. https://linux.die.
net/man/8/numactl, 2019. [Online; accessed 27-Aug-2019].
[3] Linux Documentation: ps(1): Linux man page. https://linux.die.net/man/
1/ps, 2019. [Online; accessed 27-Aug-2019].
[4] Linux Documentation: pthread setaffinity np(3) - Linux man page.
https://man7.org/linux/man-pages/man3/pthread setaffinity np.3.html,
2019. [Online; accessed 27-Aug-2019].
[5] Linux Documentation: sched getaffinity(2): Linux man page. https://
linux.die.net/man/2/sched getaffinity, 2019. [Online; accessed 27-Aug-
2019].
[6] Linux Documentation: sched setaffinity(2): Linux man page. https://
linux.die.net/man/2/sched setaffinity, 2019. [Online; accessed 27-Aug-
2019].
[7] Linux Documentation: taskset(1): Linux man page. https://linux.die.net/
man/1/taskset, 2019. [Online; accessed 27-Aug-2019].
[8] Linux Documentation: top(1) - Linux man page. https://linux.die.net/
man/1/top, 2019. [Online; accessed 27-Aug-2019].
[9] Francois Broquedis1, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie
Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and
Raymond Namyst. hwloc: a Generic Framework for Managing Hardware
Affinities in HPC Applications. PDP 2010 - The 18th Euromicro
International Conference on Parallel, Distributed and Network- Based
Computing.
[10] Intel MPI developers. https://software.intel.com/en-us/mpi-library,
2019. [Online; accessed 27-Aug-2019].
[11] Mvapich developers. http://mvapich.cse.ohio-state.edu/, 2019. [Online;
accessed 27-Aug-2019].
[12] T. Evans, W. L. Barth, J. C. Browne, R. L. DeLeon, T. R. Furlani,
S. M. Gallo, M. D. Jones, and A. K. Patra. Comprehensive resource use
monitoring for hpc systems with tacc stats. In 2014 First International
Workshop on HPC User Support Tools, pages 13–21, Nov 2014.
[13] IBM. POWER9 Servers Overview, Scalable servers to meet the business
needs of tomorrow. https://www.ibm.com/downloads/cas/KDQRVQRR,
2019. [Online; accessed 27-Aug-2019].
[14] Intel. Intel Math Kernel Library Developer Reference. https://software.
intel.com/en-us/articles/mkl-reference-manual, 2019. [Online; accessed
27-Aug-2019].
[15] John Hennessy and David Patterson. Computer Architecture: A Quanti-
tative Approach (The Morgan Kaufmann Series in Computer Architec-
ture and Design), 6th edition. 2017.
[16] J
¨
urgen Hafner and Georg Kresse. The Vienna AB-Initio Simulation
Program VASP: An Efficient and Versatile Tool for Studying the
Structural, Dynamic, and Electronic Properties of Materials. In: Gonis
A., Meike A., Turchi P.E.A. (eds) Properties of Complex Inorganic
Solids. Springer, Boston, MA. 1997.
[17] Lawrence Livermore National Laboratory. Sierra supercomputer. https://
computation.llnl.gov/computers/sierra, 2019. [Online; accessed 27-Aug-
2019].
[18] National Supercomputer Center in Wuxi. The Sunway TaihuLight sys-
tem . http://www.nsccwx.cn/wxcyw/soft1.php?word=soft&i=46, 2019.
[Online; accessed 27-Aug-2019].
[19] Oak Ridge National Lab. Summit: Oak Ridge National Laboratory’s
200 petaflop supercomputer. https://www.olcf.ornl.gov/olcf-resources/
compute-systems/summit/, 2019. [Online; accessed 27-Aug-2019].
[20] OpenMP Architecture Review Board. OpenMP Application Program-
ming Interface, Version 4.5, in November 2015, 2015.
[21] OpenMP Architecture Review Board. OpenMP Application Program-
ming Interface, Version 5.0, in November 2018, 2018.
[22] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad
Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel,
Laxmikant Kale, and Klaus Schulten. Scalable molecular dynamics with
NAMD. Journal of Computational Chemistry: 26, pages 1781–1802,
2005.
[23] T. Roehl, J. Treibig, G. Hager, and G. Wellein. Overhead analysis of
performance counter measurements. In 43rd International Conference
on Parallel Processing Workshops (ICCPW), pages 176–185, Sept 2014.
[24] TACC Staff. TACC: amask project page. https://github.com/TACC/
amask/, 2019. [Online; accessed 27-Aug-2019].
[25] TACC Staff. TACC core usage project page. https://github.com/TACC/
core usage/, 2019. [Online; accessed 27-Aug-2019].
[26] TACC Staff. TACC show affinity project page. https://github.com/
TACC/show affinity/, 2019. [Online; accessed 27-Aug-2019].
[27] Texas Advanced Computing Center. Frontera User Guide. https://portal.
tacc.utexas.edu/user-guides/frontera, 2019. [Online; accessed 27-Aug-
2019].
[28] Texas Advanced Computing Center. Stampede2 User Guide. https:
//portal.tacc.utexas.edu/user-guides/stampede2, 2019. [Online; accessed
27-Aug-2019].
[29] Travis Oliphant. NumPy: A guide to NumPy. USA: Trelgol Publishing,
”http://www.numpy.org/”, 2006–. [Online; accessed 27-Aug-2019].
[30] J. Treibig, G. Hager, and G. Wellein. Likwid: A lightweight
performance-oriented tool suite for x86 multicore environments. In
Proceedings of PSTI2010, the First International Workshop on Parallel
Software Tools and Tool Infrastructures, San Diego CA, 2010.
[31] Wikipedia contributors. List of Intel CPU microarchitectures. https://en.
wikipedia.org/wiki/List of Intel CPU microarchitectures, 2019. [On-
line; accessed 27-Aug-2019].
[32] Wikipedia contributors. The Sunway TaihuLight Supercomputer. https://
en.wikipedia.org/wiki/Sunway TaihuLight, 2019. [Online; accessed 27-
Aug-2019].