Tools for Monitoring CPU Usage and Affinity in
Multicore Supercomputers
Lei Huang
Texas Advanced Computing Center
The University of Texas at Austin
10100 Burnet Rd. Austin, TX, 78758
Kent Milfeld
Texas Advanced Computing Center
The University of Texas at Austin
10100 Burnet Rd. Austin, TX, 78758
Si Liu
Texas Advanced Computing Center
The University of Texas at Austin
10100 Burnet Rd. Austin, TX, 78758
Abstract—Performance boosts in HPC nodes have come from
making SIMD units wider and aggressively packing more and
more cores in each processor. With multiple processors and so
many cores it has become necessary to understand and manage
process and thread affinity and pinning. However, affinity tools
have not been designed specifically for HPC users to quickly
evaluate process affinity and execution location. To fill in the
gap, three HPC user-friendly tools, core usage, show affinity, and
amask, have been designed to eliminate barriers that frustrate
users and impede users from evaluating and analyzing affinity
for applications. These tools focus on providing convenient
methods, easy-to-understand affinity representations for large
process counts, process locality, and run-time core load with
socket aggregation. These tools will significantly help HPC users,
developers and site administrators easily monitor processor
utilization from an affinity perspective.
Index Terms—Supercomputers, User support tool, Multicore
system, Affinity, Resource utilization, Core binding, Real-time
monitoring, Debugging
I. INTRODUCTION
Up to the millennium, the processor frequency of com-
modity CPUs increased exponentially year after year. High
CPU frequency had been one of the major driving forces to
boost CPU performance, other than the introduction of vector
processor units. However, it ceased to grow significantly in
recent years due to both technical reasons and market forces.
To accommodate the high demand of computing power in
HPC, significantly more cores are being packed into a single
compute node [15].
The needs of HPC and the use of core-rich processors are
exemplified in the extraordinary large-scale supercomputers
found throughout the world. The Sierra supercomputer [17] at
the Lawrence Livermore National Laboratory and the Summit
supercomputer [19] at the Oak Ridge National Lab have 44
processing cores per compute node with two IBM Power9
CPUs [13]. The Sunway TaihuLight supercomputer [18] at
the National Supercomputing Center in Wuxi deploys Sunway
SW26010 manycore processors, containing 256 processing
cores and additional 4 auxiliary cores for system manage-
ment [32] per node. The Stampede2 supercomputer [28] at
the Texas Advanced Computing Center (TACC) provides Intel
Knights Landing (KNL) nodes with 68 cores per node. The
Stampede2 [28] and Frontera [27] supercomputers at TACC
provide 48 and 56 processing cores per node with Intel’s
Skylake (SKX) and Cascade Lake (CLX) processors [31],
respectively. These, and other HPC processors, also support
Simultaneous Multi-Threading (SMT) to a level of 2 to 4
per core. Consequently, there could be 2x to 4x more logical
processors than physical processors on a node.
When working with nodes of such large core counts, the
performance of HPC applications is not only dependent upon
the number and speed of the cores, but also upon proper
scheduling of processes and threads. HPC application runs
with proper affinity settings will take full advantage of re-
sources like local memory, reusable caches, etc., and will
obtain a distinct benefit in performance.
II. BACKGROUND
A. Process and Thread Affinity
A modern computer often has more than one socket per
node and therefore HPC applications may have non-uniform
access to memory. Ideally, an application process should be
placed on a processor that is close to the data in memory
it accesses, to get the best performance. Process and thread
affinity/pinning allows a process or a thread to bind to a
single processor or a set of (logical) processors. The processes
or threads with specific affinity settings will then only run
on the designated processor(s). For Single-Program Multiple-
Data (SPMD) applications, managing this affinity can be
difficult. Moreover, the present-day workflows on modern
supercomputers have moved beyond the SPMD approach and
now include hierarchical levels of Multiple-Program Multiple-
Data (MPMD), demanding even more attention to affinity.
MPI affinity for Intel MPI (IMPI), MVAPICH2
(MV2), Open MPI (OMPI), and IBM Spectrum MPI
(SMPI) have a variety of mechanisms for setting
affinity. IMPI relies solely on “I
MPI x” environment
variables, as does MV2 (MV2 CPU/HYBRID BINDING x,
MV2 CPU MAPPING x, etc.). SMPI uses both environment
variables (MP TASK/CPU x) and mpirun command-line
options (-map-by, -bind-to, -aff shortcuts, etc.). Similarly,
OMPI uses mpirun options (-bind-to-core, –cpus-per-proc,
etc.) and also accepts a rankfile file with a map (slot-list) for
each rank.
When no affinity is specified these MPIs evaluate a node’s
hardware configuration (for example with hwloc for MV2 and