Tools for Monitoring CPU Usage and Affinity in Multicore

Tools for Monitoring CPU Usage and Afﬁnity in

Multicore Supercomputers

Lei Huang

Texas Advanced Computing Center

The University of Texas at Austin

10100 Burnet Rd. Austin, TX, 78758

Email: [email protected]

Kent Milfeld

Texas Advanced Computing Center

The University of Texas at Austin

10100 Burnet Rd. Austin, TX, 78758

Email: [email protected]

Si Liu

Texas Advanced Computing Center

The University of Texas at Austin

10100 Burnet Rd. Austin, TX, 78758

Email: [email protected]

Abstract—Performance boosts in HPC nodes have come from

making SIMD units wider and aggressively packing more and

more cores in each processor. With multiple processors and so

many cores it has become necessary to understand and manage

process and thread afﬁnity and pinning. However, afﬁnity tools

have not been designed speciﬁcally for HPC users to quickly

evaluate process afﬁnity and execution location. To ﬁll in the

gap, three HPC user-friendly tools, core usage, show afﬁnity, and

amask, have been designed to eliminate barriers that frustrate

users and impede users from evaluating and analyzing afﬁnity

for applications. These tools focus on providing convenient

methods, easy-to-understand afﬁnity representations for large

process counts, process locality, and run-time core load with

socket aggregation. These tools will signiﬁcantly help HPC users,

developers and site administrators easily monitor processor

utilization from an afﬁnity perspective.

Index Terms—Supercomputers, User support tool, Multicore

system, Afﬁnity, Resource utilization, Core binding, Real-time

monitoring, Debugging

I. INTRODUCTION

Up to the millennium, the processor frequency of com-

modity CPUs increased exponentially year after year. High

CPU frequency had been one of the major driving forces to

boost CPU performance, other than the introduction of vector

processor units. However, it ceased to grow signiﬁcantly in

recent years due to both technical reasons and market forces.

To accommodate the high demand of computing power in

HPC, signiﬁcantly more cores are being packed into a single

compute node [15].

The needs of HPC and the use of core-rich processors are

exempliﬁed in the extraordinary large-scale supercomputers

found throughout the world. The Sierra supercomputer [17] at

the Lawrence Livermore National Laboratory and the Summit

supercomputer [19] at the Oak Ridge National Lab have 44

processing cores per compute node with two IBM Power9

CPUs [13]. The Sunway TaihuLight supercomputer [18] at

the National Supercomputing Center in Wuxi deploys Sunway

SW26010 manycore processors, containing 256 processing

cores and additional 4 auxiliary cores for system manage-

ment [32] per node. The Stampede2 supercomputer [28] at

the Texas Advanced Computing Center (TACC) provides Intel

Knights Landing (KNL) nodes with 68 cores per node. The

Stampede2 [28] and Frontera [27] supercomputers at TACC

provide 48 and 56 processing cores per node with Intel’s

Skylake (SKX) and Cascade Lake (CLX) processors [31],

respectively. These, and other HPC processors, also support

Simultaneous Multi-Threading (SMT) to a level of 2 to 4

per core. Consequently, there could be 2x to 4x more logical

processors than physical processors on a node.

When working with nodes of such large core counts, the

performance of HPC applications is not only dependent upon

the number and speed of the cores, but also upon proper

scheduling of processes and threads. HPC application runs

with proper afﬁnity settings will take full advantage of re-

sources like local memory, reusable caches, etc., and will

obtain a distinct beneﬁt in performance.

II. BACKGROUND

A. Process and Thread Afﬁnity

A modern computer often has more than one socket per

node and therefore HPC applications may have non-uniform

access to memory. Ideally, an application process should be

placed on a processor that is close to the data in memory

it accesses, to get the best performance. Process and thread

afﬁnity/pinning allows a process or a thread to bind to a

single processor or a set of (logical) processors. The processes

or threads with speciﬁc afﬁnity settings will then only run

on the designated processor(s). For Single-Program Multiple-

Data (SPMD) applications, managing this afﬁnity can be

difﬁcult. Moreover, the present-day workﬂows on modern

supercomputers have moved beyond the SPMD approach and

now include hierarchical levels of Multiple-Program Multiple-

Data (MPMD), demanding even more attention to afﬁnity.

MPI afﬁnity for Intel MPI (IMPI), MVAPICH2

(MV2), Open MPI (OMPI), and IBM Spectrum MPI

(SMPI) have a variety of mechanisms for setting

afﬁnity. IMPI relies solely on “I

MPI x” environment

variables, as does MV2 (MV2 CPU/HYBRID BINDING x,

MV2 CPU MAPPING x, etc.). SMPI uses both environment

variables (MP TASK/CPU x) and mpirun command-line

options (-map-by, -bind-to, -aff shortcuts, etc.). Similarly,

OMPI uses mpirun options (-bind-to-core, –cpus-per-proc,

etc.) and also accepts a rankﬁle ﬁle with a map (slot-list) for

each rank.

When no afﬁnity is speciﬁed these MPIs evaluate a node’s

hardware conﬁguration (for example with hwloc for MV2 and

OMPI) and make appropriate default afﬁnity settings. OpenMP

afﬁnity for hybrid runs can be speciﬁed by various “vendor”

methods. However, since all of these MPIs accept OpenMP’s

OMP PLACES/OMP PROC BIND speciﬁcations, it is best

to use the standard’s mechanism. Hence, for portable hybrid

computing a user must deal with many ways of setting each

rank’s afﬁnity. (When a master thread encounters a parallel

region it inherits the MPI rank’s mask, and OpenMP Afﬁnity

speciﬁcations take over).

Fig. 1 shows a schematic of the afﬁnity process. A mask

is maintained for each process by the kernel that describes

which processor(s) the process can run on. The mask consists

of a bit for each processor, and the process can execute on any

processor where a mask bit is set. There are a myriad of ways

to set and alter the afﬁnity mask for processes of a parallel

application. For instance, vendors have their own way to set

afﬁnities for MPI and OpenMP, usually through environment

variables. Only recently has OpenMP 4.5 [20], [21] provided a

standard way to set afﬁnity for threads, and MPI has yet to do

this for MPI tasks. As shown in Fig. 1 the afﬁnity can not only

be affected before an application is launched but also while

it is running. There are utilities such as numactl [2] and util-

linux taskset [7] to do this. Furthermore, the afﬁnity can even

be changed within a program with the sched

setafﬁnity [6]

function.

Fig. 1. The left box indicates mechanisms for setting the afﬁnity mask. The

right box illustrates how a BIOS setting has designated the processor ids

for the hardware (cores). The center section shows a mask with bits set for

execution on cores 1, 3, 5, and 7.

Understanding the “vernaculars” of all these methods can be

challenging. Even the default settings are sometimes unknown

to users. In addition, users are commonly uncertain of their

attempts to set the afﬁnity for processes of their parallel

applications. Other factors (see Fig. 1), such as user environ-

ment variables, the MPI launcher, etc., also create a lack of

conﬁdence in a user’s attempt to control afﬁnity. Incorrect core

binding for processes and/or threads can have adverse effects

on performance, even reducing the program performance by

single or multi-digit factors.

B. Related Work

There are many ways to view CPU loads and the afﬁnity

mask of a process. Moreover, some methods are not well-

known or only available through unexpected means.

The Linux command-line tool top [8] and its more recent

counterparts (htop or atop) can be used to monitor CPU loads,

and manage process and thread afﬁnity in real time. The

Linux command ps [3] can report which core a process is

running on. However, it does not report the afﬁnity mask

explicitly. It only reports the core the process is presently

running on. The taskset [7] command-line utility is normally

more helpful since it can query and modify the binding afﬁnity

of a given thread or process. Linux also provides API functions

sched getafﬁnity [5] and pthread setafﬁnity np [4] for a

process or thread to query and set the afﬁnity (kernel mask)

of itself. While these tools are pervasive and do provide the

information needed, they are sometimes cumbersome to use,

particularly for supercomputer users working with large core

counts.

For HPC users, these tools may provide too much admin-

istrative information, and it may not be apparent how to get

HPC-relevant information for their applications. Users need

to remember extra options or give extra instructions to obtain

relevant CPU information. For instance, top does not show

the loads of individual processors by default. For example,

pressing “1” within a top session is required to display the

performance data of individual CPUs; and pressing “z” is

needed to display running process in color. The htop utility

does show usage information for all logical processors and the

load on each individual core is represented by a progress bar

in text mode. It works up to about one hundred sixty cores.

However, the progress bar is distracting on a computer with

many cores.

Furthermore, such tools were originally designed for admin-

istrators to display multi-user information, not for a single-

user screen display of HPC information. Therefore, there is

a real need for convenient HPC tools to readily display each

CPU utilization and afﬁnity information for multicore compute

nodes.

The MPI libraries and OpenMP 5.0 [20] implementions

themselves can present afﬁnity mask information on the pro-

cesses or threads they instantiate. For instance, by setting the

I MPI DEBUG environment to 4 or above, the Intel MPI

runtime will print a list of processor ids (mask bits set) for each

process (rank) at launch time. Likewise for OpenMP 5.0 im-

plementations, setting OMP AFFINITY DISPLAY to TRUE

will have the runtime print a line for each thread (number)

reporting the processor ids associated with its binding, at the

beginning of the ﬁrst parallel region. However, it is difﬁcult to

make sense of these lists for multicore or manycore compute

nodes.

There are other comprehensive tools that can be used to

collect the CPU loads and the afﬁnity information. TACC

Stats [12] is a well-established one. It monitors parallel jobs

on supercomputers and collects a series of system statistics

and hardware performance count including the CPU usage of

each core. But the data processing and display is not real-time

and this tools is mainly designed for system administrators.

Another practical tool suite is Likwid [23], [30]. It consists

of many convenient command-line applications. Particularly,

likwid-topology is used to print thread, cache, and NUMA in-

formation. likwid-pin can be used to pin threaded applications.

Based on years of experience administrating multiple super-

computer systems and supporting thousands of HPC users, the

following questions are always asked by users and administra-

tors when monitoring a program running on an HPC system:

Does my application use the maximum capacity of CPU?

How many physical or logical processors are practical for a

running application? What is the currently used process and

thread afﬁnity pattern based on my current settings? To help

HPC users and administrators answer these questions easily,

three innovative tools core usage, show afﬁnity, and amask

were designed and developed. They are now serving the HPC

communities by presenting real-time CPU usage and afﬁnity

information on systems with large core counts.

III. THREE INNOVATIVE TOOLS

A. core usage

1) IMPLEMENTATION: The ﬁrst tool we designed and

developed to quickly and efﬁciently show processor loads

is core usage [25]. It employs the logical processor (core)

usage information directly from /proc/stat on Linux systems.

Speciﬁcally, non-idle time (t N onIdle) has six components:

user, nice, system, irq, softirq, and steal columns. Idle time

(t Idle) is calculated as the sum of idle and iowait columns.

core usage regularly reads kernel activity information of every

logical processor on a node, then calculates core utilization

with the two most-recent core status data points according to

the following equation.

utilization =

( t_NonIdle_new - t_NonIdle_old ) /

( (t_NonIdle_new + t_Idle_new) -

(t_NonIdle_old + t_Idle_old) )

core usage then displays CPU load for all logical proces-

sors. The data are grouped by socket id and the ﬁrst core of

every socket is highlighted to make it easy to determine if the

processes/threads are evenly distributed across sockets.

2) COMMANDS AND REPORTS: The syntax to run

core usage is:

core_usage [<int>] [txt]

where <int> is the update interval and must be an integer or

ﬂoat (unit is second, default is 1). For example, “core usage

3” will provide an update every three seconds. Users can also

add “txt” as a parameter to force the text mode.

The core usage command can present a graphical-user-

interface (GUI) or a command-line-interface (CLI). When X

Forwarding is supported in the current environment, the GUI

version is presented. In the GUI, the size of the plot area

is automatically set according to the number of cores on the

running computer. The usage percentage of each individual

logical processor is represented by the height of a blue bar, as

shown in Fig. 2, for a hybrid job run on the Stampede2 system.

The GUI version is an ideal way to visualize core usage

information and can easily be extended to support thousands

of cores per node in the future by adding more rows in the

bar chart.

Fig. 2. Snapshot of core usage (GUI) display for a hybrid application run

with 4 MPI and 8 threads per MPI task on a Stampede2 Intel Xeon Skylake

compute node. Colors and format are slightly modiﬁed for presentation.

If X Forwarding is not detected, the CLI version automati-

cally launches the reports in text mode as shown in Fig. 3. A

ﬂoating point number between 0.0 and 1.0 is calculated and

displayed for each logical processor to represent the current

core usage. The text is monochrome if the core is idle (usage

less than 2%), otherwise it is green to highlight the cores in

use.

From these ﬁgures, it can be seen that core usage presents

in logical order the current usage of each individual processor

in real time. The results are collected and displayed in a

socket-aware manner so that users can easily track processor

status by socket. In the latest version, core usage explicitly

displays the name of the application that keeps the individual

core busy as shown in Fig. 3. To make the results clear and

concise, core usage only shows the application with the top

usage for each individual logical processor in this version.

As mentioned above, it is also possible to run core usage

manually in terminal mode with an argument “txt”, even if

X11 environment is available and GUI is the default starting

mode.

B. show afﬁnity

1) BACKGROUND AND IMPLEMENTATION: Though

core usage is valuable for monitoring individual CPU core

usage and detecting under utilization issues, it doesn’t report

the process and thread binding (afﬁnity) that may be needed

to adjust the resource usage.

Normally, the Linux tool taskset [7] can be used to retrieve

the CPU binding afﬁnity of individual threads or processes

with commands like “taskset -p pid”. However, users need

to compile a full list of pid/tid’s for all processes/threads,

and this would require running taskset for each process. This

would be tedious and error-prone, and “simple” command-

line expressions or scripts may require Unix skills unfamiliar

to inexperienced HPC users.

Fig. 3. Snapshot of core usage CLI text report for a Weather Research and Forecasting (WRF) run with 16 MPI tasks and 4 OpenMP threads per MPI task

on one Stampede2 KNL compute node. Colors and format are slightly modiﬁed for presentation.

To report this type of afﬁnity information automatically

and clearly, the show afﬁnity [26] tool was developed. This

tool was also designed to be intuitive and simple to use.

When executing show afﬁnity, all running processes/threads

on a computer node are enumerated by inspecting the direc-

tories under /proc and their owners. To avoid unnecessary

information, show afﬁnity only queries and reports binding

afﬁnity for the processes owned by the current user (on a

compute node). Application names are then extracted from

/proc/pid/exe. For each process the threads are enumerated and

the core binding afﬁnity of each individual process/thread is

queried and displayed.

2) COMMANDS AND REPORTS: There are two modes of

operation for show afﬁnity. The syntax is:

show_affinity [all]

In the ﬁrst and default mode, the tool shows the process-

es/threads launched by the current user that keeps CPUs busy

as demonstrated in Fig. 4. To make the results more concise

and clear, the outputs are organized in four columns: process

id (pid), executable name, thread id (tid), and binding afﬁnity

respectively. They are grouped with pid. The second mode is

invoked with the “all” argument, and show afﬁnity displays

all running processes and threads on the current compute node

owned by the current user as demonstrated in Fig. 5.

C. amask

1) BACKGROUND AND IMPLEMENTATION: The initial

amask utility [24] was designed [1] as an analysis tool to

conﬁrm afﬁnity settings for the OpenMP 4.0 Afﬁnity im-

plementation on the manycore Intel Xeon Phi system (68

pid Exe_Name tid Affinity

91884 namd2_skx 91884 0

91910 2

91915 4

...

91942 20

91945 22

91885 namd2_skx 91885 24

91911 26

91914 28

...

91941 44

91944 46

91886 namd2_skx 91886 1

91909 3

91913 5

...

91943 21

91946 23

91887 namd2_skx 91887 25

91908 27

91912 29

...

91950 45

91951 47

Fig. 4. Snapshot of show afﬁnity showing the running processes and threads

that keep CPU busy for a NAMD [22] run with 4 MPI tasks and 12 threads

per MPI task on one Stampede2 Skylake compute node. The output contains

four columns: process id (pid), executable name, thread id (tid), and core

binding afﬁnity. Format is slightly modiﬁed for presentation.

pid Exe_Name tid Affinity

91544 slurm_script 91544 0-95

91551 sleep 91551 0-95

91649 sshd 91649 0-95

91650 bash 91650 0-95

91829 ibrun 91829 0-95

91879 mpiexec.hydra 91879 0-95

91880 pmi_proxy 91880 0-95

91884 namd2_skx 91884 0

91905 0,2,4,...

91910 2

...

91942 20

91945 22

91885 namd2_skx 91885 24

91904 24,26,...

91911 26

...

91941 44

91944 46

91886 namd2_skx 91886 1

91906 1,3,5,...

91909 3

...

91943 21

91946 23

91887 namd2_skx 91887 25

91907 25,27,...

91908 27

...

91950 45

91951 47

91975 tee 91975 0-95

Fig. 5. Snapshot of show afﬁnity with “all” argument showing all running

processes and threads for a NAMD run with 4 MPI tasks and 12 threads

per MPI task on one Stampede2 Skylake compute node. Format is slightly

modiﬁed for presentation.

cores, 272 processor ids). It consisted of a single, argumentless

library function called within an application. However, it was

found that users were more interested in executing a command

immediately before their application (and after setting the

afﬁnity environment) to report the afﬁnity of a “generic”

parallel region, rather than instrumenting an application with

a library call. Therefore, stand-alone external commands

were created. amask was soon adapted for MPI, and the

external commands (and library calls) became amask omp,

amask mpi, and amask hybrid for pure OpenMP, pure MPI,

and hybrid OpenMP-MPI applications, respectively.

The OpenMP component works for any version of OpenMP.

The commands with MPI components must be compiled/used

with the same ﬂavor (OpenMPI, IMPI, MVAPICH2, etc.) used

by the application, so that the same runtimes are invoked. It

is worth mentioning that the amask code does not rely on

any vendor-speciﬁc features or APIs. The library (and other

utilities) remain available for developers or power-users who

want in-situ reporting.

2) COMMANDS AND API: All three amask commands

accept the same options. The syntax is:

amask_[omp|mpi|hybrid] -h -vk -w# -pf

Commands amask omp or amask mpi are for report-

ing masks for a pure OpenMP or pure-MPI run. The

amask hybrid command is used for reporting the (parent)

MPI masks followed by the OpenMP thread masks for each

MPI task. The −h option provides help. The −vk option

overrides the automatic core view, forcing a kernel (k) view

(mask of processor ids). amask will load each process for #

seconds when the −w (wait) option is invoked (default is 10).

This is helpful when used in combination with monitoring

tools like core usage and htop. A slight pause after printing

each row (mask) was found to give the viewer time to start

comprehending the content of each mask, and then allow

analysis of the pattern as more rows are reported. This slow-

mode can be turned off by requesting the fast printing mode

with −pf .

The library has function names corresponding to the com-

mands:

C/C++ Fortran

amask_omp(); call amask_omp()

amask_mpi(); call amask_mpi()

amask_hybrid(); call amask_hybrid()

These can be inserted in a pure OpenMP parallel region,

after an MPI Init of a pure MPI code, or within an OpenMP

parallel region of a hybrid code (calls within a loop structure

should be conditionally executed for only one iteration.)

3) REPORTS: The important feature that makes amask

more useful is that it reports a mask for each process of

a parallel execution in a matrix format (process number vs.

processor id), so that the user can quickly visualize relevant

patterns of the afﬁnity (such as socket, NUMA nodes, tile,

core, and single hardware-thread assignments).

In the reports shown in Fig. 6, each row represents a

“kernel” mask for the process number labeled at the left. Each

label (process) is followed by N characters, one for each bit

of the kernel’s afﬁnity mask. A dash (-) represents an unset

bit, while a digit (0-9) represents a set bit. In order to easily

evaluate the process id of a set bit, the single digit (0-9) of

the set bit is added to the header group process id label at the

top (labels represent groups of 10s).

For instance, the mask of process 1 in Fig. 6a) has mask

bit 12 (proc-id 12) set (proc-id = “2” + “10” from group

value). This single-character bit mask representation is ideal

for working with systems with hundreds of logical processors.

Fig. 6a) and Fig. 6b) show processes bound to single cores

and sockets, respectively (where proc-ids sets 0-11 and 12-

23 are on different sockets). In the latter case each process

can “ﬂoat” on any core in the socket. Fig. 6c) illustrates a

socket afﬁnity, just as in Fig. 6b), but for a system with even

and odd proc-id sets for each socket. While the sequential

or even-odd assignments could have been determined from

hwloc [9], or /proc/cpuinfo on certain Linux systems, the

amask report identiﬁes the proc-id assignment pattern. The

last report, Fig. 6d), shows a scenario where each process is

allowed to execute on any core of the system.

proc-id > | 0 | 10 | 20 | a)

process v | | | |

0000 0-----------------------

0001 ------------2----------

proc-id > | 0 | 10 | 20 | b)

process v | | | |

0000 012345678901------------

0001 ------------234567890123

proc-id > | 0 | 10 | 20 | c)

process v | | | |

0000 0-2-4-6-8-0-2-4-6-8-0-2-

0001 -1-3-5-7-9-1-3-5-7-9-1-3

proc-id > | 0 | 10 | 20 | d)

process v | | | |

0000 012345678901234567890123

0001 012345678901234567890123

Fig. 6. Masks for 2 processes on a 2-socket, 24 core platform. Dash (-)

represents unset bit, while a single digit represents a set bit. Add digit to

column group value to obtain processor id (core number) value. a) Process

0 can only execute on core 0; process 1 can only execute on core 12. b)

Process 0 can execute on cores 0-11; process 1 can execute on cores 12-23.

c) Process 0 can execute on even-numbered cores; process 1 can execute on

odd-numbered cores. d) Processes 0 and 1 can execute on any cores

With simultaneous multithreading (SMT), available on IBM,

Intel, AMD, and other processors, the OS assigns multiple

(virtual) processors to a core. Hence each core has multiple

processor ids, also called hardware threads (HWT) - the term

used here.

When amask detects hardware threading, it reports a “core”

view, showing a column for each core id, and each process

reporting a row (mask) for each hardware thread. Hence, core

group numbers appear in the header instead of processor-id

group numbers.

For a 2-socket system with sequential process id numbering

Fig. 7a) shows the afﬁnity mask for process 0 execution on

either HWT of core 0, and process 1 execution on either HWT

on core 12; while Fig. 7b) shows executions are available only

on the 1st hardware thread of two adjacent cores. Fig. 7c)

shows 68 threads executing with “cores” afﬁnity (execution

available on all hardware threads of a core) for a 4-SMT, 68-

core Intel Xeon Phi system. It is easy to see that each process

is assigned to all HWTs of a core. A process id list for core

number 67 is the set {67, 135, 203, 271}, and determining that

these represent a single core in the amask kernel (processor

id) view would be difﬁcult, and checking the assignment with

just a process id listing would be tedious.

a) Core ids

proc-id > | 0 | 10 | 20 |

process v | | | |

0000 0======================= HWT0

0----------------------- HWT1

0001 ===========2============

-----------2------------

b) Core ids

proc-id > | 0 | 10 | 20 |

process v | | | |

0000 01======================

------------------------

0001 ===========23===========

------------------------

c) Core ids

proc-id > | 0 | ... | 60 |

process v | | | |

0000 0========== ======== HWT0

0---------- -------- HWT1

0---------- -------- HWT2

0---------- -------- HWT3

: :

0067 =========== =======7

----------- -------7

Fig. 7. Core view of masks for SMT systems: Equals (=) represent unset bits

and distinguishes the ﬁrst HWT. Dashes (-) represent unset bits for the other

HWTs. a) Process 0 can execute on either HWT of core 0. Process 1 can

execute on either HWT of core 12. b) Process 0 can execute only on HWT

0 of cores 0 and 1. Process 1 can execute only on HWT 0 of cores 12 and

13. c) Process i can execute on any HWT of core i. (for a 68-core 4-SMT

system)

For multi-node executions, amask reports the masks on

each node, and labels each process (row) with a node name

and rank number as shown in Fig. 8a). Masks for Hybrid

(OpenMP/MPI) executions are reported by the amask hybrid

command. The report consists of two parts. The ﬁrst part

contains the masks of the MPI task (just as the amask mpi

would report). It is important to show these masks, because

the OpenMP runtime inherits the task’s map for the parallel

region, and can only assign thread masks as subsets (or the

full set) of the set bits in the MPI task mask. Fig. 8b) shows

the MPI (parent) masks and thread masks for a hybrid run of

4 MPI tasks with 6 OpenMP threads per task. Each thread is

bound to a single core, as would be desired for a 4 x 6 (task

x thread) run on a 24-core system.

IV. CASE STUDY

A. Unexpected Slow VASP Runs

In 2018, one of our experienced Vienna Ab Initio Simulation

Package (VASP) [16] users reported unexpected performance

proc-id > | 0 | 10 | 20 |

node rank | | | |

c123-509 0000 012345678901------------

c123-509 0001 ------------234567890123

c123-802 0002 012345678901------------

c123-802 0003 ------------234567890123

b) (parent) MPI mask

proc-id > | 0 | 10 | 20 |

rank v | | | |

0000 012345------------------

0001 ------678901------------

0002 ------------234567------

0003 ------------------890123

MPI-thread mask

proc-id > | 0 | 10 | 20 |

rank thrd | | | |

0000 0000 0-----------------------

0000 0001 -1----------------------

0000 0002 --2---------------------

0000 0003 ---3--------------------

...

0003 0003 ---------------------1--

0003 0004 ----------------------2-

0003 0005 -----------------------3

Fig. 8. a) Afﬁnity for a pure MPI multi-node execution, 4 MPI tasks on

two separate nodes. b) Afﬁnity for an OpenMP/MPI hybrid (amask hybrid)

execution with parent MPI and hybrid (rank/thread) reports.

drop in his jobs and he needed help to debug this issue on

TACC’s Stampede2 system. The user had created a top-level

script in Python to manage the overall workﬂow, and invoked

numpy [29] in this script for scientiﬁc computing work.

Numpy then invoked Intel Math Kernel Library (MKL) [14]

for threaded and vectorized function calls. Executions of

VASP were employed later by the Python scripts for material

modeling simulations.

We could reproduce the user’s issue and we employed our

new tools on the user’s workﬂow. core usage demonstrated

that all cores were allocated and used at the beginning of

the job. However, after a while only a single core was busy

when the VASP runs ﬁnally started. show afﬁnity also showed

that all running threads on a compute node were bound to a

single core (0) instead of separate cores. A more in-depth

investigation revealed that the Intel MKL functions called

Intel OpenMP function “omp get num procs” from the Intel

OpenMP library. This function was directly binding the parent

process (where the python script was calling numpy) to only

core 0. This is an Intel default setting when the environment

variable “OMP PROC BIND” is not set. Consequently, all

child processes of the python script including the following

VASP runs inherited this binding afﬁnity unexpectedly. Hence

only a single core was used by all new processes/threads for

the VASP runs, though they were designed to run in parallel

on all the available cores.

Due to this incorrect binding, the job ran a hundred times

slower. With the core usage and show afﬁnity tools, the

source of the problem was quickly and efﬁciently determined.

The ﬁx was easy: “OMP PROC BIND” was set to TRUE in

the user’s default environment, forcing each parallel region to

obey OpenMP’s afﬁnity policy.

B. MPI Library Evaluation on A New System

When building and deploying the new Frontera system [27]

at TACC in 2019, several different MPI Stacks were tested

and evaluated, including the Intel MPI library [10], the

MVAPICH2 library [11], etc. The objective was to determine

conﬁgurations and settings for optimal performance of the

system. In the evaluation process, two signiﬁcant issues related

to process/thread afﬁnity were discovered for hybrid (MPI +

OpenMP) application runs.

The ﬁrst problem was an unbalanced work distribution when

all cores on a compute node were not fully used (due to mem-

ory or other limitations). On Frontera CLX nodes there are 56

physical cores on two sockets (28 cores/socket). The initial

MVAPICH2 “MV2 HYBRID BINDING POLICY” was set

to “linear” at TACC as the default and recommended value.

By evaluating application executions with our tools, it was

soon discovered that this was not an appropriate setting for

hybrid applications that did not use all the cores of a node. For

instance for application requiring 2

cores/node, an application

execution with 2 tasks per node and 16 threads per task was

assigning 28 cores on the ﬁrst socket and 4 cores on the

second, while ideally one would want 16 cores assigned to

each socket. The default setting was changed to “spread“,

which generally works well for all cases.

The second problem was with the different thread binding

behavior of Intel MPI and MVAPICH2. The default/recom-

mended setting of Intel MPI binds each process to a single

core throughout a run. The bind-to-one-core afﬁnity assumes

that cache/memory locality will normally provide optimal per-

formance. However, with MVAPICH2, every bit in a process

mask is set to 1 by default, and therefore each process can

run on any core. Explicitly setting core binding for certain

hybrid applications compiled with MVAPICH2 has been found

to increase performance slightly.

C. Afﬁnity Discovery

While amask has beneﬁted many in discovering the afﬁnity

for their HPC applications, another potential feature is learning

how to correctly interpret the complicated syntax of certain

MPI implementations. That is, a concise report of afﬁnity can

help users when experimenting with unfamiliar options and

syntax. While the syntax for OpenMP Afﬁnity is standard, the

implementation of certain features is implementation deﬁned

– amask can quickly show the effects of an implementation-

deﬁned afﬁnity setting in its report.

Discovering the afﬁnity for a pure MPI application can

be complicated even if the number of processes does evenly

divide the number of processors. For instance, on Intel KNL

with 68 cores (4 SMT threads per core), users might want to

use 16 MPI tasks. The default setting will mask 17 sequential

bits (in proc-id space, from 0 to 271) for each task. However,

this mask allows tasks (processes) to overlap on a single core

as shown in Fig. 9. Using 17 MPI tasks produces 16 sequential

bits (4 cores) for each mask and makes for a more balanced

distribution without core sharing.

core > | 0 | 10 | ... 60 |

process v

0000 01234================

0123-----------------

0001 =====5678============

----45678------------

----4567-------------

0002 =========9012========

---------9012--------

--------89012--------

--------8901---------

...

Fig. 9. amask shows process 0 and 1 masks overlapping on core 4; likewise

process 1 and 2 overlap on core 8.

D. General

These cases demonstrate that with our afﬁnity tools, pro-

cess/thread kernel masks can be determined easily and process

execution location can be easily monitored in real time. This

can be particularly important when one begins to work on a

new system and/or in an unfamiliar environment. These tools

also help users and site staff discover issues that can be im-

mediately reported back to developers and site administrators

so that parallel applications can achieve higher performance.

V. BEST PRACTICE

The core usage and show afﬁnity tools are simple and

convenient. They are recommended for daily use, especially

when a workﬂow is changed or any environment variables

related to process/thread binding are introduced or modiﬁed.

Neither the source code nor the workﬂow needs to be changed.

A user or a system administrator can easily ssh to the compute

node that is running an application and then run core usage

and show afﬁnity at any time. core usage shows how many

cores are being used by an application so that a user can

validate that it is the expected number. If the core occupation

count is smaller than the number of tasks/threads set by the

user, show afﬁnity should be run to check whether multiple

processes/threads are bound to a core. Whenever users ob-

serve a drastic drop in application performance, show afﬁnity

should be executed with the job to make sure that the number

of worker threads/processes is correct and they have expected

binding afﬁnities.

It should be noted that it may take some time, e.g. up to

several minutes, for a large job to complete MPI initialization

or read large input ﬁles before all working threads execute.

Complicated model design and workﬂow may also alter the

process/thread binding status through the job, and different

process/thread binding patterns are likely present during the

run of these jobs. Users can try Linux commands watch and

show afﬁnity to monitor thread afﬁnity in real time for very

complicated workﬂows. If a test application ﬁnishes in less

than a few milliseconds, show afﬁnity may not have enough

time to determine the binding afﬁnity.

The amask tool allows users to quickly see the kernel mask

of all processes/threads in a “matrix” format that facilitates

analysis of the interaction between processes/threads. It can

also be used to evaluate the effects of changing afﬁnity settings

that may not be familiar to the user.

VI. CONCLUSION

Working with modern supercomputers with large core

counts is not trivial. To help supercomputer users run parallel

applications efﬁciently with the hardware, three convenient

tools core usage, show afﬁnity, and amask were designed

and developed to monitor how computing resources are uti-

lized in practice. These tools have helped many HPC users and

administrators detect, understand, and resolve issues related to

process and thread afﬁnity. Consequently, they help user jobs

to run faster and supercomputers to be used more efﬁciently.

ACKNOWLEDGMENT

We would like to thank all our users who worked with

these new tools and provided us with constructive feedback

and suggestions to make improvements. We would also like

to thank our colleagues in the High-Performance Computing

group and Advanced Computing Systems group who provided

expertise and insight that signiﬁcantly assisted this work.

Particularly, we would like to show our gratitude to Hang Liu,

Albert Lu, John Cazes, Robert McLay, Victor Eijkhout, and

Bill Barth who helped us design, test, and debug the early

versions of these products. We also appreciate the technical

writing assistance from Bob Garza.

All these tools are mainly developed and tested on TACC’s

supercomputer systems, including Stampede, Stampede2, Lon-

estar5, Wrangler, Maverick2, and Frontera. The computation

of all experiments was supported by the National Science

Foundation, through the Frontera (OAC-1818253), Stampede2

(OAC-1540931) and XSEDE (ACI-1953575) awards.

REFERENCES

[1] 2017 IXPUG US Annual Meeting, Austin, TX, USA. https://www.ixpug.

org/events/ixpug-2017-us, 2017. [Online; accessed 27-Aug-2019].

[2] Linux Documentation: numactl(8): Linux man page. https://linux.die.

net/man/8/numactl, 2019. [Online; accessed 27-Aug-2019].

[3] Linux Documentation: ps(1): Linux man page. https://linux.die.net/man/

1/ps, 2019. [Online; accessed 27-Aug-2019].

[4] Linux Documentation: pthread setafﬁnity np(3) - Linux man page.

https://man7.org/linux/man-pages/man3/pthread setafﬁnity np.3.html,

2019. [Online; accessed 27-Aug-2019].

[5] Linux Documentation: sched getafﬁnity(2): Linux man page. https://

linux.die.net/man/2/sched getafﬁnity, 2019. [Online; accessed 27-Aug-

2019].

[6] Linux Documentation: sched setafﬁnity(2): Linux man page. https://

linux.die.net/man/2/sched setafﬁnity, 2019. [Online; accessed 27-Aug-

2019].

[7] Linux Documentation: taskset(1): Linux man page. https://linux.die.net/

man/1/taskset, 2019. [Online; accessed 27-Aug-2019].

[8] Linux Documentation: top(1) - Linux man page. https://linux.die.net/

man/1/top, 2019. [Online; accessed 27-Aug-2019].

[9] Francois Broquedis1, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie

Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and

Raymond Namyst. hwloc: a Generic Framework for Managing Hardware

Afﬁnities in HPC Applications. PDP 2010 - The 18th Euromicro

International Conference on Parallel, Distributed and Network- Based

Computing.

[10] Intel MPI developers. https://software.intel.com/en-us/mpi-library,

2019. [Online; accessed 27-Aug-2019].

[11] Mvapich developers. http://mvapich.cse.ohio-state.edu/, 2019. [Online;

accessed 27-Aug-2019].

[12] T. Evans, W. L. Barth, J. C. Browne, R. L. DeLeon, T. R. Furlani,

S. M. Gallo, M. D. Jones, and A. K. Patra. Comprehensive resource use

monitoring for hpc systems with tacc stats. In 2014 First International

Workshop on HPC User Support Tools, pages 13–21, Nov 2014.

[13] IBM. POWER9 Servers Overview, Scalable servers to meet the business

needs of tomorrow. https://www.ibm.com/downloads/cas/KDQRVQRR,

2019. [Online; accessed 27-Aug-2019].

[14] Intel. Intel Math Kernel Library Developer Reference. https://software.

intel.com/en-us/articles/mkl-reference-manual, 2019. [Online; accessed

27-Aug-2019].

[15] John Hennessy and David Patterson. Computer Architecture: A Quanti-

tative Approach (The Morgan Kaufmann Series in Computer Architec-

ture and Design), 6th edition. 2017.

[16] J

urgen Hafner and Georg Kresse. The Vienna AB-Initio Simulation

Program VASP: An Efﬁcient and Versatile Tool for Studying the

Structural, Dynamic, and Electronic Properties of Materials. In: Gonis

A., Meike A., Turchi P.E.A. (eds) Properties of Complex Inorganic

Solids. Springer, Boston, MA. 1997.

[17] Lawrence Livermore National Laboratory. Sierra supercomputer. https://

computation.llnl.gov/computers/sierra, 2019. [Online; accessed 27-Aug-

2019].

[18] National Supercomputer Center in Wuxi. The Sunway TaihuLight sys-

tem . http://www.nsccwx.cn/wxcyw/soft1.php?word=soft&i=46, 2019.

[Online; accessed 27-Aug-2019].

[19] Oak Ridge National Lab. Summit: Oak Ridge National Laboratory’s

200 petaﬂop supercomputer. https://www.olcf.ornl.gov/olcf-resources/

compute-systems/summit/, 2019. [Online; accessed 27-Aug-2019].

[20] OpenMP Architecture Review Board. OpenMP Application Program-

ming Interface, Version 4.5, in November 2015, 2015.

[21] OpenMP Architecture Review Board. OpenMP Application Program-

ming Interface, Version 5.0, in November 2018, 2018.

[22] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad

Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel,

Laxmikant Kale, and Klaus Schulten. Scalable molecular dynamics with

NAMD. Journal of Computational Chemistry: 26, pages 1781–1802,

2005.

[23] T. Roehl, J. Treibig, G. Hager, and G. Wellein. Overhead analysis of

performance counter measurements. In 43rd International Conference

on Parallel Processing Workshops (ICCPW), pages 176–185, Sept 2014.

[24] TACC Staff. TACC: amask project page. https://github.com/TACC/

amask/, 2019. [Online; accessed 27-Aug-2019].

[25] TACC Staff. TACC core usage project page. https://github.com/TACC/

core usage/, 2019. [Online; accessed 27-Aug-2019].

[26] TACC Staff. TACC show afﬁnity project page. https://github.com/

TACC/show afﬁnity/, 2019. [Online; accessed 27-Aug-2019].

[27] Texas Advanced Computing Center. Frontera User Guide. https://portal.

tacc.utexas.edu/user-guides/frontera, 2019. [Online; accessed 27-Aug-

2019].

[28] Texas Advanced Computing Center. Stampede2 User Guide. https:

//portal.tacc.utexas.edu/user-guides/stampede2, 2019. [Online; accessed

27-Aug-2019].

[29] Travis Oliphant. NumPy: A guide to NumPy. USA: Trelgol Publishing,

”http://www.numpy.org/”, 2006–. [Online; accessed 27-Aug-2019].

[30] J. Treibig, G. Hager, and G. Wellein. Likwid: A lightweight

performance-oriented tool suite for x86 multicore environments. In

Proceedings of PSTI2010, the First International Workshop on Parallel

Software Tools and Tool Infrastructures, San Diego CA, 2010.

[31] Wikipedia contributors. List of Intel CPU microarchitectures. https://en.

wikipedia.org/wiki/List of Intel CPU microarchitectures, 2019. [On-

line; accessed 27-Aug-2019].

[32] Wikipedia contributors. The Sunway TaihuLight Supercomputer. https://

en.wikipedia.org/wiki/Sunway TaihuLight, 2019. [Online; accessed 27-

Aug-2019].