When: 9:30 - 13:00, September 2nd, 2025
Where: Holyrood, John McIntyre Conference Center in Edinburgh, United Kingdom.

Overview

Production HPC software and system is becoming increasingly complex due to its deep software abstractions as well as hardware hierarchies. Such software and system complexity usually leads to unexpected inefficiencies that are hard to detect and locate at large scale through manual inspection. Versatile profilers have revealed a promising way to detect the above inefficiencies with accurate root cause analysis for optimization guidance, which is vital to achieve superior performance for scientific applications running at extremely large scale.

In this tutorial, we will present a series of works on various profilers that detect performance inefficiencies due to sub-optimal computation, communication and I/O occurring at large scale. After the presentation of each profiler, we will provide hands-on exercises to acquaint the audience with the usage of each profiler and show case the effectiveness of each profiler for guiding performance optimization. The tools demonstrated in this tutorial have already been open-sourced for public access (MSToolkit).

This tutorial is expected to provide the audiences with the background of performance analysis, illustrate the well known performance profilers and their limitations, and then present our large-scale tools. We hope after the tutorial the audiences can master the usage of large-scale tools to pinpoint hardware and software inefficiencies.

Audience and Prerequisites

The target audience of the tutorial can be identified as two categories: 1) application developers, people who are interested in optimizing the performance of their applications by leveraging the strength of performance analysis tools, and 2) tool developers, people who are interested in building customized value profilers to pinpoint unrevealed hardware and software inefficiencies for guiding performance optimization.

The audience is expected to have a solid understanding of C/C++ programming languages, and a basic understanding of computer architecture. Not required, but the audiences with a preliminary experience of performance analysis tools are preferred.

Hardware and Software Requirements

Hardware: There is no special hardware required by the tutorial. The only thing we expect the audiences is to bring his/her own laptops. The laptop should have decent computation, memory and storage capability to access the remote server to run the hands-on exercises.
Software: The necessary software environments have already been installed on the remote server. The audiences only need to copy the source files for the hands-on exercises from the shared file system to their own workspaces. After that, they can start with the hands-on exercises right away. Note that the audiences shall install their own favorite terminals in advance in order to access the remote server.

Program

Stage	Content	Presenters	Schedule
1	Performance Engineering in HPC from A Tool Perspective	Hailong Yang	9:30-10:00
2	Introduction of High Performance Cluster at CNIC	Ningming Nie, Yuhang Hu	10:00-10:10
3	Tutorial: Detecting Performance Variance on Large-Scale Heterogeneous Systems	Xin You	10:10-11:00
4	Break		11:00-11:30
5	Tutorial: Runtime Performance Inefficiency Detection and Program Debugging	Ningming Nie, Yuhang Hu	11:30-12:00
6	Tutorial: MPI Communication Performance Modeling	Ziheng Wang	12:00-12:30
7	Tutorial: Identifying Parallel I/O Performance Bottleneck Using IOP Tool	Genshen Chu	12:30-13:00

Organizers

Hailong Yang, Beihang University
Xin You, Beihang University
Zhibo Xuan, Beihang University
Ningming Nie, Yuhang Hu, Computer Network Information Center, Chinese Academy of Science
Ziheng Wang, Xi’an Jiaotong University
Genshen Chu, University of Science and Technology Beijing

You X, Xuan Z, Yang H, et al. GVARP: Detecting performance variance on large-scale heterogeneous systems[C]//SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024: 1-16.
Xuan Z, You X, Feng T, et al. SimTrace: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis[J]. ACM Transactions on Architecture and Code Optimization, 2025, 22(2): 1-26.
Xuan Z, Sun X, You X, et al. Identifying Performance Inefficiencies of Parallel Program With Spatial and Temporal Trace Analysis[J]. IEEE Transactions on Parallel and Distributed Systems, 2025.
Wang Z, Chen H, Cai W, et al. C-Lop: Accurate contention-based modeling of MPI concurrent communication[J]. Parallel Computing, 2022, 111: 102925.

Questions

For questions about this tutorial, please contact Xin You (youxin2015@buaa.edu.cn) and Zhibo Xuan(xzb@buaa.edu.cn).

Identifying Software and Hardware Inefficiency at Scale