Following very successful previous editions initiated at ISC HPC, we are inviting contributions to the 3rd ISC HPC International Workshop on Monitoring and Operational Data Analytics (MODA22). The goal of the MODA workshop series is to provide a venue for sharing insight into current trends in MODA, to identify potential gaps, and offer an outlook into the future of the involved fields: high performance computing, databases, machine learning, and possible solutions that can contribute to the design and procurement of upcoming Exascale systems.
While MODA is already a common practice at various HPC sites, each site adopts a different, insular approach, not always adopted in production environments, and mostly limited to the visualization of the system and building infrastructure metrics for health check purposes. In this regard, we observe a gap between the collection of operational data and its meaningful and effective analysis and exploitation, which prevents the closing of the feedback loop between the monitored HPC system, its operation, and its end-users. Under these premises, the goals of the MODA22 workshop can then be summarized in the following way:
- Gather and share knowledge and establish a common ground within the international community with respect to best practices in monitoring and operational data analytics.
- Discuss future strategies and alternatives for MODA, potentially improving existing solutions and envisioning a common baseline approach in HPC sites and data centers.
- Establish a debate on the usefulness and applicability of AI/ML techniques on collected operational data for optimizing the operation of production systems (e.g., for practices such as predictive maintenance, runtime optimization, optimal resource allocation and scheduling).
We seek novel research ideas that align with the above goals and match (see note below) the scope of the MODA workshop series:
- Challenges, solutions, and best practices for monitoring systems at data centers and HPC sites. Significant focus will be placed on operational data collection mechanisms respectively:
- covering different system levels, from building infrastructure sensor data to CPU-core performance metrics, and
- targeting different end-users, from system administrators and operators to application developers, computer scientists, and computational scientists.
- Effective strategies for analyzing and interpreting the collected operational data. Such strategies should particularly include (but are not limited to):
- different visualization approaches and
- machine learning-based techniques, potentially inferring knowledge of the system behavior and allowing for the realization of a proactive control loop.
Note: New solutions proposed in the context of application performance modeling and/or application performance analysis tools fall outside the scope of MODA. Novel contributions in the area of compiler analysis, debugging, programming models and/or sustainability of scientific software are also considered out of the scope of the workshop.
We cordially invite you to submit your contributions to the MODA22 workshop, in the form of short (6 pages) and full (12 pages) submissions, that address but are not limited to:
- Monitoring and operational data analysis challenges and approaches (data collection, storage, visualization, integration into system software, adoption).
- State-of-the-practice method, tools, techniques in monitoring at various HPC sites
- Solutions for monitoring and analysis of operational data that work very well on large- to extreme-scale systems, with a large number of users.
- Solutions that have proven limitations in terms of efficiency of operational data collection in real-time or in terms of the quality of the collected data.
- Opportunities and challenges of using machine learning methods for efficient monitoring and analysis of operational data.
- Integration of monitoring and analysis practices into production system software (energy and resource management) and runtime systems (scheduling and resource allocation).
- Explicit gaps between operational data collection, processing, effective analysis, impactful exploitation; new approaches for closing these gaps for the benefit of improving HPC center planning, operations, and research.
- Means to identify misuse, intentional or unintentional, of resources, and methods to mitigate the effects of these: taking automatic steps to contain the effects of one application/job/user allocation on others, supporting users to identify causes for the misbehavior of their application, linking to intrusion detection and safe multitenancy.
- Concepts to integrate MODA into the system design at all levels, including dedicated hardware components, middleware features, and tool support that make ‘monitoring by default’ a viable option without sacrificing performance.
- FAIR data practices, including sharing of monitoring workflows and tools across sites while ensuring compliance with GDPR regulations and user access agreements.