Call for Papers

Following the successful previous editions initiated at ISC HPC, we are inviting contributions to the 5th ISC HPC International Workshop on Monitoring and Operational Data Analytics (MODA24). The goal of the MODA workshop series is to provide a venue for sharing insights into current trends in MODA for HPC systems and data centers, identify potential gaps, and offer an outlook into the future of the involved fields: high performance computing, databases, machine learning, and possible solutions that can contribute to the codesign and procurement of future computing and data processing systems.

Goals

While MODA is already a common practice at various HPC and data centers, each site adopts a different, insular approach, rarely adopted in production environments, and mostly limited to the visualization of the system and building infrastructure metrics for health check purposes. In this regard, we observe a gap between the collection of operational data and its meaningful and effective analysis and exploitation, which prevents closing the feedback loop between the monitored HPC and data processing system, its operation, and its end-users. Under the above premises, the goals of the MODA 2024 workshop are:

  1. Gather and share knowledge and establish a common ground within the international community with respect to best practices in monitoring and operational data analytics.
  2. Discuss future strategies and alternatives for MODA, potentially improving existing solutions and envisioning a common baseline approach in computing and data centers.
  3. Establish a debate on the usefulness and applicability of AI/ML techniques on collected operational data for optimizing the operation and energy-consumption of production systems (for practices such as predictive maintenance, runtime optimization, optimal and adaptive resource allocation and scheduling).

Scope

We seek novel research ideas that align with the above goals and match (see note below) the scope of the MODA workshop series:

  1. Challenges, solutions, and best practices for monitoring systems at HPC and data centers. A significant focus is on operational data collection mechanisms
    • covering different system levels, from building infrastructure sensor data to processing-core performance metrics, and 
    • targeting different end-users, from system administrators and operators to computer scientists, application developers, and computational scientists.
  2. Effective strategies for analyzing and interpreting the collected operational data. Such strategies should particularly include (but are not limited to):
    • different visualization approaches and 
    • machine learning-based strategies, potentially inferring knowledge of the system behavior and allowing for the realization of a proactive control loop.
  3. Methods, tools, and techniques for automated control of HPC systems. Such contributions may include (but are not limited to):
    • use-cases that need autonomy loops and automated control and 
    • instances of partial autonomy loops that need more automation and control.

Note: New solutions proposed in the context of application performance modeling and/or application performance analysis tools fall outside the scope of the MODA workshop series. Novel contributions in the area of compiler analysis, debugging, programming models, and/or sustainability of scientific software are also considered out of the scope of the workshop.

Topics

We cordially invite you to submit your contributions to the MODA24 workshop, in the form of full (12 pages), short/work-in-progress (6 pages) papers, and abstracts for lightning talks (1 page), that address but are not limited to:

  • Monitoring and operational data analysis challenges and approaches (data collection, storage, visualization, integration into system software, adoption).
  • State-of-the-practice method, tools, techniques in monitoring at various HPC and data centers.
  • Solutions for monitoring and analysis of operational data deployed productively on large- to extreme-scale systems, with a large number of users.
  • Solutions that have proven limitations in terms of quality of the collected data or efficiency of real-time collection of operational data.
  • Opportunities and challenges of using machine learning methods for efficient monitoring and analysis of operational data.
  • Examples of successful integration of monitoring and data analysis practices into production system software (energy and resource management) and runtime systems (scheduling and resource allocation).
  • Explicit gaps between operational data collection, processing, effective analysis, and impactful exploitation; new approaches for closing these gaps for the benefit of improving HPC and data center planning, operations, and research.
  • Means to identify (intentional or unintentional) misuse of resources, and methods to mitigate its effects: taking automatic steps to contain the effects of one application/job/user allocation on others, supporting users to identify causes for the misbehavior of their application, linking to intrusion detection. and safe and trusted multitenancy.
  • Concepts to integrate MODA into the system codesign at all levels, including dedicated hardware components, middleware features, and tool support that make ‘monitoring and analysis by design and by default’ a viable option without sacrificing performance.
  • Examples and challenges of applying FAIR data practices, including sharing of monitoring workflows and tools across sites while ensuring compliance with GDPR regulations, user access agreements, or special operational security requirements.
  • Concrete use cases of improvements achieved by applying ML models in HPC operations.
  • Suggestions for appropriate data to collect towards an open data set (ODS) (ideally containing anomalies) that captures the execution of a set of representative applications on a representative production HPC system.
  • Methodologies to prepare the ODS in view of deploying appropriate ML models
  • Use of MODA to tackle the rising challenges of sustainable HPC, mainly at the level of energy efficiency, but also in hardware stability and replacement policies, from application-level to sitewide approaches.