Infrastructure renewal

From Alliance Doc
Jump to navigation Jump to search
Other languages:

Major upgrade of our Advanced Research Computing infrastructure[edit]

Our Advanced Research Computing infrastructure is undergoing major changes in the winter of 2024-2025 and spring of 2025 to provide better High Performance Computing (HPC) and Cloud services for Canadian researchers. This page will be regularly updated to keep you informed of the activities concerning the transition to the new equipment.

The infrastructure renewal will replace the nearly 80% of our current equipment that is approaching end-of-life. The new equipment will offer faster processing speeds, greater storage capacity, and improved reliability.

New system details[edit]

New System Old System to be Replaced Documentation
Arbutus Cloud (as a virtual infrastructure there is no change to the cloud interface.) see this page
Rorqual Béluga see this page
Fir Cedar see this page
Trillium Niagara & Mist see this page
Nibi Graham see this page

System capacity, reductions and outages[edit]

During the installation and the transition to the new systems, outages and reductions will be unavoidable due to constraints on space and electrical power. We recommend that you consider the possibility of outages when you plan research programs, graduate examinations, etc.

For a list of completed events, please see Infrastructure renewal completed events.

Start Time End Time Status System Type Description
Jan 22, 2025 Jan 22, 2025 (1 day) Upcoming Niagara, Mist Outage Niagara and Mist compute nodes will be shut down on January 22, 2025 from 8 AM to 5 PM EST to support ongoing system improvements and the integration with the new system, Trillium.

The login nodes, file systems, and the HPSS system will remain available.

The scheduler will hold jobs that are submitted until the maintenance has finished.

Jan 22, 2025 Ongoing Upcoming Cedar (70%) Reduction Starting January 22, Cedar cluster will operate at approximately 70% capacity until Fir is commissioned in the spring of 2025.
Jan 13, 2025 Jan 21, 2025 (9 days) In Progress Cedar (100%) Outage The Cedar compute cluster will be shut down in preparation for the infrastructure renewal. Jobs submitted to the cluster will queue and may start running if they can complete before the shutdown. Jobs that cannot run will remain in the queue until the cluster is fully operational on January 21. The Cedar /scratch filesystem will be migrated to new storage. Please move any important data immediately to your /project, /nearline, or /home directory.

Cedar cloud will remain operational during this period.

Jan 16, 2025 Ongoing In Progress Graham (25%) Reduction Starting January 16, the Graham cluster will operate at approximately 25% capacity (see here for details) until the new system is expected in March 2025. Users will be restricted to 256 cores per job, and all existing queued jobs will be cleared before the reduction begins. However, all existing user data on /home, /project, and /scratch will remain available. The Graham cloud will return to service at normal capacity on January 4.

Jan 15, 2025 UPDATE: Please note that while the reduction is currently scheduled to begin on January 16, this timeline may be adjusted due to the extension of the Graham outage, which is now expected to end on the same day (January 16). Users are encouraged to check https://status.alliancecan.ca for the latest updates and any potential changes to the schedule.

Jan 6, 2025 Ongoing In Progress Niagara (50%), Mist (35%) Reduction Niagara will operate at 50% capacity and Mist at 35% to support ongoing system improvements and the integration with the new system, Trillium, expected in spring 2025.

Mist required a temporary shutdown for a few hours on January 6.

Jan 13, 2025 Jan 31, 2025 (18 days) In Progress Béluga (100%), Narval (50%) Temporary Reduction Performance and stability tests on Rorqual will require the shutdown of all Béluga compute nodes and about half of Narval compute nodes from 8 a.m. on January 13 until 12 p.m. (noon) on January 31, 2025 (EST). Login nodes and data access will remain operational. On Narval, approximately 50% of nodes from each category (CPU, GPU, and large memory) will be shut down. During the shutdown time, Béluga Storage will be mounted to Narval (/lustre01, /lustre02, /lustre03, /lustre04 of Beluga). Béluga and Juno cloud instances are unaffected. Jobs on Béluga scheduled to complete after 8 a.m. on January 13 will remain queued until the cluster resumes.
Dec 7, 2024 Jan 16, 2025 (Extended from Jan 3) In Progress Graham (100%) Outage Ongoing renovations require a complete data center shutdown from Dec 7, 2024 to Jan 16, 2025. During this time, all Graham cluster services, storage, and cloud services will be entirely unavailable.

Jan 15, 2025 UPDATE: This outage has been extended to January 16 due to some delays. For updated information, please see https://status.alliancecan.ca.

Resource Allocation Competition (RAC)[edit]

The Resource Allocation Competition will be impacted by this transition, but the application process remains the same.
2024/25 allocations will remain in effect on retiring clusters while each cluster remains in service. The 2025/26 allocations will be implemented everywhere once all new clusters are in service.
Because the old clusters will mostly be out of service before all new ones are available, if you hold both a 2024 and a 2025 RAC award you will experience a period when neither award is available to you. You will be able to compute with your default allocation (def-xxxxxx) on each new cluster as soon as it goes into service, but the 2025 RAC allocations will only become available when all new clusters are in service.

User training resources[edit]

Course Title Course Provider Instructor Date Description Audience Format Registration
Mastering GPU Efficiency SHARCNET Sergey Mashchenko Available Anytime This online self-paced course provides basic training for Alliance users on using GPUs on our national systems. Modern GPUs (such as NVIDIA A100 and H100) are massively parallel and very expensive devices. Most of GPU jobs are incapable of utilizing these GPUs efficiently, either due to the problem size being too small to saturate the GPU, or due to the intermittent (bursty) GPU utilization pattern. This course will teach you how to measure the GPU utilization of your jobs on our clusters, and show how to use the two NVIDIA technologies - MPS (Multi-Process Service) and MIG (Multi-Instance GPU) - to improve GPU utilization. Prospective users of the upgraded systems 1-hour self-paced online course with a certificate of completion Access the course here/Alliance CCDB account is required
Introduction to the Fir cluster Simon Fraser University (SFU) / West DRI Alex Razoumov Tuesday, May 20, 2025, 10:00 AM PT SFU’s newest cluster Fir will be coming online towards the end of spring 2025. In this webinar, we will give an overview of the cluster and its hardware, walk through the filesystems and their recommended usage, talk about job submission policies and overall best practices for using the cluster. Prospective users of Fir cluster Webinar Registration/Free event open to all
Survival guide for the upcoming GPU upgrades SHARCNET Sergey Mashchenko Wednesday, November 20, 2024, 12:00 PM to 1:00 PM ET In the coming months, national systems will be undergoing significant upgrades. In particular, older GPUs (P100, V100) will be replaced with the newest H100 GPUs from NVIDIA. The total GPU computing power of the upgraded systems will grow by a factor of 3.5, but the number of GPUs will decrease significantly (from 3200 to 2100). This will present a significant challenge for users, as the usual practice of using a whole GPU for each process or MPI rank will no longer be feasible in most cases. Fortunately, NVIDIA provides two powerful technologies that can be used to mitigate this situation: MPS (Multi-Process Service) and MIG (Multi-Instance GPU). The presentation will walk the audience through both technologies and discuss the ways they can be used on the clusters. The discussion will include how to determine which approach will work best for specific code, and a live demonstration will be given at the end. Prospective users of the upgraded systems. Users intending to use a substantial amount of H100 resources (e.g., more than one GPU at a time, and/or over 24 hours runtime) 1-hour presentation and slides Past

Frequently asked questions[edit]

Will my data be copied to its new system?[edit]

Data migration to the new systems is the responsibility of each National Host Site who will inform you of what you need to do.

Will my files be deleted when a system is undergoing a complete data center shutdown as part of renewal activities?[edit]

No, your files will not be deleted. During renewal activities, each National Host Site will migrate /project and /home data from the existing storage system to the new storage system once it is installed. These migrations typically occur during outages, but specific details may vary by National Host Site. Each National Host Site will keep users informed of any specific, user-visible effects. Additionally, tape systems for backups and /nearline data are not being replaced, so backups and /nearline data will remain unchanged. For further technical questions, please email technical support. This goes directly to our ticketing system, where a support expert can provide a detailed response.

When will outages occur?[edit]

Each National Host Site will have its own schedule for outages as the installation of and transition to new equipment proceeds. As usual, specific outages will be described on our system status web page. We will provide more general updates on this wiki page and you will periodically receive emails with updates and outage notices.

Whom can I contact for questions about the transition?[edit]

Contact our technical support. They will try their best to answer any questions they can.

Will my jobs and applications still be able to run on the new system?[edit]

Generally yes, but the new CPUs and GPUs may require recompilation or reconfiguration of some applications. More details will be provided as the transition unfolds.

Will the software from the current systems still be available?[edit]

Yes, our standard software environment will be available on the new systems.

Will commercial, licensed software be migrated to the new systems?[edit]

Yes, the plan is that the current commercial software licenses will be transitioned from an old system to the new replacement so to the extent possible users should see identical access to those special applications (Gaussian, AMS/ADF, etc.). There is a small risk that the software providers will change their licensing terms for the new system. Such issues will be addressed individually as they come up.

Will there be staggered outages?[edit]

We will do our best to limit overlapping outages, but because we are very constrained by delivery schedules and funding deadlines, there will probably be periods when several of our systems are simultaneously offline. Outages will be announced as early as possible.

Can I purchase old hardware after equipment upgrades?[edit]

Most of the equipment is legally the property of the hosting institution. When the equipment is retired, the host institution manages its disposal following that institution's guidelines. This typically involves "e-cycling"--- recycling the equipment rather than selling it. If you're looking to acquire the old hardware, it's best to contact the host institution directly, as they may have specific policies or options for selling equipment.