Infrastructure renewal

From Alliance Doc
Jump to navigation Jump to search
This page contains changes which are not marked for translation.
Other languages:

Major upgrade of our Advanced Research Computing infrastructure[edit]

Our Advanced Research Computing infrastructure is undergoing major changes in the winter of 2024-2025 and spring of 2025 to provide better High Performance Computing (HPC) and Cloud services for Canadian researchers. This page will be regularly updated to keep you informed of the activities concerning the transition to the new equipment.

The infrastructure renewal will replace the nearly 80% of our current equipment that is approaching end-of-life. The new equipment will offer faster processing speeds, greater storage capacity, and improved reliability.

New system details[edit]

New System Old System to be Replaced Documentation
Arbutus Cloud (as a virtual infrastructure there is no change to the cloud interface.) Coming Soon
Rorqual Béluga see this page
Fir Cedar see this page
Trillium Niagara & Mist see this page
(TBD) Graham Coming Soon

System outages[edit]

During the installation and the transition to the new systems, outages will be unavoidable due to constraints on space and electrical power. We recommend that you consider the possibility of outages when you plan research programs, graduate examinations, etc.

Start Time End Time Status System Description
Jan 6, 2025 Jan 24, 2025 (18 days) Upcoming Béluga (100%), Narval (50%) Performance and stability tests on Rorqual will require the shutdown of all Béluga compute nodes and about half of Narval compute nodes from 8 a.m. on January 6 until 12 p.m. (noon) on January 24, 2025 (EST). On Béluga, jobs scheduled to complete after 8 a.m. on January 6 will remain in the queue until the cluster is back online. Login nodes and access to data will not be affected. On Narval, about half of the nodes of each type (CPU, GPU and large memory) will be shut down.Cloud instances on the Béluga and Juno clouds will not be affected by this shutdown.
Nov 25, 2024 Nov 26, 2024 (1 day) Upcoming Niagara A full power shutdown will take place for main panel upgrades ahead of Trillium cluster setup. The scheduler will hold jobs that cannot finish before the start of the shutdown. Users are encouraged to submit small and short jobs that can take advantage of this, as the scheduler may be able to fit these jobs in before the maintenance on otherwise idle nodes.
Nov 7, 2024 Nov 8, 2024 (1 day) Complete Niagara All systems and storage located at the SciNet Datacenter (Niagara, Mist, HPSS, Rouge, Teach, JupyterHub, Balam) will be unavailable from 7 a.m. to 5 p.m. ET. This outage is required to install new electrical equipment (UPS) for the upcoming systems refresh. The work is expected to be completed in one day. The scheduler will hold jobs that cannot finish before the start of the shutdown. Users are encouraged to submit small and short jobs that can take advantage of this, as the scheduler may be able to fit these jobs in before the maintenance on otherwise idle nodes.
Nov 7, 2024 6am PST Nov 8, 2024 6am PST Complete Cedar Cedar compute nodes will be unavailable during this period (jobs will not run). Cedar login nodes and storage, as well as Cedar cloud will remain online and are not affected by this work.

Resource Allocation Competition (RAC)[edit]

The Resource Allocation Competition will be impacted by this transition, but the application process remains the same.
2024/25 allocations will remain in effect on retiring clusters while each cluster remains in service. The 2025/26 allocations will be implemented everywhere once all new clusters are in service.
Because the old clusters will mostly be out of service before all new ones are available, if you hold both a 2024 and a 2025 RAC award you will experience a period when neither award is available to you. You will be able to compute with your default allocation (def-xxxxxx) on each new cluster as soon as it goes into service, but the 2025 RAC allocations will only become available when all new clusters are in service.

General progress updates[edit]

Nov 8, 2024 The Nov.7 outages have been completed. Work is continuing at all sites on the power and cooling infrastructure. All sites have begun receiving equipment and will be starting the installation over November and December. Currently planning for significant outages during December and January - details will be provided when available.
Oct 7, 2024 Details for necessary infrastructure (power and cooling) upgrades are being worked out. Timelines are not yet available but we expect some outages of a day or more in November.
Sep 13, 2024 The RFP processes for all sites except for Rorqual (replacing Béluga) have been completed, and purchase orders have been sent to vendors. The Rorqual storage Request for Proposals is still open and is scheduled to complete on September 18.

All sites are working on infrastructure design (power and cooling) and implementation. We are expecting some outages throughout the fall for cabling and plumbing upgrades.

Sep 3, 2024 All sites have completed their Requests for Proposals, and are working with the vendors on deliverables and purchase orders.

User training resources[edit]

Course Title Course Provider Instructor Date Description Audience Format Registration
Survival guide for the upcoming GPU upgrades SHARCNET Sergey Mashchenko Wednesday, November 20, 2024, 12:00 PM to 1:00 PM ET In the coming months, national systems will be undergoing significant upgrades. In particular, older GPUs (P100, V100) will be replaced with the newest H100 GPUs from NVIDIA. The total GPU computing power of the upgraded systems will grow by a factor of 3.5, but the number of GPUs will decrease significantly (from 3200 to 2100). This will present a significant challenge for users, as the usual practice of using a whole GPU for each process or MPI rank will no longer be feasible in most cases. Fortunately, NVIDIA provides two powerful technologies that can be used to mitigate this situation: MPS (Multi-Process Service) and MIG (Multi-Instance GPU). The presentation will walk the audience through both technologies and discuss the ways they can be used on the clusters. The discussion will include how to determine which approach will work best for specific code, and a live demonstration will be given at the end. Prospective users of the upgraded systems. Users intending to use a substantial amount of H100 resources (e.g., more than one GPU at a time, and/or over 24 hours runtime) 1-hour live presentation, recorded for later access No registration required. More info

Frequently asked questions[edit]

Will my data be copied to its new system?[edit]

Data migration to the new systems is the responsibility of each National Host Site who will inform you of what you need to do.

When will outages occur?[edit]

Each National Host Site will have its own schedule for outages as the installation of and transition to new equipment proceeds. As usual, specific outages will be described on our system status web page. We will provide more general updates on this wiki page and you will periodically receive emails with updates and outage notices.

Who can I contact for questions about the transition?[edit]

Contact our technical support. They will try their best to answer any questions they can.

Will my jobs and applications still be able to run on the new system?[edit]

Generally yes, but the new CPUs and GPUs may require recompilation or reconfiguration of some applications. More details will be provided as the transition unfolds.

Will the software from the current systems still be available?[edit]

Yes, our standard software environment will be available on the new systems.

Will commercial, licensed software be migrated to the new systems?[edit]

Yes, the plan is that the current commercial software licenses will be transitioned from an old system to the new replacement so to the extent possible users should see identical access to those special applications (Gaussian, AMS/ADF, etc.). There is a small risk that the software providers will change their licensing terms for the new system. Such issues will be addressed individually as they come up.

Will there be staggered outages?[edit]

We will do our best to limit overlapping outages, but because we are very constrained by delivery schedules and funding deadlines, there will probably be periods when several of our systems are simultaneously offline. Outages will be announced as early as possible.

Can I purchase old hardware after equipment upgrades?[edit]

Most of the equipment is legally the property of the hosting institution. When the equipment is retired, the host institution manages its disposal following that institution's guidelines. This typically involves "e-cycling"--- recycling the equipment rather than selling it. If you're looking to acquire the old hardware, it's best to contact the host institution directly, as they may have specific policies or options for selling equipment.