Infrastructure renewal

From Alliance Doc
Revision as of 16:33, 20 September 2024 by Diane27 (talk | contribs)
Jump to navigation Jump to search
Other languages:

Major upgrade of our Advanced Research Computing infrastructure[edit]

Our Advanced Research Computing infrastructure is undergoing major changes to always provide better High Performance Computing (HPC) and Cloud services for Canadian researchers. This page will be regularly updated to keep you informed of the activities concerning the transition to the new equipment.
The infrastructure renewal will replace the nearly 80% of our current equipment that is approaching end-of-life. The new equipment will offer faster processing speeds, greater storage capacity, and improved reliability.

The systems involved are

  • Arbutus, cloud
  • Béluga, compute cluster only (not cloud)
  • Cedar, compute cluster and cloud
  • Graham, compute cluster and cloud
  • Niagara, compute cluster

Technical specifications[edit]

Technical specifications for each new system will be provided further down this page in future updates. Generally, they will be similar in architecture to the current systems, but with considerably increased capacity and performance.
For example, we expect to have fewer compute nodes, but each node will have a significant increase in the number of its cores, for an overall increase in the total number of CPU cores.

Impacts[edit]

System outages[edit]

An intense period of work will be conducted in the winter of 2024-2025 and spring of 2025. During the installation and the transition to the new systems, outages will be unavoidable due to constraints on space and electrical power.
We recommend that you consider the possibility of outages when you plan research programs, graduate examinations, etc.

Resource Allocation Competition (RAC)[edit]

The Resource Allocation Competition will be impacted by this transition, but the application process remains the same. Application deadline this year is October 30, 2024.
2024/25 allocations will remain in effect on retiring clusters while each cluster remains in service. The 2025/26 allocations will be implemented everywhere once all new clusters are in service.
Because the old clusters will mostly be out of service before all new ones are available, if you hold both a 2024 and a 2025 RAC award you will experience a period when neither award is available to you. You will be able to compute with your default allocation (def-xxxxxx) on each new cluster as soon as it goes into service, but the 2025 RAC allocations will only become available when all new clusters are in service.

General progress updates[edit]

Sep 13, 2024 The RFP processes for all sites except for Rorqual (replacing Béluga) have been completed, and purchase orders have been sent to vendors. The Rorqual storage Request for Proposals is still open and is scheduled to complete on September 18.

All sites are working on infrastructure design (power and cooling) and implementation. We are expecting some outages throughout the fall for cabling and plumbing upgrades.

Sep 3, 2024 All sites have completed their Requests for Proposals, and are working with the vendors on deliverables and purchase orders.

Activities by system[edit]

Arbutus, cloud[edit]

Arbutus coming soon

Béluga, compute cluster only (not cloud)[edit]

Beluga/en The cluster replacing Béluga will be named Rorqual. details coming soon

Cedar, compute cluster and cloud[edit]

Cedar coming soon

Graham, compute cluster and cloud[edit]

Niagara, compute cluster[edit]

T

System-specific updates[edit]

Arbutus[edit]

Coming soon

Béluga[edit]

The cluster that is being deployed to replace Béluga will be named Rorqual.

More details coming soon.

Cedar and Cedar Cloud[edit]

Coming soon

Graham and Graham Cloud[edit]

Coming soon

Niagara[edit]

Coming soon

Frequently asked questions[edit]

As we work on finalizing the details, here are a few key points to keep in mind.

Light-bulb.pngWe are committed to providing the most up-to-date information. Please check back regularly as this section will be updated frequently to reflect any new developments


Will data be copied to the new systems?[edit]

Data migration to the new systems is a site responsibility. Each site will let you know what you need to do and what will be done for you once the details are finalized.

When will outages occur?[edit]

Each site will have their own schedule for outages as the new equipment is installed and transitioned. Specific outages will as usual be described on the status pages (https://status.alliancecan.ca). We will also provide more general updates through this wiki page as we know more, probably in early autumn 2024. We will also periodically send emails with updates and outage notices.

Who should I contact for questions about the transition?[edit]

Contact our Technical support, but don't expect them to know a great deal more than you read here.

Will my jobs/applications run without change on the new system?[edit]

Generally yes, but with new CPUs and GPUs some codes may need recompiling or reconfiguring. More details will be provided during the transition.

Will the software from the old systems still be available?[edit]

Yes, our standard software environment will be available on the new systems.

Will there be staggered outages?[edit]

We will do our best to limit overlapping outages, but we are very constrained by delivery schedules and funding deadlines so there will probably be periods when many of our systems are simultaneously out. We’ll communicate all outages as early as possible.