Infrastructure renewal: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
m (replace external link with Wiki link)
 
(60 intermediate revisions by 8 users not shown)
Line 1: Line 1:
{{Draft}}
<languages />


Welcome to the ARC/Cloud renewal transition documentation for the Digital Research Alliance of Canada (the Alliance). This is the primary source for users with questions about the upgrade of our HPC/Cloud infrastructure. The upgrade will replace the nearly 80% of our current HPC and Community Cloud equipment which is approaching end-of-life.
<translate>
=Major upgrade of our Advanced Research Computing infrastructure= <!--T:1-->


= What's coming in 2025? =
<!--T:2-->
In 2023, The Digital Research Alliance of Canada was given formal approval and funding for a complete replacement of aging national systems.
Our Advanced Research Computing infrastructure is undergoing major changes in the winter of 2024-2025 and spring of 2025 to provide better High Performance Computing (HPC) and Cloud services for Canadian researchers. This page will be regularly updated to keep you informed of the activities concerning the transition to the new equipment.
The new equipment will offer:
* Increased processing capacity
* Increased storage capacity
* Improved reliability


This new infrastructure will better support your computational tasks, providing a better-performing and more efficient environment for your research.
<!--T:31-->
The infrastructure renewal will replace the nearly 80% of our current equipment that is approaching end-of-life. The new equipment will offer faster processing speeds, greater storage capacity, and improved reliability.


The systems being replaced are [[Arbutus]], [[Béluga]], [[Cedar]], [[Graham]] and [[Niagara]]. The new systems will be broadly comparable to the old systems, but with significantly increased capacity.
<!--T:3-->
The systems involved are
*[[Infrastructure renewal#Arbutus,_cloud|Arbutus]], cloud
*[[Infrastructure renewal#Béluga,_compute_cluster_only_(not_cloud)|Béluga, compute cluster only (not cloud)]]
*[[Infrastructure renewal#Cedar,_compute_cluster_and_cloud|Cedar, compute cluster and cloud]]
*[[Infrastructure renewal#Graham,_compute_cluster_and_cloud|Graham, compute cluster and cloud]]
*[[Infrastructure renewal#Niagara,_compute_cluster|Niagara, compute cluster]]


= Outages during the transition =
=Technical specifications= <!--T:4-->
This renewal will be implemented during an intense period in the winter of 2024-2025. Constraints on space and electrical power mean that there will have to be service outages during the installation and transition to the new systems. Each site will develop a transition plan for their new system. We expect to hear more details in the autumn and will continue to update this landing page as those details become known.  
Technical specifications for each new system will be provided further down this page in future updates. Generally, they will be similar in architecture to the current systems, but with considerably increased capacity and performance.
For example, we expect to have fewer compute nodes, but each node will have a significant increase in the number of its cores, for an overall increase in the total number of CPU cores.


{{Callout
=Impacts= <!--T:5-->
  |title=Important information
  |content=
There will be outages in the winter of 2024-25 and spring of 2025. We recommend that researchers consider the possibility of such outages when planning research programs, graduate examinations, etc., for next winter and spring.
}}


= Status =
==System outages== <!--T:6-->
 
During the installation and the transition to the new systems, outages will be unavoidable due to constraints on space and electrical power.
For current outages please see the [https://status.computecanada.ca system status page].
We recommend that you consider the possibility of outages when you plan research programs, graduate examinations, etc.


<!--T:29-->
{| class="wikitable"
{| class="wikitable"
|-
|-
| Sep 13, 2024 || All sites except McGill have completed their RFP processes and have sent Purchase Orders to vendors. The McGill (Rorqual) storage RFP is still open and is scheduled to complete on Sep 18.
| '''Start Time''' || '''End Time'''  || '''System''' || '''Description'''
All sites are working on infrastructure (power and cooling) design and implementation. We are expecting some outages over the autumn for cabling and plumbing upgrades, and will update this page when we know more.
|-
| Nov 7, 2024 || Nov 8, 2024 (1 day) || Niagara || All systems and storage located at the SciNet Datacenter (Niagara, Mist, HPSS, Rouge, Teach, JupyterHub, Balam) will be unavailable from 7 a.m. to 5 p.m. ET. This outage is required to install new electrical equipment (UPS) for the upcoming systems refresh. The work is expected to be completed in one day. The scheduler will hold jobs that cannot finish before the start of the shutdown. Users are encouraged to submit small and short jobs that can take advantage of this, as the scheduler may be able to fit these jobs in before the maintenance on otherwise idle nodes.
|-
|-
| Sep 3, 2024 || Currently all sites have completed Requests for Proposals, and are working with the vendors on deliverables and purchase orders.   
| Nov 7, 2024 6am PST || Nov 8, 2024 6am PST || Cedar || Cedar compute nodes will be unavailable during this period (jobs will not run). Cedar login nodes and storage, as well as Cedar cloud will remain online and are not affected by this work.   
|}
|}


= What we know so far =
==Resource Allocation Competition (RAC)== <!--T:7-->
We'll keep this table updated regularly as we move through the acquisition, installation, and migration phases. Please be aware that all dates are subject to change.
The [https://www.alliancecan.ca/en/services/advanced-research-computing/accessing-resources/resource-allocation-competition Resource Allocation Competition]  will be impacted by this transition, but the application process remains the same. Application deadline this year is October 30, 2024.<br>
2024/25 allocations will remain in effect on retiring clusters while each cluster remains in service.  The 2025/26 allocations will be implemented everywhere once all new clusters are in service.<br>
Because the old clusters will mostly be out of service before all new ones are available, if you hold both a 2024 and a 2025 RAC award you will experience a period when neither award is available to you. You will be able to compute with your default allocation (<code>def-xxxxxx</code>) on each new cluster as soon as it goes into service, but the 2025 RAC allocations will only become available when all new clusters are in service.  


=General progress updates= <!--T:8-->
{| class="wikitable"
{| class="wikitable"
|-
|-
| Current Status || Sep 3, 2024: Currently all sites have completed Requests for Proposals, and are working with the vendors on deliverables and purchase orders.
| Oct 7, 2024 || Details for necessary infrastructure (power and cooling) upgrades are being worked out. Timelines are not yet available but we expect some outages of a day or more in November.
|-
|-
| Specifications || The sites cannot yet provide detailed technical specifications of the new systems. Generally, the new systems will be similar in architecture to the old systems but with considerably increased capacity and performance. For instance, we expect to have fewer compute nodes, but each node will have a significant increase in the number of cores due to the increase in the size of multi-core CPUs since 2017.
| Sep 13, 2024 || The RFP processes for all sites except for Rorqual (replacing Béluga) have been completed, and purchase orders have been sent to vendors. The Rorqual storage Request for Proposals is still open and is scheduled to complete on September 18.
All sites are working on infrastructure design (power and cooling) and implementation. We are expecting some outages throughout the fall for cabling and plumbing upgrades.
|-
|-
| Timeline || We expect the new systems to be installed in the first quarter of 2025, with a reasonable expectation that they will be in production and available to users in early summer 2025. More specific delivery and installation schedules are not yet available.
| Sep 3, 2024 || All sites have completed their Requests for Proposals, and are working with the vendors on deliverables and purchase orders.
|}
|}


= Resource Allocation Competition and renewals =
=Activities by system= <!--T:9-->
The Resource Allocation Competition (RAC) and RAC renewals will be affected by this transition, but we are not changing the normal RAC process. Expect to see the usual announcements for the competition in September 2024. We expect to implement the 2025/26 allocations on the new machines when they become available so there may be some delay in RAC implementation. Detailed updates to follow.
 
See RAC documentation available [https://www.alliancecan.ca/en/services/advanced-research-computing/accessing-resources/resource-allocation-competition here].
==Arbutus, cloud== <!--T:10-->
[[Cloud resources#Arbutus cloud|Arbutus]]
= System-specific updates =
<i>coming soon</i>
 
==Béluga, compute cluster only (not cloud)== <!--T:11-->


== Arbutus ==
<!--T:30-->
Coming soon
The cluster replacing [[Beluga/en|Béluga]] will be named <b>Rorqual</b>. Specifications are available on [[Rorqual/en|this page]].


== Béluga ==
==Cedar, compute cluster and cloud== <!--T:12-->
Coming soon
The cluster replacing [[Cedar]] will be named <b>Fir</b>. Specifications are available on [[Fir/en|this page]].


== Cedar and Cedar Cloud ==
==Graham, compute cluster and cloud== <!--T:13-->
Coming soon
[[Graham]]
<i>coming soon</i>


== Graham and Graham Cloud ==
==Niagara, compute cluster== <!--T:14-->
Coming soon
The cluster replacing [[Niagara]] & [[Mist]] in early 2025 will be named [[Trillium]]. Specifications are available on [[Trillium|this page]]. Hardware delivery starts late 2024, and the new cluster will be available for users in the spring of 2025. To make room, half of Niagara will be decommissioned starting in December 2024 or January 2025. We’ll update you when we have a better idea of Trillium’s installation schedule.


== Niagara ==
= Frequently asked questions = <!--T:15-->
Coming soon


= Frequently asked questions =
== Will my data be copied to its new system? == <!--T:16-->
As we work on finalizing the details, here are a few key points to keep in mind.
Data migration to the new systems is the responsibility of each National Host Site who will inform you of what you need to do.
{{Note|We are committed to providing the most up-to-date information. Please check back regularly as this section will be updated frequently to reflect any new developments}}


== Will data be copied to the new systems? ==
== When will outages occur? == <!--T:17-->
Data migration to the new systems is a site responsibility. Each site will let you know what you need to do and what will be done for you once the details are finalized.
Each National Host Site will have its own schedule for outages as the installation of and transition to new equipment proceeds. As usual, specific outages will be described on [https://status.alliancecan.ca our system status web page]. We will provide more general updates on this wiki page and you will periodically receive emails with updates and outage notices.


== When will outages occur? ==
== Who can I contact for questions about the transition? == <!--T:18-->
Each site will have their own schedule for outages as the new equipment is installed and transitioned. Specific outages will as usual be described on the status pages (https://status.alliancecan.ca). We will also provide more general updates through this wiki page as we know more, probably in early autumn 2024.
Contact our [[technical support]]. They will try their best to answer any questions they can.
We will also periodically send emails with updates and outage notices.


== Who should I contact for questions about the transition? ==
== Will my jobs and applications still be able to run on the new system? == <!--T:19-->
Contact our [[Technical support]], but don't expect them to know a great deal more than you read here.
Generally yes, but the new CPUs and GPUs may require recompilation or reconfiguration of some applications. More details will be provided as the transition unfolds.


== Will my jobs/applications run without change on the new system? ==
== Will the software from the current systems still be available? == <!--T:20-->
Generally yes, but with new CPUs and GPUs some codes may need recompiling or reconfiguring. More details will be provided during the transition.
Yes, our [[Standard software environments|standard software environment]] will be available on the new systems.


== Will the software from the old systems still be available? ==
== Will there be staggered outages? == <!--T:21-->
Yes, our standard software environment will be available on the new systems.
We will do our best to limit overlapping outages, but  because we are very constrained by delivery schedules and funding deadlines, there will probably be periods when several of our systems are simultaneously offline. Outages will be announced as early as possible.


== Will there be staggered outages? ==
== Can I purchase old hardware after equipment upgrades? == <!--T:28-->
We will do our best to limit overlapping outages, but we are very constrained by delivery schedules and funding deadlines so there will probably be periods when many of our systems are simultaneously out. We’ll communicate all outages as early as possible.
Most of the equipment is legally the property of the hosting institution.  When the equipment is retired, the host institution manages its disposal following that institution's guidelines. This typically involves "e-cycling"--- recycling the equipment rather than selling it. If you're looking to acquire the old hardware, it's best to contact the host institution directly, as they may have specific policies or options for selling equipment.
</translate>

Latest revision as of 13:46, 6 November 2024

Other languages:

Major upgrade of our Advanced Research Computing infrastructure[edit]

Our Advanced Research Computing infrastructure is undergoing major changes in the winter of 2024-2025 and spring of 2025 to provide better High Performance Computing (HPC) and Cloud services for Canadian researchers. This page will be regularly updated to keep you informed of the activities concerning the transition to the new equipment.

The infrastructure renewal will replace the nearly 80% of our current equipment that is approaching end-of-life. The new equipment will offer faster processing speeds, greater storage capacity, and improved reliability.

The systems involved are

Technical specifications[edit]

Technical specifications for each new system will be provided further down this page in future updates. Generally, they will be similar in architecture to the current systems, but with considerably increased capacity and performance. For example, we expect to have fewer compute nodes, but each node will have a significant increase in the number of its cores, for an overall increase in the total number of CPU cores.

Impacts[edit]

System outages[edit]

During the installation and the transition to the new systems, outages will be unavoidable due to constraints on space and electrical power. We recommend that you consider the possibility of outages when you plan research programs, graduate examinations, etc.

Start Time End Time System Description
Nov 7, 2024 Nov 8, 2024 (1 day) Niagara All systems and storage located at the SciNet Datacenter (Niagara, Mist, HPSS, Rouge, Teach, JupyterHub, Balam) will be unavailable from 7 a.m. to 5 p.m. ET. This outage is required to install new electrical equipment (UPS) for the upcoming systems refresh. The work is expected to be completed in one day. The scheduler will hold jobs that cannot finish before the start of the shutdown. Users are encouraged to submit small and short jobs that can take advantage of this, as the scheduler may be able to fit these jobs in before the maintenance on otherwise idle nodes.
Nov 7, 2024 6am PST Nov 8, 2024 6am PST Cedar Cedar compute nodes will be unavailable during this period (jobs will not run). Cedar login nodes and storage, as well as Cedar cloud will remain online and are not affected by this work.

Resource Allocation Competition (RAC)[edit]

The Resource Allocation Competition will be impacted by this transition, but the application process remains the same. Application deadline this year is October 30, 2024.
2024/25 allocations will remain in effect on retiring clusters while each cluster remains in service. The 2025/26 allocations will be implemented everywhere once all new clusters are in service.
Because the old clusters will mostly be out of service before all new ones are available, if you hold both a 2024 and a 2025 RAC award you will experience a period when neither award is available to you. You will be able to compute with your default allocation (def-xxxxxx) on each new cluster as soon as it goes into service, but the 2025 RAC allocations will only become available when all new clusters are in service.

General progress updates[edit]

Oct 7, 2024 Details for necessary infrastructure (power and cooling) upgrades are being worked out. Timelines are not yet available but we expect some outages of a day or more in November.
Sep 13, 2024 The RFP processes for all sites except for Rorqual (replacing Béluga) have been completed, and purchase orders have been sent to vendors. The Rorqual storage Request for Proposals is still open and is scheduled to complete on September 18.

All sites are working on infrastructure design (power and cooling) and implementation. We are expecting some outages throughout the fall for cabling and plumbing upgrades.

Sep 3, 2024 All sites have completed their Requests for Proposals, and are working with the vendors on deliverables and purchase orders.

Activities by system[edit]

Arbutus, cloud[edit]

Arbutus coming soon

Béluga, compute cluster only (not cloud)[edit]

The cluster replacing Béluga will be named Rorqual. Specifications are available on this page.

Cedar, compute cluster and cloud[edit]

The cluster replacing Cedar will be named Fir. Specifications are available on this page.

Graham, compute cluster and cloud[edit]

Graham coming soon

Niagara, compute cluster[edit]

The cluster replacing Niagara & Mist in early 2025 will be named Trillium. Specifications are available on this page. Hardware delivery starts late 2024, and the new cluster will be available for users in the spring of 2025. To make room, half of Niagara will be decommissioned starting in December 2024 or January 2025. We’ll update you when we have a better idea of Trillium’s installation schedule.

Frequently asked questions[edit]

Will my data be copied to its new system?[edit]

Data migration to the new systems is the responsibility of each National Host Site who will inform you of what you need to do.

When will outages occur?[edit]

Each National Host Site will have its own schedule for outages as the installation of and transition to new equipment proceeds. As usual, specific outages will be described on our system status web page. We will provide more general updates on this wiki page and you will periodically receive emails with updates and outage notices.

Who can I contact for questions about the transition?[edit]

Contact our technical support. They will try their best to answer any questions they can.

Will my jobs and applications still be able to run on the new system?[edit]

Generally yes, but the new CPUs and GPUs may require recompilation or reconfiguration of some applications. More details will be provided as the transition unfolds.

Will the software from the current systems still be available?[edit]

Yes, our standard software environment will be available on the new systems.

Will there be staggered outages?[edit]

We will do our best to limit overlapping outages, but because we are very constrained by delivery schedules and funding deadlines, there will probably be periods when several of our systems are simultaneously offline. Outages will be announced as early as possible.

Can I purchase old hardware after equipment upgrades?[edit]

Most of the equipment is legally the property of the hosting institution. When the equipment is retired, the host institution manages its disposal following that institution's guidelines. This typically involves "e-cycling"--- recycling the equipment rather than selling it. If you're looking to acquire the old hardware, it's best to contact the host institution directly, as they may have specific policies or options for selling equipment.