RAC transition FAQ: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(copy-edit, add links, add more on scheduling expectations)
Line 5: Line 5:


=== Storage ===
=== Storage ===
* There will be a 30 days overlap between 2018 and 2019 storage allocations, starting on Apr/04.
* There will be 30 days of overlap between 2018 and 2019 storage allocations, starting on 2019 April 4.
* In principle adopt the largest of the 2 quotas (2018, 2019) during the transition period
* On a given system the largest of the two quotas (2018, 2019) will be adopted during the transition period.
* If the new allocations have moved from one site to another, for small amounts users are expected to transfer material across by themselves (via globus, scp, rsync, etc). For larger amounts (>200TB?) maybe staff will need to help, upon request.
* If an allocation has moved from one site to another, users are expected to transfer the data by themselves (via globus, scp, rsync, ''etc.''; see [[Transferring data]]). For large amounts of data (''e.g.'', 200TB or more) please [[Technical support|contact support]] for advice or assistance to manage the transfer.
* In the meantime, groups re-allocated from arbutus|cedar|graham|niagara to beluga are encouraged to start migrating their data now (beluga storage is already active).  
* Groups with an allocation that has been moved to [[Béluga]] are encouraged to start migrating their data '''now.'''  Béluga storage is already active.  
* Contributed storage systems have different dates of activation and decommissioning. For these we'll be doing the SUM(2018, 2019) for quotas during the 30 days transition period.
* Contributed storage systems have different dates of activation and decommissioning. For these we'll be doing the SUM(2018, 2019) for quotas during the 30 days transition period.
* For every other PI we use default quotas
* For every other PI we use default quotas.
* After the transition period, the quotas on the original sites from which data has been migrated will also be set to default. Users are expected to delete data from those original sites as well, if the quotas are above the default. Otherwise staff will delete everything.
* After the transition period, the quotas on the original sites from which data has been migrated will also be set to default. Users are expected to delete data from those original sites as well, if the quotas are above the default. Otherwise staff will delete everything.


=== Job scheduling ===
=== Job scheduling ===
* RAC 2019 allocations may not be active exactly on April 4th
* The scheduler team is planning to archive and compact the Slurm database on April 4 before activating the new allocations. We hope to schedule this during off-peak hours. During this process the database may be unresponsive. Specifically, <tt>sacct</tt> and <tt>sacctmgr</tt> may be unresponsive.
* Default allocations may face decreased priority as the RAC 2019 allocations become active and the RAC 2018 allocations are deactivated.
* Once the database is compacted, 2018 allocations will be replaced with 2019 allocations.
* The scheduler team is planning to compact the database at approximately the same time as the new awards become active. We hope to schedule this during off-hours. During the compaction and archive process, the database may be unresponsive. Specifically, sacct and sacctmgr may be unresponsive for several hours.
* We're not sure how long these steps (database archiving and compaction and allocation cutover) will take. We are hoping for a few hours.
* Job priority may be inconsistent during the allocation cutover.  Specifically, default allocations may face decreased priority.
* Jobs already in the system will be retained.  Running jobs will not be stopped.  Queued jobs may be held.
* Waiting jobs attributed to an allocation which has been moved or not renewed may not schedule after the cutover.  Advice on how to detect and handle such jobs will be forthcoming.

Revision as of 18:55, 22 March 2019


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.




Allocations from the 2019 Resource Allocation Competition come into effect on 2019 April 4. Here are some notes on how we expect the transition from 2018 to 2019 allocations to go.

Storage

  • There will be 30 days of overlap between 2018 and 2019 storage allocations, starting on 2019 April 4.
  • On a given system the largest of the two quotas (2018, 2019) will be adopted during the transition period.
  • If an allocation has moved from one site to another, users are expected to transfer the data by themselves (via globus, scp, rsync, etc.; see Transferring data). For large amounts of data (e.g., 200TB or more) please contact support for advice or assistance to manage the transfer.
  • Groups with an allocation that has been moved to Béluga are encouraged to start migrating their data now. Béluga storage is already active.
  • Contributed storage systems have different dates of activation and decommissioning. For these we'll be doing the SUM(2018, 2019) for quotas during the 30 days transition period.
  • For every other PI we use default quotas.
  • After the transition period, the quotas on the original sites from which data has been migrated will also be set to default. Users are expected to delete data from those original sites as well, if the quotas are above the default. Otherwise staff will delete everything.

Job scheduling

  • The scheduler team is planning to archive and compact the Slurm database on April 4 before activating the new allocations. We hope to schedule this during off-peak hours. During this process the database may be unresponsive. Specifically, sacct and sacctmgr may be unresponsive.
  • Once the database is compacted, 2018 allocations will be replaced with 2019 allocations.
  • We're not sure how long these steps (database archiving and compaction and allocation cutover) will take. We are hoping for a few hours.
  • Job priority may be inconsistent during the allocation cutover. Specifically, default allocations may face decreased priority.
  • Jobs already in the system will be retained. Running jobs will not be stopped. Queued jobs may be held.
  • Waiting jobs attributed to an allocation which has been moved or not renewed may not schedule after the cutover. Advice on how to detect and handle such jobs will be forthcoming.