Technical glossary for the resource allocation competitions/en: Difference between revisions

Jump to navigation Jump to search
Updating to match new version of source page
(Updating to match new version of source page)
(Updating to match new version of source page)
Line 20: Line 20:
'''Compute node:''' A computational unit of a cluster, one or more of which can be allocated to a job. A node has its own operating system image, one or more CPU cores and some memory (RAM). Nodes can be used by the jobs in either exclusive or shared manner depending on the cluster.
'''Compute node:''' A computational unit of a cluster, one or more of which can be allocated to a job. A node has its own operating system image, one or more CPU cores and some memory (RAM). Nodes can be used by the jobs in either exclusive or shared manner depending on the cluster.


'''Core year:''' The equivalent of using 1 CPU core continuously for a full year. Using 12 cores for a month, or 365 cores for a single day are both equivalent to 1 core-year. Compute Canada compute allocations are based on core year allocations.
'''Core year:''' The equivalent of using 1 CPU core continuously for a full year. Using 12 cores for a month, or 365 cores for a single day are both equivalent to 1 core-year. Compute allocations are based on core year allocations.


'''Core equivalent:''' A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. See detailed explanation [[Allocations_and_compute_scheduling|here]].
'''Core equivalent:''' A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. See detailed explanation [[Allocations_and_compute_scheduling|here]].
Line 26: Line 26:
'''Head or Login node:''' Typically when you access a cluster you are accessing a head node, or gateway/login node. A head node is configured to be the launching point for jobs running on the cluster. When you are told or asked to login or access a cluster, invariably you are being directed to log into the head node, often nothing more than a node configured to act as a middle point between the actual cluster and the outside network.
'''Head or Login node:''' Typically when you access a cluster you are accessing a head node, or gateway/login node. A head node is configured to be the launching point for jobs running on the cluster. When you are told or asked to login or access a cluster, invariably you are being directed to log into the head node, often nothing more than a node configured to act as a middle point between the actual cluster and the outside network.


'''Fair share allocation:''' Generally speaking, Compute Canada allocates its batch processing priority based on a fair-share algorithm. Each user is allocated a share of the total system resources, which effectively translates into priority access to the system. If you have used a large fraction of the system recently (ie. larger than your fair-share), your priority drops. However, the scheduling system has a limited time window over which it calculates priority. After some time (e.g., weeks) of reduced usage, it gradually “forgets” that you overused in the past. This is designed to ensure full system usage and not to penalize users who take advantage of idle compute resources. A consequence is that your total allocation is not a limit on how many compute resources you can consume. Rather, your total allocation represents what you should be able to get over the course of the year if you submit a constant workload to the system and it is fully busy. In other words, once your “total allocation” is used, just keep working.
'''Fair share allocation:''' Generally speaking, batch processing priority is allocated based on a fair-share algorithm. Each user is allocated a share of the total system resources, which effectively translates into priority access to the system. If you have used a large fraction of the system recently (ie. larger than your fair-share), your priority drops. However, the scheduling system has a limited time window over which it calculates priority. After some time (e.g., weeks) of reduced usage, it gradually “forgets” that you overused in the past. This is designed to ensure full system usage and not to penalize users who take advantage of idle compute resources. A consequence is that your total allocation is not a limit on how many compute resources you can consume. Rather, your total allocation represents what you should be able to get over the course of the year if you submit a constant workload to the system and it is fully busy. In other words, once your “total allocation” is used, just keep working.


'''Job:''' A job is the basic execution object managed by the batch system. It is a collection of one or more related computing processes that is managed as a whole. Users define resource requirements for the job when they submit it to the batch system. A job description includes a resource request, such as the amount of required memory, the duration of the job, and how many compute cores this job will require. Jobs can be either serial (running on one compute core) or parallel (running on multiple compute cores).
'''Job:''' A job is the basic execution object managed by the batch system. It is a collection of one or more related computing processes that is managed as a whole. Users define resource requirements for the job when they submit it to the batch system. A job description includes a resource request, such as the amount of required memory, the duration of the job, and how many compute cores this job will require. Jobs can be either serial (running on one compute core) or parallel (running on multiple compute cores).
Line 57: Line 57:
'''Nearline:''' The nearline filesystem is a disk-tape hybrid storage system, in which data with size above a certain threshold is  automatically migrated from disk to tape, and then back again upon read operations. Access to this storage resource requires deliberate actions by users (i.e., via the Linux command line: cp, mv, rsync, etc …) of placing files into this designated nearline location, or by file transfers from another filesystem (scratch, project, home, etc).  The tape subsystem has very high capacity, but adds latency when files need to be accessed again. This storage system should be used for datasets that are infrequently accessed, and needs to be retained for long periods of time. This is not true “archival” storage in that the datasets must be part of an “active” project. Nearline capacity is managed by quotas, and  allocations are via the RAC process.
'''Nearline:''' The nearline filesystem is a disk-tape hybrid storage system, in which data with size above a certain threshold is  automatically migrated from disk to tape, and then back again upon read operations. Access to this storage resource requires deliberate actions by users (i.e., via the Linux command line: cp, mv, rsync, etc …) of placing files into this designated nearline location, or by file transfers from another filesystem (scratch, project, home, etc).  The tape subsystem has very high capacity, but adds latency when files need to be accessed again. This storage system should be used for datasets that are infrequently accessed, and needs to be retained for long periods of time. This is not true “archival” storage in that the datasets must be part of an “active” project. Nearline capacity is managed by quotas, and  allocations are via the RAC process.


'''dCache:''' dCache is a storage filesystem developed originally for high-energy physics projects for very large datasets (petabytes). dCache storage is essentially an object file storage layer on top of classical storage providing a single namespace and various authorized access and transfer protocols to the underlying immutable data. Allocations here tend to be for large Compute Canada projects with many Principal Investigators and researchers. dCache capacity is allocated via the RAC process. If you wish to use this storage, contact the Compute Canada Subatomic Physics National Team by writing to our [[Technical support]].
'''dCache:''' dCache is a storage filesystem developed originally for high-energy physics projects for very large datasets (petabytes). dCache storage is essentially an object file storage layer on top of classical storage providing a single namespace and various authorized access and transfer protocols to the underlying immutable data. Allocations here tend to be for large projects with many Principal Investigators and researchers. dCache capacity is allocated via the RAC process. If you wish to use this storage, contact the Subatomic Physics National Team by writing to our [[Technical support]].
Local storage: This refers to the hard drive or solid-state drive in a compute node that can be used to temporarily store programs, input files, or their results. Files in local storage on one node can not be accessed on any other node. The local storage may not be persistent, so the files created on the local storage should be moved to non-local storage to avoid data loss.
Local storage: This refers to the hard drive or solid-state drive in a compute node that can be used to temporarily store programs, input files, or their results. Files in local storage on one node can not be accessed on any other node. The local storage may not be persistent, so the files created on the local storage should be moved to non-local storage to avoid data loss.


'''Site:''' A member of one of Compute Canada’s regional consortia providing advanced research computing (ARC) resources (such as high-performance computing clusters, Clouds, storage, and/or technical support).
'''Site:''' A member of one of the Alliance's partners providing advanced research computing (ARC) resources (such as high-performance computing clusters, Clouds, storage, and/or technical support).


'''Tape:''' Tape is a storage technology used to store long-term data that are infrequently accessed. It is considerably lower in cost than disk and is a viable option for many use cases.
'''Tape:''' Tape is a storage technology used to store long-term data that are infrequently accessed. It is considerably lower in cost than disk and is a viable option for many use cases.
Line 67: Line 67:


== Cloud ==
== Cloud ==
'''Compute Canada Cloud:''' is a pool of hardware supporting virtualization. This can be thought of as Infrastructure as a Service (IaaS).
'''Alliance Cloud:''' is a pool of hardware supporting virtualization. This can be thought of as Infrastructure as a Service (IaaS).


'''Compute Cloud:''' These are instances that have a limited life-time and typically have constant high-CPU requirements for the instances life-time. They have also been referred to as ‘batch’ instances. These will be granted higher vCPU/Memory quotas since they are time-limited instances.
'''Compute Cloud:''' These are instances that have a limited life-time and typically have constant high-CPU requirements for the instances life-time. They have also been referred to as ‘batch’ instances. These will be granted higher vCPU/Memory quotas since they are time-limited instances.
Line 82: Line 82:
'''Ephemeral local disk:''' Ephemeral disks are often used in cloud native application use cases where VMs are expected to be short-lived and the data does not need to persist beyond the life of the VM. This may include a VM with data that is cached for short usage; a VM hosting applications that replicate their data across multiple VMs; or a VM whose persistent data is saved on a volume or on any external storage support. Ephemeral local disks use the storage directly attached to a virtualisation host and are not expected to survive if the hardware fails, unlike volumes which are backed by a resilient storage cluster. An ephemeral local disk is purged and deleted when an instance is itself deleted by the cloud user or by the admin.
'''Ephemeral local disk:''' Ephemeral disks are often used in cloud native application use cases where VMs are expected to be short-lived and the data does not need to persist beyond the life of the VM. This may include a VM with data that is cached for short usage; a VM hosting applications that replicate their data across multiple VMs; or a VM whose persistent data is saved on a volume or on any external storage support. Ephemeral local disks use the storage directly attached to a virtualisation host and are not expected to survive if the hardware fails, unlike volumes which are backed by a resilient storage cluster. An ephemeral local disk is purged and deleted when an instance is itself deleted by the cloud user or by the admin.


'''Service Portal:''' Compute Canada hosts many research web portals which serve datasets or tools to a broad research community. These portals generally do not require large computing or storage resources, but may require support effort by the Compute Canada technical team. Groups applying for a service portal often use the Compute Canada cloud, generally require a public IP address, and may (or may not) have more stringent up-time requirements than most research projects. This option is shown as “Portal” in the online form.
'''Service Portal:''' We host many research web portals that serve datasets or tools to a broad research community. These portals generally do not require large computing or storage resources, but may require support effort from our technical team. Groups applying for a service portal often use our cloud, generally require a public IP address, and may (or may not) have more stringent up-time requirements than most research projects. This option is shown as “Portal” in the online form.


'''Virtual Machine (VM):''' See Instance above.
'''Virtual Machine (VM):''' See Instance above.
38,760

edits

Navigation menu