|This site replaces the former Compute Canada documentation site, and is now being managed by the Digital Research Alliance of Canada. |
Ce site remplace l'ancien site de documentation de Calcul Canada et est maintenant géré par l'Alliance de recherche numérique du Canada.
Please use data transfer nodes, also called data mover nodes, instead of login nodes whenever you are transferring data to and from our clusters. If a data transfer node is available, its URL will be given near the top of the main page for each cluster: Béluga, Narval, Cedar, Graham and Niagara.
Globus automatically uses data transfer nodes.
To and from your personal computer
You will need software that supports secure transfer of files between your computer and our machines. The commands
sftp can be used in a command-line environment on Linux or Mac OS X computers. On Microsoft Windows platforms, MobaXterm offers both a graphical file transfer function and a command-line interface via SSH, while WinSCP is another free program that supports file transfer. Setting up a connection to a machine using SSH keys with WinSCP can be done by following the steps in this link. PuTTY comes with
psftp which are essentially the same as the Linux and Mac command line programs.
Globus is the preferred tool for transferring data between systems, and if it can be used, it should.
However, other common tools can also be found for transferring data both inside and outside of our systems, including
Note: If you want to transfer files between another of our clusters and Niagara use the SSH agent forwarding flag
-A when logging into another cluster. For example, to copy files to Niagara from Cedar, use:
ssh -A USERNAME@cedar.computecanada.ca
then perform the copy:
[USERNAME@cedar5 ~]$ scp file USERNAME@niagara.computecanada.ca:/scratch/g/group/USERNAME/
From the World Wide Web
The standard tool for downloading data from websites is wget. Also available is curl. The two are compared in this StackExchange article. For getting data from various cloud services such as Google cloud storage, Google Drive and Google Photos, consider rclone. All three (wget, curl, rclone) are available on our clusters without loading a module.
To synchronize or sync files (or directories) stored in two different locations means to ensure that the two copies are the same. Here are several different ways to do this.
We find Globus usually gives the best performance and reliability.
Normally when a Globus transfer is initiated it will overwrite the files on the destination with the files from the source, which means all of the files on the source will be transferred. If some of the files may already exist on the destination and need not be transferred if they match, you should go to the Transfer & Timer Options shown in the screenshot and choose to sync instead.
You may choose how Globus decides which files to transfer:
|Their checksums are different||This is the slowest option but most accurate. This will catch changes or errors that result in the same size of file, but with different contents.|
|File doesn't exist on destination||This will only transfer files that have been created since the last sync. Useful if you are incrementally creating files.|
|File size is different||A quick test. If the file size has changed then its contents must have changed, and it will be re-transferred.|
|Modification time is newer||This will check the file's recorded modification time and only transfer the file if it is newer on the source than the destination. If you want to depend on this, it is important to check the preserve source file modification times option when initiating a Globus transfer.|
For more information about Globus please see Globus.
Rsync is a popular tool for ensuring that two separate datasets are the same but can be quite slow if there are a lot of files or there is a lot of latency between the two sites, i.e. they are geographically apart or on different networks. Running
rsync will check the modification time and size of each file, and will only transfer the file if one or the other does not match. If you expect modification times not to match on the two systems, you can use the
-c option, which will compute checksums at the source and destination, and transfer only if the checksums do not match.
When transferring files into the
/project file systems, do not use
-g flags (or
-a, which implies those two). The quotas in
/project are enforced based on group ownership, and thus preserving the group ownership will lead to the Disk quota exceeded error message.
If you are using
-a when transferring files into the
/project file systems, you can add
--no-g --no-p to your options, like so
[name@server ~]$ rsync -avz --no-g --no-p LOCALNAME email@example.com:projects/def-professor/someuser/
or avoid using
[name@server ~]$ rsync -rltv LOCALNAME firstname.lastname@example.org:projects/def-professor/someuser/
where LOCALNAME can be a folder or file. For large transfers consider adding --partial so interrupted transfers maybe restarted and/or --progress to see a summary of the transfer progress.
Using checksums to check if files match
If Globus is unavailable between the two systems being synchronized and Rsync is taking too long, then you can use a checksum utility on both systems to determine if the files match. In this example we use
[name@server ~]$ find /home/username/ -type f -print0 | xargs -0 sha1sum | tee checksum-result.log
This command will create a new file called checksum-result.log in the current directory; the file will contain all of the checksums for the files in /home/username/. It will also print out all of the checksums to the screen as it goes. If you have a lot of files or very large files you may want to run this command in the background, in a screen or tmux session; anything that allows it to continue if your SSH connection times out.
After you run it on both systems, you can use the
diff utility to find files that don't match.
[name@server ~]$ diff checksum-result-silo.log checksum-dtn.log 69c69 < 017f14f6a1a194a5f791836d93d14beead0a5115 /home/username/file-0025048576-0000008 --- > 8836913c2cc2272c017d0455f70cf0d698daadb3 /home/username/file-0025048576-0000008
It is possible that the
find command will crawl through the directories in a different order, resulting in a lot of false differences so you may need to run
sort on both files before running diff such as:
[name@server ~]$ sort -k2 checksum-result-silo.log -o checksum-result-silo.log [name@server ~]$ sort -k2 checksum-dtn.log -o checksum-dtn.log
SFTP (Secure File Transfer Protocol) uses the SSH protocol to transfer files between machines which encrypts data being transferred.
For example, you can connect to a remote machine at
ADDRESS as user
USERNAME with SFTP to transfer files like so:
[name@server]$ sftp USERNAME@ADDRESS The authenticity of host 'ADDRESS (###.###.###.##)' can't be established. RSA key fingerprint is ##:##:##:##:##:##:##:##:##:##:##:##:##:##:##:##. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'ADDRESS,###.###.###.##' (RSA) to the list of known hosts. USERNAME@ADDRESS's password: Connected to ADDRESS. sftp>
or using an SSH Key for authentication using the
[name@server]$ sftp -i /home/name/.ssh/id_rsa USERNAME@ADDRESS Connected to ADDRESS. sftp>
which returns the
sftp> prompt where commands to transfer files can be issued. To get a list of commands available to use at the sftp prompt enter the
SCP stands for Secure Copy Protocol. Like SFTP it uses the SSH protocol to encrypt data being transferred. It does not support synchronization like Globus or rsync. Some examples of the most common use of SCP include
[name@server ~]$ scp foo.txt email@example.com:work/
which will copy the file
foo.txt from the current directory on my local computer to the directory
$HOME/work on the cluster Béluga. To copy a file,
output.dat from my project space on the cluster Cedar to my local computer I can use a command like
[name@server ~]$ scp firstname.lastname@example.org:projects/def-jdoe/username/results/output.dat .
Many other examples of the use of SCP are shown here. Note that you always execute this
scp command on your local computer, not the remote cluster - the SCP connection, regardless of whether you are transferring data to or from the remote cluster, should always be initiated from your local computer.
SCP supports the option
-r to recursively transfer a set of directories and files. We recommend against using
scp -r to transfer data into
/project because the setgid bit is turned off in the created directories, which may lead to
Disk quota exceeded or similar errors if files are later created there (see Disk quota exceeded error on /project filesystems).
***Note*** if you chose a custom SSH key name, i.e. something other than the default names:
id_rsa, you will need to use the
-i option of scp and specify the path to your private key before the file paths via
[name@server ~]$ scp -i /path/to/key foo.txt email@example.com:work/