Transferring data: Difference between revisions

Latest revision as of 01:47, 5 October 2024

Other languages:

English
français

Please use data transfer nodes, also called data mover nodes, instead of login nodes whenever you are transferring data to and from our clusters. If a data transfer node is available, its URL will be given near the top of the main page for each cluster: Béluga, Narval, Cedar, Graham and Niagara.

Globus automatically uses data transfer nodes.

To and from your personal computer[edit]

You will need software that supports secure transfer of files between your computer and our machines. The commands scp and sftp can be used in a command-line environment on Linux or Mac OS X computers. On Microsoft Windows platforms, MobaXterm offers both a graphical file transfer function and a command-line interface via SSH, while WinSCP is another free program that supports file transfer. Setting up a connection to a machine using SSH keys with WinSCP can be done by following the steps in this link. PuTTY comes with pscp and psftp which are essentially the same as the Linux and Mac command line programs.

If it takes more than one minute to move your files to or from our servers, we recommend you install and try Globus Personal Connect. Globus transfers can be set up and will run in the background.

Between resources[edit]

Globus is the preferred tool for transferring data between systems, and if it can be used, it should.

However, other common tools can also be found for transferring data both inside and outside of our systems, including

SFTP
SCP or Secure Copy Protocol
rsync

Note: If you want to transfer files between another of our clusters and Niagara use the SSH agent forwarding flag -A when logging into another cluster. For example, to copy files to Niagara from Cedar, use:

ssh -A USERNAME@cedar.computecanada.ca

then perform the copy:

[USERNAME@cedar5 ~]$ scp file USERNAME@niagara.computecanada.ca:/scratch/g/group/USERNAME/

From the World Wide Web[edit]

The standard tool for downloading data from websites is wget. Another often used is curl. Their similarities and differences are compared in several places such as this StackExchange article or here. While the focus here is transferring data on Alliance Linux systems this tutorial also addresses Mac and Windows machines. Both wget and curl can resume interrupted downloads by rerunning them with the -c and -C - command line options respectively. When getting data from various cloud services such as Google cloud storage, Google Drive and Google Photos, consider using the rclone tool instead. All of these tools (wget, curl, rclone) are available on every Alliance cluster by default (without loading a module). For a detailed listing of command line options check the man page for each tool or run them with --help or simply -h on the cluster.

Synchronizing files[edit]

To synchronize or sync files (or directories) stored in two different locations means to ensure that the two copies are the same. Here are several different ways to do this.

Globus transfer[edit]

We find Globus usually gives the best performance and reliability.

Normally when a Globus transfer is initiated it will overwrite the files on the destination with the files from the source, which means all of the files on the source will be transferred. If some of the files may already exist on the destination and need not be transferred if they match, you should go to the Transfer & Timer Options shown in the screenshot and choose to sync instead.

You may choose how Globus decides which files to transfer:

Their checksums are different	This is the slowest option but most accurate. This will catch changes or errors that result in the same size of file, but with different contents.
File doesn't exist on destination	This will only transfer files that have been created since the last sync. Useful if you are incrementally creating files.
File size is different	A quick test. If the file size has changed then its contents must have changed, and it will be re-transferred.
Modification time is newer	This will check the file's recorded modification time and only transfer the file if it is newer on the source than the destination. If you want to depend on this, it is important to check the preserve source file modification times option when initiating a Globus transfer.

For more information about Globus please see Globus.

Rsync[edit]

Rsync is a popular tool for ensuring that two separate datasets are the same but can be quite slow if there are a lot of files or there is a lot of latency between the two sites, i.e. they are geographically apart or on different networks. Running rsync will check the modification time and size of each file, and will only transfer the file if one or the other does not match. If you expect modification times not to match on the two systems, you can use the -c option, which will compute checksums at the source and destination, and transfer only if the checksums do not match.

When transferring files into the /project file systems, do not use -p and -g flags since the quotas in /project are enforced based on group ownership, and thus preserving the group ownership will lead to the Disk quota exceeded error message. Since -a includes -p and -g by default, the --no-g --no-p options should be added, like so

[name@server ~]$ rsync -avzh --no-g --no-p LOCALNAME someuser@graham.computecanada.ca:projects/def-professor/someuser/somedir/

where LOCALNAME can be a directory or file preceded by its path location and somedir will be created if it doesn't exist. The -z option compresses files (not in the default file suffixes --skip-compress list) and requires additional cpu resources while the -h option makes transferred file sizes human readable. If you are transferring very large files add the --partial option so interrupted transfers maybe restarted:

[name@server ~]$ rsync -avzh --no-g --no-p --partial --progress LOCALNAME someuser@graham.computecanada.ca:projects/def-professor/someuser/somedir/

The --progress option will display the percent progress of each file as its transferred. If you are transferring very many smaller files, then it maybe more desirable to display a single progress bar that represents the transfer progress of all files:

[name@server ~]$ rsync -azh --no-g --no-p --info=progress2 LOCALNAME someuser@graham.computecanada.ca:projects/def-professor/someuser/somedir/

The above rsync examples all involve transfers from a local system into a project directory on a remote system. Rsync transfers from a remote system into a project directory on a local system work in much the same way, for example:

[name@server ~]$ rsync -avzh --no-g --no-p someuser@graham.computecanada.ca:REMOTENAME ~/projects/def-professor/someuser/somedir/

where REMOTENAME can be a directory or file preceded by its path location and somedir will be created if it doesn't already exist. In its simplest incarnation rsync can also be used locally within a single system to transfer a directory or file (from home or scratch) into project by dropping the cluster name:

[name@server ~]$ rsync -avh --no-g --no-p LOCALNAME ~/projects/def-professor/someuser/somedir/

where somedir will be created if it doesn't already exist before copying LOCALNAME into it. For comparison purposes, the copy command can similarely be used to transfer LOCALNAME from home to project by doing:

[name@server ~]$ cp -rv --preserve="mode,timestamps" LOCALNAME ~/projects/def-professor/someuser/somedir/

however unlike rsync, if LOCALNAME is a directory, it will be renamed to somedir if somedir does not exist.

Using checksums to check if files match[edit]

If Globus is unavailable between the two systems being synchronized and Rsync is taking too long, then you can use a checksum utility on both systems to determine if the files match. In this example we use sha1sum.

[name@server ~]$ find /home/username/ -type f -print0 | xargs -0 sha1sum | tee checksum-result.log

This command will create a new file called checksum-result.log in the current directory; the file will contain all of the checksums for the files in /home/username/. It will also print out all of the checksums to the screen as it goes. If you have a lot of files or very large files you may want to run this command in the background, in a screen or tmux session; anything that allows it to continue if your SSH connection times out.

After you run it on both systems, you can use the diff utility to find files that don't match.

[name@server ~]$ diff checksum-result-silo.log checksum-dtn.log
69c69
 < 017f14f6a1a194a5f791836d93d14beead0a5115  /home/username/file-0025048576-0000008
 ---
 > 8836913c2cc2272c017d0455f70cf0d698daadb3  /home/username/file-0025048576-0000008

It is possible that the find command will crawl through the directories in a different order, resulting in a lot of false differences so you may need to run sort on both files before running diff such as:

[name@server ~]$ sort -k2 checksum-result-silo.log -o checksum-result-silo.log
[name@server ~]$ sort -k2 checksum-dtn.log -o checksum-dtn.log

SFTP[edit]

SFTP (Secure File Transfer Protocol) uses the SSH protocol to transfer files between machines which encrypts data being transferred.

For example, you can connect to a remote machine at ADDRESS as user USERNAME with SFTP to transfer files like so:

[name@server]$ sftp USERNAME@ADDRESS
The authenticity of host 'ADDRESS (###.###.###.##)' can't be established.
RSA key fingerprint is ##:##:##:##:##:##:##:##:##:##:##:##:##:##:##:##.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ADDRESS,###.###.###.##' (RSA) to the list of known hosts.
USERNAME@ADDRESS's password:
Connected to ADDRESS.
sftp>

or using an SSH Key for authentication using the -i option

[name@server]$ sftp -i /home/name/.ssh/id_rsa USERNAME@ADDRESS
Connected to ADDRESS.
sftp>

which returns the sftp> prompt where commands to transfer files can be issued. To get a list of commands available to use at the sftp prompt enter the help command.

There are also a number of graphical programs available for Windows, Linux and Mac OS, such as WinSCP and MobaXterm (Windows), filezilla (Windows, Mac, and Linux), and cyberduck (Mac and Windows).

SCP[edit]

SCP stands for Secure Copy Protocol. Like SFTP it uses the SSH protocol to encrypt data being transferred. It does not support synchronization like Globus or rsync. Some examples of the most common use of SCP include

[name@server ~]$ scp foo.txt username@beluga.computecanada.ca:work/

which will copy the file foo.txt from the current directory on my local computer to the directory $HOME/work on the cluster Béluga. To copy a file, output.dat from my project space on the cluster Cedar to my local computer I can use a command like

[name@server ~]$ scp username@cedar.computecanada.ca:projects/def-jdoe/username/results/output.dat .

Many other examples of the use of SCP are shown here. Note that you always execute this scp command on your local computer, not the remote cluster - the SCP connection, regardless of whether you are transferring data to or from the remote cluster, should always be initiated from your local computer.

SCP supports the option -r to recursively transfer a set of directories and files. We recommend against using scp -r to transfer data into /project because the setgid bit is turned off in the created directories, which may lead to Disk quota exceeded or similar errors if files are later created there (see Disk quota exceeded error on /project filesystems).

***Note*** if you chose a custom SSH key name, i.e. something other than the default names: id_dsa, id_ecdsa, id_ed25519 and id_rsa, you will need to use the -i option of scp and specify the path to your private key before the file paths via

[name@server ~]$ scp -i /path/to/key foo.txt username@beluga.computecanada.ca:work/

@@ Line 1: / Line 1: @@
-{{Draft}}
+<languages />
-==From your personal computer==
+<translate>
-You will need software that supports secure transfer of files between your computer and the Compute Canada machines. The command line programs <code>scp</code> and <code>sftp</code> can be used from within terminal programs on '''Linux''' or '''Mac''' OS X computers. On '''Microsoft Windows''' platforms, [http://mobaxterm.mobatek.net/MobaXterm MobaXterm] offers both file transfer and a [[SSH|terminal function]], while [http://winscp.net/eng/index.php WinSCP] is another free program that supports file transfer. [http://www.chiark.greenend.org.uk/~sgtatham/putty/ PuTTY] comes with <code>pscp</code> and <code>psftp</code> which are essentially the same as the Linux and Mac command line programs.
+<!--T:30-->
+Please use <b>data transfer nodes</b>, also called <b>data mover nodes</b>, instead of login nodes whenever you are transferring data to and from our clusters. If a data transfer node is available, its URL will be given near the top of the main page for each cluster: [[Béluga/en|Béluga]], [[Narval/en|Narval]], [[Cedar]], [[Graham]] and [[Niagara]].
-If it takes more than about a minute to move your files to or from Compute Canada servers, you should install [[Globus#Personal_Computers|Globus Personal Connect]] and try [[Globus]] for transfers which can be set up and will then go on in the background without you. Most Compute Canada legacy systems can be reached with Globus, but not all.
+<!--T:31-->
+[[Globus]] automatically uses data transfer nodes.
-==Between Compute Canada resources==
+==To and from your personal computer== <!--T:1-->
-[[Globus]] is the preferred tool for transferring data between Compute Canada systems, and if it can be used, it should.
+You will need software that supports secure transfer of files between your computer and our machines. The commands <code>scp</code> and <code>sftp</code> can be used in a command-line environment on <b>Linux</b> or <b>Mac</b> OS X computers. On <b>Microsoft Windows</b> platforms, [https://docs.computecanada.ca/wiki/Connecting_with_MobaXTerm/en MobaXterm] offers both a graphical file transfer function and a [[Linux introduction|command-line]] interface via [[SSH]], while [http://winscp.net/eng/index.php WinSCP] is another free program that supports file transfer. Setting up a connection to a machine using SSH keys with WinSCP can be done by following the steps in this [https://www.exavault.com/blog/import-ssh-keys-winscp link]. [https://docs.computecanada.ca/wiki/Connecting_with_PuTTY/en PuTTY] comes with <code>pscp</code> and <code>psftp</code> which are essentially the same as the Linux and Mac command line programs.
-However, other common tools can also be found for transferring data both inside and outside of Compute Canada, including
+<!--T:2-->
-* scp
+If it takes more than one minute to move your files to or from our servers, we recommend you install and try [[Globus#Personal_Computers|Globus Personal Connect]]. [[Globus]] transfers can be set up and will run in the background.
-* [SFTP]
-* rsync
-* wget
-==Verify or Synchronize Files After Transfer==
+==Between resources== <!--T:3-->
-There are several tools that you can use to verify that your files have transferred safely or sync a changed dataset to update a location.
+[[Globus]] is the preferred tool for transferring data between systems, and if it can be used, it should.
-===Globus Transfer===
+<!--T:4-->
-For performance and reliabilty reasons we recommend Globus Transfer.
+However, other common tools can also be found for transferring data both inside and outside of our systems, including
+* [[Transferring_data#SFTP | SFTP]]
+* [[Transferring_data#SCP | SCP]] or Secure Copy Protocol
+* [[Transferring_data#Rsync | rsync]]
-Normally when a Globus Transfer is initiated it will overwrite all the files on the destination with the files from the source, which potentially means you would transfer all of the files. Instead, if you go to the bottom of the transfer window as shown in the screenshot you can choose to "sync" instead.
+<!--T:35-->
+Note: If you want to transfer files between another of our clusters and Niagara use the SSH agent forwarding flag <code>-A</code> when logging into another cluster. For example, to copy files to Niagara from Cedar, use:
-[[File:Globus_Transfer_Sync_Options.png|280px|thumb|left]]
+<!--T:36-->
+<pre>
+ssh -A USERNAME@cedar.computecanada.ca
+</pre>
+then perform the copy:
+<pre>
+[USERNAME@cedar5 ~]$ scp file USERNAME@niagara.computecanada.ca:/scratch/g/group/USERNAME/
+</pre>
-This gives you the option to only transfer new or changed files if:
+==From the World Wide Web== <!--T:5-->
+The standard tool for downloading data from websites is [https://en.wikipedia.org/wiki/Wget wget]. Another often used is [https://curl.haxx.se/ curl]. Their similarities and differences are compared in several places such as this StackExchange [https://unix.stackexchange.com/questions/47434/what-is-the-difference-between-curl-and-wget article] or [https://draculaservers.com/tutorials/wget-and-curl-for-files/ here].  While the focus here is transferring data on Alliance Linux systems this [https://www.techtarget.com/searchnetworking/tutorial/Use-cURL-and-Wget-to-download-network-files-from-CLI tutorial] also addresses Mac and Windows machines.   Both [https://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/ wget] and [https://www.thegeekstuff.com/2012/04/curl-examples/ curl] can resume interrupted downloads by rerunning them with the [https://www.cyberciti.biz/tips/wget-resume-broken-download.html -c] and [https://www.cyberciti.biz/faq/curl-command-resume-broken-download/ -C -] command line options respectively. When getting data from various cloud services such as Google cloud storage, Google Drive and Google Photos, consider using the [https://rclone.org/ rclone] tool instead.  All of these tools (wget, curl, rclone) are available on every Alliance cluster by default (without loading a module).  For a detailed listing of command line options check the man page for each tool or run them with <code>--help</code> or simply <code>-h</code> on the cluster.
+==Synchronizing files== <!--T:6-->
+To synchronize or <i>sync</i> files (or directories) stored in two different locations means to ensure that the two copies are the same. Here are several different ways to do this.
+===Globus transfer=== <!--T:7-->
+We find Globus usually gives the best performance and reliability.
+<!--T:8-->
+Normally when a Globus transfer is initiated it will overwrite the files on the destination with the files from the source, which means all of the files on the source will be transferred. If some of the files may already exist on the destination and need not be transferred if they match, you should go to the <i>Transfer & Timer Options</i> shown in the screenshot and choose to <i>sync</i> instead.
+<!--T:9-->
+[[File:Globus_Transfer_Sync_Options.png|280px|thumb|left|alt="Globus file transfer sync options"]]
+<!--T:10-->
+You may choose how Globus decides which files to transfer:
 {| class="wikitable"
 |-
-| Their checksums are different  || This is the slowest option but most accurate. This will catch errors that may have resulted in the same size of file, but with different contents.
+| Their checksums are different  || This is the slowest option but most accurate. This will catch changes or errors that result in the same size of file, but with different contents.
 |-
-| File doesn't exist on destination || This will only transfer new files that have been created since the last transfer / sync which is useful if you are incrementally creating files.
+| File doesn't exist on destination || This will only transfer files that have been created since the last sync. Useful if you are incrementally creating files.
 |-
-| File size is different || A quick process that checks to see if data has been removed / added to a file so that its size changed and therefore needs to be re-transferred
+| File size is different || A quick test. If the file size has changed then its contents must have changed, and it will be re-transferred.
 |-
-| Modification time is newer || It will check the file's recorded modification time and only transfer the file if it is newer on the source than the destination. If you want to depend on this it is important to check the "preserve source file modification times" option when initiating a Globus Transfer
+| Modification time is newer || This will check the file's recorded modification time and only transfer the file if it is newer on the source than the destination. If you want to depend on this, it is important to check the <i>preserve source file modification times</i> option when initiating a Globus transfer.
 |}
-For more information about Globus please see our documentation at: https://docs.computecanada.ca/wiki/Globus
+<!--T:11-->
+For more information about Globus please see [[Globus]].
+<!--T:37-->
+<br clear="all"/>
+===Rsync=== <!--T:12-->
+[https://en.wikipedia.org/wiki/Rsync Rsync] is a popular tool for ensuring that two separate datasets are the same but can be quite slow if there are a lot of files or there is a lot of latency between the two sites, i.e. they are geographically apart or on different networks. Running <code>rsync</code> will check the modification time and size of each file, and will only transfer the file if one or the other does not match. If you expect modification times not to match on the two systems, you can use the <code>-c</code> option, which will compute checksums at the source and destination, and transfer only if the checksums do not match.
+<!--T:26-->
+When transferring files into the <code>/project</code> file systems, do not use <code>-p</code> and <code>-g</code> flags since the quotas in <code>/project</code> are enforced based on group ownership, and thus preserving the group ownership will lead to the [[Frequently_Asked_Questions#Disk_quota_exceeded_error_on_.2Fproject_filesystems | Disk quota exceeded]] error message.  Since <code>-a</code> includes <code>-p</code> and <code>-g</code> by default, the <code>--no-g --no-p</code> options should be added, like so
+{{Command|rsync -avzh --no-g --no-p LOCALNAME someuser@graham.computecanada.ca:projects/def-professor/someuser/somedir/}}
+where LOCALNAME can be a directory or file preceded by its path location and somedir will be created if it doesn't exist.  The <code>-z</code> option compresses files (not in the default file suffixes <code>--skip-compress</code> list) and requires additional cpu resources while the <code>-h</code> option makes transferred file sizes human readable.  If you are transferring very large files add the <code>--partial</code> option so interrupted transfers maybe restarted:
+{{Command|rsync -avzh --no-g --no-p --partial --progress LOCALNAME someuser@graham.computecanada.ca:projects/def-professor/someuser/somedir/}}
+The <code>--progress</code> option will display the percent progress of each file as its transferred.  If you are transferring very many smaller files, then it maybe more desirable to display a single progress bar that represents the transfer progress of all files:
+{{Command|rsync -azh --no-g --no-p --info{{=}}progress2 LOCALNAME someuser@graham.computecanada.ca:projects/def-professor/someuser/somedir/}}
+The above rsync examples all involve transfers from a local system into a project directory on a remote system.  Rsync transfers from a remote system into a project directory on a local system work in much the same way, for example:
+{{Command|rsync -avzh --no-g --no-p someuser@graham.computecanada.ca:REMOTENAME ~/projects/def-professor/someuser/somedir/}}
+where REMOTENAME can be a directory or file preceded by its path location and somedir will be created if it doesn't already exist.  In its simplest incarnation rsync can also be used locally within a single system to transfer a directory or file (from home or scratch) into project by dropping the cluster name:
+{{Command|rsync -avh --no-g --no-p LOCALNAME ~/projects/def-professor/someuser/somedir/}}
+where somedir will be created if it doesn't already exist before copying LOCALNAME into it.  For comparison purposes, the copy command can similarely be used to transfer LOCALNAME from home to project by doing:
+{{Command|cp -rv --preserve{{=}}"mode,timestamps" LOCALNAME ~/projects/def-professor/someuser/somedir/}}
+however unlike rsync, if LOCALNAME is a directory, it will be renamed to somedir if somedir does not exist.
+===Using checksums to check if files match=== <!--T:13-->
+If Globus is unavailable between the two systems being synchronized and Rsync is taking too long, then you can use a  [https://en.wikipedia.org/wiki/Checksum checksum] utility on both systems to determine if the files match. In this example we use <code>sha1sum</code>.
+<!--T:14-->
+{{Command
+|find /home/username/ -type f -print0 {{!}} xargs -0 sha1sum {{!}} tee checksum-result.log
+}}
+<!--T:15-->
+This command will create a new file called checksum-result.log in the current directory; the file will contain all of the checksums for the files in /home/username/. It will also print out all of the checksums to the screen as it goes. If you have a lot of files or very large files you may want to run this command in the background, in a [https://en.wikipedia.org/wiki/GNU_Screen screen] or [https://en.wikipedia.org/wiki/Tmux tmux] session; anything that allows it to continue if your [[SSH]] connection times out.
+<!--T:16-->
+After you run it on both systems, you can use the <code>diff</code> utility to find files that don't match.
+<!--T:17-->
+{{Command
+|diff checksum-result-silo.log checksum-dtn.log
+|result=69c69
+ < 017f14f6a1a194a5f791836d93d14beead0a5115  /home/username/file-0025048576-0000008
+ ---
+ > 8836913c2cc2272c017d0455f70cf0d698daadb3  /home/username/file-0025048576-0000008
+}}
+<!--T:18-->
+It is possible that the <code>find</code> command will crawl through the directories in a different order, resulting in a lot of false differences so you may need to run <code>sort</code> on both files before running diff such as:
+<!--T:19-->
+{{Commands
+|sort -k2 checksum-result-silo.log -o checksum-result-silo.log
+|sort -k2 checksum-dtn.log -o checksum-dtn.log
+}}
+==SFTP== <!--T:21-->
+[https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol SFTP] (Secure File Transfer Protocol) uses the SSH protocol to transfer files between machines which encrypts data being transferred.
+<!--T:22-->
+For example, you can connect to a remote machine at <code>ADDRESS</code> as user <code>USERNAME</code> with SFTP to transfer files like so:
+<!--T:23-->
+<source lang="console">
+[name@server]$ sftp USERNAME@ADDRESS
+The authenticity of host 'ADDRESS (###.###.###.##)' can't be established.
+RSA key fingerprint is ##:##:##:##:##:##:##:##:##:##:##:##:##:##:##:##.
+Are you sure you want to continue connecting (yes/no)? yes
+Warning: Permanently added 'ADDRESS,###.###.###.##' (RSA) to the list of known hosts.
+USERNAME@ADDRESS's password:
+Connected to ADDRESS.
+sftp>
+</source>
+or using an [[SSH Keys|SSH Key]] for authentication using the <code>-i</code> option
+<source lang="console">
+[name@server]$ sftp -i /home/name/.ssh/id_rsa USERNAME@ADDRESS
+Connected to ADDRESS.
+sftp>
+</source>
-===Rsync===
+<!--T:24-->
-Rsync is a popular tool for ensuring that two separate datasets are the same but can be quite slow if your dataset has a lot of files or there is a lot of latency between the two sites, i.e. they are geographically apart or on different networks. Running rsync will check the modification time and size of each file before transmitting it. If for some reason your modification times do not match on the two systems you can also run using the "-c" option which will create a checksum of the source and destination file before transferring. Generating checksums for files can slow down.
+which returns the <code>sftp></code> prompt where commands to transfer files can be issued. To get a list of commands available to use at the sftp prompt enter the <code>help</code> command.
-===Using sha1sum checksums locally to check if files match===
+<!--T:25-->
-If the two systems have high latency, Globus Transfer is unavailable and Rsync is taking too long then you can use checksums from both systems to determine if the files match. You can use the command:
+There are also a number of graphical programs available for Windows, Linux and Mac OS, such as [https://winscp.net/eng/index.php WinSCP] and [http://mobaxterm.mobatek.net/ MobaXterm] (Windows), [https://filezilla-project.org filezilla] (Windows, Mac, and Linux), and [https://cyberduck.io/?l=en cyberduck] (Mac and Windows).
+[[Category:Connecting]]
-    find /home/username/ -type f -print0  | xargs -0 sha1sum | tee sha1-checksum-result.log
+==SCP== <!--T:27-->
-This command will create a new file called sha1-checksum-result.log in the current directory that will contain all of the checksums for the files in /home/username/. It will also print out all of the checksums to the screen as well as it goes. If you have a lot of files or very large files you may want to run this command in the background, in a screen or tmux session etc. Anything that allow it to continue if your ssh connection times out.
+<!--T:28-->
+SCP stands for [https://en.wikipedia.org/wiki/Secure_copy <i>Secure Copy Protocol</i>]. Like SFTP it uses the SSH protocol to encrypt data being transferred. It does not support synchronization like [[Globus]] or [[Transferring_data#Rsync|rsync]]. Some examples of the most common use of SCP include
+{{Command
+|scp foo.txt username@beluga.computecanada.ca:work/
+}}
+which will copy the file <code>foo.txt</code> from the current directory on my local computer to the directory <code>$HOME/work</code> on the cluster [[Béluga/en|Béluga]]. To copy a file, <code>output.dat</code> from my project space on the cluster [[Cedar]] to my local computer I can use a command like
+{{Command
+|scp username@cedar.computecanada.ca:projects/def-jdoe/username/results/output.dat .
+}}
+Many other examples of the use of SCP are shown [http://www.hypexr.org/linux_scp_help.php here]. Note that you always execute this <code>scp</code> command on your local computer, not the remote cluster - the SCP connection, regardless of whether you are transferring data to or from the remote cluster, should always be initiated from your local computer.
-After you run it on both systems you can use the diff utility to find files that don't match.
+<!--T:29-->
+SCP supports the option <code>-r</code> to recursively transfer a set of directories and files. We <b>recommend against using <code>scp -r</code></b> to transfer data into <code>/project</code> because the setgid bit is turned off in the created directories, which may lead to <code>Disk quota exceeded</code> or similar errors if files are later created there (see [[Frequently_Asked_Questions#Disk_quota_exceeded_error_on_.2Fproject_filesystems | Disk quota exceeded error on /project filesystems]]).
-    diff sha1-checksum-result-silo.log sha1-checksum-dtn.log
+<!--T:33-->
-c69
+<b><big>***Note***</big></b> if you chose a custom SSH key name, <i>i.e.</i> something other than the default names: <code>id_dsa</code>, <code>id_ecdsa</code>, <code>id_ed25519</code> and <code>id_rsa</code>, you will need to use the <code>-i</code> option of scp and specify the path to your private key before the file paths via
-    < 017f14f6a1a194a5f791836d93d14beead0a5115  /home/username/file-0025048576-0000008
-    ---
-    > 8836913c2cc2272c017d0455f70cf0d698daadb3  /home/username/file-0025048576-0000008
-It is possible that the find command will crawl through the directories in a different order resulting in a lot of false differences so you may need to run sort on both files before running diff such as:
+<!--T:34-->
+{{Command
+|scp -i /path/to/key foo.txt username@beluga.computecanada.ca:work/
+}}
-sort sha1-checksum-result-silo.log -o sha1-checksum-result-silo.log
+<!--T:20-->
-sort sha1-checksum-dtn.log -o sha1-checksum-dtn.log
+[[Category:Connecting]]
+</translate>

Transferring data: Difference between revisions

Latest revision as of 01:47, 5 October 2024

Contents

To and from your personal computer[edit]

Between resources[edit]

From the World Wide Web[edit]

Synchronizing files[edit]

Globus transfer[edit]

Rsync[edit]

Using checksums to check if files match[edit]

SFTP[edit]

SCP[edit]

Navigation menu

Transferring data: Difference between revisions

Latest revision as of 01:47, 5 October 2024

To and from your personal computer[edit]

Between resources[edit]

From the World Wide Web[edit]

Synchronizing files[edit]

Globus transfer[edit]

Rsync[edit]

Using checksums to check if files match[edit]

SFTP[edit]

SCP[edit]

Navigation menu

Search