General directives for migration: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(simplify, shorten & copy-edit)
No edit summary
 
(17 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Draft}}
<languages/>


This page is for users of Compute Canada clusters concerned about data migration. It explains issues related to transferring your data between Compute Canada facilities and its regional partners ([http://www.ace-net.ca/ ACENET], [http://www.calculquebec.ca/en/ Calcul Quebec], [http://computeontario.ca/ Compute Ontario] and [https://www.westgrid.ca/ WestGrid]).  
<translate>
<!--T:1-->
This page explains issues related to transferring your data between our facilities and our regional partners.  


If you are in any doubt about details of the following advice, contact [mailto:support@computecanada.ca support@computecanada.ca] for help.
<!--T:2-->
If you are in any doubt about details of the following advice, contact our [[technical support]] for help.


== What to do before the migration starts? ==
== What to do before migrating ? == <!--T:3-->
Make sure you know whether you are responsible for your own data migration, or whether Compute Canada staff will be migrating your data. Migration of certain legacy systems like [[Migration2016:Silo|Silo]] is being handled by staff. If you are in any doubt, write [mailto:support@computecanada.ca support@computecanada.ca].
Make sure you know whether you are responsible for your own data migration, or whether our staff will be migrating your data. If you are in any doubt, contact our [[technical support]].


If you haven't used [[Globus]] before, read about it now and verify that it works on the system you are migrating from. Test any other tools you will use (like [[tar]], [[gzip]], [[zip]]) on test data to ensure you know how they work before using them on important data.  
<!--T:4-->
If you haven't used [[Globus]] before, read about it now and verify that it works on the system you are migrating from. Test any other tools you will use (like [http://www.howtogeek.com/248780/how-to-compress-and-extract-files-using-the-tar-command-on-linux/ tar], [https://www.gnu.org/software/gzip/manual/gzip.html gzip], [https://www.cyberciti.biz/faq/how-to-create-a-zip-file-in-unix/ zip]) on test data to ensure you know how they work before using them on important data.  


<!--T:5-->
Do not wait until the last minute to start your migration. Depending on how much data you have and how much load there is on the machines and network, you may be surprised at how long it will take to finish a large transfer. Expect hundreds of gigabytes to take hours to transfer, but give yourself days in case there is a problem. Expect terabytes to take days.
Do not wait until the last minute to start your migration. Depending on how much data you have and how much load there is on the machines and network, you may be surprised at how long it will take to finish a large transfer. Expect hundreds of gigabytes to take hours to transfer, but give yourself days in case there is a problem. Expect terabytes to take days.


=== Clean up ===
=== Clean up === <!--T:6-->
It is a good practice to look at your files regularly and see what can be deleted, but unfortunately many of us do not have the habit. A major data migration is a good reminder to clean up your files and directories. Moving less data will take less time, and storage space even on new systems is in great demand and should not be wasted.
It is a good practice to look at your files regularly and see what can be deleted, but unfortunately many of us do not have this habit. A major data migration is a good reminder to clean up your files and directories. Moving less data will take less time, and storage space even on new systems is in great demand and should not be wasted.
* If you compile programs and keep source code, delete any intermediate files. One or more of <code>make clean</code>, <code>make realclean</code>, or <code>rm *.o</code> might be appropriate, depending on your [[Make|makefile]].
* If you compile programs and keep source code, delete any intermediate files. One or more of <code>make clean</code>, <code>make realclean</code>, or <code>rm *.o</code> might be appropriate, depending on your [[Make|makefile]].
* If you find any large files named like <code>core.12345</code> and you don't know that they are, they are probably [https://en.wikipedia.org/wiki/Core_dump core dumps] and can be deleted.
* If you find any large files named like <code>core.12345</code> and you don't know that they are, they are probably [https://en.wikipedia.org/wiki/Core_dump core dumps] and can be deleted.


=== Compress and archive ===
=== Archive and compress === <!--T:7-->
Most file transfer programs move one file of a reasonable size more efficiently than thousands of small files of equal total size. If you have directories or directory trees containing many small files, use [[tar]] or [[zip]] to combine (archive) and compress them.
Most file transfer programs move one file of a reasonable size more efficiently than thousands of small files of equal total size. If you have directories or directory trees containing many small files, use [[Archiving and compressing files|tar]] to combine (archive) them.


Large files can also benefit from compression in many cases, especially text files or numeric data stored as human-readable text. You can use again use [[tar]] for this, or [[gzip]], or [[zip]].
<!--T:8-->
Large files can benefit from compression in some cases, especially text files which can usually be compressed a great deal. Compressing a file <b>only</b> for the purpose of transferring it, and then decompressing it at the end of the transfer will not necessarily save time. It depends on how much the file can be compressed, how long it takes to compress it, and the transfer bandwidth. The calculation is described under <i>Data Compression and transfer discussion</i> in [https://bluewaters.ncsa.illinois.edu/data-transfer-doc this document] from the US National Center for Supercomputing Applications.


=== Avoid duplication ===
<!--T:9-->
If you decide compression is worthwhile, you can again use [[Archiving and compressing files|tar]] for this, or [https://www.gnu.org/software/gzip/manual/gzip.html gzip].
 
=== Avoid duplication === <!--T:10-->
Try not to move the same data twice. If you are migrating from more than one existing system to one new system and you have data duplicated on the sources, choose one and only move the duplicate data from that one.  
Try not to move the same data twice. If you are migrating from more than one existing system to one new system and you have data duplicated on the sources, choose one and only move the duplicate data from that one.  


Beware of files with duplicate names, but which do not contain duplicate information. Ensure that you will not accidentally over-write one file with another of the same name.
<!--T:11-->
Beware of files with duplicate names, but which do not contain duplicate information. Ensure that you will not accidentally overwrite one file with another of the same name.
 
== What to do during the migration process == <!--T:12-->
If it is supported at your source site, use [[Globus|Globus]] to set up your file transfer. It is the most user-friendly and efficient tool we know for this task. Globus is designed to recover from network interruptions automatically. We recommend you enable the setting to <i>preserve source file modification times</i> in the <i>Transfer & Timer Options</i>.


== What to do during the migration process? ==
<!--T:13-->
If it is supported at your source site, use [[Globus|Globus Online]] to set up your file transfer. It is the most user-friendly and efficient tool we know of for this task. Globus is designed to recover from network interruptions automatically. We recommend you select the following options at the bottom of the "Transfer files" screen:
If Globus is not supported at your source site, then compressing data and avoiding duplication is even more important. If you are using [https://en.wikipedia.org/wiki/Secure_copy scp], [https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol sftp], or [https://en.wikipedia.org/wiki/Rsync rsync], then:
* preserve source file modification times
* Make a schedule to migrate your data in blocks of a few hundreds of GBs at a time. If the transfer stops for some reason, you will be able to try again starting from the incomplete file, but you will not have to re-transfer files that are already complete. An organized list of files will help here.
* verify file integrity after transfer
* Check regularly to see that the transfer process has not stopped. File size is a good indicator of progress. If no files have changed size for several minutes, then something may have gone wrong. If restarting the transfer does not work, contact our [[technical support]].


If Globus is not supported at your source site, then the advice to compress data and avoid duplication is even more important. If you must use one of [[scp]], [[sftp]], or [[rsync]], then:
<!--T:14-->
* Make a schedule to migrate your data part by part. If the transfer stops for any reason you will be able to try again starting from the incomplete file, but you will not have to re-transfer files that are already complete. An organized list of files will help here.
Be patient. Even with Globus, transferring large volumes of data can be time consuming. Specific transfer speeds will vary, but expect hundreds of gigabytes to take hours and terabytes to take days.
* Check regularly to see that the transfer process has not stopped. File size is a good indicator of progress. If no files have changed size for several minutes, then something may have gone wrong. If restarting the transfer does not work, contact [mailto:support@computecanada.ca support@computecanada.ca].


Be patient. Even with Globus, transferring large volumes of data can be time consuming. Specific transfer speeds will vary a lot, but expect hundreds of gigabytes to take hours and terabytes to take days.
== What to do after migration == <!--T:15-->
If you did not use Globus, or if you did but did not check <i>verify file integrity</i>, make sure that the data you have transferred are not corrupted. A crude way to do this is to compare file sizes at the source with file sizes at the destination. For greater assurance, you can use [http://man7.org/linux/man-pages/man1/cksum.1.html cksum] or [http://man7.org/linux/man-pages/man1/md5sum.1.html md5sum] at each end, and see if the results match. Any files with mismatching sizes or checksums should be transferred again.


== What to do after migration? ==
== Where and how to get help == <!--T:16-->
If you did not use Globus, or if you did but did not check "verify file integrity", make sure that the data you have transferred are not corrupted. A crude way to do this is to compare file sizes at the source with file sizes at the destination. For greater confidence you can use [http://man7.org/linux/man-pages/man1/cksum.1.html cksum] or [http://man7.org/linux/man-pages/man1/md5sum.1.html md5sum] at each end, and see that the results match. Any files with mismatching sizes or checksums should be transferred again.
* To know how to use different archiving and compression utilities, use a Linux command like <code>man <command></code> or <code><command> --help</code>.  
* Contact our [[technical support]]


== Where and how to get HELP? ==
</translate>
* To know how to use different archiving and compression utilities, use the Linux command like <code>man <command></code> or <code><command> --help</code>.
* Email [mailto:support@computecanada.ca support@computecanada.ca]

Latest revision as of 16:40, 27 November 2023

Other languages:

This page explains issues related to transferring your data between our facilities and our regional partners.

If you are in any doubt about details of the following advice, contact our technical support for help.

What to do before migrating ?[edit]

Make sure you know whether you are responsible for your own data migration, or whether our staff will be migrating your data. If you are in any doubt, contact our technical support.

If you haven't used Globus before, read about it now and verify that it works on the system you are migrating from. Test any other tools you will use (like tar, gzip, zip) on test data to ensure you know how they work before using them on important data.

Do not wait until the last minute to start your migration. Depending on how much data you have and how much load there is on the machines and network, you may be surprised at how long it will take to finish a large transfer. Expect hundreds of gigabytes to take hours to transfer, but give yourself days in case there is a problem. Expect terabytes to take days.

Clean up[edit]

It is a good practice to look at your files regularly and see what can be deleted, but unfortunately many of us do not have this habit. A major data migration is a good reminder to clean up your files and directories. Moving less data will take less time, and storage space even on new systems is in great demand and should not be wasted.

  • If you compile programs and keep source code, delete any intermediate files. One or more of make clean, make realclean, or rm *.o might be appropriate, depending on your makefile.
  • If you find any large files named like core.12345 and you don't know that they are, they are probably core dumps and can be deleted.

Archive and compress[edit]

Most file transfer programs move one file of a reasonable size more efficiently than thousands of small files of equal total size. If you have directories or directory trees containing many small files, use tar to combine (archive) them.

Large files can benefit from compression in some cases, especially text files which can usually be compressed a great deal. Compressing a file only for the purpose of transferring it, and then decompressing it at the end of the transfer will not necessarily save time. It depends on how much the file can be compressed, how long it takes to compress it, and the transfer bandwidth. The calculation is described under Data Compression and transfer discussion in this document from the US National Center for Supercomputing Applications.

If you decide compression is worthwhile, you can again use tar for this, or gzip.

Avoid duplication[edit]

Try not to move the same data twice. If you are migrating from more than one existing system to one new system and you have data duplicated on the sources, choose one and only move the duplicate data from that one.

Beware of files with duplicate names, but which do not contain duplicate information. Ensure that you will not accidentally overwrite one file with another of the same name.

What to do during the migration process[edit]

If it is supported at your source site, use Globus to set up your file transfer. It is the most user-friendly and efficient tool we know for this task. Globus is designed to recover from network interruptions automatically. We recommend you enable the setting to preserve source file modification times in the Transfer & Timer Options.

If Globus is not supported at your source site, then compressing data and avoiding duplication is even more important. If you are using scp, sftp, or rsync, then:

  • Make a schedule to migrate your data in blocks of a few hundreds of GBs at a time. If the transfer stops for some reason, you will be able to try again starting from the incomplete file, but you will not have to re-transfer files that are already complete. An organized list of files will help here.
  • Check regularly to see that the transfer process has not stopped. File size is a good indicator of progress. If no files have changed size for several minutes, then something may have gone wrong. If restarting the transfer does not work, contact our technical support.

Be patient. Even with Globus, transferring large volumes of data can be time consuming. Specific transfer speeds will vary, but expect hundreds of gigabytes to take hours and terabytes to take days.

What to do after migration[edit]

If you did not use Globus, or if you did but did not check verify file integrity, make sure that the data you have transferred are not corrupted. A crude way to do this is to compare file sizes at the source with file sizes at the destination. For greater assurance, you can use cksum or md5sum at each end, and see if the results match. Any files with mismatching sizes or checksums should be transferred again.

Where and how to get help[edit]

  • To know how to use different archiving and compression utilities, use a Linux command like man <command> or <command> --help.
  • Contact our technical support