Transferring data: Difference between revisions

Jump to navigation Jump to search
Line 44: Line 44:
Rsync is a popular tool for ensuring that two separate datasets are the same but can be quite slow if your dataset has a lot of files or there is a lot of latency between the two sites, i.e. they are geographically apart or on different networks. Running rsync will check the modification time and size of each file before transmitting it. If for some reason your modification times do not match on the two systems you can also run using the "-c" option which will create a checksum of the source and destination file before transferring. Generating checksums for files can slow down.
Rsync is a popular tool for ensuring that two separate datasets are the same but can be quite slow if your dataset has a lot of files or there is a lot of latency between the two sites, i.e. they are geographically apart or on different networks. Running rsync will check the modification time and size of each file before transmitting it. If for some reason your modification times do not match on the two systems you can also run using the "-c" option which will create a checksum of the source and destination file before transferring. Generating checksums for files can slow down.


===Using sha1sum checksums locally to check if files match===
===Using checksums to check if files match===
If the two systems have high latency, Globus Transfer is unavailable and Rsync is taking too long then you can use checksums from both systems to determine if the files match. You can use the command:
If Globus is unavailable between the two systems being synchronized and Rsync is taking too long, then you can use [https://en.wikipedia.org/wiki/Checksum checksums] on both systems to determine if the files match.


    find /home/username/ -type f -print0  | xargs -0 sha1sum | tee sha1-checksum-result.log
{{Command
|find /home/username/ -type f -print0  | xargs -0 sha1sum | tee checksum-result.log
}}


This command will create a new file called sha1-checksum-result.log in the current directory that will contain all of the checksums for the files in /home/username/. It will also print out all of the checksums to the screen as well as it goes. If you have a lot of files or very large files you may want to run this command in the background, in a screen or tmux session etc. Anything that allow it to continue if your ssh connection times out.
This command will create a new file called checksum-result.log in the current directory that will contain all of the checksums for the files in /home/username/. It will also print out all of the checksums to the screen as it goes. If you have a lot of files or very large files you may want to run this command in the background, in a [https://en.wikipedia.org/wiki/GNU_Screen screen] or [https://en.wikipedia.org/wiki/Tmux tmux] session; anything that allows it to continue if your [[SSH]] connection times out.


After you run it on both systems you can use the diff utility to find files that don't match.
After you run it on both systems you can use the <code>diff</code> utility to find files that don't match.


    diff sha1-checksum-result-silo.log sha1-checksum-dtn.log
{{Command
    69c69
|diff checksum-result-silo.log checksum-dtn.log
    < 017f14f6a1a194a5f791836d93d14beead0a5115  /home/username/file-0025048576-0000008
|result=69c69
    ---
< 017f14f6a1a194a5f791836d93d14beead0a5115  /home/username/file-0025048576-0000008
    > 8836913c2cc2272c017d0455f70cf0d698daadb3  /home/username/file-0025048576-0000008
---
> 8836913c2cc2272c017d0455f70cf0d698daadb3  /home/username/file-0025048576-0000008
}}


It is possible that the find command will crawl through the directories in a different order resulting in a lot of false differences so you may need to run sort on both files before running diff such as:
It is possible that the <code>find</code> command will crawl through the directories in a different order resulting in a lot of false differences so you may need to run <code>sort</code> on both files before running diff such as:


sort sha1-checksum-result-silo.log -o sha1-checksum-result-silo.log
{{Commands
sort sha1-checksum-dtn.log -o sha1-checksum-dtn.log
|sort -k2 checksum-result-silo.log -o checksum-result-silo.log
|sort -k2 checksum-dtn.log -o checksum-dtn.log
}}


[[Category:Connecting]]
[[Category:Connecting]]
Bureaucrats, cc_docs_admin, cc_staff
2,919

edits

Navigation menu