Dar: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
 
(30 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Draft}}
<languages />


<translate>
<!--T:1-->
''Parent page: [[Storage and file management]]''
''Parent page: [[Storage and file management]]''


The [http://dar.linux.free.fr <code>dar</code>] (stands for Disk ARchiver) utility was written from the ground up as a modern
<!--T:2-->
The [http://dar.linux.free.fr <code>dar</code>] (stands for Disk ARchive) utility was written from the ground up as a modern
replacement to the classical Unix <code>tar</code> tool. First released in 2002, <code>dar</code> is open
replacement to the classical Unix <code>tar</code> tool. First released in 2002, <code>dar</code> is open
source, is actively maintained, and can be compiled on any Unix-like system.
source, is actively maintained, and can be compiled on any Unix-like system.


<!--T:3-->
Similar to <code>tar</code>,
Similar to <code>tar</code>,
<code>dar</code> supports full / differential / incremental backups. Unlike <code>tar</code>, each
<code>dar</code> supports full / differential / incremental backups. Unlike <code>tar</code>, each
<code>dar</code> arhive includes a file index for fast file access and restore -- this is especially useful for large
<code>dar</code> archive includes a file index for fast file access and restore -- this is especially useful for large
archives! <code>dar</code> has built-in compression on a file-by-file basis, making it more resilient
archives! <code>dar</code> has built-in compression on a file-by-file basis, making it more resilient
against data corruption, and you can optionally tell it not to compress already highly compressed files
against data corruption, and you can optionally tell it not to compress already highly compressed files
Line 17: Line 21:
data loss, and has many other desirable features. On the [http://dar.linux.free.fr <code>dar</code> page] you can find a [http://dar.linux.free.fr/doc/FAQ.html#tar detailed feature-by-feature <code>tar</code>-to-<code>dar</code> comparison].
data loss, and has many other desirable features. On the [http://dar.linux.free.fr <code>dar</code> page] you can find a [http://dar.linux.free.fr/doc/FAQ.html#tar detailed feature-by-feature <code>tar</code>-to-<code>dar</code> comparison].


== Where to find <code>dar</code> ==
== Where to find <code>dar</code> == <!--T:4-->
 
Since <code>dar</code> can be compiled on the command-line, you can install it easily on Linux and
MacOS. On Compute Canada clusters a slightly out-of-date version can be found in <code>/cvmfs</code>:


<!--T:5-->
On our clusters, <code>dar</code> is available on <code>/cvmfs</code>.
With [[Standard software environments|StdEnv/2020]]:
</translate>
<source lang="console">
<source lang="console">
[user_name@localhost]$ which dar
[user_name@localhost]$ which dar
/cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/dar
/cvmfs/soft.computecanada.ca/gentoo/2020/usr/bin/dar
[user_name@localhost]$ dar --version
[user_name@localhost]$ dar --version
dar version 2.5.3, Copyright (C) 2002-2052 Denis Corbin
dar version 2.5.11, Copyright (C) 2002-2052 Denis Corbin
...
...
</source>
</source>
<translate>


If you want a newer version, you can compile it from source (replace 2.6.3 with the latest version number):
== Using <code>dar</code> manually == <!--T:7-->


<source lang="console">
=== Basic archiving and extracting === <!--T:8-->
[user_name@localhost]$ wget https://sourceforge.net/projects/dar/files/dar/2.6.3/dar-2.6.3.tar.gz
[user_name@localhost]$ tar xvfz dar-*.gz && /bin/rm -f dar-*.gz
[user_name@localhost]$ cd dar-*
[user_name@localhost]$ ./configure --prefix=$HOME/dar --disable-shared
[user_name@localhost]$ make
[user_name@localhost]$ make install-strip
[user_name@localhost]$ $HOME/dar/bin/dar --version
</source>
 
== Using <code>dar</code> manually ==
 
=== Basic archiving and extracting ===


<!--T:9-->
Let's say, in the current directory you have a subdirectory <code>test</code>. To pack it into an archive,
Let's say, in the current directory you have a subdirectory <code>test</code>. To pack it into an archive,
you can type in the current directory:
you can type in the current directory:
 
</translate>
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -w -c all -g test
[user_name@localhost]$ dar -w -c all -g test
</source>
</source>
 
<translate>
<!--T:10-->
This will create an archive file <code>all.1.dar</code>, where <code>all</code> is the base name and
This will create an archive file <code>all.1.dar</code>, where <code>all</code> is the base name and
<code>1</code> is the slice number. You can break a single archive into multiple slices (below). You can
<code>1</code> is the slice number. You can break a single archive into multiple slices (below). You can
include multiple directories and files into an archive, e.g.
include multiple directories and files into an archive, e.g.
 
</translate>
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -w -c all -g testDir1 -g testDir2 -g file1 -f file 2
[user_name@localhost]$ dar -w -c all -g testDir1 -g testDir2 -g file1 -g file2
</source>
</source>
 
<translate>
<!--T:11-->
Please note that all paths should be relative to the current directory.
Please note that all paths should be relative to the current directory.


<!--T:12-->
To list the archive's contents, use only the base name:
To list the archive's contents, use only the base name:


<!--T:13-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -l all
[user_name@localhost]$ dar -l all
</source>
</source>


<!--T:14-->
To extract a single file into a subdirectory <code>restore</code>, use the base name and the file path:
To extract a single file into a subdirectory <code>restore</code>, use the base name and the file path:


<!--T:15-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -R restore/ -O -w -x all -v -g test/filename
[user_name@localhost]$ dar -R restore/ -O -w -x all -v -g test/filename
</source>
</source>


<!--T:16-->
The flag <code>-O</code> will tell <code>dar</code> to ignore file ownership. Wrong ownership would be a
The flag <code>-O</code> will tell <code>dar</code> to ignore file ownership. Wrong ownership would be a
problem if you are restoring someone else's files and you are not root. However, even if you are
problem if you are restoring someone else's files and you are not root. However, even if you are
Line 81: Line 83:
disable a warning if <code>restore/test</code> already exists.
disable a warning if <code>restore/test</code> already exists.


<!--T:17-->
To extract an entire directory, type:
To extract an entire directory, type:


<!--T:18-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -R restore/ -O -w -x all -v -g test
[user_name@localhost]$ dar -R restore/ -O -w -x all -v -g test
</source>
</source>


<!--T:19-->
Similar to creating an archive, you can pass multiple directories and files by using multiple
Similar to creating an archive, you can pass multiple directories and files by using multiple
<code>-g</code> flags. Note that <code>dar</code> does not accept Unix wild masks after <code>-g</code>.
<code>-g</code> flags. Note that <code>dar</code> does not accept Unix wild masks after <code>-g</code>.


=== Incremental backups ===
==== A note about the Lustre filesystem ==== <!--T:86-->
 
<!--T:87-->
If the archived files are coming from a [https://www.lustre.org/ Lustre filesystem]
(typically in <code>/home</code>, <code>/project</code> or <code>/scratch</code>
on [[National_systems|our ''general-purpose'' compute clusters]]),
some <i>extended attributes</i> are saved automatically.
To see which extended attributes are assigned to each archived file, use the <code>-alist-ea</code> flag:
</translate>


{{Command2
|dar -l all -alist-ea
}}
<translate>
<!--T:88-->
We can see strings like: <code>Extended Attribute: [lustre.lov]</code>.
With this attribute, any file extraction to a location formatted in Lustre will still work as usual.
But if one tries to extract files to the [[Using_node-local_storage|node local storage]]
(also known as <code>$SLURM_TMPDIR</code>),
the extraction will show error messages like:
<code>Error while adding EA lustre.lov : Operation not supported</code>.
<!--T:89-->
To avoid these error messages, the <code>-u</code> flag can be used to exclude a specific type of attribute,
while the "affected" files are still extracted. For example:
</translate>
{{Command2
|dar -R restore/ -O -w -x all -v -g test -u 'lustre*'
}}
<translate>
<!--T:90-->
Another solution is to get rid of the <code>lustre.lov</code> attribute
while creating the archive with the same <code>-u</code> flag:
</translate>
{{Command2
|dar -w -c all -g test -u 'lustre*'
}}
<translate>
<!--T:91-->
In conclusion, this is necessary only if you intend to extract files to a location not formatted in Lustre.
=== Incremental backups === <!--T:20-->
<!--T:21-->
You can create differential and incremental backups with <code>dar</code>, by passing the base name of
You can create differential and incremental backups with <code>dar</code>, by passing the base name of
the reference archive with <code>-A</code>. For example, let's say on Monday you create a full backup
the reference archive with <code>-A</code>. For example, let's say on Monday you create a full backup
named <code>monday</code>:
named <code>monday</code>:


<!--T:22-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -w -c monday -g test
[user_name@localhost]$ dar -w -c monday -g test
</source>
</source>


<!--T:23-->
On Tuesday you modify some of the files and then include only these files into a new, incremental backup
On Tuesday you modify some of the files and then include only these files into a new, incremental backup
named <code>tuesday</code>, using <code>monday</code> archive as a reference:
named <code>tuesday</code>, using the <code>monday</code> archive as a reference:


<!--T:24-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -w -A monday -c tuesday -g test
[user_name@localhost]$ dar -w -A monday -c tuesday -g test
</source>
</source>


<!--T:25-->
On Wednesday you modify more files, and at the end of the day you create a new backup named
On Wednesday you modify more files, and at the end of the day you create a new backup named
<code>wednesday</code>, now using <code>tuesday</code> archive as a reference:
<code>wednesday</code>, now using the <code>tuesday</code> archive as a reference:


<!--T:26-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -w -A tuesday -c wednesday -g test
[user_name@localhost]$ dar -w -A tuesday -c wednesday -g test
</source>
</source>


<!--T:27-->
Now you have three files:
Now you have three files:


<!--T:28-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ ls *.dar
[user_name@localhost]$ ls *.dar
Line 121: Line 180:
</source>
</source>


<!--T:29-->
The file <code>wednesday.1.dar</code> contains only the files that you modified on Wednesday, but not the
The file <code>wednesday.1.dar</code> contains only the files that you modified on Wednesday, but not the
files from Monday or Tuesday. Therefore, the command
files from Monday or Tuesday. Therefore, the command


<!--T:30-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -R restore -O -x wednesday
[user_name@localhost]$ dar -R restore -O -x wednesday
</source>
</source>


<!--T:31-->
will only restore files that were modified on Wednesday. To restore everything, you have to go through
will only restore files that were modified on Wednesday. To restore everything, you have to go through
all backups in the chronological order:
all backups in the chronological order:


<!--T:32-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -R restore -O -w -x monday      # restore the full backup
[user_name@localhost]$ dar -R restore -O -w -x monday      # restore the full backup
Line 137: Line 200:
</source>
</source>


=== Limiting the size of each slice ===
=== Limiting the size of each slice === <!--T:33-->


<!--T:34-->
To limit the maximum size of each slice in bytes, use the flag <code>-s</code> followed by a number and one of k/M/G/T. For example, for a 1340 MB archive, the command
To limit the maximum size of each slice in bytes, use the flag <code>-s</code> followed by a number and one of k/M/G/T. For example, for a 1340 MB archive, the command


<!--T:35-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -s 100M -w -c monday -g test
[user_name@localhost]$ dar -s 100M -w -c monday -g test
</source>
</source>


<!--T:36-->
will create 14 slices named <code>monday.{1..14}.dar</code>. To extract from all of these, use their base name:
will create 14 slices named <code>monday.{1..14}.dar</code>. To extract from all of these, use their base name:


<!--T:37-->
<source lang="console">
<source lang="console">
[user_name@localhost]$ dar -O -x monday
[user_name@localhost]$ dar -O -x monday
</source>
</source>


== Using <code>dar</code> via functions ==
== External scripts == <!--T:84-->


Using <code>dar</code> would be much easier if you did not have to memorize and specify all the flags and the right syntax on the command line. Here we provide several bash functions for easy backup. Please note that these functions assume that you are below your quota (so you can write files!), have read and write permissions, i.e. all the common-sense assumptions. It is your job to ensure that this is the case, and that <code>dar</code> archived/restored your files correctly before you delete the originals. In other words, please test everything before including these functions into your workflow.
<!--T:85-->
 
One of our team members has written bash functions that can facilitate the use of <code>dar</code>. You can use these functions as inspiration to write your own scripts. See [https://github.com/razoumov/sharedSnippets here] for details.
=== Limiting the number of files in each slice with <code>multidar</code> ===
 
Using copy and paste, define the following function in your shell, or put this definition into your
<code>$HOME/.bashrc</code> file and then source it with <code>source ~/.bashrc</code>:
 
<source lang="console">
function multidar() {
    if ! [ $# = 2 ]; then
echo Usage: multidar sourceDirectory maxNumberOfFilesPerArchive
    else
sourceDirectory=$1
maxNumberOfFilesPerArchive=$2
if which dar 2>/dev/null; then
    echo great, I found dar at $(which dar)
    find $sourceDirectory -type f > .fullList
    sed -i -e '/DS_Store/d' .fullList
    sed -i -e 's/\/\//\//' .fullList
    split -a 3 -l $maxNumberOfFilesPerArchive .fullList .partial
    for i in .partial*; do
echo archiving from $i to ${sourceDirectory%?}-${i:8:3}
dar -w -c ${sourceDirectory%?}-${i:8:3} --include-from-file $i
/bin/rm -rf $i
    done
    /bin/rm -rf .fullList*
    ls -lh ${sourceDirectory%?}*.dar
else
    echo please install dar
fi
    fi
}
</source>
 
Now, running the command without arguments will show you the syntax:
 
<source lang="console">
[user_name@localhost]$ multidar
Usage: multidar sourceDirectory maxNumberOfFilesPerArchive
</source>
 
Let's assume that we have 1000 files inside <code>test</code>. Running the command
 
<source lang="console">
[user_name@localhost]$ multidar test 300
</source>
 
will produce four archives, each with its own basename and no more than 300 files inside. To restore from
these archives, use a bash loop:
 
<source lang="console">
[user_name@localhost]$ for f in test-aa{a..d}
                      do
                        dar -R restore/ -O -w -x $f
                      done
</source>
 
=== Backup ===
 
Let's define the following function:
 
<source lang="console">
function backup() {
    BREF='/home/username/tmp'
    BSRC='-g test'  # cannot use an absolute path
    BDEST=/Users/razoumov/tmp/backups
    BTAG=all
    FLAGS=(-s 5G -zbzip2 -asecu -w -X "*~" -X "*.o")  # bash array with some flags
    #FLAGS+=(-K aes:)  # add encryption
    if [ $# == 0 ]; then
echo missing argument ... need to be one of: show 0 1 2 3 .. 98 99
    elif [ $1 == 'show' ]; then
ls -lhtr $BDEST/"$BTAG"*
    elif [ $1 == '0' ]; then
echo backing up $BSRC to $BDEST
dar "${FLAGS[@]}" -c $BDEST/"$BTAG"0 -R $BREF $BSRC
/bin/rm -rf $BDEST/"$BTAG"{1..100}.*.dar; ls -lhtr $BDEST/"$BTAG"*
    else
level=$1
if [ -n "$level" ] && [ "$level" -eq "$level" ] 2>/dev/null; then  # check if it is a number
    echo backing up $BSRC to $BDEST
      dar "${FLAGS[@]}" -A $BDEST/"$BTAG"$((level-1)) -c $BDEST/"$BTAG"$level -R $BREF $BSRC
    for i in $(seq $((level+1)) 100); do
/bin/rm -rf $BDEST/"$BTAG"$i.*.dar
    done
    ls -lhtr $BDEST/"$BTAG"*
else
    echo $level is not a number ...; return 1
fi
    fi
}
</source>
 
You need to define the four variables at the top:
 
* BREF stores the absolute path of the parent directory (containing all subdirectories and files to archive)
* BSRC stores a relative (to BREF) list of subdirectories and files to archive; BSRC cannot be an absolute path
* BDEST is the backup destination
* BTAG will form the root of the backup basename
 
To create the full backup <code>all0.*.dar</code>, type
 
<source lang="console">
[user_name@localhost]$ backup 0
</source>
 
To create the first incremental backup <code>all1.*.dar</code>, type
 
<source lang="console">
[user_name@localhost]$ backup 1
</source>
 
To create the second incremental backup <code>all2.*.dar</code>, type
 
<source lang="console">
[user_name@localhost]$ backup 2
</source>
 
and so on. To see all backups, type
 
<source lang="console">
[user_name@localhost]$ backup show
</source>
 
If your backup exceeds 5 GB, more than one slice will be created.
 
If you have too many incremental backups, you can always create a lower-numbered backup, e.g.
 
<source lang="console">
[user_name@localhost]$ backup 1
</source>
 
will overwrite the first incremental backup and will remove all higher-numbered backups.
 
=== Restore from backup ===
 
Let's define the function
 
<source lang="console">
function restore() {
    BSRC=/home/username/tmp/backups
    BTAG=all
    BDEST=/home/username/tmp/restore
    if [ $# == 0 ]; then
echo Examples:
echo '  'restore -l anyPattern
echo '  'restore -x Pictures/1995
echo '  'restore -x Documents/notes
echo '  'restore -x Documents/notes/quantum.txt
echo '  'restore -n 0 Documents/misc/someFile.txt
echo 'Notes: (1)' restore -x/-n does not understand Unix wildmasks, so need to specify full directory or file name
echo '      (2)' always specify one name per command
echo '      (3)' restore will put the restored files into \$BDEST
    elif [ $1 == '-l' ]; then
echo Listing all versions
for file in $BSRC/"$BTAG"{0..99}; do
    if [ -f $file.1.dar ]; then
      echo --- in $file:
dar -l $file | grep $2
    fi
done
    elif [ $1 == '-x' ]; then
echo Restoring from the earliest version:
echo '  'important to go through all previous backups if restoring a directory or a sparsebundle
echo '  'or if the most recent version of the file is stored in an earlier backup
for file in $BSRC/"$BTAG"{0..99}; do
    if [ -f $file.1.dar ]; then
      echo --- from $file:
dar -R $BDEST -O -w -x $file -v -g $2
    fi
done
    elif [ $1 == '-n' ]; then
echo Be careful with restoring from a single layer: might not work as naively expected
echo Restoring from version $2
dar -R $BDEST -O -w -x $BSRC/"$BTAG"$2 -v -g $3
    else
echo unrecognized option ...
    fi
}
</source>


=== Symmetric encryption ===
</translate>

Latest revision as of 20:45, 15 July 2024

Other languages:

Parent page: Storage and file management

The dar (stands for Disk ARchive) utility was written from the ground up as a modern replacement to the classical Unix tar tool. First released in 2002, dar is open source, is actively maintained, and can be compiled on any Unix-like system.

Similar to tar, dar supports full / differential / incremental backups. Unlike tar, each dar archive includes a file index for fast file access and restore -- this is especially useful for large archives! dar has built-in compression on a file-by-file basis, making it more resilient against data corruption, and you can optionally tell it not to compress already highly compressed files such as mp4 and gz. dar supports strong encryption, can split archives at 1-byte resolution, supports extended file attributes, sparse files, hard and symbolic (soft) links, can detect data corruption in both headers and saved data and recover with minimal data loss, and has many other desirable features. On the dar page you can find a detailed feature-by-feature tar-to-dar comparison.

Where to find dar

On our clusters, dar is available on /cvmfs. With StdEnv/2020:

[user_name@localhost]$ which dar
/cvmfs/soft.computecanada.ca/gentoo/2020/usr/bin/dar
[user_name@localhost]$ dar --version
dar version 2.5.11, Copyright (C) 2002-2052 Denis Corbin
...

Using dar manually

Basic archiving and extracting

Let's say, in the current directory you have a subdirectory test. To pack it into an archive, you can type in the current directory:

[user_name@localhost]$ dar -w -c all -g test

This will create an archive file all.1.dar, where all is the base name and 1 is the slice number. You can break a single archive into multiple slices (below). You can include multiple directories and files into an archive, e.g.

[user_name@localhost]$ dar -w -c all -g testDir1 -g testDir2 -g file1 -g file2

Please note that all paths should be relative to the current directory.

To list the archive's contents, use only the base name:

[user_name@localhost]$ dar -l all

To extract a single file into a subdirectory restore, use the base name and the file path:

[user_name@localhost]$ dar -R restore/ -O -w -x all -v -g test/filename

The flag -O will tell dar to ignore file ownership. Wrong ownership would be a problem if you are restoring someone else's files and you are not root. However, even if you are restoring your own files, dar will throw a message that you are doing this as non-root and will ask you to confirm. To disable this warning, use -O. The flag -w will disable a warning if restore/test already exists.

To extract an entire directory, type:

[user_name@localhost]$ dar -R restore/ -O -w -x all -v -g test

Similar to creating an archive, you can pass multiple directories and files by using multiple -g flags. Note that dar does not accept Unix wild masks after -g.

A note about the Lustre filesystem

If the archived files are coming from a Lustre filesystem (typically in /home, /project or /scratch on our general-purpose compute clusters), some extended attributes are saved automatically. To see which extended attributes are assigned to each archived file, use the -alist-ea flag:

[name@server ~]$ dar -l all -alist-ea


We can see strings like: Extended Attribute: [lustre.lov]. With this attribute, any file extraction to a location formatted in Lustre will still work as usual. But if one tries to extract files to the node local storage (also known as $SLURM_TMPDIR), the extraction will show error messages like: Error while adding EA lustre.lov : Operation not supported.

To avoid these error messages, the -u flag can be used to exclude a specific type of attribute, while the "affected" files are still extracted. For example:

[name@server ~]$ dar -R restore/ -O -w -x all -v -g test -u 'lustre*'


Another solution is to get rid of the lustre.lov attribute while creating the archive with the same -u flag:

[name@server ~]$ dar -w -c all -g test -u 'lustre*'


In conclusion, this is necessary only if you intend to extract files to a location not formatted in Lustre.

Incremental backups

You can create differential and incremental backups with dar, by passing the base name of the reference archive with -A. For example, let's say on Monday you create a full backup named monday:

[user_name@localhost]$ dar -w -c monday -g test

On Tuesday you modify some of the files and then include only these files into a new, incremental backup named tuesday, using the monday archive as a reference:

[user_name@localhost]$ dar -w -A monday -c tuesday -g test

On Wednesday you modify more files, and at the end of the day you create a new backup named wednesday, now using the tuesday archive as a reference:

[user_name@localhost]$ dar -w -A tuesday -c wednesday -g test

Now you have three files:

[user_name@localhost]$ ls *.dar
monday.1.dar     tuesday.1.dar    wednesday.1.dar

The file wednesday.1.dar contains only the files that you modified on Wednesday, but not the files from Monday or Tuesday. Therefore, the command

[user_name@localhost]$ dar -R restore -O -x wednesday

will only restore files that were modified on Wednesday. To restore everything, you have to go through all backups in the chronological order:

[user_name@localhost]$ dar -R restore -O -w -x monday      # restore the full backup
[user_name@localhost]$ dar -R restore -O -w -x tuesday     # restore the first incremental backup
[user_name@localhost]$ dar -R restore -O -w -x wednesday   # restore the second incremental backup

Limiting the size of each slice

To limit the maximum size of each slice in bytes, use the flag -s followed by a number and one of k/M/G/T. For example, for a 1340 MB archive, the command

[user_name@localhost]$ dar -s 100M -w -c monday -g test

will create 14 slices named monday.{1..14}.dar. To extract from all of these, use their base name:

[user_name@localhost]$ dar -O -x monday

External scripts

One of our team members has written bash functions that can facilitate the use of dar. You can use these functions as inspiration to write your own scripts. See here for details.