CVMFS: Difference between revisions
m (add note about Merkle trees) |
mNo edit summary |
||
Line 22: | Line 22: | ||
*** Two tiers of local cache can be used, e.g. a fast SSD cache and a large HDD cache. A cluster filesystem can also be used as a shared cache for all nodes in a cluster. | *** Two tiers of local cache can be used, e.g. a fast SSD cache and a large HDD cache. A cluster filesystem can also be used as a shared cache for all nodes in a cluster. | ||
* CVMFS clients have read-only access to the filesystem. | * CVMFS clients have read-only access to the filesystem. | ||
* By using Merkle trees | * By using Merkle trees and content-addressable storage, and encoding metadata in catalogs, all metadata is treated as data, and practically all data is immutable and highly amenable to caching. | ||
* Metadata storage and operations scale by using nested catalogs, allowing resolution of metadata queries to be performed locally by the client. | * Metadata storage and operations scale by using nested catalogs, allowing resolution of metadata queries to be performed locally by the client. | ||
* File integrity and authenticity are verified using signed cryptographic hashes, avoiding data corruption or tampering. | * File integrity and authenticity are verified using signed cryptographic hashes, avoiding data corruption or tampering. | ||
* Automatic de-duplication and compression minimize storage usage on the server side. File chunking and on-demand access minimize storage usage on the client side. | * Automatic de-duplication and compression minimize storage usage on the server side. File chunking and on-demand access minimize storage usage on the client side. | ||
* Versatile configurations can be deployed by writing authorization helpers or cache plugins to interact with external authorization or storage providers. | * Versatile configurations can be deployed by writing authorization helpers or cache plugins to interact with external authorization or storage providers. |
Revision as of 21:08, 19 February 2021
This page describes CERN Virtual Machine File System (CVMFS). Compute Canada uses CVMFS to distribute software, data and other content. Refer to accessing CVMFS for instructions on configuring a CVMFS client to access this content, and the official documentation and webpage for further information.
Introduction[edit]
CVMFS is a distributed read-only software distribution system, implemented as a POSIX filesystem in user space (FUSE) using HTTP transport. It was originally developed for the LHC (Large Hadron Collider) experiments at CERN to deliver software to virtual machines and to replace diverse shared software installation areas and package management systems at numerous computing sites. Designed to deliver software in a fast, scalable and reliable fashion, its successful use has rapidly grown over recent years to include dozens of projects, ~1010 files and directories, ~102 compute sites, and ~105 clients around the world. The CernVM Monitor shows many research groups which use CVMFS and the stratum sites which replicate their repositories.
Features[edit]
- Only one copy of software needs to be maintained, and can be propagated to and used at multiple sites. Commonly used software can be installed on CVMFS in order to reduce remote software administration.
- Software applications and their prerequisites can be run from CVMFS, eliminating any requirement on the Linux distribution type or release level of a client node.
- The project software stack and OS can be decoupled. For the cloud use case in particular, this allows software to be accessed in a VM without being embedded in the VM image, enabling VM images and software to be updated and distributed separately.
- Content versioning is provided via repository catalog revisions. Updates are committed in transactions and can be rolled back to a previous state.
- Updates are propagated to clients automatically and atomically.
- Clients can view historical versions of repository content.
- Files are fetched using the standard HTTP protocol. Client nodes do not require ports or firewalls to be opened.
- Fault-tolerance and reliability are achieved by using multiple redundant proxy and stratum servers. Clients transparently fail over to the next available proxy or server.
- Hierarchical caching makes the CVMFS model highly scalable and robust and minimizes network traffic. There can be several levels in the content delivery and caching hierarchy:
- The stratum 0 holds the master copy of the repository
- Multiple stratum 1 servers replicate the repository contents from the stratum 0
- HTTP proxy servers cache network requests from clients to stratum 1 servers
- The CVMFS client downloads files on demand into the local client cache(s).
- Two tiers of local cache can be used, e.g. a fast SSD cache and a large HDD cache. A cluster filesystem can also be used as a shared cache for all nodes in a cluster.
- CVMFS clients have read-only access to the filesystem.
- By using Merkle trees and content-addressable storage, and encoding metadata in catalogs, all metadata is treated as data, and practically all data is immutable and highly amenable to caching.
- Metadata storage and operations scale by using nested catalogs, allowing resolution of metadata queries to be performed locally by the client.
- File integrity and authenticity are verified using signed cryptographic hashes, avoiding data corruption or tampering.
- Automatic de-duplication and compression minimize storage usage on the server side. File chunking and on-demand access minimize storage usage on the client side.
- Versatile configurations can be deployed by writing authorization helpers or cache plugins to interact with external authorization or storage providers.