Galaxy
Introduction
Galaxy est une plateforme web open source pour la recherche biomédicale traitant de grandes quantités de données. La plateforme rend la biologie computationnelle plus accessible, sans exiger une grande expérience en programmation ou en administration de systèmes. Conçue au départ pour la recherche en génomique, Galaxy s’adapte aujourd'hui à la plupart des domaines et sert de système de gestion du flux de travail en bio-informatique.
Pour un aperçu des applications, voyez cette liste de tutoriels.
Disponible sur Cedar seulement
Chaque groupe de recherche peut obtenir une instance Galaxy sur la grappe Cedar. Puisque l’installation demande une configuration particulière, contactez notre équipe technique.
Structure du répertoire
L’installation se fait habituellement dans le répertoire /project du groupe de recherche. Le nom du répertoire source est formé par les deux premiers caractères du nom de la chercheuse ou du chercheur principal (CP), auxquels est ajouté glxy. Par exemple, pour le CP davidc, le nom du répertoire source sera daglxy; le répertoire sera localisé dans /project/group name/ où group name est le nom du groupe par défaut pour ce CP (def-davidc). Le répertoire principal pour Galaxy contient un ensemble de sous-répertoires qui est quelque peu différent du paquet Galaxy original, soit :
- config : contient tous les fichiers de configuration pour préparer et optimiser le serveur Galaxy. Dans cette page, nous présenterons seulement les principes de base pour notre environnement de calcul haute performance.
- galaxy : ccontient le paquet de base, écrit principalement en Python.
- logs : contient le fichier galaxy.log qui enregistre les messages générés à l’exécution et le fichier server.log qui enregistre les messages générés au démarrage et à l’arrêt du serveur.
- plugins : contient les extensions; dans le paquet Galaxy original, ce répertoire se trouve dans le répertoire galaxy.
- tmp : contient les fichiers temporaires pour la compilation et l’installation des outils (Galaxy ToolShed).
- venv : répertoire de l’environnement virtuel Python qui contient les dépendances pour les paquets Python.
- tool-data : ccontient les données utilisées par les outils; voir les exemples dansData Integration for Local Instances.
- tool-dependencies : contient tous les paquets nécessaires aux outils ToolShed; ces paquets sont installés avec Anaconda.
- database : contient les fichiers d’erreurs et les fichiers d’entrée et de sortie pour les tâches exécutées sur les nœuds des grappes.
Propriété et modification des fichiers
All files of your Galaxy instance belong to a "pseudo-account", a shared account that is generated by an administrator at installation time. A pseudo-account does not belong to an individual person, but belongs to a specific group. Everyone in the group can log in to the pseudo-account using SSH keys. The name of the pseudo-account in this case is the same name as the top Galaxy directory explained above, eg., daglxy
. In order to modify any file of your Galaxy instance, e.g. configuration files, you first need to log in to the pseudo-account. Before you can log in you must generate an SSH key pair, store your public key somewhere in your home
directory, and let the administrator know about that. The administrator will store your public key in an appropriate place, after which you can log in to your pseudo-account.
Galaxy server management
Starting Galaxy server is the first thing that needs to be done by users. Galaxy server should NOT be run on cedar login node or any compute node. We have a dedicated server called "gateway" and it is used for this purpose. It contains a web server with relevant Cedar filesystems, /project
and /home
directories mounted on it. Users cannot make a SSH connection to this machine due to security reasons, however we have designed a web site on this machine that allows users to start/stop their own galaxy server. The website also allows users to user Galaxy web interface to communicate with the server. To do that please go to the website https://gateway.cedar.computecanada.ca/ and click on Galaxy link. You will be asked to enter your username and password. Your username and password is the same as your computecanada one. Once you authenticate then you will be automatically redirect to your galaxy server manager website where you can manage your server or use Galaxy web interface.
Galaxy configuration
Files in the config
directory are used to configure your Galaxy server. Configuring and optimizing Galaxy is tricky and explaining all the configuration files is beyond the scope of this article. If you want more information about this we recommend you read documentation on the Galaxy website. We list below some basic variables that are set for you by the administrator. We strongly recommend you do not change them.
- In file
galaxy.yml
(the main configuration file):http:
contains your unique port numberdatabase_connection
is the name of your Galaxy database and your database server.virtualenv
is the path to a Python virtual environment in the gateway machinefile_path, new_file_path, tool_config_file, shed_tool_config_file, tool_dependency_dir, tool_data_path, visualization_plugins_directory, job_working_directory, cluster_files_directory, template_cache_path, citation_cache_data_dir, citation_cache_lock_dir
are appropriate paths for tools, tool sheds and dependencies.
Other variables and files in this directory can be changed by the user.
Running Tools
There are basically two ways one can run tools in a Galaxy instance. One can run them "locally", which in this case means running on the gateway machine, or one can run them by submitting jobs to the cluster.
Please DO NOT run tools locally at Cedar, because the gateway machine has little memory and cannot run jobs efficiently. Galaxy is configured to submit jobs to the cluster using the file job_conf.xml
. Some variables in this file are already set by the administrator for submitting jobs. However, you may need to optimize the entries in this file depending on the tools you will use. Please first examine the file to understand these variables and the way they are used. For example, some tools require more memory or more walltime. Please perform some tests to find out optimized job specification values for every tool you would like to run.
GenAP
The Genetics and Genomics Analysis Platform (GenAP) is a computing infrastructure and software environment for life science researchers offering services since 2015. GenAP aims at facilitating the work of researchers and students by offering out of the box Web applications running on an infrastructure that currently leverages Compute Canada Cloud and HPC resources.
Any Compute Canada user can request an account on GenAP free of charges. Non-Computer Canada users can be invited if they have a CC sponsor.
Galaxy on GenAP
GenAP offers the ability to use your own privately accessed Galaxy instance, loaded with 700+ preinstalled tools. The GenAP-Galaxy is fully integrated with the GenAP infrastructure, allowing users to leverage storage and compute resources from Compute Canada as well as interact with other GenAP applications.
Tools
GenAP-Galaxy comes with more than 700 pre-installed tools and reference Genomes. There is no need for user configuration or installation.
The GenAP team has also developed some Single-Cell tools that are integrated with GenAP visualization tools.
GenAP has also close ties with the main Galaxy community and updates of reference genomes and index files are synchronized with the main usegalaxy.org.
GenAP-Galaxy tries to keep its tool set as close as possible to the main Galaxy (usegalaxy.org).
In some cases we can add new tools on request, however, this decision is made on a per case basis.
For safety reasons users are not allowed to install tools.
Running tools
There is no need for any configuration on GenAP-Galaxy. All tools, index files and reference genomes come pre-installed.
The jobs are submitted to a cluster using slurm (no configuration needed). Jobs are submitted with default parameters ( 10GB RAM, 24hrs, 2 CPUs ).
Through the Job Resources Parameters menu users can modify the default walltime, number of processors and RAM of a job. This is specially useful for large jobs (such as genome assembly), where the default may not be enough.
Galaxy documentation
GenAP has a vast documentation on data analysis ( +50 tutorials ) and also how to get started with Galaxy on GenAP.
GenAP or Cedar, where to analyze my data ?
This table is to help new user to choose where to run their data. Each of these two Galaxies have specific features that may be more suitable for some research groups.
Feature | GenAP | Cedar |
---|---|---|
Server | Arbutus | Cedar |
Galaxy configuration | None | High (done by user) |
Required knowledge of Linux | None | High |
Management/Updates | GenAP team | User |
Configuration | None | User |
Pre-installed tools | Yes | Yes (subset) |
Irida integration | No | Yes |
Reference Genome | Yes (through CVMFS) | Managed by user |
Quota | 1.5 TB (default) | storage RAC |