Galaxy: Difference between revisions

Latest revision as of 15:19, 30 October 2024

Other languages:

English
français

Introduction[edit]

Galaxy is an open source, web-based platform for data-intensive biomedical research. It aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain-agnostic and is now used as a general workflow management system in bioinformatics.

The list of tutorials here suggests the range of applications of Galaxy.

Galaxy on Cedar[edit]

On cedar we provide one Galaxy instance for every research group. Galaxy installation requires a special setup that needs to be done by Alliance staff. If you need Galaxy for your group please write an email to support team.

Galaxy directory structure[edit]

Galaxy is usually installed on the project directory of the group and it contains several sub-directories. The name of the Galaxy top directory is determined by taking the first two character of PI username + "glxy". For example if PI username is "davidc" the Galaxy top directory will be "daglxy" and it is located in /project/group name/ where group name is the default group name of PI, eg., def-davidc. Galaxy main directory contains the following sub-directories which is slightly different than the original Galaxy package.

config: contains all required configuration files to set up and optimize the Galaxy server. Below we explain some basic concepts of some of the configuration files that need to be set up in order to be compatible with our HPC environment, however, we will not cover all concepts.
galaxy: contains the core Galaxy package which is written mostly in Python.
logs: contains two files, galaxy.log and server.log. All messages during startup or shutdown of the server are written in server.log while all messages during the run are written in galaxy.log.
plugins: contains all plugins. In the original Galaxy package, this directory is located in the galaxy directory.
tmp: contains all temporary files that Galaxy needs for compiling and installing toolsheds.
venv: a Python virtual environment directory that contains all Python package dependencies.
tool-data: contains data used by tools. See the samples in data-integration.
tool-dependencies: contains all packages needed for toolsheds. By default, packages in this directory are installed using Anaconda.
database: contains input, output, and error files of all jobs that run on cluster nodes.

Galaxy files ownership and modification[edit]

All files of your Galaxy instance belong to a "pseudo-account", a shared account that is generated by an administrator at installation time. A pseudo-account does not belong to an individual person, but belongs to a specific group. Everyone in the group can log in to the pseudo-account using SSH keys. The name of the pseudo-account in this case is the same name as the top Galaxy directory explained above, eg., daglxy. In order to modify any file of your Galaxy instance, e.g. configuration files, you first need to log in to the pseudo-account. Before you can log in you must generate an SSH key pair, store your public key somewhere in your home directory, and let the administrator know about that. The administrator will store your public key in an appropriate place, after which you can log in to your pseudo-account.

Galaxy server management[edit]

Starting Galaxy server is the first thing that needs to be done by users. Galaxy server should NOT be run on cedar login node or any compute node. We have a dedicated server called "gateway" and it is used for this purpose. It contains a web server with relevant Cedar filesystems, /project and /home directories mounted on it. Users cannot make a SSH connection to this machine due to security reasons, however we have designed a web site on this machine that allows users to start/stop their own galaxy server. The website also allows users to user Galaxy web interface to communicate with the server. To do that please go to the website https://gateway.cedar.computecanada.ca/ and click on the Galaxy link. You will be asked to enter your username and password. Your username and password is the same as your computecanada one. Once you authenticate then you will be automatically redirect to your galaxy server manager website where you can manage your server or use Galaxy web interface.

Galaxy configuration[edit]

Files in the config directory are used to configure your Galaxy server. Configuring and optimizing Galaxy is tricky and explaining all the configuration files is beyond the scope of this article. If you want more information about this we recommend you read documentation on the Galaxy website. We list below some basic variables that are set for you by the administrator. We strongly recommend you do not change them.

In file galaxy.yml (the main configuration file):
- http: contains your unique port number
- database_connection is the name of your Galaxy database and your database server.
- virtualenv is the path to a Python virtual environment in the gateway machine
- file_path, new_file_path, tool_config_file, shed_tool_config_file, tool_dependency_dir, tool_data_path, visualization_plugins_directory, job_working_directory, cluster_files_directory, template_cache_path, citation_cache_data_dir, citation_cache_lock_dir are appropriate paths for tools, tool sheds and dependencies.

Other variables and files in this directory can be changed by the user.

Running Tools[edit]

There are basically two ways one can run tools in a Galaxy instance. One can run them "locally", which in this case means running on the gateway machine, or one can run them by submitting jobs to the cluster.

Please DO NOT run tools locally at Cedar, because the gateway machine has little memory and cannot run jobs efficiently. Galaxy is configured to submit jobs to the cluster using the file job_conf.xml. Some variables in this file are already set by the administrator for submitting jobs. However, you may need to optimize the entries in this file depending on the tools you will use. Please first examine the file to understand these variables and the way they are used. For example, some tools require more memory or more walltime. Please perform some tests to find out optimized job specification values for every tool you would like to run.

Galaxy on GenAP[edit]

The Genetics and Genomics Analysis Platform (GenAP) is a computing infrastructure and software environment for life science researchers offering services since 2015. GenAP aims at facilitating the work of researchers and students by offering out of the box Web applications running on an infrastructure that currently leverages Alliance Cloud and HPC resources.

Any Alliance user can request an account on GenAP free of charge. Other users can be invited if they have an Alliance sponsor.

GenAP offers the ability to use your own privately accessed Galaxy instance, loaded with 700+ preinstalled tools. The GenAP-Galaxy is fully integrated with the GenAP infrastructure, allowing users to leverage storage and compute resources from the Alliance as well as interact with other GenAP applications.

Tools[edit]

GenAP-Galaxy comes with more than 700 pre-installed tools and reference Genomes. There is no need for user configuration or installation.

The GenAP team has also developed some Single-Cell tools that are integrated with GenAP visualization tools.

GenAP has also close ties with the main Galaxy community and updates of reference genomes and index files are synchronized with the main usegalaxy.org.

GenAP-Galaxy tries to keep its tool set as close as possible to those of the main Galaxy (usegalaxy.org).

In some cases we can add new tools on request, however, this decision is made on a per case basis.

For safety reasons users are not allowed to install tools.

Running tools[edit]

There is no need for any configuration on GenAP-Galaxy. All tools, index files and reference genomes come pre-installed.

The jobs are submitted to a cluster using slurm (no configuration needed). Jobs are submitted with default parameters ( 10GB RAM, 24hrs, 2 CPUs ).

Through the Job Resources Parameters menu users can modify the default walltime, number of processors and RAM of a job. This is specially useful for large jobs (such as genome assembly), where the default may not be enough.

Galaxy documentation[edit]

GenAP has a vast documentation on data analysis ( +50 tutorials ) and also how to get started with Galaxy on GenAP.

GenAP or Cedar, where to analyze my data ?[edit]

This table is to help new user to choose where to run their data. Each of these two Galaxies have specific features that may be more suitable for some research groups.

Feature	GenAP	Cedar
Server	Arbutus	Cedar
Galaxy configuration	None	High (done by user)
Required knowledge of Linux	None	High
Management/Updates	GenAP team	User
Server configuration	None	User
Pre-installed tools	Yes	Yes (subset)
Irida integration	No	Yes
Reference Genome	Yes (through CVMFS)	Managed by user
Quota	1.5 TB (default)	storage RAC

@@ Line 1: / Line 1: @@
-== Introduction ==
+<languages />
-Galaxy is an open source, web-based platform for data intensive biomedical research. It is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.
+<translate>
-== Galaxy on Cedar ==
+== Introduction == <!--T:1-->
-On cedar we provide one galaxy instance for every research group. Galaxy installation requires a special setup that needs to be done by Compute Canada (CC) staff. If you need Galaxy for your group please write an email to support team.
+<!--T:2-->
+Galaxy is an open source, web-based platform for data-intensive biomedical research. It aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain-agnostic and is now used as a general workflow management system in bioinformatics.
-=== Galaxy Directory Structure ===
+<!--T:3-->
-Galaxy is usually installed on the project directory of the group and it contains several sub-directories. The name of the Galaxy top directory is determined by taking the first two character of PI username + "glyx". For example if PI username is "davidc" the galaxy top directory will "daglxy" and it is located in <code>/project/group name/</code> were <code>group name</code> is the group name of PI, eg., <code>def-davidc</code>. Galaxy main directory contains the following sub-directories which is slightly different than the original galaxy package.
+The list of tutorials [https://training.galaxyproject.org/ here] suggests the range of applications of Galaxy.
-* config: It contains all require configure files to setup and optimize Galaxy. Bellow we will explain some basis concept of some of configure files that need sto be setup in our HPC environment, however, we are not able to cover all concepts as they are out of this scope. In original galaxy package this directory is located within <code>galaxy</code> directory.
+== Galaxy on Cedar == <!--T:4-->
-* galaxy: It contains the core python scripts of Galaxy.
-* logs: Contains all log files during server Stop/Start/Run process as well as. The most important file in this directory is <code>galaxy.log</code> that is widely used to diagnose issues during Galaxy run.
-* plugins: Contains all plugins. In original galaxy package this directory is located within <code>galaxy</code> directory.
-* tmp: Contains all temporary files that galaxy needs for compiling and installing tool sheds.
-* venv: It is a python virtual environment directory and it contains all python package dependencies.
-* tool-data: It contains data used by tools See the samples in [https://galaxyproject.org/admin/data-integration data-integration]
-* tool-dependencies: It contains all dependency packages that is needed for tool sheds. By default packages in this directory are being installed using anaconda.
-* database: It contains input/output and error files of all job that run in cluster nodes.
-=== Galaxy Files ownership ===
+<!--T:5-->
+On cedar we provide one Galaxy instance for every research group. Galaxy installation requires a special setup that needs to be done by Alliance staff. If you need Galaxy for your group please write an email to support team.
-All files of your galaxy instance belongs to a "pseudo account" or "shared account" that is generated by admin at installation time. pseudo accounts do not belong to a real users but they belong to a specific group. They never expired and everyone within the group is able to login as pseudo account using [https://docs.computecanada.ca/wiki/Using_SSH_keys_in_Linux SSH key].
+=== Galaxy directory structure === <!--T:6-->
-=== Galaxy Server ===
+<!--T:7-->
- Galaxy server cannot be run on cedar, please do not run startup script on cedar. Instead we use another machine (gateway) as cedar portal that contains a web server with cedar <code>/project</code> and <code>/home</code> mounted. start, stop or run Galaxy server please go to the website [https://gateway.cedar.computecanada.ca/] and follow Galaxy link.
+Galaxy is usually installed on the project directory of the group and it contains several sub-directories. The name of the Galaxy top directory is determined by taking the first two character of PI username + "glxy". For example if PI username is "davidc" the Galaxy top directory will be "daglxy" and it is located in <code>/project/group name/</code> where <code>group name</code> is the default group name of PI, eg., <code>def-davidc</code>. Galaxy main directory contains the following sub-directories which is slightly different than the original Galaxy package.
-=== Galaxy configuration ===
+<!--T:8-->
+* '''config''': contains all required configuration files to set up and optimize the Galaxy server. Below we explain some basic concepts of some of the configuration files that need to be set up in order to be compatible with our HPC environment, however, we will not cover all concepts.
+* '''galaxy''': contains the core Galaxy package which is written mostly in Python.
+* '''logs''': contains two files, <code>galaxy.log</code> and <code>server.log</code>.  All messages during startup or shutdown of the server are written in <code>server.log</code> while all messages during the run are written in <code>galaxy.log</code>.
+* '''plugins''': contains all plugins. In the original Galaxy package, this directory is located in the <code>galaxy</code> directory.
+* '''tmp''': contains all temporary files that Galaxy needs for compiling and installing toolsheds.
+* '''venv''': a Python virtual environment directory that contains all Python package dependencies.
+* '''tool-data''': contains data used by tools. See the samples in [https://galaxyproject.org/admin/data-integration data-integration].
+* '''tool-dependencies''': contains all packages needed for toolsheds. By default, packages in this directory are installed using Anaconda.
+* '''database''': contains input, output, and error files of all jobs that run on cluster nodes.
+=== Galaxy files ownership and modification === <!--T:9-->
+<!--T:10-->
+All files of your Galaxy instance belong to a "pseudo-account", a shared account that is generated by an administrator at installation time. A pseudo-account does not belong to an individual person, but belongs to a specific group. Everyone in the group can log in to the pseudo-account using [https://docs.computecanada.ca/wiki/Using_SSH_keys_in_Linux SSH keys]. The name of the pseudo-account in this case is the same name as the top Galaxy directory explained above, eg., <code>daglxy</code>. In order to modify any file of your Galaxy instance, e.g. configuration files, you first need to log in to the pseudo-account. Before you can log in you must generate an SSH key pair, store your public key somewhere in your <code>home</code> directory, and let the administrator know about that. The administrator will store your public key in an appropriate place, after which you can log in to your pseudo-account.
+=== Galaxy server management === <!--T:11-->
+<!--T:12-->
+Starting Galaxy server is the first thing that needs to be done by users. Galaxy server should NOT be run on cedar login node or any compute node. We have a dedicated server called  "gateway" and it is used for this purpose. It contains a web server with relevant Cedar filesystems, <code>/project</code> and <code>/home</code> directories mounted on it. Users cannot make a SSH connection to this machine due to security reasons, however we have designed a web site on this machine that allows users to start/stop  their own galaxy server. The website also allows users to user Galaxy web interface to communicate with the server. To do that please go to the website [https://gateway.cedar.computecanada.ca/ https://gateway.cedar.computecanada.ca/] and click on the Galaxy link. You will be asked to enter your username and password. Your username and password is the same as your computecanada one. Once you authenticate then you will be automatically redirect to your galaxy server manager website where you can manage your server or use Galaxy web interface.
+=== Galaxy configuration === <!--T:13-->
+<!--T:14-->
+Files in the <code>config</code> directory are used to configure your Galaxy server. Configuring and optimizing Galaxy is tricky and explaining all the configuration files is beyond the scope of this article. If you want more information about this we recommend you read documentation on the [https://docs.galaxyproject.org/en/master/admin/config.html Galaxy website].  We list below some basic variables that are set for you by the administrator.  We strongly recommend you do not change them.
+<!--T:15-->
+* In file <code>galaxy.yml</code> (the main configuration file):
+** <code>http:</code> contains your unique port number
+** <code>database_connection</code> is the name of your Galaxy database and your database server.
+** <code>virtualenv</code> is the path to a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] in the gateway machine
+** <code>file_path, new_file_path, tool_config_file, shed_tool_config_file, tool_dependency_dir, tool_data_path, visualization_plugins_directory, job_working_directory, cluster_files_directory, template_cache_path, citation_cache_data_dir, citation_cache_lock_dir</code> are appropriate paths for tools, tool sheds and dependencies.
+</translate>
+<!--T:16-->
+Other variables and files in this directory can be changed by the user.
+<translate>
+=== Running Tools === <!--T:17-->
+<!--T:18-->
+There are basically two ways one can run tools in a Galaxy instance.  One can run them "locally", which in this case means running on the gateway machine, or one can run them by [[Running jobs|submitting jobs]] to the cluster.
+<!--T:19-->
+Please DO NOT run tools locally at Cedar, because the gateway machine has little memory and cannot run jobs efficiently. Galaxy is configured to submit jobs to the cluster using the file <code>job_conf.xml</code>. Some variables in this file are already set by the administrator for submitting jobs. However, you may need to optimize the entries in this file depending on the tools you will use. Please first examine the file to understand these variables and the way they are used. For example, some tools require more memory or more walltime. Please perform some tests to find out optimized job specification values for every tool you would like to run.
+== Galaxy on GenAP == <!--T:20-->
+<!--T:26-->
+The [https://www.genap.ca/ Genetics and Genomics Analysis Platform] (GenAP) is a computing infrastructure and software environment for life science researchers offering services since 2015. GenAP aims at facilitating the work of researchers and students by offering out of the box Web applications running on an infrastructure that currently leverages Alliance Cloud and HPC resources.
+<!--T:27-->
+Any Alliance user can [https://www.genap.ca/p/help/introduction request an account on GenAP] free of charge. Other users can be [https://www.genap.ca/static/subsections/help/tutorials/GenAP_Online_Tutorials__Accounts_and_Authentication.pdf invited] if they have an Alliance sponsor.
+<!--T:38-->
+GenAP offers the ability to use your own privately accessed Galaxy instance, loaded with 700+ preinstalled tools. The GenAP-Galaxy is fully integrated with the GenAP infrastructure, allowing users to leverage storage and compute resources from the Alliance as well as interact with other GenAP applications.
+===  Tools === <!--T:22-->
+GenAP-Galaxy comes with more than 700 pre-installed tools and reference Genomes. There is no need for user configuration or installation.
+<!--T:28-->
+The GenAP team has also developed some Single-Cell tools that are integrated with GenAP visualization tools.
+<!--T:29-->
+GenAP has also close ties with the main Galaxy community and updates of reference genomes and index files are synchronized with the main usegalaxy.org.
+<!--T:30-->
+GenAP-Galaxy tries to keep its tool set as close as possible to those of the main Galaxy (usegalaxy.org).
+<!--T:31-->
+In some cases we can add new tools on request, however, this decision is made on a per case basis.
+<!--T:32-->
+For safety reasons users are not allowed to install tools.
+===  Running tools === <!--T:23-->
+<!--T:33-->
+There is no need for any configuration on GenAP-Galaxy. All tools, index files and reference genomes come pre-installed.
+<!--T:34-->
+The jobs are submitted to a cluster using slurm (no configuration needed). Jobs are submitted with default parameters ( 10GB RAM, 24hrs,  2 CPUs ).
+<!--T:35-->
+Through the '''Job Resources Parameters''' menu users can modify the default walltime, number of processors and RAM of a job. This is specially useful for large jobs (such as genome assembly), where the default may not be enough.
+=== Galaxy documentation === <!--T:24-->
+GenAP has a vast [https://www.genap.ca/p/help/galaxy-in-genap documentation] on data analysis ( +50 tutorials ) and also [https://www.genap.ca/p/help/galaxy-in-genap how to get started] with Galaxy on GenAP.
+== GenAP or Cedar, where to analyze my data ? == <!--T:25-->
+<!--T:36-->
+This table is to help new user to choose where to run their data. Each of these two Galaxies have specific features that may be more suitable for some research groups.
+<!--T:37-->
+{| class="wikitable"
+|-
+! Feature  !! GenAP !! Cedar
+|-
+| Server || Arbutus || Cedar
+|-
+| Galaxy configuration || None || High (done by user)
+|-
+| Required knowledge of Linux || None || High
+|-
+| Management/Updates || GenAP team || User
+|-
+| Server configuration || None || User
+|-
+| Pre-installed tools || Yes || Yes (subset)
+|-
+| Irida integration || No || Yes
+|-
+| Reference Genome || Yes (through CVMFS) || Managed by user
+|-
+| Quota || 1.5 TB (default) || storage RAC
+|}
+</translate>

Galaxy: Difference between revisions

Latest revision as of 15:19, 30 October 2024

Contents

Introduction[edit]

Galaxy on Cedar[edit]

Galaxy directory structure[edit]

Galaxy files ownership and modification[edit]

Galaxy server management[edit]

Galaxy configuration[edit]

Running Tools[edit]

Galaxy on GenAP[edit]

Tools[edit]

Running tools[edit]

Galaxy documentation[edit]

GenAP or Cedar, where to analyze my data ?[edit]

Navigation menu

Galaxy: Difference between revisions

Latest revision as of 15:19, 30 October 2024

Introduction[edit]

Galaxy on Cedar[edit]

Galaxy directory structure[edit]

Galaxy files ownership and modification[edit]

Galaxy server management[edit]

Galaxy configuration[edit]

Running Tools[edit]

Galaxy on GenAP[edit]

Tools[edit]

Running tools[edit]

Galaxy documentation[edit]

GenAP or Cedar, where to analyze my data ?[edit]

Navigation menu

Search