Galaxy

From Alliance Doc
Jump to navigation Jump to search

Introduction

Galaxy is an open source, web-based platform for data intensive biomedical research. It is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

Galaxy on Cedar

On cedar we provide one galaxy instance for every research group. Galaxy installation requires a special setup that needs to be done by Compute Canada (CC) staff. If you need Galaxy for your group please write an email to support team.

Galaxy Directory Structure

Galaxy is usually installed on the project directory of the group and it contains several sub-directories. The name of the Galaxy top directory is determined by taking the first two character of PI username + "glyx". For example if PI username is "davidc" the galaxy top directory will "daglxy" and it is located in /project/group name/ were group name is the default group name of PI, eg., def-davidc. Galaxy main directory contains the following sub-directories which is slightly different than the original galaxy package.

  • config: It contains all require configure files to setup and optimize Galaxy. Bellow we will explain some basis concept of some of configure files that need sto be setup in our HPC environment, however, we are not able to cover all concepts as they are out of this scope. In original galaxy package this directory is located within galaxy directory.
  • galaxy: It contains the core python scripts of Galaxy.
  • logs: Contains all log files during server Stop/Start/Run process as well as. The most important file in this directory is galaxy.log that is widely used to diagnose issues during Galaxy run.
  • plugins: Contains all plugins. In original galaxy package this directory is located within galaxy directory.
  • tmp: Contains all temporary files that galaxy needs for compiling and installing tool sheds.
  • venv: It is a python virtual environment directory and it contains all python package dependencies.
  • tool-data: It contains data used by tools See the samples in data-integration
  • tool-dependencies: It contains all dependency packages that is needed for tool sheds. By default packages in this directory are being installed using anaconda.
  • database: It contains input/output and error files of all job that run in cluster nodes.

Galaxy Files ownership and modification

All files of your galaxy instance belongs to a "pseudo account" or "shared account" that is generated by admin at installation time. pseudo accounts do not belong to a real users but they belong to a specific group. They never expired and everyone within the group is able to login as pseudo account using SSH key. The name of the pseudo account in this case is the same name as the top galaxy directory explained above, eg., daglxy. In order to modify any file you first need to login as "pseudo account". In this case please generate your SSH key and store your public somewhere in your home directory and let the admin knows about that.

Galaxy Server

Galaxy server cannot be run on cedar, please do not run startup script on cedar. Instead we use another machine (called gateway) that contains a web server with all cedar /project and /home mounted. SSH connection to this machine by users is not possible due to the security reason but you can start, stop or run your Galaxy server by going to website https://gateway.cedar.computecanada.ca/ and follow Galaxy link.

Galaxy configuration

All files in directory config are used to configure your Galaxy server. Configuring and optimizing Galaxy is very tricky and requires a broad scientific and technical knowledge and explaining all of them is beyound our topic. We assume users who are requiring galaxy have knowledge to further setup its own galaxy instance. Here we explain some important and basic setup that needed to be done by admin in order to server work. While we recommend to go though configure files ans setup them appropriately we also strongly recommend not to overwrite the followig variables that are set by admin.

  • File galaxy.yml: its most important and the main configure file. We setup the following variables:
    • http: contain your unique port number
    • database_connection the name of your Galaxy database and your database server.
    • virtualenv the path to python virtual environment in gateway machine
    • file_path, new_file_path, tool_config_file, shed_tool_config_file, tool_dependency_dir, tool_data_path, visualization_plugins_directory, job_working_directory, cluster_files_directory, template_cache_path, citation_cache_data_dir, citation_cache_lock_dir setup appropriate paths for tools, tool sheds and dependencies.
  • job_conf.xml: All variables in this files is used in job submission to cedar. Various packages have different job specification, for example package "spades" uses 8 cores with the walltime of 3 hours and job is submission under your default group name def-xxxxx. Please take a look at this file and setup your desire job specification.