利用docker容器進(jìn)行metascape富集分析還是很簡單的,默認(rèn)分析human基因list懈息,注意-S參數(shù)肾档。
ps:
我測試的路徑在:sftp://root@ip:22/home/softwares/MSBio/data
命令:bin/ms.sh -u -o /data/output_single_id_txt /data/example/single_list_id.txt
這里要特別注意!1杓獭怒见!
Introduction
Metascape for Bioinformaticians (MSBio) enables Metascape analyses to be carried out in batch mode using users' own hardware. Metascape is a complex piece of software with many third-party dependencies (luckily no commercial ones), so Docker container technology is used to run Metascape offline. If you do not have access to a Docker infrastructure, please continue to use the Metascape.org web site. Although we have tested Docker on Mac (M1/2 chip is not supported) and Windows, the instructions below are written with Linux in mind.
To run MSBio, you need two docker images (~3GB each) and a valid license. Free MSBio licenses are available for non-web and non-commercial use only. To run MSBio within a commercial entity, please check out what commercial license enables.
Installation
Before you start, make sure you have access to a Docker infrastructure.
The installation package can be obtained by register for a free license (valid for a year). Create a new folder for MSBio, unzip the installation zip file to get three subfolders:
$ unzip msbio_v3.5.20210815.zip
$ lt
bin data license
To download MSBio docker images, enter the MSBio working folder, run:
bin/install.sh
Note: winbin/install.bat for Windows.
As each image is ~3GB in size, be patient as they are downloaded. If successful, you should see two Docker images (Image ID and Size varies):
$ docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
metadocker8/msdata latest 99ea12d0ba82 30 hours ago 3.23GB
metadocker8/msbio latest 7914a952382e 30 hours ago 3.34GB
Metascape Docker container requires a minimum of 6GB memory to run, as the database consumes memory. We recommend providing 8GB+.
Note: the installation script also creates a data folder and makes it writable to all users (MSBio creates files with user ID 1002).
Usage
Container Management
The MSBio containers must be running in order to do metascape analysis. To launch the containers, at the MSBio work folder:
bin/up.sh
After you are done with your analyses, shut down the containers to save memory resource:
bin/down.sh
Gene-List Analysis
To analyze your gene list(s), use bin/ms.sh. The minimum syntax is:
bin/ms.sh -o output_folder input_list_file
However, since the input file formats are different depending on whether you are doing a single-gene-list or multi-gene-list analyses, you must specify -u if your input format follows the single-gene-list standard. Another very important point is both output_folder
and input_list_file
must be subfolders of data
, as the data
working folder is mounted inside the container as /data
and therefore files within are visible inside the Docker container. Also, both output_folder
and input_list_file
must start with /data
, since this is the path within the container! (Since version v3.5.20211016, the "data" path not starting with "/" will be automatically prepended and will work.)
Example for single-gene-list analysis:
bin/ms.sh -u -o /data/output_single_id_txt /data/example/single_list_id.txt
Here, -u stands for "unique", as our genes are in a column. This is important because the file format of (.txt, .csv, or .xlsx) for a single gene list is different from the format used for multiple gene lists. The exact format of the input files is described in the online menu, and example input files are also available under the data/example folder. Our recommendation is to always use the multiple gene list format, we might need to retire the single gene list format in the future, as it causes some confusion.
If the gene list is not for human, use -S. Please read the next section for important options.
If your analysis command crashes without error, chances are the process within the container was killed due to insufficient memory, so it did not get a chance to complain. You should make sure your Docker server allows 8GB+ for the container.
Example of multi-gene-list analysis:
bin/ms.sh -o /data/output_multiple_sym_txt /data/example/multiple_list_symbol.txt
Advanced Options
MSBio supports many options, however, you can ignore most of them. We here only explain a few important ones:
-o OUTPUT, --output OUTPUT
The output folder path must be provided. It must start with /data/ as this is the path within the container.
-u, --one_list
This is important, when your input uses the single-list file format.
-p, --PPI
By default, MSBio perform PPI network analysis. If you do not want PPI analysis, use this option. (Note: MSBio alpha did not run PPI by default, we change the behavior in beta)
-G, --skip_go
By default, MSBio performs GO enrichment analysis. If you would like to skip, use this option.
-t ID_TYPE, --id_type ID_TYPE
ID type of genes in the input file. By default, you do not need to specify and let Metascape auto-guess. But you can also force Metascape to interpret your IDs as one of the following types: "Entrez", "RefSeq", "Symbol", or "dbxref". Type strings are case-sensitive.
-s, --skip_convert
If you are pretty sure the input gene IDs are already correct Entrez Gene IDs, you can use this option to skip the ID conversion and slightly speeds thing up.
-S SOURCE_TAX_ID, --source_tax_id SOURCE_TAX_ID
By default, Metascape treats the source organism as human, if it is not, you can specify the source taxonomy ID using this option.
-T TARGET_TAX_ID, --target_tax_id TARGET_TAX_ID
By default, Metascape treats the target organism as human, if it is not, you can specify the target taxonomy ID using this option.
--option option.json
All settings for Metascape "Custom Analysis" and more can be changed using a JSON file. data/example/option.json is an example file containing all default settings. This is what is used if the --option is not provided. You can provide your own option.json file to customize ontology categories and annotation categories. Although not recommended, you can even overwrite gene list and PPI network size limits.
For -S and -T you can use either taxonomy ID or common names. The supported IDs are: 9606, 10090, 10116, 4932, 5833, 6239, 7227, 7955, 3702, and 4896. The supported names are: human, mouse, rat, yeast, malaria, "c. elegans", fly, zebrafish, arabidopsis, or "s. pombe".
Batch Processing
At the beginning of each bin/ms.sh run, it first needs to load databases. If you need to run multiple tasks and you would like to avoid this overhead, you can use a .job
file as the input file, see data/example/test.job
. This way the databases are only loaded once and Metascape can run multiple tasks afterward. for examples.
Each line in a .job
file is a JSON-format description of a Metascape task. You must minimally specify the input, output, and "single":true
if input file format uses the single-gene-list standard (equivalent to the -u option). You can even provide job-specific option.json file, if you want to alter the default behavior.
To run the job file:
bin/ms.sh /data/example/test.job
Since v3.5.20211016, you may also omit the "/" in the beginning of the input and output arguments, e.g.:
bin/ms.sh data/example/test.job
For debugging purpose, if you want to skip a task, use "#" to comment out that task line.
When Metascape executes a task, it encloses the output message within two lines, starting with 'START>' and 'COMPLETE>'. For example:
START> job #12, input=/data/example/multiple_list_id_bg.xlsx, output=/data/output_multiple_id_xlsx_bg
...
Cytoscape Free Memory: 1531
COMPLETE> job #12, input=/data/example/multiple_list_id_bg.xlsx, output=/data/output_multiple_id_xlsx_bg
If a task line is commented out or the input or output path for a task is missing, there will be a line:
SKIP> job #1
This supposedly makes it easier for you to parse the batch processing output to identify the failed tasks.
Parallel Metascape Analyses
When one bin/ms.sh is running, you must not execute another bin/ms.sh command! This is because the backend plotting components can only plot one task at a time, so if you run two ms.sh simultaneously, plots from two gene lists may cross0talk with each other.
In case you really need to run multiple tasks in parallel, you need to use multiple MSBio containers, which is isolated from each other. Each ms.sh process in a container should only process one gene list at a time!
As an example, to launch two MSBio containers, do:
bin/up.sh
bin/up.sh 2
There will be two MSBio containers running, named msbio1 and msbio2. The first command, bin/up.sh names the container as msbio1 by default; it is equivalent to bin/up.sh 1. Now you can use both containers in parallel. The following two commands can be run at once:
bin/ms.sh -u -o /data/output_single_id_txt /data/example/single_list_id.txt &
bin/ms.sh 2 -o /data/output_multiple_sym_txt /data/example/multiple_list_symbol.txt
bin/ms.sh 2 means run the command using the msbio2 container. bin/ms.sh followed by "-" means msbio1 is used. You may also use bin/ms.sh 1, if you want to be explicitly using msbio1.
To shut down both containers:
bin/down.sh
bin/down.sh 2
If you need more containers, just follow this usage pattern. To minimize resource consumption, only msbio1 runs the database server, and all other containers talk to msbio1. So bin/down.sh 1 will only work if there are no other containers depending on msbio1.
Mac and Windows
If you install Docker Desktop for Mac/Windows, MSBio does work in our tests. For MAC, commands are the same in the examples above. For Windows, the scripts are in the winbin folder instead. So commands bin/up.sh, bin/down.sh, bin/ms.sh are replaced by winbin/up.bat, winbin/down.bat, and winbin/ms.bat, respectively.