MultiScan: Multiple species motif scanning with heterogeneous background constructed by HtBackground.

Related paper

Chen, G. and Zhou, Q. Heterogeneity in DNA Multiple Alignments: Modeling, Inference, and Applications in motif finding.

Executable programs

 MultiScan (OS executable)

Instructions for running the programs

Assume that the directory containing data, estimate and results is D and the directiory containing the executable program is P.

The following command is used to run the program:
P/MultiScan -r "D/" -t "test_case_23_6sp/" -d "data.txt" -m MYF

Options

-r: Root directory of data and results
-t: Relative directiory of the input data and estimate (this directory should be located within the root directory)
-d: Input multiple alignment file name
-m: Motif name

Input

In D, there should be three directories /data, /estimate and /results, for input data, input parameter estimate, and output files, respectively. In each of them, there should be a directory for a data set. For example, a sample data set is located at D/data/test_case_23_6sp. Its parameters (for both background and motif) are in D/estimate/test_case_23_6sp. Its output results are in D/results/test_case_23_6sp.

In D/data/test_case_23_6sp/, there are three files: data.txt, tree.txt, and map_from_leave_label_to_seq_id.txt.

data.txt contains input alignment. The coding scheme is as follows:
0--gap indicating absense of nucleotide bases, denoted by "-" in UCSC genome browser database;
1--A; 2--C; 3--G; 4--T;
5--alignment is not available, denoted by "=" in UCSC genome browser database (and denoted by "?" in the files showing the original alignments for the sample data set with outgroup in horizontal format upstream_horizontal_alignments_w_outgroup.txt and the one without outgroup in vertical format upstream_vertical_alignments_letters.txt);
6--repeats for the target species of motif finding (for example, human). The program does not construct any background for the region where the target species has repeats. All the output are for the data with such regions removed.

tree.txt contains information (parent node, child node, and branch length between them) of a phylogentic tree estimated from the multiple alignment in upstream_horizontal_alignments_w_outgroup.txt. outfile shows the original output of PHYLIP inference package, and outgroup.txt shows the outgroup species used to locate the root of the tree. Note that the nodes of the original tree are re-labeled with numbers (4->0, 3->1, 2->2, 1->3, cow->4, horse->5, rhesus->6, human->7, chimpanzee->8) in tree.txt.

map_from_leave_label_seq_id.txt contains mapping from leave labels of a tree to sequence ids in the alignment (i.e., the column number in data.txt).

Estimate

Background parameters (these can be obtained by HtBackground)
HeteMultiDMHMM_estimate_betas_1.txt: Column-wise estimate for transition probabilities of segmented Markov chain (\beta) from A;
HeteMultiDMHMM_estimate_betas_2.txt: Column-wise estimate for transition probabilities of segmented Markov chain (\beta) from C;
HeteMultiDMHMM_estimate_betas_3.txt: Column-wise estimate for transition probabilities of segmented Markov chain (\beta) from G;
HeteMultiDMHMM_estimate_betas_4.txt: Column-wise estimate for transition probabilities of segmented Markov chain (\beta) from T;
HeteMultiDMHMM_estimate_thetas.txt: Column-wise estimate for cell probabilities of multinomial distribution for mutated bases (\theta), in the order of A, C, G, and T;
HeteMultiDMHMM_estimate_deletion_rates.txt: Column-wise estimate for deletion rates;
HeteMultiDMHMM_estimate_mutation_rates.txt: Column-wise estimate for mutation rates.

Motif parameters (these can be treated as input parameters)
MYF_DM_estimate_thetas.txt: Position-specific estimate for cell probabilities of multinomial distribution for root nodes (\theta^m), in the order of A, C, G, and T;
MYF_DM_estimate_deletion_rates.txt: Position-specific estimate for deletion rates--the sample used a pooled estimate from all the positions;
MYF_DM_estimate_mutation_rates.txt: Position-specific estimate for mutation rates--the sample used a pooled estimate from all the positions.

Output

Generic output
auxiliary_ouput.txt: It shows input of tree information and motif parameters for validation check;
open_file_errors.txt: It shows file openning errors;

Scoring output
MYF_HeteMultiDMHMMDM_positive_strand_scan.txt: It gives the candidate position (start position and end position), motif likelihood, background likelihood, likelihood ratio of a motif model over a background model. The last three are in log scale. This file contains all results of positive strand scanning.
MYF_HeteMultiDMHMMDM_negative_strand_scan.txt: This file contain all results of negative strand scanning.