Chen, G. and Zhou, Q. Heterogeneity in DNA Multiple Alignments: Modeling, Inference, and Applications in motif finding.
Assume that the directory containing data, estimate and results is D and the directiory containing the executable program is P.
The following command is used to run the program:
P/MultiScan -r "D/" -t
"test_case_23_6sp/" -d "data.txt" -m MYF
-r: | Root directory of data and
results |
-t: | Relative directiory of the input data and estimate (this directory should be located within the root directory) |
-d: | Input multiple alignment file name |
-m: | Motif name |
In D, there should be three directories /data, /estimate and /results, for
input data, input parameter estimate, and output files, respectively.
In each of them,
there should be a directory for a data set. For example, a sample data set is located at D/data/test_case_23_6sp.
Its parameters (for both background and motif) are in D/estimate/test_case_23_6sp. Its
output results are in D/results/test_case_23_6sp.
In D/data/test_case_23_6sp/,
there are three files: data.txt,
tree.txt, and map_from_leave_label_to_seq_id.txt.
data.txt contains
input alignment. The coding scheme is as
follows:
0--gap indicating absense of nucleotide bases, denoted by "-" in UCSC
genome browser database;
1--A; 2--C; 3--G; 4--T;
5--alignment is not
available, denoted by "=" in UCSC genome browser database (and denoted
by "?" in the files showing the original alignments for the sample
data set with outgroup
in horizontal format upstream_horizontal_alignments_w_outgroup.txt
and the one without outgroup in vertical format upstream_vertical_alignments_letters.txt);
6--repeats for the target
species of motif finding (for example, human). The program does not
construct any background for the region where the target species has
repeats. All the output are for the data with such regions removed.
tree.txt contains
information (parent node, child node, and branch length between them)
of a phylogentic tree estimated from the multiple alignment in upstream_horizontal_alignments_w_outgroup.txt.
outfile shows the original
output of PHYLIP
inference package,
and outgroup.txt
shows the outgroup species used to locate the root
of the tree. Note that the nodes of the original tree are re-labeled
with
numbers (4->0, 3->1, 2->2, 1->3, cow->4, horse->5,
rhesus->6, human->7, chimpanzee->8) in tree.txt.
map_from_leave_label_seq_id.txt
contains mapping from leave labels
of
a tree to sequence ids in the alignment (i.e., the column number in
data.txt).
Background parameters (these can be obtained by HtBackground)
HeteMultiDMHMM_estimate_betas_1.txt:
Column-wise estimate for transition probabilities of segmented Markov
chain (\beta) from A;
HeteMultiDMHMM_estimate_betas_2.txt:
Column-wise estimate for transition probabilities of segmented Markov
chain (\beta) from C;
HeteMultiDMHMM_estimate_betas_3.txt:
Column-wise estimate for transition probabilities of segmented Markov
chain (\beta) from G;
HeteMultiDMHMM_estimate_betas_4.txt:
Column-wise estimate for transition probabilities of segmented Markov
chain (\beta) from T;
HeteMultiDMHMM_estimate_thetas.txt:
Column-wise estimate for cell probabilities of multinomial distribution
for mutated bases (\theta), in the order of A, C, G, and T;
HeteMultiDMHMM_estimate_deletion_rates.txt:
Column-wise estimate for deletion rates;
HeteMultiDMHMM_estimate_mutation_rates.txt:
Column-wise estimate for mutation rates.
Motif parameters (these can be treated as input parameters)
MYF_DM_estimate_thetas.txt:
Position-specific estimate for cell probabilities of multinomial
distribution for root nodes (\theta^m), in the order of A, C, G, and T;
MYF_DM_estimate_deletion_rates.txt:
Position-specific estimate for deletion rates--the sample used a pooled
estimate from all the positions;
MYF_DM_estimate_mutation_rates.txt:
Position-specific estimate for mutation rates--the sample used a pooled
estimate from all the positions.
Generic output
auxiliary_ouput.txt:
It shows input of tree information and motif
parameters for
validation check;
open_file_errors.txt:
It
shows file openning errors;
Scoring output
MYF_HeteMultiDMHMMDM_positive_strand_scan.txt:
It gives the candidate position (start position and end position),
motif likelihood, background likelihood, likelihood ratio of a motif
model over a background model. The last three are in log scale. This
file contains all results of positive strand scanning.
MYF_HeteMultiDMHMMDM_negative_strand_scan.txt:
This file contain all results of negative strand scanning.