Chen, G. and Zhou, Q. Heterogeneity in DNA Multiple Alignments: Modeling, Inference, and Applications in motif finding.
Assume that the directory containing data and results is D and the directiory containing the executable program is P.
The following command is used to run the program:
P/HtBackground -s 3 -r "D/" -n 2 -t
"test_case_23_6sp/" -d "data.txt" -k 4
-s: | Pseudo random seed (different seeds lead to different start points of the sampler.) |
-r: | Root directory of data and
results |
-n: | Number of sampling
iterations |
-t: | Relative directiory of the input data (this directory should be located within the root directory) |
-d: | Input multiple alignment file name |
-k: | Maximum of segments (k_max) |
In D, there should be two directories /data and /results, the
former for input data and the later for output files. In each of them,
there should be a directory for a data set. For example, a sample data set is located at D/data/test_case_23_6sp.
Its output results are in D/results/test_case_23_6sp.
In D/data/test_case_23_6sp/,
there are three files: data.txt,
tree.txt, and map_from_leave_label_to_seq_id.txt.
data.txt contains
input alignment. The coding scheme is as
follows:
0--gap indicating absense of nucleotide bases, denoted by "-" in UCSC
genome browser database;
1--A; 2--C; 3--G; 4--T;
5--alignment is not
available, denoted by "=" in UCSC genome browser database (and denoted
by "?" in the files showing the original alignments for the sample
data set with outgroup
in horizontal format upstream_horizontal_alignments_w_outgroup.txt
and the one without outgroup in vertical format upstream_vertical_alignments_letters.txt);
6--repeats for the target
species of motif finding (for example, human). The program does not
construct any background for the region where the target species has
repeats. All the output are for the data with such regions removed.
tree.txt contains
information (parent node, child node, and branch length between them)
of a phylogentic tree estimated from the multiple alignment in upstream_horizontal_alignments_w_outgroup.txt.
outfile shows the original
output of PHYLIP
inference package,
and outgroup.txt
shows the outgroup species used to locate the root
of the tree. Note that the nodes of the original tree are re-labeled
with
numbers (4->0, 3->1, 2->2, 1->3, cow->4, horse->5,
rhesus->6, human->7, chimpanzee->8) in tree.txt.
map_from_leave_label_seq_id.txt
contains mapping from leave labels
of
a tree to sequence ids in the alignment (i.e., the column number in
data.txt).
The output files are prefixed by a pseudo random seed (an input parameter). Some files are suffixed by numbers indicating sampling iterations. Detailed descriptions for part of output are in the following.
Generic output
3_auxiliary_ouput.txt:
It shows input parameters (k_max, tree
inforamtion, and mapping from leave labels to sequence ids) for
validation check;
3_errors.txt: It
shows file openning errors;
3_post_para.txt:
It stores the quantity P(W, \beta,
\theta, \lambda, \alpha, U | S) (specified up to a constant and in log
scale) along iterations for convergence
assessment;
3_post_dist_k.txt:
Each row shows the conditional probabilities of the number of segments
P(K | X, V) (specified up to a constant and in log scale) for one
iteration;
Segment-specific output
3_from_tos_1.txt:
For the 1st iteration, a pair of numbers (segment
start position and end position) indicate sampled
segments;
3_beta_shapes_1.txt:
For the 1st iteration, every four rows present
conditional expectations of transition probabilities
of root bases E(\beta | W, R)--from A, C, G, and T (row) to A, C, G,
and T (column)--for one segment;
3_theta_shapes_1.txt:
For the 1st iteration, every row presents conditional
expections of emission probabilities for mutated bases
E(\theta | W, S, I, V), in the order of A, C, G, and T, for one segment;
Conservation related or state-specific output
3_rate_state_transitions.txt:
Every three rows present conditional
expections of conservation state transition probabilities E(\alpha |
U) from 1, 2, and 3 to 1, 2, and 3 for one iteration;
3_deletions_1.txt:
For the 1st iteration, the three rows show sampled
state-specific deletion rates, in the order of 1, 2, and 3;
3_mutations_1.txt:
For the 1st iteration, the three rows show sampled
state-specific mutation rates, in the order of 1, 2, and 3;
3_map_column_to_rate_state_1.txt:
For the 1st iteration, every row
indicates the conservation state associated with an alignment column
(indexed by the row number in the file).
Error: "The program cannot open output file
D/results/test_case_23_6sp/3_errors.txt".
Solution: A directory test_case_23_6sp should be created under the
directory D/results.