HtBackground: Constructing heterogeneous background for multiple species motif finding by a Bayesian segmentation approach coupled with a hidden Markov model.

Related paper

Chen, G. and Zhou, Q. Heterogeneity in DNA Multiple Alignments: Modeling, Inference, and Applications in motif finding.

Executable programs

HtBackground (OS executable)

Instructions for running the programs

Assume that the directory containing data and results is D and the directiory containing the executable program is P.

The following command is used to run the program:
P/HtBackground -s 3 -r "D/" -n 2 -t "test_case_23_6sp/" -d "data.txt" -k 4

Options

-s:	Pseudo random seed (different seeds lead to different start points of the sampler.)
-r:	Root directory of data and results
-n:	Number of sampling iterations
-t:	Relative directiory of the input data (this directory should be located within the root directory)
-d:	Input multiple alignment file name
-k:	Maximum of segments (k_max)

Input

In D, there should be two directories /data and /results, the former for input data and the later for output files. In each of them, there should be a directory for a data set. For example, a sample data set is located at D/data/test_case_23_6sp. Its output results are in D/results/test_case_23_6sp.

In D/data/test_case_23_6sp/, there are three files: data.txt, tree.txt, and map_from_leave_label_to_seq_id.txt.

data.txt contains input alignment. The coding scheme is as follows:
0--gap indicating absense of nucleotide bases, denoted by "-" in UCSC genome browser database;
1--A; 2--C; 3--G; 4--T;
5--alignment is not available, denoted by "=" in UCSC genome browser database (and denoted by "?" in the files showing the original alignments for the sample data set with outgroup in horizontal format upstream_horizontal_alignments_w_outgroup.txt and the one without outgroup in vertical format upstream_vertical_alignments_letters.txt);
6--repeats for the target species of motif finding (for example, human). The program does not construct any background for the region where the target species has repeats. All the output are for the data with such regions removed.

tree.txt contains information (parent node, child node, and branch length between them) of a phylogentic tree estimated from the multiple alignment in upstream_horizontal_alignments_w_outgroup.txt. outfile shows the original output of PHYLIP inference package, and outgroup.txt shows the outgroup species used to locate the root of the tree. Note that the nodes of the original tree are re-labeled with numbers (4->0, 3->1, 2->2, 1->3, cow->4, horse->5, rhesus->6, human->7, chimpanzee->8) in tree.txt.

map_from_leave_label_seq_id.txt contains mapping from leave labels of a tree to sequence ids in the alignment (i.e., the column number in data.txt).

Output

The output files are prefixed by a pseudo random seed (an input parameter). Some files are suffixed by numbers indicating sampling iterations. Detailed descriptions for part of output are in the following.

Generic output
3_auxiliary_ouput.txt: It shows input parameters (k_max, tree inforamtion, and mapping from leave labels to sequence ids) for validation check;
3_errors.txt: It shows file openning errors;
3_post_para.txt: It stores the quantity P(W, \beta, \theta, \lambda, \alpha, U | S) (specified up to a constant and in log scale) along iterations for convergence assessment;
3_post_dist_k.txt: Each row shows the conditional probabilities of the number of segments P(K | X, V) (specified up to a constant and in log scale) for one iteration;

Segment-specific output
3_from_tos_1.txt: For the 1st iteration, a pair of numbers (segment start position and end position) indicate sampled segments;
3_beta_shapes_1.txt: For the 1st iteration, every four rows present conditional expectations of transition probabilities of root bases E(\beta | W, R)--from A, C, G, and T (row) to A, C, G, and T (column)--for one segment;
3_theta_shapes_1.txt: For the 1st iteration, every row presents conditional expections of emission probabilities for mutated bases E(\theta | W, S, I, V), in the order of A, C, G, and T, for one segment;

Conservation related or state-specific output
3_rate_state_transitions.txt: Every three rows present conditional expections of conservation state transition probabilities E(\alpha | U) from 1, 2, and 3 to 1, 2, and 3 for one iteration;
3_deletions_1.txt: For the 1st iteration, the three rows show sampled state-specific deletion rates, in the order of 1, 2, and 3;
3_mutations_1.txt: For the 1st iteration, the three rows show sampled state-specific mutation rates, in the order of 1, 2, and 3;
3_map_column_to_rate_state_1.txt: For the 1st iteration, every row indicates the conservation state associated with an alignment column (indexed by the row number in the file).

Common error

Error: "The program cannot open output file D/results/test_case_23_6sp/3_errors.txt".
Solution: A directory test_case_23_6sp should be created under the directory D/results.