Code for CMF (contrast motif finder) can be found at http://www.stat.ucla.edu/~zhou/CMF/ CMF was created for Unix machines but should work on Linux and Windows sytstems too. To compile CMF, ensure that files cmf.cpp and cmf.h are in the same folder and type g++ cmf.cpp -o cmf after compiling, make sure the cmf is executable using "ls -l cmf" and "chmod" if necessary. RUNNING CMF: CMF is run from the command line with the following options provided (just type ./cmf to see these options): w the length of the motif seed, default = 7 m numer of mismatches in seed, default = 2 F the FDR level for determining the LR cutoff for identifying TF binding sites (default = 0.667) t the number of top seeds to test, default = 10 i1 first set of sequences, should be fasta formatted i2 second set of sequences, should be fasta formatted d 1: enrichment only in i1 (traditional scenario), 2: enrichment in both datasets (contrasting scenario), default = 1 l lower bound on length of motifs, default = 5 u upper bound on length of motifs, default = 20 o output of seed statistics f folder for all other output c c/g content filter, if > 0 filter out seeds based on cg content (ex -c 4, filter out seeds with more than 4 C or G's, default = 0) Input files must be in the FASTA format and repeat regions should be masked with "N" in desired since lower case nucleotides are converted to upper case. If the input files differ greatly in c/g content we recommend considering the -c option (this should be done with care since some motifs are c/g rich, ex Klf4). Below is an example of how to find the Oct4 motif in the traditional scenario (i.e. enriched in a bound set of sequences as compared to a control) ./cmf -w 7 -m 2 -d 1 -t 50 -i1 youngOct4Bound.txt -i2 youngOct4Control.txt -o seedsInfo.txt -f outputFolder/ To contrast two sets of bound sequences simply change the -d option to 2 and use appropriate sequences (see below) ./cmf -w 7 -m 2 -d 2 -t 50 -i1 youngOCT4wSOX2.txt -i2 youngOCT4woSOX2.txt -o seedsInfo.txt -f outputFolder/ OUTPUT: Output will be in the folder specified by option 'f' in a file called "output.txt". For each seed successfully updated into a motif, the file contains: The consensus motif The initial seed The m flexible positions in the seed The likelihood threshold The t-score The enrichment (log2) The PWM Additionally, positive and negative motif sites are given, for each site the 1) sequence name 2) location in sequence 3) likelihood ratio score are given in a tab delimited format. On the next line the actual wmer at that site is given.