_____________________________________________________________________
|                                                                     |    
| SOME NOTES ON GENETIC DISTANCE AND LOD SCORE, HOPEFULLY CLARIFYING  |
| TERMINOLOGY AND APPROPRIATE USE                                     |
|_____________________________________________________________________|


Chiasmata Crossovers Recombinations and Distances
==================================================

These notes are mostly a memo for myself.
The Haldane function maps distance into recombination probability
between markers.  If you consider two markers on a chromosome you can
speak of three events that can take place between them and that are linked 
together and to the genetic distance.

1. chiasmata (cross on the cromatide bundles, before the formation of gametes.
   So that for one chiasmata, you'll have 2 gamets which are with crossovers 
   and 2 without). 
2. crossover (now we are looking at a chromosome and seeing if it is formed by 
   two pieces of the original one or not)
3. recombination == any odd number of crossovers (if you just look at
   the marker alleles and you had two crossover between the markers,
   you'll see the same marker alleles as in the chromosome without crossover)

| *         | | * *       |  |  *  *   
| *         | * | *       |  *  |  *

original    cromatide     four gametes  
autosomal   bundle        (two with crossover 
            with 1         and two  without)
            chiasmata 


The genetic distance is the expected number of crossover per gamete.
In terms of probabilities, here is how it goes when you assume the
Poisson model
               
1.
The number of chiasmata between two markers that have distance d follows a
Poisson distribution with mean d*2 (distance=1/2expected number of chiasmata)

Pr (at least 1 chiasmata)=1-exp{-2d}

2.
The number of crossovers between two markers that have distance d follows a
Poisson distribution with mean d (distance=expected number of crossovers 
per gamete)

Pr (at least one crossover) = 1 - exp{d}

3.
To get to recombination, one has to consider
(A) no recombination= 0 or even number of crossovers
(B) recombination = odd number of crossovers

Pr(recombination between markers) = 1/2 Pr(more than 0 chiasmata)
(see Lange's book pag 207, summarized below)

=1/2(1-exp(-2d))

This last is Haldane formula, connecting recombination to distance.

Among the three concepts, I think crossover is the really natural one.
Anyway, if you are considering markers that are close together, it is true that

(A) no recombination = 0 crossover
(B) recombination = 1 crossover 
      (which probability is equivalent to at least 1 crossover)

Which is to say that if d = approx =0
1-exp{-d} = approx = 1/2(1-exp{-2d})

so that you can effectively forget about recombination and think of 
crossovers.


Mother's formula and why recombination < 1/2
============================================

Pr(2 loci are recombinant)= ?

Pr(2 loci are recombinant | n chiasmata between them) = r_n

r_n= 1/2 r_(n-1) [they were recombinant with n-1 and are not affected by
 the new chiasma] +  1/2 (1-r_(n-1)) [they were not recombinant and
they became so because of the last chiasma]

so r_n= 1/2r_(n-1)+ 1/2(1-r_(n-1)
r_n=1/2

Illustration:
-------------

( 1 chiasmata )

| *         | | * *       |  |  *  *   
| *         | * | *       |  *  |  *
                          n  r  r  n

( 2 chiasmata )

| *         | | * *       |  |  *  *   
| *         | * | *       |  *  |  *
| *         * | | *       *  |  |  *
| *         * | | *       *  |  |  *
                          r  n  r  n

the frequency of recombinant gametes is allways 1/2, 
given any number of chiasmata


Pr(2 loci are recombinant)= 
sum_{i=0}^{\infty} Pr(2 loci are rec | there are i chiasmata between
them) Pr( there are  i chiasmata)
= P(2 rec | 0  chiasma)P( 0 chiasma)  + 
     + sum_{i=0}^{\infty} 1/2 Pr( there are  i chiasmata) =
= 0 + 1/2 pr(at least one chiasmata)


And the results is really a combination of the fact that you have 
1) two recombinant gamets per chiasmata
2) sort of equal change of being an odd or even number of crossover at
   a certain point.


 LodScores
===========

(A) 

The basic idea of the tests for linkage is a likelihood ratio test.

      L_0
LR = ----- is the test statistics and H_0 is rejected if LR < c
     L_max

-2log(LR) ~ Chisquare ( dim(unrestricted)-dim(restricted) )


How from this statistics and this distribution we go to the LodScores
used in the genetics literature is a mixture of tradition, misnomer
and savvy evaluations of the peculiarity of the linkage problem.

(B)

Let's get some nomenclature straight, first

                 L_max
LOD score = Log (-----)
                  L_0

so, with respect to the general statistical practice, you take the
inverse of the LR and, most importantly, you use the Log base 10
rather than the natural log. This means that the LOD score is
distributed as

LOD = .2172 Chisquare 

(C) 

The name LOD stays for Log of the Odds, and it is evidently a
misnomer. The odds would be the odds of linkage vs non linkage

             P(A)
odds (A) = -------- 
            1- P(A)

obviously, the LOD is not a log of the ratio of probability of Linkage
vs probability of non linkage, because the likelihoods cannot be
interpreted in this way.

An other possible way of reconnecting it to the odds, would be through
the odds ratio

              odds(A|link)
odds ratio = --------------
             odds(A|no link)

where A are the data. However, this is not correct either.

(D) 

It is common practice to say that we can reject the null hypothesis of
non linkage if the LOD score is bigger than 3.

If we refer to the distribution of the LOD score we notice that

Pr( LOD >3 | no link)=
Pr( LOD /.2172 > 3 /.2172 | no link)=
Pr (chisquare_1 > 13.81)= 0.0002022568

so that the cut off, does not correspond to a .05 significant level,
but to a much smaller one.
Why is this so?

There are a variety of justifications of this, some of historical
values and not really applicable now and some of relevance today. It
is sort of remarkable, that they all agree on the three value. It has
to be stressed, however, that lot of the ``hype'' around the three
value is purely traditional 

(E)

The first person to give a rigorous treatment of the LOD scores and to
introduce the cut off value of three was Morton.
The context of his study was: 

[1]  comparing two loci at the time (not just in the sense of pairwise
     comparison, but meaning that the entire dataset included only
     information about two loci)
[2]  sequential tests: the data was collected one family at the time
     and he was proposing to do so and stop when appropriate cutoff
     where reached, suggesting that this would lead to a
     correct statistical result and that it would require examining
     less subjects.

these are the consequences of the approach

-> the parameter is recombination fraction (there is really no need to
   convert this to distances if you are looking at only two loci, it 
   would be absolutely equivalent)

-> to be able to add the contribution of the various families in
   sequential manner, the Log of the 1/LR had to be calculated for
   various values of theta (the max for one family may not be the max for
   the other), so that one calculated the Log likelihood ratio function
   for each family

-> the cut-off value of a sequential test have to be higher than for a
   normal test, as they have to say that the evidence suffices and that
   no further observation could change the outcome. The cutoff values
   that Morton obtained were 3 and -2 for a power of .99 and
   significance of 0.001

Morton had decided for a significant level of 0.001 on the base of a
Bayesian argument. His point was that the event that two random loci
are linked is very rare and one should take into account this before
declaring that enough evidence has been gathered to prove it.
A modified version of Morton Bayesian reasoning, justified the fact
that the 3 cut-off value stayed in use long after the sequential test
procedure was gone.

(F) 

With this Bayesian reasoning, we have abandoned the sequential test
procedure, but we are still in a world where 

[1] we compare two loci.

let D be the data and L linkage NL non linkage


                P(D|L)P(L)
Pr(L|D) = ------------------------
          P(D|L)P(L) + P(D|NL)P(NL)


now, P(L) has been calculated to be 0.02 and re-writing the expression
above one gets

          
               P(D|L)/P(D|NL) 
Pr(L|D) = ------------------------
            P(D|L)/P(D|NL) + 49


So, if we want Pr(L|D)>.95, than P(D|L)/P(D|NL)> 1000, which gets our
lod score of 3.

We now want to clarify how one gets P(L) = 0.02. This is, as far as I
can tell, work of Elston and Lange.

P(L)=Pr(two genes are on the same chromosome at less than 30 Mb a
     part)

G = total genome length ~ 3300 MB
G_k = length of the k-th chromosome
d=30 mB

P(L)= sum_{k=1}^{22} P( l1 and l1 on same chr) 
                            x P( l1 and l2 linked | on same chr)

                                      G_k^2 -(G_k- d)^2
 P(l1 and l2 linked | on same chr)= -------------------- 
                                           G_k^2

(to see why draw a square ...)

 P(L) = sum_{k=1}^{22} (G_k/G)^2 (1-(1-d/G_k)^2) =
        
        2d            d^2
      = ---sum G_k  - --- sum 1=
        G^2           G^2
 
         2dG     22d^2   2d     1
      = -----  - ---- ~ ---- = ----
         G^2      G^2    G      50

(G)

With time, the contest has changed and instead of comparing a pair of
loci current linkage studies compare one disease locus, say, with a
series of markers scattered around the genome and covering it, so that
the prior probability that the gene of interest in within 30 Mb of one
of the markers in the dataset is 1.
Clearly, a different situation from the above. How comes, than, that
we still are looking for LOD scores of three or more?

The answer, this time, depends from the multiple comparison phenomena.
If we conduct 100 tests, and each of them reject the null hypothesis
wrongly with probability 0.05, on average we will have 5 tests that
result in a wrong rejection. To avoid this problem, one has to lower
the significance level. 
In the case of n independent tests,  to guarantee an overall
significance level sall, the single significant levels sone have to be
adjusted downward

sall =  1-(1-sone)^n
      
sone =  1- (1-sall)^(1/n)

which translates approximately in sone ~sall/n 

So that if we consider 200 markers, sone= 0.00025 which translates in
a chisq cutoff value of 13.41215 and in a lod score cut off value of
approx 3, and there we go.

Actually, the experiments conducted in genome screens are not
independent. Lander and Kruglyak (1995) have done simulations that
suggest that 3.3 may be an appropriate cut off level in
non-independence case. So the value 3 is still a working tool.