xlispstat glossary
Some examples and commands that we'll use in class. Check back
for updates.
Boxplots
Suppose you have two variables, x and y, and you want to make side-by-side
boxplots to compare them:
Correlation/Covariance
Nothing is ever easy... To compute correlations, you must first compute
the covariance. To
do this, you need to compute the covariance matrix, which is a matrix
that has the variances of
x and y on the diagonal, and the covariance on the off-diagonal.
I'm going to do this the
slow, simple, imprecise way, but you can combine these into one step
to get a more precise
answer. Also, if you want to try your hand at writing functions,
you can write your
own correlation function. (I'm not going to have time to teach this
until much later, though. So you're
on your own.)
This example assumes you have variables named weight and height, perhaps
from the classdata. But any
two variables with equal lengths will do.
-
1. (def ourmat (covariance-matrix weight height))
-
2. (print-matrix ourmat)
-
3. (def sweight (standard-deviation weight)
-
4 (def sheight (standard-deviation height)
-
5 (/ -10.94 (* sweight sheight))
Notes:
#1 creates a covariance matrix named "ourmat". #2 prints it.
The off-diagonals are the
covariance. #3 defines the standard deviation of weight as "sweight"
(this is just the square
root of the first entry in the matrix). #3 does the same for
height. #5 computes the correlation. The
-10.94 is the off-diagonal entry in "ourmat". This is the sample
covariance. I don't know how to extract this directly
from the matrix, so I just typed it in.
A Word of Caution: the class data contain missing values, which
are coded as -9. In practice, these
need to be removed before computing any statistics. For now,
we assume there are no missing observations.
Regression
Assume you already have two lists of equal length called x and y. We have
two things to handle here: performing a regression and viewing the output.
To do both things requires learning a new sort of lisp object. Do the
following:
This will provide simple output: coefficients, correlation coefficient, and
estimate of error standard deviation. To do more complicated displays:
- (def myregress (regression-model x y :print nil))
- (send myregress :help)
The first step created a "regression object" called myregress. Now that
you have such an object, you can send it messages. In return, it will
respond to your messages. To send a message, you type "(send myregress
:message)" where the message you wish to send follows the colon.
What messages can you send? Well, in the example above we sent the message
"help". In return, we got a list of all allowable messages that we could
have sent. For example, "coef-estimates" would have returned the values of
the estimated slope and intercept. "plot-residuals" would have plotted the
residuals.
Generating Data
You can make up your own data from various probability distributions.
- (uniform-rand 50)
- creates 50 observations from a (continuous) uniform distribution on
the interval (0,1)
- (binomial-rand 50 5 .3)
- generates a sample of size 50 from 5 coin tosses with a coin whose
probability of landing "heads" is 0.30.
- (normal-rand (list 10 10 10))
- generates a list of three lists, each of which has ten standard normal
random observations.
- (sample (iseq 1 20) 5)
- draws a random sample of size 5, without replacement, from the
integers 1 to 20.
- (sample (iseq 1 20) 5 t)
- does the same sample, but now with replacement.
Saving Work
If you want to record everything you do, type
- (dribble "myfile")
Everything you type and all output goes into this file. You end
the recording by typing
- (dribble)
To save variables (for example, suppose you have (def height (list 1 4
5)) and (def weight (list 3 5 6)) ):
- (savevar 'height "height") OR for more than one:
- (savevar '(height weight) "myvariables")
You can now exit safely. When you return to xlispstat, type
- (load "myvariables") to retrieve the information.
- (variables) gives you a list of all defined variables.
- (undef 'height) removes, or "undefines" the variables height.
Help
Online help is not great in xlispstat, but there are some choices:
- If you know the name of a function, for example, "median", then
(help 'median) will tell you something about it. (Note the quote before
the function name.)
- If you know the name of part of the function, for example, you know
it has "norm" in it, use (help* 'norm) to get help for ALL functions with
a "norm" in them.
- If you know the name of part of the function, but don't want help for
ALL functions with that name in them, type (apropos 'norm) to get a list of
what these functions are, and then use (help 'function) to get help for
the one you want.
The help function uses particular notation to tell you about the
functions. For example, you'll see something like Args: (x y z), which
means it requires 3 input arguments. You might also see
Args: (x &optional y (z t)), which means that x is required, but y and
z are optional, and you can call (function-name x), (function-name x y),
or (function-name x y z). But NOT (function-name x z). The (z t) means
that if you DON'T input z, it will be given the value "t".)
Binomial Density
Other densities follow similar commands.
- To calculate the probability that X=k, where X is a binomial random
variable with parameters n (number of coin flips) and p (probability of a
Head), type (binomial-pmf k n p). pmf stands for probability mass
function.
- To calculate the Cumulative Distribution Function at any value of x,
type (binomial-cdf x n p)
- For example, a "loaded" coin lands on its Heads with probability .4.
Suppose we toss the coin 10 times?
What's the probability of getting exactly 5 heads? What's the probability
of getting 5 or fewer heads? What's the probability of getting more than
8 heads.
- (binomial-pmf 5 10 .4) returns 0.200658
- (binomial-cdf 5 10 .5) returns .833376
- (- 1 (binomial-cdf 8 10 .4)) returns .001677772
Normal Density
Xlispstat does the standard normal, and leaves you to compute the rest.
Try (help* normal) for a full listing.
- To compute the probability of being less than or equal to x, use
(normal-cdf x) (Question: What if your rv X has mean 10 and SD 3?)
- To compute the pth percentile, use (normal-quant p)
- To generate a list of n numbers from a standard normal distribution,
type (normal-rand 5)
Missing Data
If your data set has missing data (for example, the "student" data set
from
HW 8), you'll need to treat the data specially. There are statistical
concerns here (sometimes ignoring missing data can introduce bias), but in
this case its safe to just ignore the missing values. But you have to have
a way to tell the computer to ignore them.
The first step is to save the data and then edit it with your text
editor. In this case the missing values are indicated by a "." where
the number should be. (That's a period, by the way.) Because xlispstat
has a hard time mixing numbers with characters (as do most statistical
packages), a standard technique is to replace the "." with a value that
could not possibly be part of the data. In this case, since the data are
SAT scores and therefore positive, replacing "." with any negative number
will do. A common choice is -9. Use the "search and replace" feature of
your text editor, but be careful! If you just type ".", then the editor
will also replace decimal points, as well as periods! You can get around
this by replacing " . ", that is, space-period-space, with a -9.
Next, enter xlispstat and download the data as usual (using
read-data-columns). Define your two sat variables, say
- (def satverb (select student.dat 2)) (Remember that xlispstat starts
counting with 0, not with 1.)
- (def satmath (select student.dat 3))
Now we need to create new variables that do not have the -9's in them. The
command to do this is
(def satverb (select satverb (which (/= -9 satverb))))
And you just replace "satmath" for "satverb" to do the same for the satmath
variable. This is how this works:
- (/= -9 satverb)
- This command returns a list of T's and F's: T if the ith element of
satverb is not equal to -9, F if it is.
- (which ...)
- "which" returns the indices which have T's. Hence if the first
command returns (T T F F T F), then "which" returns (0 1 4). (Remember
that the first element of a list has index 0, not index 1.)
- (select satverb (which ...))
- You've seen this command before. (select list index) selects only
those items in "list" that are listed in "index". In this case, "list" is
"satverb" and "index" is the result of the "which" command.
Be sure to create a new vector for gender that ignore the same missing
values:
- (def gender (select student.dat 0))
- (def gender2 (select gender (which (/= -9 satverb))))
Selecting Based on Variable Values
The "student" data set for HW8 contains a variable for gender (1 for
female, 0 for male) and another for sat score. To compare the sat scores
for men and women, you need to create two new variables, satm and satw, for
example, so that satm has the sat scores only for the men, and satw only
for the women.
We assume that the gender variable and the sat variable are the same
length, so if you removed missing values from sat, you have to remove the
SAME entries from gender. (Note that gender is not missing any values, so
you have to remove the same one's that are missing from either satmath or
satverb (they are the same in this case) to make sure both variables are
the same length.)
The following example creates a list of sat verbal scores for women.
- (def satv (select student.dat 2))
- (def gender (select student.dat 0))
- (def satv2 (select satv (which (/= -9 satv))))
- (def gender2 (select gender (which (/= -9 satv))))
- (def satvfemale (select satv2 (which (= 1 gender2))))
Using XLISPSTAT as a Calculator
The trick here is to remember that operations come first, so 3+4 is (+ 3 4).
- Adding two lists: if x and y are each lists of the same length, then
(+ x y) results in a new list in which each x_i was added to y_i
- Exponents: (^ 4 3) for example, is 4 raised ot the 3rd power.
- (^ 4 (/ 1 2)) is 4 raised to the 1/2th power (in other words, the
square root of 4.)
- (log 10) gives the log
- (exp y) is the inverse of the natural log