Transforming Data

Often, statisticians work with transformed data. The most common reason for this is that it can make the distribution of the data more bell-shaped and this increases power and puts one on surer ground, as far as hypothesis testing goes. But transforms are also used in regression contexts to make relationships more linear and also to "stabilize variance".

There are many rules of thumb for how to do this, but the best is common sense: stick to transformations that are common and meaningful. It doesn't make sense to transform by raising the data to the power of 1/pi, for example. But taking the square-root is a little more comprehensible.

How do you choose a transformation? Because of today's computers, this has never been easier. One common approach is to use what is called the Box-Cox family of transformations. Say that you have a positive-valued variable x that you want to transform. The Box-Cox transformation defines x^(lambda) as

x^(lambda) = (x^lambda - 1)/lambda if lambda is nonzero and

x^(lambda) = log(x) if lambda is 0.

For example, if you set lambda = 1, the result is that you are merely subtracting 1 from each observation (x¹ - 1)/1 = x-1. If you set lambda=2, then essentially you are squaring the observations (and multiplying and subtracting a constant.) If lambda is 1/2, you are essentially taking a square-root.

ARC does this automatically, (with the "transformation slidebar" option), and you can also create a similar slidebar in Fathom.

Things to keep in mind about transformations:

• one reason for using a transformation is to satisfy the assumptions behind certain statistical models. But to do so you might have to sacrifice interpretability. We all understand units of, say, dollars. But what about log-dollars? What about dollars-to-the-2/3 power?

• sometimes transformations can be "undone" to report back in the original units. Sometimes not.

• the log is a handy transformation. The book "The Statistical Sleuth" has this to say:

---"If the ratio of the largest to the smallest measurement in a group is greater than 10, then the data are probably more conveniently expressed on the log scale. Also, if the graphical displays of the two samples show them both to be skewed and if the group with the larger average also has the larger spread, the log transformation is likely to be a good choice."

The context here is that of comparing two groups of data.