Lecture 2
MUST END Class at 12:30
Case Study of Observational Study
Case Study: Cholera
Slide 1
Would like to examine a bit of statistics and epidimiological history. This case study gives hits many of the themes we'll be visiting in the next few weeks.
Slide 2
Cholera is a disease passed through germs that usually travel through water supplies. It kills quite quickly -- within 3 days -- causing severe diarrhea.
In 1854 the cause of cholera was unknown, but most experts thought it was passed on through air-borne contaminants. It had been a problem in London since the early 1800's, and seem to occur most often in crowded, poor neighborhoods.
Slide 3
A particularly sudden outbreak occurred in Soho, London near an area called Golden Square. There were 500 deaths in just a few days, and the suddeness caused widespread panic. John Snow, a physician, believed -- based on past research -- that cholera spread through polluted water supplies. He left another cholera epidemic he was studying to investigate the Golden Square outbreak, and the result was a rather famous report that has popularly been considered the "solution" to cholera.
Slide 4
Snow went door-to door and indicated on a map of the area where all of the pumps occurred and where each death occurred. Deaths are indicated with the black bars. Let's zoom in on the area with the most bars.
Slide 5
You see that many deaths are quite near the Broad St. pump. In fact, the number of deaths decrease as you move further away. On one version of this map, Snow indicated which homes were closer to the Broad St. pump than any others, and this area had almost all of the deaths.
There were some anomalies. The Brewery one block away had no deaths. Similarly for a nearby workhouse/prison. On the other hand, almost everyone in a household about a mile away died, even though there were no other deaths around that household.
Some further investigating showed:
a) the workers at the brewery didn't trust the water supply and drank beer
b) the workhouse had its own water supply
c) the woman who owned the distant house preferred the taste of the Broad St. pump and paid someone to bring drinking water from that pump every day.
slide 6
Is this definitive "proof"? Snow presented his findings to local municipal administrators, who then removed the handle from the pump on Sept 8. This graph makes it seem as if deaths decreased rather dramatically after this.
But there's reason for suspicion.
a) Do we trust this graph?
b) People had been drinking from the pump for decades, with no ill problems. Why now?
c) This is what we call an observational study. It looks as if people drank from the pump, died. Or didn't drink from the pump and didn't die. There is a "treatment variable" -- whether or not someone drank from the broad st. pump, and a "response variable" -- whether they lived or died.
But its possible, in an observational study, that there was a confounding variable: something that might explain why they drank from the pump and died. Perhaps, for example, the disease really was air-borne, and there was something in the air around the pump? Some people, in fact, thought cholera was spread by "vapors" that arose from underground.
Slide 7
Here's the same graph, but with deaths reported daily. You can see that, in fact, deaths had already started decreasing when the pump was removed, and removing the pump seems to have little to do with it.
While graphs can be a useful way of summarizing information, whenever you summarize something you have to leave other things out. The previous graph combines data into weeks, and leaves out too much detail. You'll sometimes see similar things in financial data.
slide 8
Net earnings from a particular company reported on a quarterly basis. (Tufte, Visual and Statistical Thinking: Displays of Evidence for Making Decisions.) You can see occaisonal dips in earning.
Slide 9
Same slide, but now with annual earnings based on fiscal years. You notice the dip in 1982, and this is the basis of a claim for damages.
Slide 10
Same slide, but now with earnings in calendar years. The dip is gone. The units we use to aggregate matter, and its important that we choose units that "fit". Probably fiscal years are a better way of reckoning than calendar years. Similarly, there's no particular reason to consider the cholera data in terms of weeks.
Slide 11
Snow used other means to advance his argument. It was impossible for him to do what we calla controlled study. In a controlled study or experiment the researcher assigns subjects to one treatment or another. Clearly, it was not possible or ethical to assign some people to drink from one potentially contaminated water spply and others from another. But an odd quirk of London municipal works allowed something almost as good.
Water was supplied to London residents via a number of different water companies. For the most part, the company that you got water from was determined by where you lived. But in one case, two companies served the same neighborhood.
In 1849, both companies got their water from a polluted part of the Thames. The neighborhoods served by these companies experienced a cholera outbreak. IN 1853-54 there was another cholera outbreak in these neighborhoods, but this time, the Lambeth company had switched to another, cleaner water supply.
slide 12
Map of area served by both
Slide 13
The result was that the death rate was less (61/100000) in the neighborhoods served by Lambeth and S&V than in neighborhoods served solely by S&V.
Of course there could still be a confounder.
Slide 14
Today there's a pub, the John Snow, where the pump once was.
Slide 15
Across the street is a memorial pump:
Slide 16
medical mysteries persist. This latest was reported in Monday's NY Times.
Graphical and Numerical Summaries
AFter collecting data, we're confronted with the task of trying to organize it in a way that makes sense. We hope to see patterns that help us understand the phenomenon we're studying. But we also just hope to have a compact way of communicating what we've found.
Framework: there's a population that's quite large. Out of this population we select a small sample and that's all we see. But our sample can give us insight into what this variable would look like in the population.
Guiding Principle: distribution function
This is a function that tells us the values that a variable has an the frequency of those values in the population.
Note that we can never know what this function is unless we see everything in the population. But we can see what it might be for our data and use that to make some intelligent choices about what it might be in the population.
So the first order of business when studying data is to make a picture of its distribution. There are several means for doing so:
a) dot plots
b) stem and leaf plots
c) histograms.
1) How to make them
2) What to look for
How to make a dot plot:
Each observation is represented by a dot that is piled up on a number line
How to make a stem-and-leaf
Not always useful, but a good quick-and-dirty way if no computer is handy.