\documentclass[11pt]{amsart} \usepackage{psfig,amssymb} \parskip .2 in \large \begin{document} \title{NELS88} \author{Deborah Wang and Amy Braverman} \maketitle \section{Introduction} This paper discusses various aspects of the NELS88 sample design, and their implications for using bootstrap techniques in this setting. In particular, the NELS88 sample design is complex, and would seem to lend itself well to the techniques presented by Rao and Wu (1988), and Sitter (1989). Available NELS88 data are a condensed version of the actual NELS88 sample, but nonetheless form a large dataset in their own right. In this paper, we use the available NELS88 data to study the NELS88 sample design, and as test data for programs that carry out and evaluate the performance of resampling techniques proposed by Rao and Wu and Sitter. Section 2 sets the stage with a discussion of the relationships between the population, sampling frame, and the NELS88 sample. Section 3 describes the strategies we used on the NELS88 sample to select subsamples for analysis. These include stratified random sampling, and a systematic sampling procedure that mimicks the sample design actually used to select the NELS88 data. Section 4 presents XLISPSTAT code for bootstrap estimates derived from stratified random samples. This includes both the ``naive'' bootstrap, and Rao and Wu's proposed improvement. In Section 5, we investigate Sitter's generalization of Rao and Wu's techniques. Sitter presents methods for bootstrapping from samples collected using other complex designs besides stratified random sampling. We present XLISPSTAT code to carry out such a procedure on the NELS88 subsample collected using systematic sampling. Finally, in Section 6, we evaluate the performance of these techniques on the NELS88 subsample, and draw some conclusions about which methods are most appropriate for the NELS88 analysis. \section{Target Population, Sampling Frame, and the NELS88 Sample} The NELS88 sampling frame consists of 38,866 schools classified into superstrata based on geographic location (G8REGON) and school type (G8CTRL). Public superstrata are further broken down into substrata based on urbanicity (G8URBAN) and minority composition (G8MINOR). Some private superstrata are broken down into substrata based on urbanicity, and some are not; these superstrata contain only one substratum. According to the {\it Base Year Sample Design Report}, \begin{quote} Investigation of various sources indicated that the most readily accessible source for a complete and accurate frame available was the data base compiled by Quality Education Data, Inc. (QED) of Denver, Colorado. The data base includeds both public and private (parochial and non-parochial schools). QED performs annual, late-summer updates by telephoning each public school district, each Catholic diocese, and all private schools on its records. In addition, QED frequently receives updated information from agencies such as the National Catholic Educational Association, the Council of American Private Education, the Association of Christian Schools, and others, regarding school openings and closings, enrollments and so forth. $\ldots$ The QED list did not contain information about the racial/ethnic composition of public schools usable for constructing the NELS88 sampling frame. NORC obtained racial/ethnic composition data, on public schools only, from Westat, Inc., a subcontractor for the NELS:88 survey. \end{quote} Surely, such a list does not include every school in the United States; some are invariably missing. The frame itself is, in a sense, a sample from the target population. Missingness may be related to other school characteristics of interest. The frame may be a slightly distorted proxy for the target population. Sampling from the frame was done in a way that ensured the sample would include specified numbers of schools from each superstratum and substratum. With respect to other (non-stratification) variables the sampled schools may or may not be representative of the frame. Representativeness is achieved if the relationship between the stratification variables and the other variables in the sample is the same as the relationship between the stratification variables and the other variables on the frame. This mirrors the situation that exists between the frame and the target population with regard to missingness. Regarding missingness {\it on the frame} as a (binary) stratification variable, the relationship between an indicator of frame-missingness and other variables in the target population should be the same as the relationship between frame-missingness and the other variables in the frame. Since there are no frame-missing cases on the frame by definition, this implies that there must be no relationship between frame missingness and other variables in the target population if the frame is to be representative of the population. Finally, the same set of circumstances governs the relationship between the NELS88 sample and samples taken from it. There are 1,734 schools in the NELS88 sample, which we will refer to from now on as the NELS88 data. Samples may be drawn from these 1,734 schools, and statistics computed. However, unlike the two situations discussed above, we have the ability to compute from the data the actual quanities our statistics are designed to estimate. In other words, we may regard the NELS88 data as a target population in its own right. Since we can compute ``true population'' parameters, we can evaluate various estimators by comparing their behavior over repeated trials to the known true values of the parameters they are designed to estimate. \section{Sample Design} The NELS88 sample is a stratified sample with systematic sampling within strata. \subsection{Stratification} The data available to us (hereafter called NELS88(2)) are a condensed version of the NELS88 data described in the {\it Base Year Sample Design Report}. For instance, the NELS88 data are stratified into 32 superstrata based on school type (public, private, other private religious, and other private) and geographic region in which schools are located. For public schools there are a total of 17 different geographic regions. In NELS88(2) these have been collapsed in to four regions: Northeast, North Central, South, and West resulting in only four public school superstrata. There are 15 private school superstrata in NELS88 owing to five geographic regions and three different private school types. In NELS88(2) there are 12 private school superstrata resulting from the combination of three private school types and four regions. For public schools, substrata are based on school urbanicity and minority composition. In NELS88 there are three levels of urbanicity (urban, suburban, and rural) and two levels of minority composition (minority and non-minority). Therefore, there are a maximum of six possible substrata within a superstratum. The number of substrata within public superstrata varies superstratum to superstratum, but never exceeds four. For example, in the Public, New York superstratum there are four substrata: minority, urban, nonminority, suburban, nonminority, rural, and nonminority. NELS88(2) contains information that could be used to form substrata in the same manner as was done in NELS88. However, since there are only 1035 schools in NELS88(2) this results in some substrata with insufficient numbers of schools. Therefore, we do not form substrata. The entire analysis is carried out at the superstratum level. Table A shows the stratification scheme for NELS88(2). \begin{table} \begin{center} \begin{tabular}{|l|l|l|} \hline \multicolumn{3}{|c|}{Table A}\\ \multicolumn{3}{|c|}{NELS88(2) Stratification}\\ \hline\hline $h$ & \em Superstratum & $N_h$ \\ \hline\hline 1 & Public, Northeast & 135 \\ 2 & Public, North Central & 209 \\ 3 & Public, South & 292 \\ 4 & Public, West & 165 \\ \hline \multicolumn{2}{|l|}{Total, Public} & 801 \\ \hline\hline 5 & Catholic, Northeast & 45 \\ 6 & Catholic, North Central & 31 \\ 7 & Catholic, South & 20 \\ 8 & Catholic, West & 9 \\ \hline \multicolumn{2}{|l|}{Total, Catholic} & 105 \\ \hline\hline 9 & Other Religious, Northeast & 10 \\ 10 & Other Religious, North Central & 18 \\ 11 & Other Religious, South & 22 \\ 12 & Other Religious, West & 17 \\ \hline \multicolumn{2}{|l|}{Total, Other Religious} & 67 \\ \hline\hline 13 & Other Non-Religious, Northeast & 24 \\ 14 & Other Non-Religious, North Central & 8 \\ 15 & Other Non-Religious, South & 22 \\ 16 & Other Non-Religious, West & 6 \\ \hline \multicolumn{2}{|l|}{Total, Other Religious} & 60 \\ \hline\hline \multicolumn{2}{|l|}{Total} & 1033 \\ \hline \multicolumn{2}{|l|}{Missing} & 2 \\ \hline \end{tabular} \end{center} \end{table} \subsection{Systematic Sampling Within Strata} Let $n$ be the number of schools from among $N=1035$ to be selected for the sample. $n$ must be allocated among $H=16$ strata. Let $n_h$ be the number of schools drawn from stratum $h$. In a manner analagous to sample size allocation for NELS88, $n_h$ is based on the estimated total eighth grade enrollment for all schools in stratum $h$: $$ n_{h} = {{\sum_{j \in {\cal B}_{h}} {G8ENROL}_{j}} \over {\sum_{j} {G8ENROL}_{j}}} n $$ $j$ indexes schools independent of stratum membership, and ${\cal B}_{h}$ is the set of indices, $j$, such that school $j$ is in stratum $h$. Denote the quantity which multiplies $n$ by $\alpha_{h}$. Table B shows the values of this quantity for each $h$. \begin{table} \begin{center} \begin{tabular}{|l|l|l|l|c|} \hline \multicolumn{5}{|c|}{Table B}\\ \hline\hline $h$ & \em Superstratum & $\alpha_h$ & $r_h$ & $\lceil N_{h}/r_h \rceil$ \\ \hline\hline 1 & Public, Northeast & 0.1456 & 15 & 9 \\ 2 & Public, North Central & 0.2017 & 21 & 10 \\ 3 & Public, South & 0.3495 & 35 & 9 \\ 4 & Public, West & 0.2066 & 21 & 8 \\ 5 & Catholic, Northeast & 0.0175 & 2 & 23 \\ 6 & Catholic, North Central & 0.0115 & 2 & 16 \\ 7 & Catholic, South & 0.0090 & 1 & 20 \\ 8 & Catholic, West & 0.0027 & 1 & 9 \\ 9 & Other Religious, Northeast & 0.0048 & 1 & 10 \\ 10 & Other Religious, North Central & 0.0057 & 1 & 18 \\ 11 & Other Religious, South & 0.0087 & 1 & 22 \\ 12 & Other Religious, West & 0.0066 & 1 & 17 \\ 13 & Other Non-Religious, Northeast & 0.0099 & 1 & 24 \\ 14 & Other Non-Religious, North Central & 0.0045 & 1 & 8 \\ 15 & Other Non-Religious, South & 0.0121 & 2 & 11 \\ 16 & Other Non-Religious, West & 0.0033 & 1 & 6 \\ \hline\hline \multicolumn{2}{|c|}{Total} & 0.9997 & & 224 \\ \hline \end{tabular} \end{center} \end{table} \begin{table} \begin{center} \begin{tabular}{|l|l|c|c|c|c|} \hline \multicolumn{6}{|c|}{Table C}\\ \hline\hline $h$ & \em Superstratum & \em Number of & \em Size of & $P(s_{hi} = 1)$ & $w_{hi}$ \\ & & \em Subpopulations & \em Subpopulations & & \\ \hline\hline 1 & Public, & 15 & 9 & 0.06667 & 15 \\ & Northeast & & & & \\ \hline 2 & Public, & 20 & 10 & 0.04762 & 21 \\ & North Central & 1 & 9 & & \\ \hline 3 & Public, & 12 & 9 & 0.02857 & 35 \\ & South & 23 & 8 & & \\ \hline 4 & Public, & 18 & 8 & 0.04762 & 21 \\ & West & 3 & 7 & & \\ \hline 5 & Catholic, & 1 & 23 & 0.50000 & 2 \\ & Northeast & 1 & 22 & & \\ \hline 6 & Catholic, & 1 & 16 & 0.50000 & 2\\ & North Central & 1 & 15 & & \\ \hline 7 & Catholic, & 1 & 20 & 1.00000 & 1 \\ & South & & & & \\ \hline 8 & Catholic, & 1 & 9 & 1.00000 & 1 \\ & West & & & & \\ \hline 9 & Other Religious, & 1 & 10 & 1.00000 & 1 \\ & Northeast & & & & \\ \hline 10 & Other Religious, & 1 & 18 & 1.00000 & 1 \\ & North Central & & & & \\ \hline 11 & Other Religious, & 1 & 22 & 1.00000 & 1 \\ & South & & & & \\ \hline 12 & Other Religious, & 1 & 17 & 1.00000 & 1 \\ & West & & & & \\ \hline 13 & Other Non-Religious, & 1 & 24 & 1.00000 & 1 \\ & Northeast & & & & \\ \hline 14 & Other Non-Religious, & 1 & 8 & 1.00000 & 1 \\ & North Central & & & & \\ \hline 15 & Other Non-Religious, & 2 & 11 & 0.50000 & 2 \\ & South & & & & \\ \hline 16 & Other Non-Religious, & 1 & 6 & 1.00000 & 1 \\ & West & & & & \\ \hline \end{tabular} \end{center} \end{table} The {\it Base Year Sample Design Report} states \begin{quote} The selection of public schools was accomplished using systematic sampling with random starts in each public superstratum and sampling intervals in each superstratum that were proportional to $MOS$. The selection of the private schools was accomplished using systematic sampling with random starts in each private substratum and with the sampling intervals proportional to $MOS$. \end{quote} $MOS_{j}$ is {\it measure of size} for school $j$, and is calculated as follows: \begin{quote} $$ MOS = F * G * max(24, \mbox{G8 enrollment}) $$ $\ldots$ The $MOS$ was equal to an adjustment factor, $F$, times another factor, $G$, times the maximum of 24 (which is the desired number of regular students per school to be sampled) or the estimated eighth grade enrollment of the school($M_{j}/g_{j}$). \end{quote} NELS88(2) does not provide enough information to compute the factors $F$ and $G$. Therefore, we use $\alpha_{h}$ as a proxy for $$ { {\sum_{j \in {\mathcal A}_{h}} MOS_{j} \over {{\sum_{j} MOS_{j}} }}} $$ and the sampling intervals within strata are be proportional to $\alpha_{h}$. For stratum $h$ the sample size is $n_{h}= \alpha_{h} n$. Under systematic sampling, we sequentially assign each school in the frame to one of $r_{h}$ subpopultions where $r_{h}$ is the sampling interval for stratum $h$. Where $\alpha_{h}$ is large, and hence $r_{h}$ is large, there are relatively many subpopulations formed each with relatively few schools {\em given fixed $N_{h}$.} Where $\alpha_{h}$ is small, and hence $r_{h}$ is small, there are relatively few subpopulations formed each with relatively many schools (again for fixed $N_{h}$). One subpopulation is chosen (uniformly) at random to be the sample from that stratum. Since large strata have large values of $\alpha_{h}$, and small strata have small values of $\alpha_{h}$, the sample sizes should tend to be similar for all strata since the artifact of choosing relatively fewer schools from the larger strata will operate against the fact that those strata are larger in the first place, similarly for small strata, the artifact of choosing relatively many schools from the smaller strata will operate against the fact that those strata are smaller. The probability of selection for schools in larger strata is smaller than the probability of selection for schools in smaller strata since more subpopulations are formed in the larger strata. Let $r_{h}=\lceil100\alpha_{h}\rceil$. Then $r_{h}$ is proportional to $\alpha_{h}$, and $N_{h}/r_{h}$ is the subpopulation size implied by this choice of $r_{h}$. (See Table B.) Subpopulation sizes must be whole numbers, so we use the least integer greater than $N_{h}/r_h$ $\left( \lceil N_{h}/r_h \rceil \right)$. Rounding upward in this fashion may cause subpopulation sizes to be too large: the number of subpopulations multiplied by this implied subpopulation size may exceed the number of schools in some strata. Sequential assignment of schools to subpopulations by systematic sampling procedures automatically corrects for this. Some subpopulations will simply have one fewer school than $\lceil N_{h}/r_h \rceil$ dictates. (See Table C.) To obtain a sample from stratum $h$, we select one of the $r_{h}$ subpopulations at random. This sampling procedure can be described by specifying $P(\mbox{\boldmath $S$}_{h} = \mbox{\boldmath $s$}_{h})$ where $\mbox{\boldmath $S$}_{h}$ is a random vector of length $N_{h}$ having $n_{h}$ ones and $N_{h}-n_{h}$ zero's such that the $i$th element of $\mbox{\boldmath $S$}_{h}$ is one if school $i$ in stratum $h$ is selected for the sample, and zero if not. For example, the sample design for the Public, Northeast superstratum assigns probability $1/15$ to each of 15 vectors of length 135. The first such vector has one in positions 1, 16, 31, 46, $\ldots$, 121, and zero elsewhere. The second such vector has one in positions 2, 17, 32, 47, $\ldots$, 122, and zero elsewhere. The third through fifteenth vectors are analagous. All other vectors of length 135 have probability zero. For strata such as Public, South in which not all subpopulations have the same size, it is still the case that probability $1/r_{h}$ ($1/35$ for Public, South) is assigned to each vector of length $N_{h}$ (292 for Public, South) having a configuration of ones and zeros corresponding to obtainable combinations of schools. In this case 12 of those vectors will represent samples of size nine, while 23 will represent samples of size eight. The design for superstrata having only one subpopulation is specified by one vector of all ones with probability one, and probability zero assigned to all other vectors. The design for sampling within stratum $h$ can be expressed compactly in the following form: \begin{eqnarray*} P(\mbox{\boldmath $S$}_{h}=\mbox{\boldmath $s$}_{h}) = \left\{ \begin{array}{ll} r_{h}^{-1} & \mbox{for {\boldmath $s$}}_{h} \in \{\mbox{\boldmath $s$}^{(k)}_{h}\}, \hspace{.12 in} k=1,\ldots,r_{h}, \hspace{.12 in} \mbox{\boldmath $s$}_{h}^{(k)} = (s_{j}^{(k)}),\\ & \hspace{.90 in} s_{j}^{(k)}=I(j=k+pr_{h}), \hspace{.12 in} p=1, \ldots, p_{0}\\ & \hspace{.90 in} p_{0} \mbox{ s.t. } k+pr_{h} \leq N_{h}\\\\ 0 & \mbox{Otherwise.} \end{array} \right. \end{eqnarray*} Since sampling within a given stratum is independent of sampling within any other stratum, $$ P(\mbox{\boldmath $S$}=\mbox{\boldmath $s$}) = \prod_{h=1}^{16} P(\mbox{\boldmath $S$}_{h}=\mbox{\boldmath $s$}_{h}) $$ where {\boldmath $S$} is a vector of length $\sum_{h=1}^{16} N_{h}$ formed by concatenating the vectors $\mbox{\boldmath $S$}_{h}$. The overall sampling design is a probability distribution for vectors of length $\sum_{h=1}^{16} N_{h}$ assigning positive probability to possible combinations of schools as described above. \subsection{Estimating the Population Mean} We wish to estimate the true (finite) population mean, $${\bar y} = {1 \over N}\sum_{h=1}^{H} \sum_{i=1}^{N_{h}} y_{hi}$$ where $N = \sum_{h} N_{h}$. A logical estimator of ${\bar y}$ is $${\hat {\bar y}} = {1 \over n} \sum_{h=1}^{H} \sum_{i=1}^{N_{h}} S_{hi} w_{hi} y_{hi} $$ where $n=\sum_{h=1}^{H} {n_{h}}$, $w_{hi}$ is a weight assigned to school $i$ in stratum $h$, and (as before) $S_{hi}$ is an indicator for whether school $i$ in stratum $h$ was selected for the sample. ${\hat {\bar y}}$ is a weighted average of the sample values, $y_{hi}$. Specifically, if $w_{hi} = {[P(s_{hi}=1)]}^{-1}$ then $$ {\hat {\bar y}} = {1 \over n} \sum_{h=1}^{H} \sum_{i=1}^{N_{h}} S_{hi} {[P(s_{hi}=1)]}^{-1} y_{hi} $$ With weights inversely proportional to probability of selection, schools from large strata will be weighted more heavily than schools from small strata. Stratified systematic sampling intentionally keeps sample sizes across strata similar. The weights reflect the fact that schools selected from large strata stand in for large numbers of schools (since there are more school in those strata to begin with), and schools selected from small strata stand in for small numbers of schools. By contrast, under simple random sampling, every school selected for the sample stands in for the same number of schools in the population: the greater importance ascribed to large strata is reflected in the fact that $n_{h}$ is bigger in large strata. Under SRS, the estimator ${\hat {\bar Y}}$ can also be written as above, noting that in this case \begin{eqnarray*} w_{hi} & = & {[P(s_{hi}=1)]}^{-1} \\ & = & {n_{h} \over N_{h}}^{-1} \\ & = & {N_{h} \over n_{h}} \end{eqnarray*} Weights correct for the non-representativeness of the sample with respect to the population. The simplest circumstance occurrs when the sample is taken in such a way that it is fully representative of the population by virtue of the sample design. (This is the situation with simple random sampling.) Then each element in the sample receives the same weight in the computation of the sample mean. When the design deliberately imposes a distortion on the sample with respect to the population, say for the purposes of guaranteeing the presence rare items, weights can and should be used to compensate. \section{Bootstrap Estimates Under Statified Random Sampling} \section{Bootstrap Estimates Under Complex Sample Designs} \section{Conclusion} \end{document}