fastR
fastR is an R package that contains data and other utilities to support the book
Foundations and Applications of Statistics: An Introduction using R.
This book is designed for an upper-level undergraduate "mathematical statistics" course (or 2-semester
sequence), but it a different from other books for this audience.
Obtaining the book
Foundations and Applications of Statistics: An Introduction Using R is being published
by the American Mathematical Society and is scheduled to appear in early 2011. If you are interested
in using the book before it is published, contact the author
(rpruim@calvin.edu).
Approach of the book
Features of this book that help distinguish it from other books available for such a
course include
-
The use of R, a free software environment for statistical computing and graphics,
throughout the text.
Many books claim to integrate technology, but often technology appears
to be more of an afterthought. In this book, topics are selected, ordered, and
discussed in light of the current practice in statistics, where computers are an
indispensable tool, not an occasional add-on.
R was chosen because it both powerful and available. Its “market share”
is increasing rapidly, so experience with R is likely to serve students well in
their future careers in industry or academics. A large collection of add-on
packages are available, and new statistical methods are often available in R
before they are available anywhere else.
R is open source and available at the Comprehensive R Archive Network
(CRAN, http://cran.r-project.org) for a wide variety of computing platforms at no cost.
This allows students to obtain the software for their personal
computers – an essential ingredient if computation is to be used throughout
the course.
The R code in this book was executed on a 2.66 GHz Intel Core 2 Duo
MacBook Pro running OS X (version 10.5.8) and the current version of R
(2.11). Results using a different computing platform or different version of R
should be similar.
-
An emphasis on practical statistical reasoning.
The idea of a statistical study is introduced early on using Fisher’s famous
example of the lady tasting tea. Numerical and graphical summaries of data
are introduced early to give students experience with R and to allow them
to begin formulating statistical questions about data sets even before formal
inference is available to help answer those questions.
-
Probability for statistics.
One model for the undergraduate mathematical statistics sequence presents
a semester of probability followed by a semester of statistics. In this book,
I take a different approach and get to statistics early, developing the
necessary probability as we go along, motivated by questions that are primarily
statistical. Hypothesis testing is introduced almost immediately, and p-value
computation becomes a motivation for several probability distributions. The
binomial test and Fisher’s exact test are introduced formally early on, for example.
Where possible, distributions are presented as statistical models first,
and their properties (including the probability mass function or probability
density function) derived, rather than the other way around. Joint distributions
are motivated by the desire to learn about the sampling distribution of
a sample mean.
Confidence intervals and inference for means based on t-distributions must
wait until a bit more machinery has been developed, but my intention is that
a student who only takes the first semester of a two-semester sequence will
have a solid understanding of inference for one variable – either quantitative
or categorical.
-
The linear algebra middle road.
Linear models (regression and ANOVA) are treated using a geometric,
vector-based approach. A more common approach at this level is to intro-
duce these topics without referring to the underlying linear algebra. Such an
approach avoids the problem of students with minimal background in linear
algebra, but leads to mysterious and unmotivated identities and notions.
Here I rely on a small amount linear algebra that can be quickly reviewed
or learned and is based on geometric intuition and motivation (see Appendix C).
This works well in conjunction with R since R is in many ways vector-based
and facilitates vector- and matrix-operations. On other hand, I avoid
using an approach that is too abstract or requires too much background for
the typical student in my course.
Brief Outline of the book
Table of Contents [pdf]
The first four chapters of this book introduce important ideas in statistics (distributions,
variability, hypothesis testing, confidence intervals) while developing a
mathematical and computational toolkit. I cover this material in a one-semester
course. And since some of my students only take the first semester, I wanted to
be sure that they have gotten a sense for statistical practice and have some useful
statistical skills even if they do not continue. Interestingly, as a result of designing
my course so that stopping half-way makes some sense, I am finding that more
of my students are continuing on to the second semester. My sample size is still
small, but I hope that the trend continues, and would like to think it is due in part
because the students are enjoying the course and can see “where it is going.”
The last three chapters deal primarily with two important methods for handling
more complex statistical models: maximum likelihood and linear models (including
regression, ANOVA, and an introduction to generalized linear models). This is not
a comprehensive treatment of these topics, of course, but I hope it both provides
flexible, usable statistical skills and prepares students for further learning.
Chi-squared tests for goodness of fit and for two-way tables using both the
Pearson and likelihood ratio test statistics are covered after first generating empirical
p-values based on simulations. The use of simulations here reinforces the
notion of a sampling distribution and allows for a discussion about what makes a
good test statistic when multiple test statistics are available. I have also included
a brief introduction to Bayesian inference, some examples that use use simulations
to investigate robustness, a few examples of permutations tests, and a discussion
of Bradley-Terry models. The latter topic is one that I cover between Selection
Sunday and the beginning of the NCAA Division I Basketball Tournament each
year. An application of the method to the 2009–2010 season is included.
Various R functions and methods are described as we go along, and Appendix A
provides an introduction to R focusing on the way R is used in the rest of the book.
I recommend that you work through Appendix A simultaneously with first chapter
– especially if you are unfamiliar with programming or with R.
Some of my students enter the course unfamiliar with the notation for things
like sets, functions, and summation, so Appendix B contains a brief tour of the basic
mathematical results and notation that are needed. The linear algebra required for
parts of Chapter 4 and again in Chapters 6 and 7 is covered in Appendix C. These
can be covered as needed or used as a quick reference. Appendix D is a review of
the first four chapters in outline form. It is intended to remind prepare students
for the remainder of the book after a semester break, but it could also be used as
an end of term review.
fastR Package Project Summary
You can find the project summary page for the fastR package here.