| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
I. INTRODUCTION
One of the first problems of national importance Hat was considered
by Be Committee on Applied and Theoretical Statistics (CATS) was
posed to it by staff members of He Intemal Revenue Service (IRS). They
were concemed with the lack of appropriate statistical methodologies for
certain nonstandard situations Hat arise in auditing where the disuibucions
appropriate for modeling the data are mastery different from those for
which most statistical analyses were designed.
The quality of He procedures used in a statistical analysis depends
heavily on He probability mode! or distributions assumed. Because of
this, considerable effort over the years has been expended in He
development of large classes of standard distributions, along win relevant
statistical methodologies, designed to serve as models for a wide range of
phenomena. However, there still remain many important problems where
the data do not follow any of these more "standard" models. The problem
raised by the IRS provides a strikingly simple example of data from a
nonstandard distnbudon for which statistical methodologies have ordy
recently begun to be developed, and for which much additional research is
needed. The example is of such national importance, both for government
agencies and for business and industry, Cat it is He primary focus of this
report. The potential monetary losses associated with poor statistical
practice in tills auditing context are exceedingly high.
It is the purpose of this report to give a survey of the available
statistical methods, to provide an annotated bibliography of the literature
on which He survey is based, to summarize important open questions, to
present recommendations designed to improve the level and direction of
research on these matters and to encourage greater interaction between
statisticians and accountants. This report is primarily directed towards
researchers, both in statistics and accounting, and students who wish to
become familiar with He important problems and literature associated
with statistical auditing. It is hoped Hat this report wild stimulate the
needed collaborative research in statistical auditing involving both
statisticians and accountants. It is also hoped that practitioners will benefit
from the collection of methodologies presented here, and possibly will be
able to incorporate some of these ideas into their own work.
Although this report is centered upon a particular nonstandard
distnbunon that arises in auditing, the original proposal for this study
recognized that this same type of nonstandard mode] arises in many quite
different applications covering almost all other disciplines. Three general
areas of application, (accounting, medicine and engineering) were initially
chosen for consideration by the Panel. Later in this Introduction we list
1
OCR for page 2
several examples in order to illustrate the wide-spread occurrence of
similar nonstandard models throughout most areas of knowledge. These
examples will, however, pnma~y reflect Me opal areas of emphasis of
He Panel. Before describing these examples. however, we briefly discuss
the general concept of a mixture of distributions since it appears in me
name of the Panel.
Nonstandard Mixtures:
The phrase "mixture of distnbudons" usually refers to a situation in
which tile j-th of k (taken here to be finite) underlying distributions is
chosen with probabilitypi,j=l,...,k. The selection probabilities are
usually unknown and the number of underlying distnbui~ons k may be
fixed or random. The special case of two underdog dis~ibubons is an
important classical problem which encompasses this report's particular
problem In which, with probability p, a specified constant is observed
while, win probability I-p, one observes a random measurement whose
distribution has a density function. That is, it is a mixture of a degenerate
distnbunon and an absolutely continuous one.
There are many examples of probability models that are best
described as mixtures of two or more other models in me above sense. For
example, a probability model for the heights of 16 year olds would
probably best be descnbed as He mixture of two unmoral distnbui~ons,
one representing He model for the heights of girls and one for He boys.
Karl Pearson in 1894 was possibly me first to study form ally the case of a
mixture of two distributions; in this case they were two nonnal
distnbunons Hereby providing one possible mixture mode} for the above
example of heights. Following this, there were few if any notable studies
until the paper of Robbins and Pionan (1949) in which general mixtures of
chi-square distnbutions were denved as probability models for quadratic
founs of nonnal random vanables. Since ~en, there have been many
other papers dealing win particular mixture models. The published
research primarily deals with mixtures of distnbutions of similar types,
such as mixtures of nominal distr~bution-s, mixtures of chi-square
distnbuiions, mixtures of exponential distributions, mixtures of binomial
distnbutions, and so on. However, the literature contains very few papers
that provide and deal with special "nonstandard" mixtures that mix
discrete (degenerate, even) and continuous distributions as emphasized in
this Report.
In general, the word mixture refers to a convex combination of
distributions or random variables. To illustrate, suppose X arid Y are
random variables with distribution functions F and G respectively. Let
O
OCR for page 3
model in which the distribution F is used win probability p while G is
used with probability 1-p. ~ terms of random variables, one may say Hat
H models an observation Z that is obtained as follows: With probability p
observe X having distnbui~on F. and m~ probability 1-p observe Y
having distnbution G. Such mixtures may then be viewed as models for
data that may be interpreted as the outcomes of a two-stage expenment:
In He first stage, a population is randomly chosen and Hen in He second
stage an obse~vabon is made from He chosen population.
It is not necessary to limit oneself to mixtures of just two or even a
finite number of distnbutions. In general, one may have an arbitranly
indexed family of distributions, for which an index is randomly chosen
from a given mixing distribution It should also be emphasized that Here
is considerable ambiguity associated with mixtures; every distribution may
be expressed as a mixture in infinitely many ways. Nevertheless, when
mixture models are formulated reasonably, they can provide useful tools
for statistical analysis. There is by now a large literature pertaining to
statistical analyses of mixtures of distnbudons; for a source of references,
see Tittenngton, Smith and Makov (19851. Problems and applications of
mixtures also appear in He literature associated with the term
heterogeneity; see Keyfitz (1984~.
Applications lavolving Nonstar~ard Mixtures
The interpretation of the nonstandard mixtures emphasized in this
report is quite simple. If F. the degenerate distribution, is chosen in the
first stage, the observed value of the outcome is zero; otherwise the
observed value is drawn from the other distribution. In what folBows we
illustrate several situations in which this type of nonstandard mixture may
arise, and indicate thereby its wide range of applications. There are of
course fundamental differences among many of these applications. For
example, in some of these applications, the mixtures are distinguishable in
the sense that one can tell from which population an observation has come,
whilelin others the mixtures are indistinguishable. In many applications it
is necessary to form restrictive parametric models for the non-degenerate
distribution G; in at least one example, G is itself seen to arise from a
mixture. In some cases G admits only positive values of X; In over cases
G presents positive, negative, or even zero values. Of course, if G also
permits zero values with positive probability, men the mixture is clearly
indistinguishable.
The descriptions of the following applications are brief and
somewhat simplified. They should suffice, however, to indicate the broad
diversity of important situations in which these nonstandard mixtures
arise. We begin with the auditing application that is the focus of dais
report.
OCR for page 4
1. In auditing, some population elements contain no errors
while other population elements contain errors of varying
amounts. The distribution of Eros can, therefore, be
Rewed as a mixture of two distinguishable distributions,
one win a discrete probability mass at zero and He over a
continuous distribution of non-zero positive and/or
negative error amounts. The main statistical objective in
this auditing problem is to provide a statistical bound for
Be total error amount in the population. The difficulty
inherent in this problem is the typical presence of only a
few or no errors in a given sample. This application win
be the main focus of this report; it is studied at length in
Chapter IT.
Independent public accountants often use samples to
estimate the amount of monetary error in an account
balance or class of transactions. Their interest usually
centers on obtaining a statistical upper bound for the true
monetar, error, a bound that is most likely going to be
greater than the ear. A major concern is Mat the
estimated upper bound of monetary error may in fact be
less Tan the true amount more often than desired.
Govewrnental auditors are also interested in monetary
error - the difference between the costs reported and what
should have been reported, for example. Because Be
government may not wish to over estimate the adjustment
that the auditee owes the govemment, interest often
centers on the lower confidence limit of monetary error at
a specified confidence level allowed by the policy.
The mixture problem affects bow groups of auditors as
well as intemal auditors who may be concemed win both
upper and lower limits. In all cases there is a serious
tendency for Be use of standard statistical techniques, mat
are based upon the approximate normality of Be estimator
of total monetary error, to provide erroneous results.
Specifically, as will be reviewed in the following chapter,
both confidence limits tend to be too small. Upper limits
being too small means that the frequency of upper limits
exceeding me true monetary error is less than the nominal
confidence level. Lower limits being too small means that
the frequency of lower limits being smaller than Be true
monetary error is greater man Be nominal confidence
level. To the auditors these deficiencies have important
practical consequences.
4
OCR for page 5
Most of the research to date has been directed toward the
independent public accountants' concern with the upper
limit. For example, the research outlined in the next
chapter that is concemed with sampling of dollar units
represents a major Trust in this direction. By contrast,
very little research has been done on We problem of the
lower confidence bound. This represents an area of
considerable importance where research is needed.
2. In a community a particular service, such as a specific
medical care, may not be utilized by an families in me
community. There may be a substantial portion of non-
takers of such a service. Those families, who subscribe to
it, do so in varying amounts. Thus me distnbution of the
consumption of the service may be represented by a
mixture of zeros and positive values.
3. In the mass production of technological components of
hardware, intended to Unction over a penod of time, some
components may fail on installation and therefore have
zero life lengths. A component Hat does not fail on
installation win have a life length which is a positive
random v en able whose distnbution may take different
fonns. Thus, the overall distribution of lifetimes which
includes the duds is a nonstandard mixture.
In measuring precipitation amounts for specified time
periods, one must deal with the problem that a proportion
of these amounts win be zero (i.e. measured as zero). The
remaining proportion is charactenzed by some positive
random variable. The distribution of this positive random
variable usually looks reasonably smooth, but in fact is
itself a complex mixture arising from many different types
of events.
5. In the study of human smoking behavior, two variables of
interest are smoking status - Ever Smoked and Never
Smoked - and score on a ''PhaImacological Scale" of
people who have smoked. This also is a bivar~ate problem
with a discrete v en ate - O (Never Smoked), 1 (Ever
Smoked) and a continuous variate "Pharmacological
Score." A nontrivial conditional distnbution of the second
variate can be defined only in association with die 1
outcome of the first vanate. This problem can be further
complicated by nonresponse on either of the first or
second variates.
s
OCR for page 6
6. In the study of minor c~actenshcs. two variates may be
recorded. The first is me absence (0) or presence (~) of a
minor and the second is tumor size measure on a
continuous scale. In this problem, it is sometimes of
interest to consider a marginal tumor measurement which
is O with nonzero probability, an example of a mixture of
unrelated distributions. The problem can be furler
complicated by recognizing that Me absence of a tumor is
an operational definition and mat in fact patients with
non-detectable minors win be included in this category.
7. In series of genetic bird defects, children can be
characterized by two vanates, a discrete or categoncal
vanable to indicate if one is not affected, affected and
born dead, or affected and born alive, and a continuous
variable measuring me survival time of affected children
born alive. The conditional distnbution of survival tome
given this first variable is undefined for children who are
not affected, a mass point at O for children who are
affected and born dead, and nontrivial for children who
are born alive. In some cases it may be necessary to
consider the conditional survival time distnbudon for
affected children as a mixture of a mass point (at O) and a
nontrivial continuous distribution.
8. Consider measurements of physical performance scores of
parents with a debilitating disease such as multiple
sclerosis. There wild be frequent zero measurements from
those giving no performance and many observations with
graded positive performance.
9. In a study of took decay, the number of surfaces in a
mouth which are filled, missing, or decayed are scored to
produce a decay index. Healthy teeth are scored O for no
evidence of decay. The distribution is a mixture of a mass
point at O and a nontrivial continuous distnbution of decay
score. The problem could be funkier complicated if the
decay score is expressed as a percent of damage to
measured teeth. The distnbudon should then be a mixture
of a discrete random variable (0 - healthy teeth, ~ - all
teeth missing) with nonzero probability of bow outcomes
and a continuous random vanable (amount of decay in the
(0,1) interval).
10. In studies of mesons for removing certain behaviors (e.g.,
predatory behavior, or salt consumption), the amount of
6
OCR for page 7
the behavior which is exhibited at a certain point in time
may be measured. In this context, complete absence of
the target behavior may represent a different reset man
would a reduction from a baseline level of the behavior.
Thus, one would mode} the distnbudon of activity levels
as a mixture of a discrete value of zero and a continuous
random level.
11. T~me until remission is of interest in studies of drug
effectiveness for treannent of certain diseases. Some
patients respond and some do not. The distribution is a
mixture of a mass point at O and a nontrivial continuous
distnbudon of positive remission times.
12. In a quite different context, important problems exist in
dme-senes analysis in which there are mixed spectra
containing both discrete and continuous components.
In some of the above examples, the value zero is a natural extension
of the possible measurements, and in other examples it is not. For
example, in measuring behavioral activity (Example 9), a zero
measurement can occur because the subject has totally ceased the
behavior, or because the subject has reduced the behavior to such a low
average level that the time of observation is insufficient to observe the
behavior. This indecision might also occur in the example concerning
tumor measurement or in rainfall measurement. An other examples,
however, it is possible to determine the source of me observation. The
very fact that the service lifetime of a component in Example 3 is zero
identifies that component as a dud, and in Example 7 there is a clear
distinction between stillbom and livebom children. These To kinds of
examples represent applications of indistinguishable and distinguishable
mixtures, respectively.
7
Representative terms from entire chapter:
random variables