An English translation of Laplace s Théorie Analytique des Probabilités is available online at the departmental website of Richard Pulskamp (Math Dept, Xavier U, Cincinnati). Ed Jaynes would have rejoiced to see this day! Pulskamp won t try to publish his translation (2014) in book form until/unless he (or another) annotates it, and meanwhile it is online. perhaps the greatest single work on probability ever published. Written in two parts, part 1 (routine) on generating fns, part 2 on probability theory. Pulskamp has translated the important part 2. Laplace also wrote a philosophical essay on probabilities which acts as the Intro to 3 rd edition. This has been translated into English previously (twice!)
From samples, how many species in the population? Anton Garrett Visiting Researcher Cavendish Laboratory University of Cambridge Department of Physics
How many nationalities are represented in a crowd, based on samples? How many subject categories are there in a library, based on inspection of books pulled out at random? (NB categories are written on book covers by name not number, or else largest category no. observed is an immediate lower bound) How many species of bacteria in a pond? (Population ecology) with and without sample replacement
The Bayesian solution to this problem will be given, ie the posterior distribution for the number of classes represented will be given in terms of the prior distn for parameters relating to this number. Forms for the likelihood (ie, the sampling distribution) and the prior will be discussed. The problem is routine for Bayesians, but we shall also look at the harder problem when we don t know how many countries/categories there are. Contains a twist! ONLY the sum and product rules of probability will be used. Bayes theorem is a corollary. Extra parameters will be marginalised over. Marginalisation is another corollary.
The answer depends on the prior info a strength of Bayesianism, not a weakness. For, if we knew the answer in advance and were doing the sampling only under orders, our prior would be a discrete δ-fn at the answer. In Bayes theorem a δ-fn prior carries through unchanged to the posterior (because the posterior is proportional to the prior, so that the zero prior probability value except at the δ-fn carries through). The variety of sampling-theoretical methods designed to let the data speak for themselves while ignoring the prior info give impossible answers (ie, nonzero prob away from the δ-fn) in such problems, and are therefore WRONG. Don t trust any method that fails in a simple problem.
What is probability? p(a B) how strongly B implies A. Formally, a measure of how strongly A is implied to be true upon supposing that B is true, according to relns known between their referents. (A,B: binary propositions, true/false). Degree of implication is what you actually want in any problem involving uncertainty. RT Cox (1946), Knuth: if propositions obey Boolean algebra then the degrees of implication for them obey corresponding algebraic relations that turn out to be the sum and product rules (and hence Bayes theorem). So let s call degree of implication probability. But if frequentists (etc) object then bypass them. Calculate the degree of implication in each problem, because it s what you want in order to solve any problem. In defining probability there are no higher criteria than consistency and universality. no worries over belief or imaginary ensembles; all probs automatically conditional. This viewpoint downplays random all it means is unpredictable, but by whom?
You have to use the prior info, and you have to use Bayes theorem to update it in the light of the data. Anything else is inequivalent to the sum and product rules, and they follow from Boolean algebra of the propositions that are the arguments of the probabilities. This is real Bayesianism; accept no other!
Make this an urn problem. We are sampling ball bearings from an urn containing N ball bearings, identical except that each bearing is stamped lightly with its manufacturer s name. How many manufacturers are represented in the urn? The prob of sampling the observed no. of ball bearings from each manufacturer, supposing that we know the number in the urn from each manufacturer, and ball bearings are replaced in the urn after sampling, is the multinomial distn (standard). Without replacement, it is the (multivariate) hypergeometric distribution (named after its moment generating fn).
Attached to the variables representing the numbers of ball bearings is a suffix identifying the manufacturer. So our answer works when our prior info identifies every manufacturer that might be in the urn. This solution is a routine application of Bayesianism. (Sampling-theoretical approach??) But what if we have lost the list of manufacturers and their output capacities (the key )? Or never had one? Suppose that after 20 samplings we have seen 15 ball bearings stamped by Smith, 3 by Jones and 2 by Davies. If we didn t know that these manufacturers even existed before the sampling, how can we have any prior info about them?
We can still make progress if we have statistical info about the manufacturers. If we know that (eg) one manufacturer has more ball bearings in the urn than the rest combined, that manufacturer is likely to be Smith. More generally, we can assign a probability to manufacturers specified only statistically in the prior being the ones observed in the sampling. Then we borrow the analysis above, assuming that particular identification; then marginalise over all possible identifications. Finally, extract a posterior distn for the no. of manufacturers with ball bearings in the urn, using the counting trick with the δ-fn.
Suppose we have statistical info that distinguishes between manufacturers even though we don t know which is which, or how many there are. The principles of economics might give a scaling law, such that (eg) twice as many manufacturers make 10,000 ball bearings/day as 100,000, etc (logarithmic). This induces a prior (see following slides). The scaling law allows us to choose a labelling of manufacturers according to the expected output of each manufacturer (or any other variable that distinguishes them statistically).
Another labelling can be generated from the stats of the samples (or just the order the manufacturers came out of the urn). This labelling is an unknown subpermutation of the labelling of the manufacturers in the prior which was based on the scaling law. Using Bayes theorem we can get a posterior distn for which subpermutation it is (the prior for perms is uniform over them). This variable enters the analysis as just another unknown that is ultimately to be marginalised over. Our answer is now a sum over a large number of quantities. (Care needed with normalisations!) But Bayesian computing continues to make progress too...
NB This situation is not the same as the prior for a dice that we know is weighted, but we don t know which face it is weighted towards. In that case we know the faces by name, and the prior is an exchangeable sum, with each term in the sum weighted toward a different face. Our prior might reasonably take the form where the mean μ depends on the label j according to the scaling law, which gives a density of manufacturers wrt μ. The standard deviation, ie the variability of output of the manufacturer, is assumed to be small. Urn filled in proportion to factory output.
The number of manufacturers was part of the conditioning info in this prior probability for the number of ball bearings in the urn by manufacturer. So we need a prior for the number of manufacturers in existence (of course!) Economics might also furnish this prior, given the size of the economy. Its tail will be important in answering how many manufacturers are/aren t in the urn.
Conclusion: This is a nice blind problem, in which something apparently vital in defining the problem can be demoted to being known only probabilistically, and marginalised out at the end (although the computations become formidable). Tricky, in that what is lost when this demotion happens is the labelling, without which you apparently cannot get the problem off the ground. Actually, you can by defining different labellings from the prior stats and the data stats, and relating these labellings probabilistically. Other problems which this trick can solve?