Skip to main content

Currently Skimming:


Pages 363-388

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 363...
... .. ~ am going to be talking about a whole bunch of data structures which are very promising for allowing us to do apparently rather expensive statistics on large amounts of data, as opposed to trying to speed up cheap statistics, trying to do some of the apparently very expensive algorithms.
From page 365...
... Even building a covariance matrix requires you to do that, and in some cases, there are things that you need to look at all triples and all quadruples of covariances. So, these things which scale quadratically or cubically with the amount of data over the number of attributes are very fiightening.
From page 366...
... 366 EW~; ~ ^ Gaus$isn~-~1 :says 'my ~ ~~:~:~:~' Another related example is if you are doing something like a Gaussian mixture model cluster, where you have a large number of records and a large number of clusters. Certainly, if you are using something like an ATC or a BAC type of a measure for creating your model, you would notice that the number of Gaussians that you created in your mixture model grows as the number of records grows.
From page 367...
... :. .': For building dependency trees or Bayesian networks, two now very popular things to do in commercial data and on scientific data, they also require you to look over not just pairs, but also n-tuples of attributes.
From page 369...
... 369 recursion, I happen to be in a situation where I am asking, how many pairs of data points have one data point in this left rectangle and one data point in this right rectangle, and are within distance .4 of each other. Now, the way I am going to answer that is by breaking that single question into two smaller questions.
From page 370...
... We have had been able to run this on data sets with 14 million records, and that would have taken over a year to run otherwise, and in our case, it took 3 hours. So, that is one example of very simple geometry helping do something we wanted, and we clidn't resort to any approximation there.
From page 371...
... 371 ~The~ti~ funeti~-:n~ :~emp~' ~ t Tow many~:~dpl~s ~~::~*
From page 372...
... Suppose you have this as a two dimensional Gaussian and ~ am just showing you the covariance matrix to give you an idea.
From page 374...
... ~ am going to run Gaussian mixture mocleling and ~ am going to run ~ 5 iterations. So, the thing ~ just wanted to show you is that, by using tricks like that, you can do things like that to your mixture models very quickly.
From page 375...
... 375 K: # # :X :'~X :~: ~'8~ I'd Kin : _ -*
From page 377...
... It is slowing down a little bit now. It is up to about 50 Gaussians.
From page 379...
... For example, you end up wanting to do this if you are building a Bayesian network. You end up wanting to do it if you are, for instance, using a very nice algorithm by DuMichelle and Pregibon, that they introduced a couple of years ago based on doing lots and lots of empirical Bayes tests on subsets of the data.
From page 380...
... . Here is a data structure which actually implements this impossibly expensive database ~ mentioned.
From page 381...
... Any node here with a count of zero, any specializations of those queries, also count as zeroes. Thins is :a cm~: Ma With ~ bny census outlets Would need 400 T6~6YIOS of men,cry Withy ~ medium size preganesf~ra&king "8itth" data - , w~d need...1~43 Is ~: memory AR .~ ~ ~ lo: He'd So, that saved us a little bit of data, but not much.
From page 382...
... A:, 2. I will use this notation of a contingency table of a set of attributes.
From page 383...
... There are two properties that contingency tables have, and they are very similar to the properties that you get from the axioms of probability, only this one is based on counts. One of them, the marginal contingency table over a2, I can just row-wise add together the conditional contingency tables over each of the values of al.
From page 384...
... To do this one, it just realizes it has to create a table full of zeroes. This is the place where it looks like we are getting stuck, because this is the place where we don't have any data structures to run down to compute this conditional contingency table.
From page 385...
... 'a `: ~~.( ~ , ~ ~~:? ,~.:~: if T do another call, recursive call on my data structure here, asking it to create for me just the marginal contingency table for as, once ~ subtract these three contingency tables, that allows me to create this conditional contingency table, which ~ glue together with the remaining three, and that gets me the joint.
From page 386...
... ~:~ , . .: So, when we actually do use this for various things, we have to search, in this case, over a relatively small database, just I.5 million records, and 27 attributes, so 27 covariates.
From page 387...
... We are currently trying to take that national to the stage where we are still going to be, by your standards, quite small. We will be getting in ~ O million records a day, so not the billions of records that some of you have been talking about, and hoping to apply that in those cases.
From page 388...
... So, some of these algorithms, like the mixture of Gaussian ones, we get into computational problems with about 20 dimensions. Some of the other things, like the kernel methods or the two point counts we have done, we actually these days run them in a different data structure called a metric tree, where everything is stored in balls instead of hyper-rectangles.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.