Skip to main content

Currently Skimming:


Pages 252-260

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 252...
... So, let me give you sort of a little bit of a background on this work. So, in a local information sphere, really what we are building is this distributed data stream event processing data mining system.
From page 253...
... Then ~ am going to talk a little bit about the actual techniques we are using, which are actually random projections. They are also called sketches, which allow us to process network data streams at high-speed rates, and with very small space.
From page 254...
... There is one application which is out there, which is sensor networks, and actually, later in the talk, I am going to switch and talk just a little bit about that work, because I think it fits very nicely here into the scope. So, in computations over streaming data, the model really is that I have this synopsis -- essentially, you have this stream processing engine, yet these different streams are streaming through the stream crossing engine, and the stream crossing network is physically distributed over different parts of the network.
From page 255...
... Again, the setting is that there are distributed data streams. ~ would like to find out the number of matches.
From page 256...
... So, the main idea is that ~ would like to summarize these frequency vectors and the way ~ can summarize them is as follows. T take the frequency vector and project it onto a random vector.
From page 257...
... It is sort of a really bad property, if you look at it. Again, in expectation, ~ still get exactly what ~ am trying to estimate, but my variance is really getting extremely large and hopefully, ~ have a little bit of time, and ~ will tell you now about some techniques how to reduce the variance, and then hopefully some techniques how to do multiple characterization for a set of sketch queries.
From page 258...
... and these are Fi and Gi on the frequency vectors. These are the frequency of the different attribute values for attribute values one through four.
From page 259...
... So, ~ mean, you want to partition the domain into two parts, or 2K parts, such that it minimizes the sum of the variances, and you get the sum of the variances, and the variance is sort of a product between cell join sizes. We can actually show that you should actually allocate this space in proportion to the variance, and we only have to look at partition as sort of the order of the ratios of the two frequencies, and it actually follows from sort of a nice theory that actually comes from a decision tree construction of a data mining problem called Raymond's theorem.
From page 260...
... AUDIENCE: It seems possible in some cases you could get additional acceleration by allowing your two data sources to do a little bit of negotiation between themselves before they start sending off lots of information. For instance, they could initially show each other what their most frequent items are and, if they tend to agree on different items -- [comment off microphone.]


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.