What do I mean by data mining (see Figure 15.1)? Data mining in a very general sense is finding interesting structure in databases, and structure to me is a loose word. It could be patterns, statistical models, relationships, and the models for a whole data set. A pattern is a simple description of a subset of the data. The relationships could be over just a small sample of the data. Some of the columns of the data are sort of correlated. We'll come back and argue why these things can be very meaningful even though they sound a bit fishy. Some of them are not even statistical models, but you can actually find a lot of interesting things this way. What do you do when you are data mining? You could do predictive modeling, so you have a variable that you are predicting from other variables. Or you could solve a harder problem, such as clustering where you are just finding groups in the data. Dependency modeling is when we do a density estimate. Basically, you are trying to model the joint probability density that generated the data in the first place, a much harder problem; some techniques work. Another is summarization, which looks for relations between fields or associations. Sometimes finding correlation and pieces in the data can be useful. Finally, the last class of techniques accounts for sequence. For example, suppose I want to analyze how people use my Web site. I can actually take each user through my home page, which is my representation of the world. Or I could actually keep track of the fact that a user visited this page before this page before this page. It turns out that there are amazingly efficient algorithms that will do things like find you all frequency concerns in there, which is an interesting reduction. A lot of people are working on trying to relate this back to classical analysis techniques. There are a lot of interesting things that can be done when you account for the sequence in data and changes in data. I want to make data mining easy to use, so let me start with an illustration. Suppose you have a Web store. Microsoft puts out a product called SiteServer Commerce Edition 3.0, which helps you put on a Web site on the Internet. This means not just Web pages, but people now have shopping baskets. Consumers can shop around and add things to their basket. You charge them money. Of course, at a Web store, you don't have salespeople to help customers find what they want. So wouldn't it be nice to be able to predict, based on the mouse clicks and perhaps on what you have on your basket, what else in the store might be interesting. This is not a trivial problem, because these stores typically have something like 50,000 to 100,000 different products. Finding what you are looking for can be an interesting experience. The basic idea is to take part of the pattern recognition algorithms and ask how people do analysis. How do they use analysis techniques? The traditional method is to take data out of the database. You create your own infrastructure to do analysis. You extract the data, and you start running these scripts and so forth. And soon enough you have created a whole bunch of droppings, and if you come back to this session two weeks later, you don't recall what the files meant, and so forth. That's called a data management problem. It's exactly what a database was created to solve. So the whole idea is to decompose these operations in such a way that a lot of the things can live on the server. You never have to move data around; the data support. Of course, there are more sophisticated things that you want to do on the client, but the idea is you almost never need--at least in the cases I'm familiar with--the details of the data. You actually can get away with what we call "sufficient statistics." If the data can provide these, you can build models much more easily. To give you an idea of how Microsoft thinks of the world, the database collectivity standard is called OLEDB, for "object linking and embedding for databases." It's the standard way to talk to databases or other data sources. The next release of our primary database product, Sequence Server 2000, will include a set of laws about part of the database engine that talks the same language, OLEDB, for data mining. We just published the specs on the Web. We have been working with approximately 100 vendors on revising the standard. The idea is basically to address some of the costs of data mining. Right now data mining is expensive. There are three big costs. One is data cleaning and data maintenance and all that kind of stuff. That's never going to go away. The second is building models, and that's usually the tiny part of it. The third is actually deploying models. Suppose you built the data mining model and now wanted to deploy the organization. Typically, you have to build your own infrastructure for deploying it. Every vendor has his own standard, and it becomes a mess. So the idea is that if everybody agrees on an interface, hopefully everybody wins. So this is the basic idea behind OELDB. I won't talk much about it, but OLEDB is a language-based application program interface, with the look and feel of SQL. The idea was to go out of our way to make it feel like the familiar objects and the relation of the world. A data mining model looks like a table. You insert data into it. You join it with another table to do predictions, and so forth.
Figure 15.2 illustrates the mining process. You create the mining model, and then you enter data into it and you get the model. Then, if you have one model that looks like another table in your database, you do what we call "a join" in the database to get the predicted values. The whole idea is to make it simple enough for a database developer to just use as a natural thing. Let's talk about data mining. I will use the example of an application I did at the Jet Propulsion Laboratory with George Djorgorski and Nick Weir at CalTech called SKICAT--Sky Image Cataloging and Analysis Tool (see Figure 15.3). The idea was to take data from the Second Palomar Observatory Sky Survey and turn them into a server with about 2 billion entries in it that identifies objects for every position in the sky. I think we had approximated 40 different measurements. Of course if we had done the work last year we probably would have reached 200.
The basic idea is to take the digitized images to the scatter plot (see Figure 15.4). There is a recognition problem here. The significance is that people couldn't tell you what these objects were by just inspecting them. The state-of-the-art methods at the time, maybe 4 or 5 years ago, would form at a certain magnitude level, and we were trying to go one magnitude greater, which is a lot more data in astronomy. The idea was to automate the processing and building of this catalogue.
Once you have done the basic image processing, you know that a bunch of pixels belong to an object. Now you can go and measure things about these objects. Someone asked me, "What should we measure?" Since I'm not an astronomer, I suggested using some standard reference guides and measuring everything that you can measure. We finally selected 40 measurements, because this was the limitation of what we could store eventually in the catalogue. This is where things get very interesting. Along with the digitized photo plate, which covers about 6 degrees square of the sky, we took an image using a CCD (charge-coupled device) telescope allowing for a high resolution of a tiny portion of that plate. The trick here is that you can never know what the object is by looking at the high-resolution image. The idea is to measure all the attributes from the low-resolution plates and get the astronomer to label 3,000 to figure out the mapping. This is a standard problem, but if we solve this problem, we have this magic box that can tell you what objects are, with high reliability. Well, you actually don't know immediately what they are, but you can verify that of course. I want to skip some of the details, but the point is that we were able to get 94 percent accuracy on this problem. Ninety percent was our magic threshold. The operating point in the field was somewhere between 65 and 75 percent accuracy. So this is a significant advance. It turns out that astronomers need the 90 percent accuracy to write papers and shoot down someone else's theory about the large structure database and how it emerged. So there are many benefits to this work. What I think is interesting to this audience is, Why did it work? This is a problem that astronomers looked at for 20 or 30 years, and this is what blew my mind: Here I come around and know nothing about astronomy. I build an algorithm and it works. So why did it work? It worked because this is intrinsically an eight-dimensional problem. We have thirty dimensions, and you needed eight at once. If you drop any one of those crucial eight--we didn't know which ones they were a priori--your accuracy would drop significantly. So why did astronomers not find it? They had been doing this for a good 20 years. They were plotting one variable. So when we started these charts, we threw in one more variable. The astronomers never knew that eight was the magic number. That's why they never stumbled on the right technique, because they could never look at all the data at once, which the algorithms could. Another problem was clustering. For example, once we succeeded in recognizing these objects, we returned to the data to see whether there were new objects in the data. Other things we didn't know about. I want to summarize quickly some of the results here. We assembled just over one weekend some 1,500 points where they were certain there were only two objects, stars and galaxies, and nothing else. We kept running the algorithms, and the algorithms kept identifying four types of objects, not two. It turns out that the algorithms never mixed stars and galaxies, they just separated galaxies into two subclasses and stars into two subclasses. I still remember the expression of shock on the graduate student's face when we pulled out the images and he recognized some classes of objects that are very well known in astronomy, and sort of very standard. So they started believing the algorithms too much, and I said let's do some real discovery here. So what's a real discovery? Let's find some high-redshift quasars, which are some of the greatest objects in the universe. To make the story very short, we used a database to classify stars accurately to pull out the list of characters. Then we did clustering on top of that to further separate things that looked like suspicious objects. Then the astronomers went and targeted with very high resolution instruments. The point is that this is all numbers, but they discovered 20 new quasars and documented them in 40 times less observation time. This is when we started getting a lot of attention, because they were filling out quasars more than anybody else. The point is, this is challenging. This is the kind of problem in which you can't just do sampling. You can't just pick some data and say I hope to find something, because most likely in any sample you take, you are not going to find anything of interest. Let's go back to our 13 coin flips, since we are almost out of time, and do some analysis. Hopefully you have diligently done your experiments and you are ready to report. The point of the experiment is that in the data mining I talked about, we can model some data, and now we can do it over a lot of databases. We can build lots of models in quick time. So what are the dangers here? I want to take you down one dangerous road and then show you how easily you can overcome it. What we are doing in our experiment is, as a group, we are going to maximize over many models. In the mutual fund industry, Peter Lynch is somewhat of a legend. One of the most outstanding things he did was that for 11 out of 13 years, he outperformed the market. Let's do some analysis and see whether we have any Peter Lynches in the audience. What you did was to flip a coin. You're going to assume that you are mutual fund managers and that each year you have a 50-50 chance of beating the market or doing worse than the market. You can do some analysis, and you could see that any one of you has a pretty small chance of beating the market. To do something like what Peter Lynch did, forget it. You're not going to beat him, at least not in this crowd. What's different is this: if we decide a new variable that is the maximum of all of you as random variables, the picture changes dramatically. When I asked you to keep track of who the best people in this room are in managing mutual funds, the equations change. I don't know what our size audience is here, but let's say we're about 100, since all of you did two experiments. I would have a 67 percent chance of finding a Peter Lynch in this audience. Let's just test that. How many people got seven or more heads? Eight or more heads? Nine or more heads? Ten or more? We have almost a Peter Lynch. Eleven or more? Yes, we have a winner. This is where the danger lies. You are running lots of models, and you are going over the maximum. In fact, it turns out that in a group of 500, your expectation is 11.6 for this. So it's nothing special. Remember that there are 2,000 mutual fund managers in the market, so draw your own conclusion. The point is this sounds scary, and it sounds dangerous, and when we were mining, it was like finding thousands of models over a fixed data set. So what is happening? It turns out that with a little bit of hygiene, you can actually avoid this problem completely. That whole analysis collapses the minute I do something as simple as holding a sample on the side that I never touch to verify my results against. I'm pretty sure the Peter Lynch in this room would probably fail on these extra data, and that's why I can avoid that danger. So a lot of these mining dangers are actually overblown. In reality, if you do a little bit of hygiene, you can overcome them.
Let me conclude, since I'm running out of time. Figure 15.5 is the summary take-home message: Where are we today in terms of the world of databases, data warehouses, and so forth? The analogy that always comes to my mind is ancient Egypt. Why is that? Well, we know that we are building these huge data warehouses. In fact, if you look at a data warehouse, you hear about companies spending millions of dollars, and you look at how they use their warehouses. These data stores are only stores. You just store into them; you almost never take the data out. That's a data tomb. It's an engineering feat to build one of these warehouses, but there are many challenges to making them useful.
|