Chapter 15: Industrial Keynote Address: Data Mining and Databases | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA

Chapter 15: Industrial Keynote Address: Data Mining and Databases | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA | National Research Council

15

Industrial Keynote Address:
Data Mining and Databases

Usama Fayyad

Good morning. I would like to discuss what we are doing at Microsoft to make data mining technology much more mainstream, more affordable, and easier to use. I'll also discuss an application as an example. Rather than using some business applications I have been involved with at Microsoft, I will use a CalTech application involving astronomy, to give the audience an idea of where this technology can help. I'll also talk a little bit about scaling and data mining algorithms to work with very large databases.

First, I want to force some action on the audience, so I'm going to ask for some help. I would like everyone to flip a coin 13 times. We will come back to the significance of 13 at the end of the talk. You can write down the sequence, but what I care about is how many heads did you get? There are heads and tails, and I would like to count the number of heads. We will need a significantly large statistical sample, so I'm going to ask you to do it twice. If some of you have more energy, please do it three times, and take your time. We'll come back and talk about this.

Let's talk about data mining while you are flipping coins. The theme here is data analysis techniques. These are statistical analyses, other algorithms for driving patterns to make them work with very large databases. What is the level of interaction between human beings and databases? That level, I would argue, is just the appropriate level. We are not very good at basic retrieval and exploitation.

When I was at the Jet Propulsion Laboratory (JPL), I would hear from scientists all the time that "I have so much data, I don't know what to do with them. I really know what the points of interest are, but I need facts automated." We will talk about examples of these.

These kinds of theories are hard to implement in the standard database language, standard query language (SQL). You have to redo potentially a whole pattern admission after it or something like that to solve these problems. This is not only affecting scientists of course, and this is what pulled me away from JPL and into Microsoft.

People are doing things like managing a store or a business. They have databases, and they are beginning to ask questions of their database that their database cannot answer. Hopefully I'll show you how you can make the database system more flexible as a platform to answer these questions.

So I'll step back here to describe the setting in the database world. Why do we access data? The typical application in business is that I have a very large data flow. I want to enter a query, and I want to retrieve a smaller data set. There is a mechanism for accessing databases through the standard query language called SQL. That has been optimized to get data. We know how to do it very well. So the setting is as follows: large store, query, and smaller subset.

Now if you want to extend this in terms of populating standard reports or filling fields and so forth, then SQL is fairly adequate. However, as we move on to lower areas where you don't even know what you are doing, you can't state exactly what you are after, things get much trickier, and databases start to help. So anytime there are these problems, which exist in business, in manufacturing, in communications, and in science, data mining can help a lot, and I'll tell you about data mining in a few minutes.

So what are the areas of relative importance? Well, for a number of data owners, these people who have databases used to be restricted to very large organizations up to maybe 5 years ago. That picture has changed dramatically. Now even small businesses such as a small restaurant can have a significant database. The existing data analysis tools that have been created so far can't really keep up with the new requirements, both the majority and the scale, the size of the data. Finally, the end user, even if a scientist, is not really a statistician. We expect the user to be so sophisticated, to know all about these analyses, algorithms, parameters, and so forth, and it becomes very difficult to use these things.

For a long time, the database was an on-line transaction processing (OLTP) store. That is, how events such as charges to a credit card or deposits to an account and so forth are captured. How do you build systems that are highly available, very reliable? For example, the system should never fail with money in the bank and money in your pocket at the same time, because then you've got the counting problem. Essentially, these problems have been solved. The field has worked on this for 30 years, and I would say they have been solved to a large degree.

This has created a large volume of data. Think of all the phone calls made in the United States or around the world. Companies started asking questions like, "Can't I use this to understand how people are using my service, how people are using my product, or how people are buying my product?" The answer is that you can't, because these databases, the OLTP stores, are not optimized for the kind of querying, exploring, and looking for patterns in data that are required. We have created a whole new bundler, which turned into a $4 billion to $5 billion market called data warehousing. In a nutshell, this is taking old OLTP stores and other data sources and putting them together so that somebody can look at this and use it for decision support.

This has changed the requirements of the database from being just a locator system to being a decision support system. This is very significant and very subtle, and it happened while people working in the field were not aware that it was happening.

This left a huge gap in the things that we require for a database that we can't get. If you look at a lot of fields, including science, what typically happens is that you end up ignoring most of the data you have. This also occurs in business fields such as mass marketing. For each U.S. household, you can get something like 5,000 different fields. You can actually buy these for a few hundred dollars, but you find that they summarize the data into two or three fields. So in marketing they call them RFM--recency, frequency, and monetary. All I care about is, How easy is it to use? How frequently do you use it? How much money do you spend? Clearly, this kind of projection loses a lot of information. A lot of myths that get propagated, saying that most of the data are useless; the data are not interesting.

People are finding out now that they can use new algorithms that actually deal with all of the details. We can easily break these myths. We can easily get more out of the data. Today I'll talk about the scientific application; I won't talk about the business applications.

The interesting point is how to think of where data mining fits in. At JPL, I had a planetary geologist say, "I'd like to detect all those small volcanoes on Venus. There are about a million of them, but I don't where they are." This is a high-level query.

So again, instead of trying to answer these directly, you go through the data. Do you have examples of what you are looking for? Can I discover what you are looking for through techniques, such as clustering or segmentation analysis and so forth? We will talk about this. The idea here is to let the data help you solve the problem directly if you have the data. Today we are in a very data-rich environment, which is very knowledge poor, and that's why these algorithms can help a lot.

Figure 15.1

What do I mean by data mining (see Figure 15.1)? Data mining in a very general sense is finding interesting structure in databases, and structure to me is a loose word. It could be patterns, statistical models, relationships, and the models for a whole data set. A pattern is a simple description of a subset of the data. The relationships could be over just a small sample of the data. Some of the columns of the data are sort of correlated. We'll come back and argue why these things can be very meaningful even though they sound a bit fishy. Some of them are not even statistical models, but you can actually find a lot of interesting things this way.

What do you do when you are data mining? You could do predictive modeling, so you have a variable that you are predicting from other variables. Or you could solve a harder problem, such as clustering where you are just finding groups in the data. Dependency modeling is when we do a density estimate. Basically, you are trying to model the joint probability density that generated the data in the first place, a much harder problem; some techniques work. Another is summarization, which looks for relations between fields or associations. Sometimes finding correlation and pieces in the data can be useful.

Finally, the last class of techniques accounts for sequence. For example, suppose I want to analyze how people use my Web site. I can actually take each user through my home page, which is my representation of the world. Or I could actually keep track of the fact that a user visited this page before this page before this page. It turns out that there are amazingly efficient algorithms that will do things like find you all frequency concerns in there, which is an interesting reduction. A lot of people are working on trying to relate this back to classical analysis techniques. There are a lot of interesting things that can be done when you account for the sequence in data and changes in data.

I want to make data mining easy to use, so let me start with an illustration. Suppose you have a Web store. Microsoft puts out a product called SiteServer Commerce Edition 3.0, which helps you put on a Web site on the Internet. This means not just Web pages, but people now have shopping baskets. Consumers can shop around and add things to their basket. You charge them money. Of course, at a Web store, you don't have salespeople to help customers find what they want. So wouldn't it be nice to be able to predict, based on the mouse clicks and perhaps on what you have on your basket, what else in the store might be interesting. This is not a trivial problem, because these stores typically have something like 50,000 to 100,000 different products. Finding what you are looking for can be an interesting experience.

The basic idea is to take part of the pattern recognition algorithms and ask how people do analysis. How do they use analysis techniques? The traditional method is to take data out of the database. You create your own infrastructure to do analysis. You extract the data, and you start running these scripts and so forth. And soon enough you have created a whole bunch of droppings, and if you come back to this session two weeks later, you don't recall what the files meant, and so forth. That's called a data management problem. It's exactly what a database was created to solve. So the whole idea is to decompose these operations in such a way that a lot of the things can live on the server.

You never have to move data around; the data support. Of course, there are more sophisticated things that you want to do on the client, but the idea is you almost never need--at least in the cases I'm familiar with--the details of the data. You actually can get away with what we call "sufficient statistics." If the data can provide these, you can build models much more easily.

To give you an idea of how Microsoft thinks of the world, the database collectivity standard is called OLEDB, for "object linking and embedding for databases." It's the standard way to talk to databases or other data sources. The next release of our primary database product, Sequence Server 2000, will include a set of laws about part of the database engine that talks the same language, OLEDB, for data mining. We just published the specs on the Web. We have been working with approximately 100 vendors on revising the standard. The idea is basically to address some of the costs of data mining. Right now data mining is expensive. There are three big costs. One is data cleaning and data maintenance and all that kind of stuff. That's never going to go away. The second is building models, and that's usually the tiny part of it. The third is actually deploying models.

Suppose you built the data mining model and now wanted to deploy the organization. Typically, you have to build your own infrastructure for deploying it. Every vendor has his own standard, and it becomes a mess. So the idea is that if everybody agrees on an interface, hopefully everybody wins.

So this is the basic idea behind OELDB. I won't talk much about it, but OLEDB is a language-based application program interface, with the look and feel of SQL. The idea was to go out of our way to make it feel like the familiar objects and the relation of the world. A data mining model looks like a table. You insert data into it. You join it with another table to do predictions, and so forth.

Figure 15.2

Figure 15.2 illustrates the mining process. You create the mining model, and then you enter data into it and you get the model. Then, if you have one model that looks like another table in your database, you do what we call "a join" in the database to get the predicted values. The whole idea is to make it simple enough for a database developer to just use as a natural thing.

Let's talk about data mining. I will use the example of an application I did at the Jet Propulsion Laboratory with George Djorgorski and Nick Weir at CalTech called SKICAT--Sky Image Cataloging and Analysis Tool (see Figure 15.3). The idea was to take data from the Second Palomar Observatory Sky Survey and turn them into a server with about 2 billion entries in it that identifies objects for every position in the sky. I think we had approximated 40 different measurements. Of course if we had done the work last year we probably would have reached 200.

Figure 15.3

The basic idea is to take the digitized images to the scatter plot (see Figure 15.4). There is a recognition problem here. The significance is that people couldn't tell you what these objects were by just inspecting them. The state-of-the-art methods at the time, maybe 4 or 5 years ago, would form at a certain magnitude level, and we were trying to go one magnitude greater, which is a lot more data in astronomy. The idea was to automate the processing and building of this catalogue.

Figure 15.4

Once you have done the basic image processing, you know that a bunch of pixels belong to an object. Now you can go and measure things about these objects. Someone asked me, "What should we measure?" Since I'm not an astronomer, I suggested using some standard reference guides and measuring everything that you can measure. We finally selected 40 measurements, because this was the limitation of what we could store eventually in the catalogue.

This is where things get very interesting. Along with the digitized photo plate, which covers about 6 degrees square of the sky, we took an image using a CCD (charge-coupled device) telescope allowing for a high resolution of a tiny portion of that plate. The trick here is that you can never know what the object is by looking at the high-resolution image. The idea is to measure all the attributes from the low-resolution plates and get the astronomer to label 3,000 to figure out the mapping. This is a standard problem, but if we solve this problem, we have this magic box that can tell you what objects are, with high reliability. Well, you actually don't know immediately what they are, but you can verify that of course.

I want to skip some of the details, but the point is that we were able to get 94 percent accuracy on this problem. Ninety percent was our magic threshold. The operating point in the field was somewhere between 65 and 75 percent accuracy. So this is a significant advance. It turns out that astronomers need the 90 percent accuracy to write papers and shoot down someone else's theory about the large structure database and how it emerged.

So there are many benefits to this work. What I think is interesting to this audience is, Why did it work? This is a problem that astronomers looked at for 20 or 30 years, and this is what blew my mind: Here I come around and know nothing about astronomy. I build an algorithm and it works. So why did it work?

It worked because this is intrinsically an eight-dimensional problem. We have thirty dimensions, and you needed eight at once. If you drop any one of those crucial eight--we didn't know which ones they were a priori--your accuracy would drop significantly. So why did astronomers not find it? They had been doing this for a good 20 years. They were plotting one variable. So when we started these charts, we threw in one more variable. The astronomers never knew that eight was the magic number. That's why they never stumbled on the right technique, because they could never look at all the data at once, which the algorithms could.

Another problem was clustering. For example, once we succeeded in recognizing these objects, we returned to the data to see whether there were new objects in the data. Other things we didn't know about.

I want to summarize quickly some of the results here. We assembled just over one weekend some 1,500 points where they were certain there were only two objects, stars and galaxies, and nothing else. We kept running the algorithms, and the algorithms kept identifying four types of objects, not two. It turns out that the algorithms never mixed stars and galaxies, they just separated galaxies into two subclasses and stars into two subclasses. I still remember the expression of shock on the graduate student's face when we pulled out the images and he recognized some classes of objects that are very well known in astronomy, and sort of very standard.

So they started believing the algorithms too much, and I said let's do some real discovery here. So what's a real discovery? Let's find some high-redshift quasars, which are some of the greatest objects in the universe. To make the story very short, we used a database to classify stars accurately to pull out the list of characters. Then we did clustering on top of that to further separate things that looked like suspicious objects. Then the astronomers went and targeted with very high resolution instruments. The point is that this is all numbers, but they discovered 20 new quasars and documented them in 40 times less observation time.

This is when we started getting a lot of attention, because they were filling out quasars more than anybody else. The point is, this is challenging. This is the kind of problem in which you can't just do sampling. You can't just pick some data and say I hope to find something, because most likely in any sample you take, you are not going to find anything of interest.

Let's go back to our 13 coin flips, since we are almost out of time, and do some analysis. Hopefully you have diligently done your experiments and you are ready to report. The point of the experiment is that in the data mining I talked about, we can model some data, and now we can do it over a lot of databases. We can build lots of models in quick time.

So what are the dangers here? I want to take you down one dangerous road and then show you how easily you can overcome it. What we are doing in our experiment is, as a group, we are going to maximize over many models. In the mutual fund industry, Peter Lynch is somewhat of a legend. One of the most outstanding things he did was that for 11 out of 13 years, he outperformed the market.

Let's do some analysis and see whether we have any Peter Lynches in the audience. What you did was to flip a coin. You're going to assume that you are mutual fund managers and that each year you have a 50-50 chance of beating the market or doing worse than the market. You can do some analysis, and you could see that any one of you has a pretty small chance of beating the market. To do something like what Peter Lynch did, forget it. You're not going to beat him, at least not in this crowd.

What's different is this: if we decide a new variable that is the maximum of all of you as random variables, the picture changes dramatically. When I asked you to keep track of who the best people in this room are in managing mutual funds, the equations change. I don't know what our size audience is here, but let's say we're about 100, since all of you did two experiments. I would have a 67 percent chance of finding a Peter Lynch in this audience. Let's just test that. How many people got seven or more heads? Eight or more heads? Nine or more heads? Ten or more? We have almost a Peter Lynch. Eleven or more? Yes, we have a winner.

This is where the danger lies. You are running lots of models, and you are going over the maximum. In fact, it turns out that in a group of 500, your expectation is 11.6 for this. So it's nothing special. Remember that there are 2,000 mutual fund managers in the market, so draw your own conclusion.

The point is this sounds scary, and it sounds dangerous, and when we were mining, it was like finding thousands of models over a fixed data set. So what is happening? It turns out that with a little bit of hygiene, you can actually avoid this problem completely.

That whole analysis collapses the minute I do something as simple as holding a sample on the side that I never touch to verify my results against. I'm pretty sure the Peter Lynch in this room would probably fail on these extra data, and that's why I can avoid that danger. So a lot of these mining dangers are actually overblown. In reality, if you do a little bit of hygiene, you can overcome them.

Figure 15.5

Let me conclude, since I'm running out of time. Figure 15.5 is the summary take-home message: Where are we today in terms of the world of databases, data warehouses, and so forth? The analogy that always comes to my mind is ancient Egypt. Why is that? Well, we know that we are building these huge data warehouses. In fact, if you look at a data warehouse, you hear about companies spending millions of dollars, and you look at how they use their warehouses. These data stores are only stores. You just store into them; you almost never take the data out. That's a data tomb. It's an engineering feat to build one of these warehouses, but there are many challenges to making them useful.