(Don't read this page. It is a work in progress for a Fall'19 graduate automated SE subject at NC State. Come back in mid-October!)
Start1.preface2.why se 4 ai? 3.tools 4.ethics: how |
ToolsbaselinesData mining: discretization basic advanced Optimizers: landscapes basic advanced optimizing+data mining Theorem provers: basic advanced |
Processrequirementscollect cleanse label train eval deploy monitor |
Codeconfigtests |
Exercises12 3a 3b 3c 3d 4 |
Recall that columns of numbers can be divided into a few bins (a.k.a ranges) using discretization. Discretization converts a (potentially) infinite range of numbers to into a small set of systems. For example, here’s a number ranges divided into 12 bins:
Clustering can be along the x-axis (as above) or (for time-series data) on the y-axis.
<img align=right width=400 src=”img/ngram.png”> Discretizaton can lose some numeric nuances. On the other hand:
Note only is explanation over ranges, it also is more general (in the sense that it covers more examples). That is, before discretization, users can expect rows with point values like loc=50 and experience=2years. While that is interesting, it does not tell the user how much they can safely change things without changing effects inside the data. If the data is discretized, then they can inspect rows with bins like and . When reasoning about models built from such discretized ranges, it is easier to work out how much things can safely change, without effecting the overall results.
Hence, discretization is very useful. For example, having discretized a time series, NASA would watch its rockets looking for unusual sequences of time steps. To that end they build a n-gram model that used the last n letters to predict the n+1 letter. This can be displayed in a simple tree where the width of each branch shows how often it is used (so fat branches are most frequent). Note that in the following, if ever we see bcb then we can relax (cause that is normal operations) but if we see aaa then we might get worried (since that is something new and rare).
It is surprising how few discretized ranges are enough to capture domain semantics.
average and most frequent common value, respectively. The mean of i(\sum_in_i)/n$$. To find the median, sort the numbers then find the middle value (or, if the list is en even number of items long, report the average between the two middle values). When finding the median, if sorting all the numbers is inconvenient, just keep a small random sample of the numbers.
Discretization can be unsupervised or supervised. Unsupervised discretization just looks at each column by itself. Simple unsupervised clustering strategies include:
start with . As every new number $x$ arrives, and and and and, if , .
Another way to divide up numbers is to split column1 according to how much that changes column2. In this approach, column2 is said to supervise the splitting of columns1.
Traditionally, discretization is applied recursively just to divide a single column of numbers. Here, we note that if the discretizer is allowed to swich attributes at each level of dividing the data, then the “discretizer” becomes a tree learners. The lesson here is that by the time you have a discretizer working, you are more than half way to having a fully fledged learner.
One discretization method, which we might call ARGBEST finds split that most include some desired dependent variables. For example, suppose the we want to find ways to maximize some numeric class variable . Consider a split that divides a column into $n_1,n_2,n_3$ rows with mean scores of . Let us say that denotes the split with worst $ value (which we will call ). The best split would be the one with the most number of values greater than those in . That is, we would seek to maximize: we would seek splits that maximize .
Another method, that we might call ARGMIN, divides the data using the split that most reduces the _variety of the dependent column. Next, we add sub-trees by recursing into each division. That is, one way to build a model is just recursively apply discretization. To implment ARGMIN we need someway to measure variability variety (since that is what we want to minimize).
this expression was named as follows. In 1949, Claude Shannon visited the mathematician John von Neumann, who asked him how he was getting on with his theory of missing information. Shannon replied that the theory was in excellent shape, except that he needed a good name for “missing information”. “Why don’t you call it entropy”, von Neumann suggested. “In the first place, a mathematical development very much like yours already exists in Boltzmann’s statistical mechanics, and in the second place, no one understands entropy very well, so in any discussion you will be in a position of advantage.”
For example, suppose we say data like this and we wanted to use supervised discretization to divide the age column (supervised by the fare column):
age fare E[V of y]
--- ----
35 53
22 7.25
40 84
26 7.9
38 71.2
35 8.05
First we compute the baseline variety score for fare. This column has a standard deviation of 35.2.
Second, we sort on age then look at the effects on the variety of fare for different splits on age. If we split at age ≥ 38, this would produce two regions where the standard deviations of the _fare_s are 22.6 and 9.1, respectively (with an exected value of 4/622.6 + 2/69.1=18.1.
age fare standard deviation
--- ------ ------------------
22 7.25 22.6
26 7.9
35 8.05
35 53
---------- ------------------
38 71.2 9.1
40 84
Age ≥ 38 is a useful split since it reduces the overall variety from 35.2 to 18.1. But we can better. If we recurs into the first split, then split again at age ≥ 35, we get three splits:
age fare standard deviation
--- ------ ------------------
22 7.25 0.5
26 7.9
--- ------ ------------------
35 8.05 31.8
35 53
---------- ------------------
38 71.2 9.1
40 84
which yields an expected value of 13.8. Once again, we see the split has successfully reduced the variety of the fare variable.
(Technical aside: in practice, we would not split down to just two rows per split. A more common stopping criteria is to split no less than of the number or rows or no less than rows.)
For ARGMIN to work, it has to explore all possible splits of all numeric attributes. To speed that up:
We said above that once a discretizer works, then we are more than halfway to have a learning system. Four illustrations of this are STAR, FFTtrees, CART and C4.5 discussed in the next chapter.
Another applciation of “discretizaton” is actually a statistical clustering method called the Scott-Knot procedure. In this methods, various treatments are applied to collect a bag of score values for each treatment. For example, a standard evalatuon rig in data mining is an M-times-N cross validation where:
The Scott-Knot procedure “discretizes” these treatments into sets of similar values as follows. The thing to note here is that, with very few changes, this procedure is the same ARGMIN discussed above.
In such a data set, attempting to learn risk factors for pregnancy using the first part of the data will fail (since that part contains no female examples).
so many cross-validations is not strong. Using reduces the over all number of trainings by a factor of four (from 10*10=100 to 5*5=25) while still returning a large enough sample to be statistically interesting.
Scott-Knot sorts a list of treatments with measurements by their median score. It then splits into sub-lists in order to maximize the expected value of differences in the observed performances before and after the splits. For example, we could sort methods based on their median score, then splits them into three sub-lists of of size . Scott-Knot would declare one of these divisions to be best, as follows. For lists of size where , the “best” division maximizes the difference in the expected mean value before and after the split:
Scott-Knot then checks if that best division is actually useful (using some statistical procedure7. To implement that check, Scott-Knot would apply some statistical hypothesis test $H$ to check if are significantly different. If so, Scott-Knot then recurs on each half of the ``best’’ division.
For example, consider treatments:
rx1 = [0.34, 0.49, 0.51, 0.6]
rx2 = [0.6, 0.7, 0.8, 0.9]
rx3 = [0.15, 0.25, 0.4, 0.35]
rx4= [0.6, 0.7, 0.8, 0.9]
rx5= [0.1, 0.2, 0.3, 0.4]
After sorting and splitting, Scott-Knott declares:
Note that Scott-Knott found little difference between rx5 and rx3. Hence, they have the same rank, even though their medians differ. This is to say that there is no real difference in the performance of rx3 and rx5 (and, indeed, rx2 and rx4). This is the essence of discretization– ignoring trivial differerences and grouping together similar values into a few, easy to browser, sets.
In statistics, the median,mean, and mode value of a distribution is the central, ↩
Warning: many distributions do not conform to this shape. ↩
To incremental compute mean and standard deviation , ↩
According to Wikiquotes ↩
E.g. Consider a data set from a hospital where all the female patients are listed last. ↩
The standard values are but the empirical evidence for the need for ↩
Statistics are discussed later in this book. ↩