(Don't read this page. It is a work in progress for a Fall'19 graduate automated SE subject at NC State. Come back in mid-October!)

Start

1.preface
2.why se 4 ai?
3.tools
4.ethics: how

Tools

baselines
Data mining:
discretization
basic
advanced
Optimizers:
landscapes
basic
advanced
optimizing+data mining
Theorem provers:
basic
advanced

Process

requirements
collect
cleanse
label
train
eval
deploy
monitor

Code

config
tests

Exercises

1
2
3a
3b
3c
3d
4

About Learners


Learners divide old data into (say):

Then, when new data arrives, we can:

For now, we focus on clustering, data mining, and regression. Later we will talk about optimization (but as we shall see, data mining and optimization are connected in that either one can help the other).

The goal of this chapter is to offer certain core intuitions, common features, about learners. It is not a comprehensive guide (for that, see [the excellent Ian Witten book] on data mining](REFS#witten-2016) but it does introduce much of the learning technology used later in this book.

But which learner is best?

Data Miners work on Data

  1. Tables have rows and columns of data.
    • When data does not come in a nice neat table1, then the first task of the data scientist is to wrangle the data into such a format.
    • Later in this book, we devote three chapters to that wrangling (when we talk about data collection, data cleaning, and data labelling).
  2. Columns are also sometimes called features, attributes, or variables.
  3. Columns can be numeric or symbolic.
    • Numbers can be added together.
    • Symbols can only be compared using equals and not equals.
  4. Some columns might be goals (things we want to minimize or maximize) or classes (which we want to predict).
    • Goals and classes are sometimes called the dependent variables.
    • Goals are always numeric columns and the others can be symbolic or numeric.
    • Classifiers predict for symbolic goals (e.g car=ford or disease=measles).
    • Regression algorithms predict for numeric goals (e.g. the time to failure of a hard drive).
      • Columns that are not dependent are independent.

Operations on data

Discretization

Columns of numbers can be divided into a few ranges using discretization. For example, here’s a number ranges divided into 12 bins:

Discretization is usually applied just to one column, or a pair of columns. - In supervised discretization, we break up one column according to how much those breaks divide some goal column. - For example, consider a table of data about every Australian born since 1788. If that table has independent and dependent columns age and alive?, then it is tempting to split age at least at 120 since above that point, everyone is alive?=no.

Clustering

Clusters divide the table into groups of similar rows (and each group is a cluster). Usually, clustering ignores the goal/class columns (but it is possible and sometimes useful to cluster on the goals/class).

If new thing is not similar to any existing cluster, then it is anomaly.

Note that Recursive clustering lets us divide large problems into smaller ones.

If we only build models for the smaller clusters, then model construction is much faster.

Model Building

There are many ways to build models, most of which involve dividing the data.

Different learners divide the data in different ways. For example:

Uncertainty

The less things divide, or the more than fall into the gaps between the divisions, the more we doubt those conclusions. For example:

The other place we need to doubt conclusions is when they fall outside of everything that has been seen before.

Operations and Ethics

Explanation and transparent:

Privacy:

Inclusiveness:

Reliability

Effectiveness

we can decide about new examples by descending the hierarchy looking for the smallest cluster nearest the example that is not marked as polluted.


Columns and Rows

Learners process examples (also called rows) of the form were:

Note that variables are also called features, attributes, or columns. The outputs can be:

A set of rows is sometimes called a table, or a relation. In such sets, we often keep statistics on each column. If n,s are instances of class Num,Sym then n+x, s+x are methods that add x to the summary. For example

n = Num()
for x in [1,2,3,4]: n+x
print(n, n.sd)

---
title: " ==> {'n': 4, 'mu': 2.5, 'lo': 1, 'hi': 4}"
layout: default
---
---
title: " ==> sd = 1.29"
layout: default
---

#-----------
s = Sym()
for x in 'abbcccc': s+x
print(s,s.mode, s.ent)

==> Sym{bag={'a': 1, 'b': 2, 'c': 4}, n=7} c 1.38
==> mode = 'c'
==> ent = 1.38

In the Numeric summaries, we see counts of how many n numbers seen, their minimum and maximum values (denoted lo,hi), and their standard deviation sd.

In the Symbolic summary, we see counts of how many n numbers were seen, their most common value (denotedmode), and their entropy ent.

Supervised learners learn a model of the form . Unsupervised learners ignore the dependent variable and group together similar rows, based on their values. For that grouping, some distance function is required. A standard distance function between rows with features

where:

clustering you might cluster first by the dependent variables, and then by the independent variables. Since the number of dependents is typically much less than the independents, this can run very fast.

In the dist equation, the diff function for symbolic and numberic columns. For numbers, a usual diff has the range and is calulated as follows:

where is a function that normalises to the range using (i.e. using the smallest and largest value seen in column ) and is a small constant added to the denominator to avoid divide-by-zero errors.

For symbols, a usual is

If either of are unknown values then diff returns the maximal possible difference.

Clustering (Unsupervised Learning)

The simplest (and sometimes slowest) learner is a clusterer. Clustering ignores any class or goal variables and group together similar rows using some distance function. Clustering can be slow since, if implemented naively, it requires multiple comparison between all rows. The famous K-means algorithm reduces that to as follows:

  1. Declare that rows (picked at random) are the centroids;
  2. Marks each example with the id of its nearest centroid;
  3. Finds the central point of all the rows marked with the same centroid. h

  4. Declares those new central points to be the new centroids
  5. Goto 1

K-means terminates when the next set of centroids are similar to the last ones. There are several heuristics for determinging a good value for :

Once you know your , you’ve got to select centroids One way is to do it randomly (see the mini-batch k-means algorithm shown below), Another way is to use k-means++, which always tries to select centroids that are far away from the existing centroids:

Pick a random number and run over sorted $X$. At each step, set . When , return .

It is insightful to think of K-means as an example of expectiation minimization algorithms. Such algorithms make guesses, then change something to minimize the errors associated with those gusess (in this case, move the centroids).

Mini-batch K-means is a more memory efficient version of K-means. This variant never loads all the examples into main memory. Instead, it only loads batches of rows of size . The first rows (picked at raond) become the first centroids. Rows arrive after that (in the same batch) add themselves to a cluster assocated with the nearest centroid. When the batch is done, each cluster adjust its centroids by reflecting over each row in the centroid:

The next batch of rows is read and the process repeats. Note that for this to work, we must have a way of moving some cluster towards row . When processing the -th item in a cluster:

An even more effecient clustering algorithm uses random projections. The above clustering algorithms are very sensitive to quirks in the distance function. One way to mitigate for those quirks is sort out the rows using multiple, randomly generated distance measures. Sure, a few may be less-than-perfect but the more often a random project says two rows are similar, then the more likely they are actually are similar.

Here is the RP0 random projection clustering (there are many others, but this one is pretty simple):

  1. Let (say)
  2. Read the data. While you have less than $2M$ rows, add the rows into a local data cache.
  3. If you have more than $2M$$ rows, then divide:
    • times; pick any two poles (the rows ) in the batch and find the distance between them.
    • Pick the pair of poles with maximum distance.
    • Divide the according to whether or not they are closer to one pole or the other.
    • For each division, goto 2

Note that RP0 is like a faster version of the recursive -means algorithm described above.

There are many interesting ways to modify RP0:

From Unsupervised to Supervised Learning

Clustering is a knowledge acquisition technique. In short, rather than asking everyone about everything, first cluster the data then only check a sample of rows from each cluster.

For example, clustering is one way to help active learning. Often we have more rows than labels (a.k.a. classes) for that data (that is, most of our data is suitable only for unsupervised learning). This is a common situation. For example:

crowdsourcing, That’s true BUT consider- when comissioning a crowdsourcing rig, you have to first certify the efficacy of that rig. That means you need some “ground truth” to check the conclusions. So now you are back to labelling some significant percentage of the data.

For another example:

Given some oracle that can label data (e.g. a human being), and assuming that oracle takes time or costs everytime you ask it a question, then active learning is the process of learning an adequate model after asking the oracle using minimum time and/or cost.

(Aside: Another way to handle label-shortage is semi-supervised learning that finds things similar to the labelled examples, then propagates the labels to the unlabelled space. For example, if you cluster data, then (a) each cluster will have a few labels and (b) the unlabelled examples in each cluster can be given that cluster’s majority label. Of course, you still have to check if the resulting new labels are correct. If you want to do that check effeciently (i.e. not check everything) then you’ll need some smart way to check a sub-sample of the new labels. At which point you are back to active learning– but perhaps with fewer overall queries to the oracle).

An aquisition function is one way to guide an active learner:

Here are some important details about the above process:

Supervised Learning

Dont traing on examples , use some sunset (selected at random, or selected froma cluserr) move the elarning into the clsuterr so icnremetnaly do top-down recrsive clsutering and build one mdoel per elaf clsuter)

maths methods highly optimized. For example, SVM with a lienar kernel runs very fast over a large text corpus.

XXX active: and suppose there are

Divide the data,a t random, into 10 buckets.

g. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning.

Of all the above algorithms, standard K-means might produce the best clusters (since it is always reasoning over all the data). On the other hand, the other variants might scale to larger data sets.

There any many other ways to cluster data. For a more complete survey, see any textbook on data mining or some of the excellent on-line tutorials.

naive bayes group and the class and loomfor dfferente ebween theing

XXX decision trees cut on the class cover and differentiate make a decision the spin hthru the rest

neural ents

Divide nums (discreitzation)

Quiz


title: “ u dont nroamlize what happens?” layout: default — — title: “ given standard deviation forumla, derive sd” layout: default — — title: “ given entropy forumla, derive e” layout: default — — title: “ distance between two rows” layout: default — — title: “ derive the cosine rule” layout: default —

  1. Which is, in fact, the usual case. 

  2. Pollution marking is adding a flag to some regions of a model saying “keep away! Don’t use me!”. In a hiearchical clustering of data, 

  3. Sometimes $f$ can be the dependent variables. In two-tiered 

  4. Some might comment that this four weeks could be done in an hour, via 


© 2019 timm + zimm