Learning & Reasoning

Simpler AdaBoost -1

이현봉 2013. 2. 2. 16:49

* 찾아보니 우리나라에서도 R을 이용한 시도들이 있고 좋은 글 (참고-1, 참고-2, 참고-3) 도 많다. 나는 R에 관해 초보라 별로 보탤 것이 없지만 조그만 좌충우돌을 이곳에 남겨 나같은 초보들에게 도움이 되었으면 하고...   AdaBoost를 설명/분석한 좋은 글들이 많지만 그 중에 가장 쉽게 설명되었다고 느낀 것을 더 쉽게 시도하려 한다.  쉽게 얘기하는 것도 좋은 능력이기에 이 또한 배워야 함을 느낀다. 아직 영어로 기술 글을 쓰는 것이 편리하기에 일단 영어로 써보자.

<Preamble>

This note is a modest effort to provide a gentler introduction on AdaBoost algorithm than already very accessible Raúl Rojas noteAs such Raúl’s original text will be used in its entirety and any further explanation or translation (Korean) will be noted in this palatino format

Let’s say we have the following data (abridged from R rattle package weather.csv) ;

 

Perhaps a brief intro on machine learning may be useful. The table shows 20 days of current 11 meteorological measurements, and whether it rained the following day.  The first row lists field, also called column/attribute/feature/variable, names of the measurements taken. There are 12 fields from “MinTemp” to “RainTomorrow”. The first 11 fields from “MinTemp” to “Temp9am” show the measurements taken on a day while the “RainTomorrow” tells whether the rain fell the following day. Note that the table is based on historical data, so we know if rain fell the following day, and so can put that in the record a day before.

From the 2’nd to 21’st row, 20 actual records of observations are listed. So there are 20 observations each occupying a row. The first column [1, … 21] is simply used to identify/index an observation/record number. The first column doesn’t carry information.    

The table above is also called “dataset” and each observation is also called a “datapoint”. Each datapoint has 12 measurements/values associated with 12 fields/variables, and we say that the observation has dimension of 12. Each observation occupies a row of 12 fields, so an observation can be represented as a row vector with 12 values. Similarly, since each field/variable needs to tell its measurements of all observations, each field can be represented as a column vector with length 20. When a measurement is not available, we usually leave the cell as an empty space or set to NA (data Not Available).

We say that the dataset above has dimension of [20, 12], meaning it’s composed of 20 rows/observations each with 12 columns/fields. Indexing column and variable-names row are not counted. Again, a row can have missing field values. A dataset can be built by stacking observation vectors or by listing variable/field vectors.      

We learn in order to use the acquired knowledge, experience, insight or whatever in the future. It’s the same with machine learning. Machine learning model/algorithm eats data, and hopefully comes up with some knowledge or “intelligence” that is useful. Many times, learning algorithms try to extract patterns underlying the data. Humans may be good in finding visual patterns but when it comes to finding patterns embedded in the millions of observations and hundreds of variables we are no match against computers. 


Let’s view the dataset earlier in the context of learning. What we want is to extract pattern of “raining tomorrow” based on measurements taken today. If we get that pattern it would be useful. At least, we could use it to decide whether to go picnic or not tomorrow. Each observation illustrates the fact that given today’s 11 measurement values of “MinTemp” to “Temp9am”, whether the rain came the following day as noted in “RainTomorrow”. So our learning task can be phrased as follows; Learn from the 20 observation examples each with a day’s 11 measurements and the outcome of “RainTomorrow”, and give us a learned model that predicts (satisfactorily) whether there will be rain tomorrow or not based on today’s 11 input values.    

It’s like, we give to a naïve learning model example conditions and answers, and ask the learning model to learn from them. We train the model. And when the model looks that it has learned, we give it a test set to see if the model indeed has learned. It’s the underlying learning algorithm the model employs that drives the model as it consumes the examples and gradually build a specific model for the task. Much of eventual success or failure of constructing a learning model depends on learning algorithm we choose. This form of learning is called supervised learning because the model has a supervisor/teacher that gives the model both the conditions and answers that the learning algorithm can use as it builds up the model. 

In the dataset earlier, we noted 12 variables. The values of the first 11 variables act as model’s training “input” examples, and a value of “RainTomorrow” as desired target “output” example. So, typically variables are further classified to “input” variables and “output” variables. Output variables are also called as target, outcome or supervisory variables. A dataset typically has many input variables but one or few output variables. Thus input variables are almost always represented as a vector, and if there is only one output variable then each observation becomes a pair consisting of input vector and an output variable.   

Input vector is usually denoted as x, and output variable as y. We got them from old junior high days of “y = ax +b”.  If we consider that a model has the form of “y = ax +b”, then the learning algorithm has to determine ‘a’ and ‘b’.  We plug in input values to x and the model spits out y value. Again, during training both x and y values are given to the model. A real capacity of a model is known when the model is deployed, and after its output y is assessed with real world operational values of x.