Hyun Bong Lee
2015년 10월 5일
LARS 패키지를 이용한 LASSO, LAR examples, by HBLEE
- 특히 HTF의 “The Elements of Statistical Learning” 에 언급된 Regression의 Shrinkage Method 중 하나인 LASSO (L1 Norm: Sum of Absolute values, PENALTY/CONSTRAINT-BASED VARIABLE SELECTION)에 집중
- Regularized Regression으로 Ridge Regression도 있으나, LASSO가 더 유용하리라 판단
Reference :
- CRAN LARS 패키지 매뉴얼 : https://cran.r-project.org/web/packages/lars/lars.pdf
- LAR 논문 : http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
Data : LARS 패키지에 내재된 ‘diabetes’ 데이타. LAR 논문에서도 사용됨.
*전에 연습했던 것을 먼지 털어 올림
library(lars)
## Warning: package 'lars' was built under R version 3.2.2
## Loaded lars 1.2
data(diabetes)
str(diabetes)
## 'data.frame': 442 obs. of 3 variables:
## $ x : AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "age" "sex" "bmi" "map" ...
## $ y : num 151 75 141 206 135 97 138 63 110 310 ...
## $ x2: AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
## ..- attr(*, ".Names")= chr "age" "age" "age" "age" ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "1" "2" "3" "4" ...
## .. ..$ : chr "age" "sex" "bmi" "map" ...
“diabetes” 의 구조에 주목: Predictor로 쓰이는 X와 X2가 matrix.
즉, input variable이 matrix이어야 함. Predictor로 X 또는 X2를 사용 가능 x2는 x의 10개 variable들의 squared, interaction들을 포함한 super set.
Response는 ‘y’ variable
중요 : x와 x2가 ’AsIs’로 declare됨. 따라서 data.frame() 으로 coversion이나 formula 내에서 쓸 때 주의. ?AsIs 참고토록.
Data 의미 :
Predictors (x): 442명 각각 환자들에게서 10개의 당뇨병에 관련된 측정치를 수집
Response (y) : x 측정 1년후 환자들의 당뇨 상태.
모델 : Regression type Supervised Learning. x를 활용해 1년 후 환자들의 예후를 예측.
Let’s take a look at each variables, x first.
str(diabetes$x) # named numeric matrix of 442x10. x is a matrix. No categorical data
## AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:10] "age" "sex" "bmi" "map" ...
colnames(diabetes$x) # "age" "sex" "bmi" "map" "tc" "ldl" "hdl" "tch" "ltg" "glu"
## [1] "age" "sex" "bmi" "map" "tc" "ldl" "hdl" "tch" "ltg" "glu"
sum(is.na(diabetes$x)) # No missing values
## [1] 0
diabetes$x[1:7,]
## age sex bmi map tc
## [1,] 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498
## [2,] -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## [3,] 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451
## [4,] -0.089062939 -0.04464164 -0.01159501 -0.036656447 0.012190569
## [5,] 0.005383060 -0.04464164 -0.03638469 0.021872355 0.003934852
## [6,] -0.092695478 -0.04464164 -0.04069594 -0.019442093 -0.068990650
## [7,] -0.045472478 0.05068012 -0.04716281 -0.015999223 -0.040095640
## ldl hdl tch ltg glu
## [1,] -0.03482076 -0.043400846 -0.002592262 0.019908421 -0.017646125
## [2,] -0.01916334 0.074411564 -0.039493383 -0.068329744 -0.092204050
## [3,] -0.03419447 -0.032355932 -0.002592262 0.002863771 -0.025930339
## [4,] 0.02499059 -0.036037570 0.034308859 0.022692023 -0.009361911
## [5,] 0.01559614 0.008142084 -0.002592262 -0.031991445 -0.046640874
## [6,] -0.07928784 0.041276824 -0.076394504 -0.041180385 -0.096346157
## [7,] -0.02480001 0.000778808 -0.039493383 -0.062912950 -0.038356660
# LARS 논문(Table-1) 에 있는 값과 다름. So, 이 diabetes 데이타는 전처리를 거쳤음.
unique(diabetes$x[,2]) # just two values. So one must be male, the other female.
## [1] 0.05068012 -0.04464164
??? 그런데 최소한 sex는 categorical이어야 하는데, x는 모두 numeric 임. so, sex는 이미 indicator variable로 바뀌었음을 예상. 그런데 [0|1] 이 아니고 값이 이상함.
원 데이타가 없는 상황에서 어떤 전처리를 했는지 알기 어려움. But, Lasso가 predictor 값의 scale에 영향을 받기에, 각 변수들을 미리 standardize 했을 수도.
colMeans(diabetes$x) # OK. Means are zero
## age sex bmi map tc
## -3.613464e-16 9.824343e-17 -7.990599e-16 1.356826e-16 -9.239659e-17
## ldl hdl tch ltg glu
## 1.285989e-16 -4.504396e-16 3.861403e-16 -3.836857e-16 -3.423209e-16
apply(diabetes$x, 2, sd)
## age sex bmi map tc ldl
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## hdl tch ltg glu
## 0.04761905 0.04761905 0.04761905 0.04761905
# what the... 왜, 하필 sd=0.0476으로 표준화했을까? I don't want a know.
이제 response ’y’를 보자.
diabetes$y # matches with the values in the paper
## [1] 151 75 141 206 135 97 138 63 110 310 101 69 179 185 118 171 166
## [18] 144 97 168 68 49 68 245 184 202 137 85 131 283 129 59 341 87
## [35] 65 102 265 276 252 90 100 55 61 92 259 53 190 142 75 142 155
## [52] 225 59 104 182 128 52 37 170 170 61 144 52 128 71 163 150 97
## [69] 160 178 48 270 202 111 85 42 170 200 252 113 143 51 52 210 65
## [86] 141 55 134 42 111 98 164 48 96 90 162 150 279 92 83 128 102
## [103] 302 198 95 53 134 144 232 81 104 59 246 297 258 229 275 281 179
## [120] 200 200 173 180 84 121 161 99 109 115 268 274 158 107 83 103 272
## [137] 85 280 336 281 118 317 235 60 174 259 178 128 96 126 288 88 292
## [154] 71 197 186 25 84 96 195 53 217 172 131 214 59 70 220 268 152
## [171] 47 74 295 101 151 127 237 225 81 151 107 64 138 185 265 101 137
## [188] 143 141 79 292 178 91 116 86 122 72 129 142 90 158 39 196 222
## [205] 277 99 196 202 155 77 191 70 73 49 65 263 248 296 214 185 78
## [222] 93 252 150 77 208 77 108 160 53 220 154 259 90 246 124 67 72
## [239] 257 262 275 177 71 47 187 125 78 51 258 215 303 243 91 150 310
## [256] 153 346 63 89 50 39 103 308 116 145 74 45 115 264 87 202 127
## [273] 182 241 66 94 283 64 102 200 265 94 230 181 156 233 60 219 80
## [290] 68 332 248 84 200 55 85 89 31 129 83 275 65 198 236 253 124
## [307] 44 172 114 142 109 180 144 163 147 97 220 190 109 191 122 230 242
## [324] 248 249 192 131 237 78 135 244 199 270 164 72 96 306 91 214 95
## [341] 216 263 178 113 200 139 139 88 148 88 243 71 77 109 272 60 54
## [358] 221 90 311 281 182 321 58 262 206 233 242 123 167 63 197 71 168
## [375] 140 217 121 235 245 40 52 104 132 88 69 219 72 201 110 51 277
## [392] 63 118 69 273 258 43 198 242 232 175 93 168 275 293 281 72 140
## [409] 189 181 209 136 261 113 131 174 257 55 84 42 146 212 233 91 111
## [426] 152 120 67 310 94 183 66 173 72 49 64 48 178 104 132 220 57
plot(table(diabetes$y))
Let’s look into x2 variable
str(diabetes$x2) # 442x64 named numeric matrix. 64 predictors. Let's see those vars.
## AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
## - attr(*, ".Names")= chr [1:28288] "age" "age" "age" "age" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:442] "1" "2" "3" "4" ...
## ..$ : chr [1:64] "age" "sex" "bmi" "map" ...
colnames(diabetes$x2)
## [1] "age" "sex" "bmi" "map" "tc" "ldl" "hdl"
## [8] "tch" "ltg" "glu" "age^2" "bmi^2" "map^2" "tc^2"
## [15] "ldl^2" "hdl^2" "tch^2" "ltg^2" "glu^2" "age:sex" "age:bmi"
## [22] "age:map" "age:tc" "age:ldl" "age:hdl" "age:tch" "age:ltg" "age:glu"
## [29] "sex:bmi" "sex:map" "sex:tc" "sex:ldl" "sex:hdl" "sex:tch" "sex:ltg"
## [36] "sex:glu" "bmi:map" "bmi:tc" "bmi:ldl" "bmi:hdl" "bmi:tch" "bmi:ltg"
## [43] "bmi:glu" "map:tc" "map:ldl" "map:hdl" "map:tch" "map:ltg" "map:glu"
## [50] "tc:ldl" "tc:hdl" "tc:tch" "tc:ltg" "tc:glu" "ldl:hdl" "ldl:tch"
## [57] "ldl:ltg" "ldl:glu" "hdl:tch" "hdl:ltg" "hdl:glu" "tch:ltg" "tch:glu"
## [64] "ltg:glu"
# x에 있는 10개 predictor에 더하여, square term, 그리고 interaction term들이 있음.
sum(is.na(diabetes$x2)) # NA 없고.
## [1] 0
sum(diabetes$x2[,1:10] == diabetes$x) # 예상대로 앞쪽 10개 측정치는 x와 같음.
## [1] 0
x 에서 x2를 어찌 만들었는지 재현하자.
** Note that; diabetes\(x와 diabetes\)x2가 ’AsIs’로 annotate되어있음. This is funny clutch of R to enforce (partial) immutability for some objects that shouldn’t be changed. But in R all the objects innards are visible if you try hard.
Some nasty reminder when dealing with “AsIs” objects.
# 1) Simple case;
?model.matrix # takes data.frame object as data argument.
## starting httpd help server ... done
x2_my <- model.matrix(~., data=as.data.frame(diabetes$x))[,-1] # get rid of 'intercept' column
str(x2_my) # note that 'model.matrix' returns matrix. x2_my is a named matrix
## num [1:442, 1:10] 0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:442] "1" "2" "3" "4" ...
## ..$ : chr [1:10] "xage" "xsex" "xbmi" "xmap" ...
x2_my[1:3,]
## xage xsex xbmi xmap xtc
## 1 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451
## xldl xhdl xtch xltg xglu
## 1 -0.03482076 -0.04340085 -0.002592262 0.019908421 -0.01764613
## 2 -0.01916334 0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262 0.002863771 -0.02593034
colnames(x2_my) # 'x' is preprended
## [1] "xage" "xsex" "xbmi" "xmap" "xtc" "xldl" "xhdl" "xtch" "xltg" "xglu"
# x2_my[,'xage'] # works
# x2_my$xage # doesn't work. remember, x2_my is not a dataframe.
2) a little less simple;
# x2_my <- model.matrix(~ xage + xsex + xbmi, data=as.data.frame(diabetes$x))[,-1]
“Error in eval(expr, envir, enclos) : object ‘xage’ not found” ; what the … something weird is happening here. So, let’s check matrix -> data.frame conversion
str(as.data.frame(diabetes$x))
## 'data.frame': 442 obs. of 1 variable:
## $ x: AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "age" "sex" "bmi" "map" ...
# 'data.frame': 442 obs. of 1 variable:" ; It's got only one freaking variable.
as.data.frame(diabetes$x)[1:3,]
## age sex bmi map tc
## [1,] 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498
## [2,] -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## [3,] 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451
## ldl hdl tch ltg glu
## [1,] -0.03482076 -0.04340085 -0.002592262 0.019908421 -0.01764613
## [2,] -0.01916334 0.07441156 -0.039493383 -0.068329744 -0.09220405
## [3,] -0.03419447 -0.03235593 -0.002592262 0.002863771 -0.02593034
It prints out innocently. But internal data structure is not what I wanted. R is automatically pretty printing. pervert SOB. R이 나도 모르게 “예쁘게” 출력을 해줌. 내가 원하는 것 아닌데…
It’s cause ‘diabetes$x’ is ‘AsIs’ object. It wants to keep its original data structure and resists conversion to dataframe, and function formula notation comprehension.
- Remember, R has funny polymorphic idea and it’s got no privacy. I can mess up with with objects in anyway I want to.
methods(as.data.frame)
## [1] as.data.frame.aovproj* as.data.frame.array
## [3] as.data.frame.AsIs as.data.frame.character
## [5] as.data.frame.complex as.data.frame.data.frame
## [7] as.data.frame.Date as.data.frame.default
## [9] as.data.frame.difftime as.data.frame.factor
## [11] as.data.frame.ftable* as.data.frame.integer
## [13] as.data.frame.list as.data.frame.logical
## [15] as.data.frame.logLik* as.data.frame.matrix
## [17] as.data.frame.model.matrix as.data.frame.noquote
## [19] as.data.frame.numeric as.data.frame.numeric_version
## [21] as.data.frame.ordered as.data.frame.POSIXct
## [23] as.data.frame.POSIXlt as.data.frame.raw
## [25] as.data.frame.table as.data.frame.ts
## [27] as.data.frame.vector
## see '?methods' for accessing help and source code
예상대로… “as.data.frame” generic 함수는 call 대상 객체가 ’AsIs’로 선언된 것이면 “as.data.frame.AsIs”로 부른다.
x_df = as.data.frame.matrix(diabetes$x) # use 'as.data.frame.matrix'
str(x_df) # looks like plain vanilla data.frame
## 'data.frame': 442 obs. of 10 variables:
## $ age: num 0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
## $ sex: num 0.0507 -0.0446 0.0507 -0.0446 -0.0446 ...
## $ bmi: num 0.0617 -0.0515 0.0445 -0.0116 -0.0364 ...
## $ map: num 0.02187 -0.02633 -0.00567 -0.03666 0.02187 ...
## $ tc : num -0.04422 -0.00845 -0.0456 0.01219 0.00393 ...
## $ ldl: num -0.0348 -0.0192 -0.0342 0.025 0.0156 ...
## $ hdl: num -0.0434 0.07441 -0.03236 -0.03604 0.00814 ...
## $ tch: num -0.00259 -0.03949 -0.00259 0.03431 -0.00259 ...
## $ ltg: num 0.01991 -0.06833 0.00286 0.02269 -0.03199 ...
## $ glu: num -0.01765 -0.0922 -0.02593 -0.00936 -0.04664 ...
names(x_df)
## [1] "age" "sex" "bmi" "map" "tc" "ldl" "hdl" "tch" "ltg" "glu"
x_df$sex # yup, this x_df is a real deal. We are out of that 'AsIs' curse
## [1] 0.05068012 -0.04464164 0.05068012 -0.04464164 -0.04464164
## [6] -0.04464164 0.05068012 0.05068012 0.05068012 -0.04464164
## [11] -0.04464164 0.05068012 -0.04464164 0.05068012 -0.04464164
## [16] 0.05068012 -0.04464164 0.05068012 -0.04464164 -0.04464164
## [21] -0.04464164 0.05068012 -0.04464164 0.05068012 -0.04464164
## [26] 0.05068012 -0.04464164 -0.04464164 -0.04464164 0.05068012
## [31] -0.04464164 -0.04464164 0.05068012 -0.04464164 -0.04464164
## [36] 0.05068012 -0.04464164 -0.04464164 0.05068012 0.05068012
## [41] 0.05068012 -0.04464164 0.05068012 -0.04464164 0.05068012
## [46] 0.05068012 -0.04464164 -0.04464164 0.05068012 0.05068012
## [51] -0.04464164 0.05068012 -0.04464164 -0.04464164 -0.04464164
## [56] -0.04464164 -0.04464164 -0.04464164 -0.04464164 0.05068012
## [61] -0.04464164 0.05068012 0.05068012 -0.04464164 0.05068012
## [66] 0.05068012 0.05068012 0.05068012 0.05068012 -0.04464164
## [71] -0.04464164 -0.04464164 0.05068012 0.05068012 0.05068012
## [76] 0.05068012 0.05068012 -0.04464164 -0.04464164 -0.04464164
## [81] -0.04464164 0.05068012 -0.04464164 -0.04464164 -0.04464164
## [86] -0.04464164 0.05068012 -0.04464164 0.05068012 -0.04464164
## [91] -0.04464164 -0.04464164 -0.04464164 -0.04464164 -0.04464164
## [96] -0.04464164 0.05068012 -0.04464164 0.05068012 -0.04464164
## [101] -0.04464164 0.05068012 -0.04464164 0.05068012 -0.04464164
snip......
** End of “AsIs” digression——————–
Use x_df to generate diabetes$x2
x2_my <- model.matrix(~.^2 + I(age^2) + I(bmi^2) + I(map^2) + I(tc^2) + I(ldl^2) + I(hdl^2) + I(tch^2) +
I(ltg^2) + I(glu^2), data=x_df)[,-1] # get rid of 'intercept' column
str(x2_my)
## num [1:442, 1:64] 0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:442] "1" "2" "3" "4" ...
## ..$ : chr [1:64] "age" "sex" "bmi" "map" ...
str(diabetes$x2)
## AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
## - attr(*, ".Names")= chr [1:28288] "age" "age" "age" "age" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:442] "1" "2" "3" "4" ...
## ..$ : chr [1:64] "age" "sex" "bmi" "map" ...
x2_my[1:3,]
## age sex bmi map tc
## 1 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451
## ldl hdl tch ltg glu
## 1 -0.03482076 -0.04340085 -0.002592262 0.019908421 -0.01764613
## 2 -0.01916334 0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262 0.002863771 -0.02593034
## I(age^2) I(bmi^2) I(map^2) I(tc^2) I(ldl^2)
## 1 1.449775e-03 0.003806422 4.783999e-04 1.955718e-03 0.0012124855
## 2 3.541986e-06 0.002649579 6.931549e-04 7.138094e-05 0.0003672336
## 3 7.275903e-03 0.001975910 3.215582e-05 2.079310e-03 0.0011692615
,,,,
## map:hdl map:tch map:ltg map:glu tc:ldl
## 1 -0.0009492787 -5.669887e-05 4.354440e-04 -0.0003859623 0.0015398960
## 2 -0.0019590954 1.039775e-03 1.798974e-03 0.0024275330 0.0001619058
## 3 0.0001834779 1.469971e-05 -1.623933e-05 0.0001470409 0.0015592489
## tc:hdl tc:tch tc:ltg tc:glu ldl:hdl
## 1 0.0019193372 0.0001146389 -0.0008804200 0.0007803734 0.001511251
## 2 -0.0006286828 0.0003336687 0.0005772992 0.0007790066 -0.001425974
## 3 0.0014754128 0.0001182057 -0.0001305864 0.0011824092 0.001106394
## ldl:tch ldl:ltg ldl:glu hdl:tch hdl:ltg
## 1 9.026454e-05 -0.0006932264 0.0006144515 1.125064e-04 -8.640423e-04
## 2 7.568251e-04 0.0013094261 0.0017669375 -2.938764e-03 -5.084523e-03
## 3 8.864101e-05 -0.0000979251 0.0008866741 8.387505e-05 -9.265996e-05
## hdl:glu tch:ltg tch:glu ltg:glu
## 1 0.0007658568 -5.160784e-05 4.574338e-05 -3.513065e-04
## 2 -0.0068610475 2.698573e-03 3.641450e-03 6.300279e-03
## 3 0.0008390003 -7.423643e-06 6.721823e-05 -7.425854e-05
diabetes$x2[1:3,]
## age sex bmi map tc
## 1 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451
## ldl hdl tch ltg glu
## 1 -0.03482076 -0.04340085 -0.002592262 0.019908421 -0.01764613
## 2 -0.01916334 0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262 0.002863771 -0.02593034
## age^2 bmi^2 map^2 tc^2 ldl^2
## 1 -0.01485516 0.022504574 -0.03104468 -0.004331120 -0.01373992
## 2 -0.04129154 0.005642773 -0.02730766 -0.030938902 -0.02480103
## 3 0.09164344 -0.004176421 -0.03880990 -0.002585937 -0.01430556
## hdl^2 tch^2 ltg^2 glu^2 age:sex
## 1 -0.004631425 -0.030448463 -0.02881622 -0.02752556 0.032864976
## 2 0.040036524 -0.009485482 0.03716124 0.08802196 -0.006609993
## 3 -0.014861456 -0.030448463 -0.03480993 -0.02243261 0.084051745
## age:bmi age:map age:tc age:ldl age:hdl
## 1 0.040571674 0.001660641 -0.04655325 -0.038244710 -0.0345115069
## 2 -0.006764804 -0.015934224 -0.01172880 -0.009655541 0.0006995477
## 3 0.070889129 -0.027912869 -0.09174427 -0.071641516 -0.0602920920
## age:tch age:ltg age:glu sex:bmi sex:map
## 1 -0.012112261 0.003064892 -0.03027751 0.06210301 0.01228204
## 2 -0.008368996 -0.010201680 -0.01138014 0.04451819 0.01373927
## 3 -0.014760528 -0.007763517 -0.06469909 0.04356153 -0.01815796
,,,,,
## ldl:glu hdl:tch hdl:ltg hdl:glu tch:ltg
## 1 -0.0009221095 0.03349363 0.0008521487 0.03115026 -0.02819118
## 2 0.0237834359 -0.02381466 -0.0945055990 -0.14037759 0.02529772
## 3 0.0049134553 0.03295588 0.0182807986 0.03279524 -0.02733183
## tch:glu ltg:glu
## 1 -0.01765816 -0.02779368
## 2 0.05303354 0.10401328
## 3 -0.01723596 -0.02230374
# x2_my <-> diabetes$x2 ; Numbers are different again. Getting into my nerves. Standardization?
colMeans(diabetes$x2) # OK. Means are zero
## age sex bmi map tc
## -4.807774e-20 7.347064e-18 1.427492e-18 -4.155996e-18 3.376235e-18
## ldl hdl tch ltg glu
## 2.087212e-18 2.392604e-18 2.628454e-18 1.007916e-18 2.880740e-18
## age^2 bmi^2 map^2 tc^2 ldl^2
## -2.562053e-17 -2.487680e-17 2.863079e-18 5.103624e-17 1.603920e-17
## hdl^2 tch^2 ltg^2 glu^2 age:sex
## -2.200538e-17 6.511517e-17 -1.233121e-17 -4.219067e-19 -2.473796e-18
## age:bmi age:map age:tc age:ldl age:hdl
## -6.574386e-18 1.253455e-18 -1.434924e-17 -1.945799e-18 -4.604180e-19
## age:tch age:ltg age:glu sex:bmi sex:map
## 1.544865e-18 -5.812010e-18 -3.373169e-18 -4.895467e-18 2.879759e-19
## sex:tc sex:ldl sex:hdl sex:tch sex:ltg
## -1.410015e-18 6.082570e-18 1.594979e-17 -6.308045e-17 -4.605652e-18
## sex:glu bmi:map bmi:tc bmi:ldl bmi:hdl
## 1.442823e-17 -3.066428e-18 -1.311173e-17 2.385490e-20 2.539584e-17
## bmi:tch bmi:ltg bmi:glu map:tc map:ldl
## -8.461438e-19 -5.782820e-20 -9.609171e-18 1.023099e-17 -8.185481e-18
## map:hdl map:tch map:ltg map:glu tc:ldl
## 3.437681e-18 -8.184500e-18 1.106279e-17 -1.515948e-17 1.793643e-17
## tc:hdl tc:tch tc:ltg tc:glu ldl:hdl
## -1.677815e-18 -2.114418e-17 -1.042220e-17 4.673537e-18 2.068999e-18
## ldl:tch ldl:ltg ldl:glu hdl:tch hdl:ltg
## -2.395670e-18 1.845505e-18 -1.498814e-17 6.730516e-18 -3.275664e-18
## hdl:glu tch:ltg tch:glu ltg:glu
## -6.698996e-19 1.297241e-18 -1.880796e-19 -2.796359e-19
apply(diabetes$x2, 2, sd) # 이럴 줄 알았음.
## age sex bmi map tc ldl
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## hdl tch ltg glu age^2 bmi^2
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## map^2 tc^2 ldl^2 hdl^2 tch^2 ltg^2
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## glu^2 age:sex age:bmi age:map age:tc age:ldl
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## age:hdl age:tch age:ltg age:glu sex:bmi sex:map
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## sex:tc sex:ldl sex:hdl sex:tch sex:ltg sex:glu
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## bmi:map bmi:tc bmi:ldl bmi:hdl bmi:tch bmi:ltg
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## bmi:glu map:tc map:ldl map:hdl map:tch map:ltg
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## map:glu tc:ldl tc:hdl tc:tch tc:ltg tc:glu
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## ldl:hdl ldl:tch ldl:ltg ldl:glu hdl:tch hdl:ltg
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905
## hdl:glu tch:ltg tch:glu ltg:glu
## 0.04761905 0.04761905 0.04761905 0.04761905
scale(x2_my[,11:64])[1:3,]*(0.0149/0.312) # Test. It matches.
## I(age^2) I(bmi^2) I(map^2) I(tc^2) I(ldl^2)
## 1 -0.01489801 0.022569491 -0.03113423 -0.004343613 -0.01377956
## 2 -0.04141065 0.005659051 -0.02738643 -0.031028148 -0.02487257
## 3 0.09190780 -0.004188469 -0.03892186 -0.002593396 -0.01434683
## I(hdl^2) I(tch^2) I(ltg^2) I(glu^2) age:sex
## 1 -0.004644785 -0.030536295 -0.02889934 -0.02760496 0.03295978
## 2 0.040152014 -0.009512844 0.03726844 0.08827587 -0.00662906
## 3 -0.014904326 -0.030536295 -0.03491034 -0.02249732 0.08429420
## age:bmi age:map age:tc age:ldl age:hdl
## 1 0.040688708 0.001665431 -0.04668754 -0.038355032 -0.0346110594
## 2 -0.006784318 -0.015980188 -0.01176263 -0.009683393 0.0007015656
## 3 0.071093617 -0.027993387 -0.09200892 -0.071848175 -0.0604660115
,,,,
## tc:hdl tc:tch tc:ltg tc:glu ldl:hdl ldl:tch
## 1 0.03993796 -0.01935874 -0.04301523 0.0009657478 0.04247711 -0.02210140
## 2 -0.01650978 -0.01554592 -0.01237870 0.0009353736 -0.02131774 -0.01159761
## 3 0.03010345 -0.01929665 -0.02725616 0.0099000824 0.03368385 -0.02212698
## ldl:ltg ldl:glu hdl:tch hdl:ltg hdl:glu
## 1 -0.03121435 -0.0009247694 0.03359024 0.0008546068 0.03124011
## 2 0.01301080 0.0238520420 -0.02388336 -0.0947782113 -0.14078252
## 3 -0.01806814 0.0049276287 0.03305094 0.0183335317 0.03288985
## tch:ltg tch:glu ltg:glu
## 1 -0.02827250 -0.01770909 -0.02787386
## 2 0.02537069 0.05318652 0.10431332
## 3 -0.02741067 -0.01728568 -0.02236807
x2_my[,11:64] = scale(x2_my[,11:64])*(0.0149/0.312)
sqrt(sum(x2_my - diabetes$x2)^2) # Two are the same. 데이터를 어떻게 만들었나 재현 끝.
## [1] 7.310626e-13
# Now we have it. x2 is generated from x1, it's quadratic term. and first order interaction.
### Do the Lasso, Lar, forward stagewise Regression as done in LARS manual |
par(mfrow=c(2,2))
object <- lars(diabetes$x, diabetes$y)
plot(object)
object2 <- lars(diabetes$x, diabetes$y, type='lar')
plot(object2)
object3 <- lars(diabetes$x, diabetes$y, type='for') # Can use abbreviations
plot(object3)
# 3개 비슷하다.
# 특히, 논문에 있는 것 같이 Lasso와 LAR는 이 데이터에 대해 거의 같은 결과
Let’s go deeper
Lasso와 같은 regularized regression은 변수가 많을 때 변수 selection 하면서 자동적으로 괜찮은 모델을 만들고자 하려고 한 것. 위 예는 변수 수가 작다. diabetes$x2 의 64개도 작다. 더 가보자. 새로운 predictor x3를 x_df를 이용해 만들어 이것으로 Lasso에 적용하자.
왠지 bmi(body mass index)가 앞의 1’st order interaction과 또 interaction 할지도 모른다는 생각이 들어;
x3 <- model.matrix(~ bmi*(.^2) + I(age^2) + I(bmi^2) + I(map^2) + I(tc^2) + I(ldl^2) + I(hdl^2) + I(tch^2) +
I(ltg^2) + I(glu^2), data=x_df)[,-1] # get rid of 'intercept' column
str(x3)
## num [1:442, 1:100] 0.0617 -0.0515 0.0445 -0.0116 -0.0364 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:442] "1" "2" "3" "4" ...
## ..$ : chr [1:100] "bmi" "age" "sex" "map" ...
x3[1:3,]
## bmi age sex map tc
## 1 0.06169621 0.038075906 0.05068012 0.021872355 -0.044223498
## 2 -0.05147406 -0.001882017 -0.04464164 -0.026327835 -0.008448724
## 3 0.04445121 0.085298906 0.05068012 -0.005670611 -0.045599451
## ldl hdl tch ltg glu
## 1 -0.03482076 -0.04340085 -0.002592262 0.019908421 -0.01764613
## 2 -0.01916334 0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262 0.002863771 -0.02593034
## I(age^2) I(bmi^2) I(map^2) I(tc^2) I(ldl^2)
## 1 1.449775e-03 0.003806422 4.783999e-04 1.955718e-03 0.0012124855
## 2 3.541986e-06 0.002649579 6.931549e-04 7.138094e-05 0.0003672336
## 3 7.275903e-03 0.001975910 3.215582e-05 2.079310e-03 0.0011692615
## I(hdl^2) I(tch^2) I(ltg^2) I(glu^2) age:sex
## 1 0.001883633 6.719822e-06 3.963452e-04 0.0003113857 0.0019296915
## 2 0.005537081 1.559727e-03 4.668954e-03 0.0085015868 0.0000840163
## 3 0.001046906 6.719822e-06 8.201182e-06 0.0006723825 0.0043229587
......
## bmi:ldl:hdl bmi:ldl:tch bmi:ldl:ltg bmi:ldl:glu bmi:hdl:tch
## 1 9.323843e-05 5.568980e-06 -4.276944e-05 3.790933e-05 6.941216e-06
## 2 7.340068e-05 -3.895686e-05 -6.740148e-05 -9.095145e-05 1.512701e-04
## 3 4.918055e-05 3.940201e-06 -4.352890e-06 3.941374e-05 3.728348e-06
## bmi:hdl:ltg bmi:hdl:glu bmi:tch:ltg bmi:tch:glu bmi:ltg:glu
## 1 -5.330813e-05 4.725046e-05 -3.184008e-06 2.822193e-06 -2.167428e-05
## 2 2.617211e-04 3.531660e-04 -1.389065e-04 -1.874402e-04 -3.243010e-04
## 3 -4.118848e-06 3.729458e-05 -3.299900e-07 2.987932e-06 -3.300882e-06
diabetes$x2[1:3,]
## age sex bmi map tc
## 1 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451
## ldl hdl tch ltg glu
## 1 -0.03482076 -0.04340085 -0.002592262 0.019908421 -0.01764613
## 2 -0.01916334 0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262 0.002863771 -0.02593034
## age^2 bmi^2 map^2 tc^2 ldl^2
## 1 -0.01485516 0.022504574 -0.03104468 -0.004331120 -0.01373992
## 2 -0.04129154 0.005642773 -0.02730766 -0.030938902 -0.02480103
## 3 0.09164344 -0.004176421 -0.03880990 -0.002585937 -0.01430556
## hdl^2 tch^2 ltg^2 glu^2 age:sex
## 1 -0.004631425 -0.030448463 -0.02881622 -0.02752556 0.032864976
## 2 0.040036524 -0.009485482 0.03716124 0.08802196 -0.006609993
## 3 -0.014861456 -0.030448463 -0.03480993 -0.02243261 0.084051745
,,,,,,
## map:hdl map:tch map:ltg map:glu tc:ldl tc:hdl
## 1 -0.01120743 -0.01382511 -0.01019384 -0.02577846 -0.006868965 0.03982309
## 2 -0.03197943 0.00987452 0.02036965 0.03136196 -0.026235312 -0.01646229
## 3 0.01209345 -0.01228188 -0.02031831 -0.01495348 -0.006596978 0.03001686
## tc:tch tc:ltg tc:glu ldl:hdl ldl:tch ldl:ltg
## 1 -0.01930305 -0.04289151 0.0009629700 0.04235493 -0.02203783 -0.03112456
## 2 -0.01550120 -0.01234310 0.0009326831 -0.02125642 -0.01156425 0.01297338
## 3 -0.01924114 -0.02717776 0.0098716066 0.03358696 -0.02206334 -0.01801617
## ldl:glu hdl:tch hdl:ltg hdl:glu tch:ltg
## 1 -0.0009221095 0.03349363 0.0008521487 0.03115026 -0.02819118
## 2 0.0237834359 -0.02381466 -0.0945055990 -0.14037759 0.02529772
## 3 0.0049134553 0.03295588 0.0182807986 0.03279524 -0.02733183
## tch:glu ltg:glu
## 1 -0.01765816 -0.02779368
## 2 0.05303354 0.10401328
## 3 -0.01723596 -0.02230374
x3[,11:100] = scale(x3[,11:100])*(0.0149/0.312) # 데이타 준비 끝.
So, let’s do Lasso on 100 predictors
# x3_lasso <- lars(x=x3, y=diabetes$y, type="lasso", trace=TRUE) # set trace=TRUE to see the lasso path
x3_lasso <- lars(x=x3, y=diabetes$y, type="lasso", trace=F) # Print out is too long
# 1, 9, 4, 7 번째 변수부터 포함시켜나감. 37번째 선택된 16번 변수 (hdl^2)가 48번째 step에서 제외됨.
# 대충 1/3 이상의 변수가 떨어져 나감.
par(mfrow=c(1,1))
plot(x3_lasso) # can't see nothing.
x3_lasso
##
## Call:
## lars(x = x3, y = diabetes$y, type = "lasso", trace = F)
## R-squared: 0.626
## Sequence of LASSO moves:
## bmi ltg map hdl bmi:map age:sex I(glu^2) I(bmi^2) age:map age:glu sex
## Var 1 9 4 7 37 20 19 12 22 28 3
## Step 1 2 3 4 5 6 7 8 9 10 11
## bmi:tc:hdl bmi:hdl:glu glu age:ltg bmi:tc:tch I(age^2) bmi:age:hdl
## Var 87 97 10 27 88 11 69
## Step 12 13 14 15 16 17 18
## sex:map bmi:age:sex map:hdl age:ldl sex:hdl tch:glu bmi:ldl:ltg
## Var 30 65 46 24 33 63 93
## Step 19 20 21 22 23 24 25
## ldl:glu bmi:sex sex:tch bmi:age:ltg I(ltg^2) ldl:ltg tc:hdl
## Var 58 29 34 71 18 57 51
## Step 26 27 28 29 30 31 32
## bmi:tc:ltg bmi:hdl:ltg hdl:tch hdl:ltg I(hdl^2) bmi:map:tch bmi:ltg
## Var 89 96 59 60 16 83 42
## Step 33 34 35 36 37 38 39
## bmi:map:tc age age:hdl bmi:age:ldl tch:ltg bmi:age:glu map:glu
## Var 80 2 25 68 62 72 49
## Step 40 41 42 43 44 45 46
## bmi:sex:map I(hdl^2) tc bmi:ldl:hdl I(ltg^2) bmi:tc:hdl bmi:sex:ldl
## Var 73 -16 5 91 -18 -87 75
## Step 47 48 49 50 51 52 53
## bmi:ldl:tch I(tch^2) bmi:ldl:hdl I(bmi^2) bmi:sex:hdl sex:ldl hdl:ltg
## Var 92 17 -91 -12 76 32 -60
## Step 54 55 56 57 58 59 60
## bmi:ldl tch tc:glu tc:ltg bmi:map:hdl hdl:glu map:tch tc:tch sex:glu
## Var 39 8 54 53 82 61 47 52 36
## Step 61 62 63 64 65 66 67 68 69
,,, snip
x3_lasso_coef_mat = coef(x3_lasso, s=c(.25,.50,0.75,1.00), mode="fraction")
# x3_lasso_coef_mat 분석:
# s = 1, 즉 no penalty (OLS) 시 :
sum(x3_lasso_coef_mat[4,] == 0) # all the variables are included, as expected.
## [1] 0
sum(abs(x3_lasso_coef_mat[4,])) # 113971, this is the total sum of beta's absolute values in OLS
## [1] 113971.2
# penalty 즉 shrinkage를 안 줄 때 beta 들이 갖는 총합
# s = 0.75 : Give some penalty
sum(x3_lasso_coef_mat[3,] == 0) # 2 variables got sacked.
## [1] 2
sum(abs(x3_lasso_coef_mat[3,])) # 85413. 예상한 값. note '113971*0.75 = 85413'
## [1] 85413.39
# s = 0.25 :
sum(x3_lasso_coef_mat[1,] == 0) # 5 variables got sacked at this stage
## [1] 5
sum(abs(x3_lasso_coef_mat[1,])) # 28477
## [1] 28477.45
어떤 penalty 값에서 가장 좋은 결과를 보이는가 ========
Use 7-fold CV
?cv.lars
# CV_x3_lasso = cv.lars(x=x3, y=diabetes$y, K=7, index=seq(from = 0, to = 1, length=200),
# type="lasso", trace=TRUE, mode="fraction") # 좋다.
CV_x3_lasso = cv.lars(x=x3, y=diabetes$y, K=7, index=seq(from = 0, to = 1, length=200),
type="lasso", trace=FALSE, mode="fraction") # 좋다.
결과 분석
s가 거의 0에 가까운 곳에서 최고의 성능: 역시나 442개 밖에 되지 않는 data point들을 갖고, 억지로 100개 변수 모델을 만들면 이런 결과 어쨌든, lasso is cool.
Then at which ‘s’, does the lasso shows best result?? Easy, I don’t exactly have to do predictions for this.
Since the there are 200 indices between 0 to 1, each is 0.005 플롯을 보니 5번째 tick에서 최적. 따라서, s = 0.005*5 = 0.025에서 모델을 만들 것.
prediction with lars lasso model.
다른 경우와 같이 lars 패키지도 predict함수를 제공 (predict.lars). predict()가 generic이므로 모델이 lars lasso 타입이면 자동적으로 predict.lars()가 적용됨.
predict_to_self = predict(x3_lasso, x3, s=0.025, type="fit", mode="fraction") # It's not out-of-sample test
str(predict_to_self) # it's got fit
## List of 4
## $ s : num 0.025
## $ fraction: num 0.025
## $ mode : chr "fraction"
## $ fit : Named num [1:442] 206.5 73.8 190.2 179.1 117.5 ...
## ..- attr(*, "names")= chr [1:442] "1" "2" "3" "4" ...
MSE_lasso_0.025 = mean((predict_to_self$fit - diabetes$y)^2) # s=0.025시 MSE
MSE_lasso_0.025
## [1] 2580.15
**2580 Plot과 맞음. plot의 값보다 좀 더 작아야 함. since this is not a out-of-sample test.
Lastly, just for fun
Lasso with s=1 is simply OLS. So let’s check it with OLS regression result.
predict_to_self_s_1.0 = predict(x3_lasso, x3, s=1.0, type="fit", mode="fraction")
LM_OLS = lm(diabetes$y ~ x3)
predict_LM_OLS = predict(LM_OLS)
# Lasso at s=1의 결과와 OLS 결과가 같아야 함. Let's check.
mean((predict_to_self_s_1.0$fit - predict_LM_OLS)^2)
## [1] 2.916263e-18
# 2.92e-18 ; 0 와 마찬가지. 둘은 같음.
Done
'Learning & Reasoning > R ' 카테고리의 다른 글
Stationarity of Financial Time Series (0) | 2015.12.31 |
---|---|
Financial Time Series Import (0) | 2015.12.21 |
A simple time series clustering (0) | 2015.04.24 |
Signal and time series seen from eight miles high cloud - DFT & Simple digital filtering (0) | 2015.02.18 |
Signal and time series seen from eight miles high cloud (0) | 2015.02.15 |