Learning & Reasoning/R

R로 Lasso regression 연습

이현봉 2015. 10. 5. 01:33

Lasso_LAR_with_LARS_Package Lasso Regression with R

LARS 패키지를 이용한 LASSO, LAR examples, by HBLEE

  • 특히 HTF의 “The Elements of Statistical Learning” 에 언급된 Regression의 Shrinkage Method 중 하나인 LASSO (L1 Norm: Sum of Absolute values, PENALTY/CONSTRAINT-BASED VARIABLE SELECTION)에 집중
  • Regularized Regression으로 Ridge Regression도 있으나, LASSO가 더 유용하리라 판단

Reference :

- CRAN LARS 패키지 매뉴얼 : https://cran.r-project.org/web/packages/lars/lars.pdf

- LAR 논문 : http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

Data : LARS 패키지에 내재된 ‘diabetes’ 데이타. LAR 논문에서도 사용됨.

*

library(lars)
## Warning: package 'lars' was built under R version 3.2.2
## Loaded lars 1.2
data(diabetes)
str(diabetes) 
## 'data.frame':    442 obs. of  3 variables:
##  $ x : AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr  "age" "sex" "bmi" "map" ...
##  $ y : num  151 75 141 206 135 97 138 63 110 310 ...
##  $ x2: AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##   ..- attr(*, ".Names")= chr  "age" "age" "age" "age" ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr  "1" "2" "3" "4" ...
##   .. ..$ : chr  "age" "sex" "bmi" "map" ...

“diabetes” 의 구조에 주목: Predictor로 쓰이는 X와 X2가 matrix.
즉, input variable이 matrix이어야 함. Predictor로 X 또는 X2를 사용 가능 x2는 x의 10개 variable들의 squared, interaction들을 포함한 super set.

Response는 ‘y’ variable

중요 : x와 x2가 ’AsIs’로 declare됨. 따라서 data.frame() 으로 coversion이나 formula 내에서 쓸 때 주의. ?AsIs 참고토록.

Data 의미 :

Predictors (x): 442명 각각 환자들에게서 10개의 당뇨병에 관련된 측정치를 수집

Response (y) : x 측정 1년후 환자들의 당뇨 상태.
모델 : Regression type Supervised Learning. x를 활용해 1년 후 환자들의 예후를 예측.

Let’s take a look at each variables, x first.

str(diabetes$x)  # named numeric matrix of 442x10. x is a matrix. No categorical data 
##  AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:10] "age" "sex" "bmi" "map" ...
colnames(diabetes$x) # "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"
##  [1] "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"
sum(is.na(diabetes$x)) # No missing values
## [1] 0
diabetes$x[1:7,]
##               age         sex         bmi          map           tc
## [1,]  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## [2,] -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## [3,]  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
## [4,] -0.089062939 -0.04464164 -0.01159501 -0.036656447  0.012190569
## [5,]  0.005383060 -0.04464164 -0.03638469  0.021872355  0.003934852
## [6,] -0.092695478 -0.04464164 -0.04069594 -0.019442093 -0.068990650
## [7,] -0.045472478  0.05068012 -0.04716281 -0.015999223 -0.040095640
##              ldl          hdl          tch          ltg          glu
## [1,] -0.03482076 -0.043400846 -0.002592262  0.019908421 -0.017646125
## [2,] -0.01916334  0.074411564 -0.039493383 -0.068329744 -0.092204050
## [3,] -0.03419447 -0.032355932 -0.002592262  0.002863771 -0.025930339
## [4,]  0.02499059 -0.036037570  0.034308859  0.022692023 -0.009361911
## [5,]  0.01559614  0.008142084 -0.002592262 -0.031991445 -0.046640874
## [6,] -0.07928784  0.041276824 -0.076394504 -0.041180385 -0.096346157
## [7,] -0.02480001  0.000778808 -0.039493383 -0.062912950 -0.038356660
# LARS 논문(Table-1) 에 있는 값과 다름. So, 이 diabetes 데이타는 전처리를 거쳤음.  
unique(diabetes$x[,2]) # just two values. So one must be male, the other female. 
## [1]  0.05068012 -0.04464164

??? 그런데 최소한 sex는 categorical이어야 하는데, x는 모두 numeric 임. so, sex는 이미 indicator variable로 바뀌었음을 예상. 그런데 [0|1] 이 아니고 값이 이상함.
원 데이타가 없는 상황에서 어떤 전처리를 했는지 알기 어려움. But, Lasso가 predictor 값의 scale에 영향을 받기에, 각 변수들을 미리 standardize 했을 수도.

colMeans(diabetes$x)  # OK. Means are zero
##           age           sex           bmi           map            tc 
## -3.613464e-16  9.824343e-17 -7.990599e-16  1.356826e-16 -9.239659e-17 
##           ldl           hdl           tch           ltg           glu 
##  1.285989e-16 -4.504396e-16  3.861403e-16 -3.836857e-16 -3.423209e-16
apply(diabetes$x, 2, sd)  
##        age        sex        bmi        map         tc        ldl 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##        hdl        tch        ltg        glu 
## 0.04761905 0.04761905 0.04761905 0.04761905
# what the...  왜, 하필 sd=0.0476으로 표준화했을까? I don't want a know.

이제 response ’y’를 보자.

diabetes$y  # matches with the values in the paper
##   [1] 151  75 141 206 135  97 138  63 110 310 101  69 179 185 118 171 166
##  [18] 144  97 168  68  49  68 245 184 202 137  85 131 283 129  59 341  87
##  [35]  65 102 265 276 252  90 100  55  61  92 259  53 190 142  75 142 155
##  [52] 225  59 104 182 128  52  37 170 170  61 144  52 128  71 163 150  97
##  [69] 160 178  48 270 202 111  85  42 170 200 252 113 143  51  52 210  65
##  [86] 141  55 134  42 111  98 164  48  96  90 162 150 279  92  83 128 102
## [103] 302 198  95  53 134 144 232  81 104  59 246 297 258 229 275 281 179
## [120] 200 200 173 180  84 121 161  99 109 115 268 274 158 107  83 103 272
## [137]  85 280 336 281 118 317 235  60 174 259 178 128  96 126 288  88 292
## [154]  71 197 186  25  84  96 195  53 217 172 131 214  59  70 220 268 152
## [171]  47  74 295 101 151 127 237 225  81 151 107  64 138 185 265 101 137
## [188] 143 141  79 292 178  91 116  86 122  72 129 142  90 158  39 196 222
## [205] 277  99 196 202 155  77 191  70  73  49  65 263 248 296 214 185  78
## [222]  93 252 150  77 208  77 108 160  53 220 154 259  90 246 124  67  72
## [239] 257 262 275 177  71  47 187 125  78  51 258 215 303 243  91 150 310
## [256] 153 346  63  89  50  39 103 308 116 145  74  45 115 264  87 202 127
## [273] 182 241  66  94 283  64 102 200 265  94 230 181 156 233  60 219  80
## [290]  68 332 248  84 200  55  85  89  31 129  83 275  65 198 236 253 124
## [307]  44 172 114 142 109 180 144 163 147  97 220 190 109 191 122 230 242
## [324] 248 249 192 131 237  78 135 244 199 270 164  72  96 306  91 214  95
## [341] 216 263 178 113 200 139 139  88 148  88 243  71  77 109 272  60  54
## [358] 221  90 311 281 182 321  58 262 206 233 242 123 167  63 197  71 168
## [375] 140 217 121 235 245  40  52 104 132  88  69 219  72 201 110  51 277
## [392]  63 118  69 273 258  43 198 242 232 175  93 168 275 293 281  72 140
## [409] 189 181 209 136 261 113 131 174 257  55  84  42 146 212 233  91 111
## [426] 152 120  67 310  94 183  66 173  72  49  64  48 178 104 132 220  57
plot(table(diabetes$y))

Let’s look into x2 variable

str(diabetes$x2)  # 442x64 named numeric matrix. 64 predictors. Let's see those vars. 
##  AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##  - attr(*, ".Names")= chr [1:28288] "age" "age" "age" "age" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:64] "age" "sex" "bmi" "map" ...
colnames(diabetes$x2) 
##  [1] "age"     "sex"     "bmi"     "map"     "tc"      "ldl"     "hdl"    
##  [8] "tch"     "ltg"     "glu"     "age^2"   "bmi^2"   "map^2"   "tc^2"   
## [15] "ldl^2"   "hdl^2"   "tch^2"   "ltg^2"   "glu^2"   "age:sex" "age:bmi"
## [22] "age:map" "age:tc"  "age:ldl" "age:hdl" "age:tch" "age:ltg" "age:glu"
## [29] "sex:bmi" "sex:map" "sex:tc"  "sex:ldl" "sex:hdl" "sex:tch" "sex:ltg"
## [36] "sex:glu" "bmi:map" "bmi:tc"  "bmi:ldl" "bmi:hdl" "bmi:tch" "bmi:ltg"
## [43] "bmi:glu" "map:tc"  "map:ldl" "map:hdl" "map:tch" "map:ltg" "map:glu"
## [50] "tc:ldl"  "tc:hdl"  "tc:tch"  "tc:ltg"  "tc:glu"  "ldl:hdl" "ldl:tch"
## [57] "ldl:ltg" "ldl:glu" "hdl:tch" "hdl:ltg" "hdl:glu" "tch:ltg" "tch:glu"
## [64] "ltg:glu"
# x에 있는 10개 predictor에 더하여, square term, 그리고 interaction term들이 있음.
sum(is.na(diabetes$x2))   # NA 없고.
## [1] 0
sum(diabetes$x2[,1:10] == diabetes$x)  # 예상대로 앞쪽 10개 측정치는 x와 같음. 
## [1] 0

x 에서 x2를 어찌 만들었는지 재현하자.

** Note that; diabetes\(x와 diabetes\)x2가 ’AsIs’로 annotate되어있음. This is funny clutch of R to enforce (partial) immutability for some objects that shouldn’t be changed. But in R all the objects innards are visible if you try hard.

Some nasty reminder when dealing with “AsIs” objects.

# 1) Simple case;
?model.matrix   # takes data.frame object as data argument. 
## starting httpd help server ... done
x2_my <- model.matrix(~., data=as.data.frame(diabetes$x))[,-1] # get rid of 'intercept' column
str(x2_my)   # note that 'model.matrix' returns matrix. x2_my is a named matrix
##  num [1:442, 1:10] 0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:10] "xage" "xsex" "xbmi" "xmap" ...
x2_my[1:3,]
##           xage        xsex        xbmi         xmap          xtc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##          xldl        xhdl         xtch         xltg        xglu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
colnames(x2_my)  # 'x' is preprended 
##  [1] "xage" "xsex" "xbmi" "xmap" "xtc"  "xldl" "xhdl" "xtch" "xltg" "xglu"
# x2_my[,'xage']   # works
# x2_my$xage       # doesn't work. remember, x2_my is not a dataframe.

2) a little less simple;

# x2_my <- model.matrix(~ xage + xsex + xbmi, data=as.data.frame(diabetes$x))[,-1]

“Error in eval(expr, envir, enclos) : object ‘xage’ not found” ; what the … something weird is happening here. So, let’s check matrix -> data.frame conversion

str(as.data.frame(diabetes$x)) 
## 'data.frame':    442 obs. of  1 variable:
##  $ x: AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr  "age" "sex" "bmi" "map" ...
# 'data.frame': 442 obs. of  1 variable:" ; It's got only one freaking variable.
as.data.frame(diabetes$x)[1:3,] 
##               age         sex         bmi          map           tc
## [1,]  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## [2,] -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## [3,]  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##              ldl         hdl          tch          ltg         glu
## [1,] -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## [2,] -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## [3,] -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034

It prints out innocently. But internal data structure is not what I wanted. R is automatically pretty printing. pervert SOB. R이 나도 모르게 “예쁘게” 출력을 해줌. 내가 원하는 것 아닌데…

It’s cause ‘diabetes$x’ is ‘AsIs’ object. It wants to keep its original data structure and resists conversion to dataframe, and function formula notation comprehension.

  1. Remember, R has funny polymorphic idea and it’s got no privacy. I can mess up with with objects in anyway I want to.
methods(as.data.frame)
##  [1] as.data.frame.aovproj*        as.data.frame.array          
##  [3] as.data.frame.AsIs            as.data.frame.character      
##  [5] as.data.frame.complex         as.data.frame.data.frame     
##  [7] as.data.frame.Date            as.data.frame.default        
##  [9] as.data.frame.difftime        as.data.frame.factor         
## [11] as.data.frame.ftable*         as.data.frame.integer        
## [13] as.data.frame.list            as.data.frame.logical        
## [15] as.data.frame.logLik*         as.data.frame.matrix         
## [17] as.data.frame.model.matrix    as.data.frame.noquote        
## [19] as.data.frame.numeric         as.data.frame.numeric_version
## [21] as.data.frame.ordered         as.data.frame.POSIXct        
## [23] as.data.frame.POSIXlt         as.data.frame.raw            
## [25] as.data.frame.table           as.data.frame.ts             
## [27] as.data.frame.vector         
## see '?methods' for accessing help and source code

예상대로… “as.data.frame” generic 함수는 call 대상 객체가 ’AsIs’로 선언된 것이면 “as.data.frame.AsIs”로 부른다.

x_df = as.data.frame.matrix(diabetes$x) # use 'as.data.frame.matrix'
str(x_df)  # looks like plain vanilla data.frame
## 'data.frame':    442 obs. of  10 variables:
##  $ age: num  0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
##  $ sex: num  0.0507 -0.0446 0.0507 -0.0446 -0.0446 ...
##  $ bmi: num  0.0617 -0.0515 0.0445 -0.0116 -0.0364 ...
##  $ map: num  0.02187 -0.02633 -0.00567 -0.03666 0.02187 ...
##  $ tc : num  -0.04422 -0.00845 -0.0456 0.01219 0.00393 ...
##  $ ldl: num  -0.0348 -0.0192 -0.0342 0.025 0.0156 ...
##  $ hdl: num  -0.0434 0.07441 -0.03236 -0.03604 0.00814 ...
##  $ tch: num  -0.00259 -0.03949 -0.00259 0.03431 -0.00259 ...
##  $ ltg: num  0.01991 -0.06833 0.00286 0.02269 -0.03199 ...
##  $ glu: num  -0.01765 -0.0922 -0.02593 -0.00936 -0.04664 ...
names(x_df)
##  [1] "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"
x_df$sex   # yup, this x_df is a real deal. We are out of that 'AsIs' curse
##   [1]  0.05068012 -0.04464164  0.05068012 -0.04464164 -0.04464164
##   [6] -0.04464164  0.05068012  0.05068012  0.05068012 -0.04464164
##  [11] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
##  [16]  0.05068012 -0.04464164  0.05068012 -0.04464164 -0.04464164
##  [21] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
##  [26]  0.05068012 -0.04464164 -0.04464164 -0.04464164  0.05068012
##  [31] -0.04464164 -0.04464164  0.05068012 -0.04464164 -0.04464164
##  [36]  0.05068012 -0.04464164 -0.04464164  0.05068012  0.05068012
##  [41]  0.05068012 -0.04464164  0.05068012 -0.04464164  0.05068012
##  [46]  0.05068012 -0.04464164 -0.04464164  0.05068012  0.05068012
##  [51] -0.04464164  0.05068012 -0.04464164 -0.04464164 -0.04464164
##  [56] -0.04464164 -0.04464164 -0.04464164 -0.04464164  0.05068012
##  [61] -0.04464164  0.05068012  0.05068012 -0.04464164  0.05068012
##  [66]  0.05068012  0.05068012  0.05068012  0.05068012 -0.04464164
##  [71] -0.04464164 -0.04464164  0.05068012  0.05068012  0.05068012
##  [76]  0.05068012  0.05068012 -0.04464164 -0.04464164 -0.04464164
##  [81] -0.04464164  0.05068012 -0.04464164 -0.04464164 -0.04464164
##  [86] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
##  [91] -0.04464164 -0.04464164 -0.04464164 -0.04464164 -0.04464164
##  [96] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
## [101] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
snip......

** End of “AsIs” digression——————–

Use x_df to generate diabetes$x2

x2_my <- model.matrix(~.^2 + I(age^2) + I(bmi^2) + I(map^2) + I(tc^2) + I(ldl^2) + I(hdl^2) + I(tch^2) +
                                 I(ltg^2) + I(glu^2), data=x_df)[,-1] # get rid of 'intercept' column
str(x2_my)
##  num [1:442, 1:64] 0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:64] "age" "sex" "bmi" "map" ...
str(diabetes$x2)
##  AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##  - attr(*, ".Names")= chr [1:28288] "age" "age" "age" "age" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:64] "age" "sex" "bmi" "map" ...
x2_my[1:3,]
##            age         sex         bmi          map           tc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##       I(age^2)    I(bmi^2)     I(map^2)      I(tc^2)     I(ldl^2)
## 1 1.449775e-03 0.003806422 4.783999e-04 1.955718e-03 0.0012124855
## 2 3.541986e-06 0.002649579 6.931549e-04 7.138094e-05 0.0003672336
## 3 7.275903e-03 0.001975910 3.215582e-05 2.079310e-03 0.0011692615
,,,,
##         map:hdl       map:tch       map:ltg       map:glu       tc:ldl
## 1 -0.0009492787 -5.669887e-05  4.354440e-04 -0.0003859623 0.0015398960
## 2 -0.0019590954  1.039775e-03  1.798974e-03  0.0024275330 0.0001619058
## 3  0.0001834779  1.469971e-05 -1.623933e-05  0.0001470409 0.0015592489
##          tc:hdl       tc:tch        tc:ltg       tc:glu      ldl:hdl
## 1  0.0019193372 0.0001146389 -0.0008804200 0.0007803734  0.001511251
## 2 -0.0006286828 0.0003336687  0.0005772992 0.0007790066 -0.001425974
## 3  0.0014754128 0.0001182057 -0.0001305864 0.0011824092  0.001106394
##        ldl:tch       ldl:ltg      ldl:glu       hdl:tch       hdl:ltg
## 1 9.026454e-05 -0.0006932264 0.0006144515  1.125064e-04 -8.640423e-04
## 2 7.568251e-04  0.0013094261 0.0017669375 -2.938764e-03 -5.084523e-03
## 3 8.864101e-05 -0.0000979251 0.0008866741  8.387505e-05 -9.265996e-05
##         hdl:glu       tch:ltg      tch:glu       ltg:glu
## 1  0.0007658568 -5.160784e-05 4.574338e-05 -3.513065e-04
## 2 -0.0068610475  2.698573e-03 3.641450e-03  6.300279e-03
## 3  0.0008390003 -7.423643e-06 6.721823e-05 -7.425854e-05
diabetes$x2[1:3,]
##            age         sex         bmi          map           tc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##         age^2        bmi^2       map^2         tc^2       ldl^2
## 1 -0.01485516  0.022504574 -0.03104468 -0.004331120 -0.01373992
## 2 -0.04129154  0.005642773 -0.02730766 -0.030938902 -0.02480103
## 3  0.09164344 -0.004176421 -0.03880990 -0.002585937 -0.01430556
##          hdl^2        tch^2       ltg^2       glu^2      age:sex
## 1 -0.004631425 -0.030448463 -0.02881622 -0.02752556  0.032864976
## 2  0.040036524 -0.009485482  0.03716124  0.08802196 -0.006609993
## 3 -0.014861456 -0.030448463 -0.03480993 -0.02243261  0.084051745
##        age:bmi      age:map      age:tc      age:ldl       age:hdl
## 1  0.040571674  0.001660641 -0.04655325 -0.038244710 -0.0345115069
## 2 -0.006764804 -0.015934224 -0.01172880 -0.009655541  0.0006995477
## 3  0.070889129 -0.027912869 -0.09174427 -0.071641516 -0.0602920920
##        age:tch      age:ltg     age:glu    sex:bmi     sex:map
## 1 -0.012112261  0.003064892 -0.03027751 0.06210301  0.01228204
## 2 -0.008368996 -0.010201680 -0.01138014 0.04451819  0.01373927
## 3 -0.014760528 -0.007763517 -0.06469909 0.04356153 -0.01815796
,,,,,
## ldl:glu hdl:tch hdl:ltg hdl:glu tch:ltg ## 1 -0.0009221095 0.03349363 0.0008521487 0.03115026 -0.02819118 ## 2 0.0237834359 -0.02381466 -0.0945055990 -0.14037759 0.02529772 ## 3 0.0049134553 0.03295588 0.0182807986 0.03279524 -0.02733183 ## tch:glu ltg:glu ## 1 -0.01765816 -0.02779368 ## 2 0.05303354 0.10401328 ## 3 -0.01723596 -0.02230374
# x2_my <-> diabetes$x2 ; Numbers are different again. Getting into my nerves. Standardization?
colMeans(diabetes$x2)  # OK. Means are zero
##           age           sex           bmi           map            tc 
## -4.807774e-20  7.347064e-18  1.427492e-18 -4.155996e-18  3.376235e-18 
##           ldl           hdl           tch           ltg           glu 
##  2.087212e-18  2.392604e-18  2.628454e-18  1.007916e-18  2.880740e-18 
##         age^2         bmi^2         map^2          tc^2         ldl^2 
## -2.562053e-17 -2.487680e-17  2.863079e-18  5.103624e-17  1.603920e-17 
##         hdl^2         tch^2         ltg^2         glu^2       age:sex 
## -2.200538e-17  6.511517e-17 -1.233121e-17 -4.219067e-19 -2.473796e-18 
##       age:bmi       age:map        age:tc       age:ldl       age:hdl 
## -6.574386e-18  1.253455e-18 -1.434924e-17 -1.945799e-18 -4.604180e-19 
##       age:tch       age:ltg       age:glu       sex:bmi       sex:map 
##  1.544865e-18 -5.812010e-18 -3.373169e-18 -4.895467e-18  2.879759e-19 
##        sex:tc       sex:ldl       sex:hdl       sex:tch       sex:ltg 
## -1.410015e-18  6.082570e-18  1.594979e-17 -6.308045e-17 -4.605652e-18 
##       sex:glu       bmi:map        bmi:tc       bmi:ldl       bmi:hdl 
##  1.442823e-17 -3.066428e-18 -1.311173e-17  2.385490e-20  2.539584e-17 
##       bmi:tch       bmi:ltg       bmi:glu        map:tc       map:ldl 
## -8.461438e-19 -5.782820e-20 -9.609171e-18  1.023099e-17 -8.185481e-18 
##       map:hdl       map:tch       map:ltg       map:glu        tc:ldl 
##  3.437681e-18 -8.184500e-18  1.106279e-17 -1.515948e-17  1.793643e-17 
##        tc:hdl        tc:tch        tc:ltg        tc:glu       ldl:hdl 
## -1.677815e-18 -2.114418e-17 -1.042220e-17  4.673537e-18  2.068999e-18 
##       ldl:tch       ldl:ltg       ldl:glu       hdl:tch       hdl:ltg 
## -2.395670e-18  1.845505e-18 -1.498814e-17  6.730516e-18 -3.275664e-18 
##       hdl:glu       tch:ltg       tch:glu       ltg:glu 
## -6.698996e-19  1.297241e-18 -1.880796e-19 -2.796359e-19
apply(diabetes$x2, 2, sd)  # 이럴 줄 알았음.   
##        age        sex        bmi        map         tc        ldl 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##        hdl        tch        ltg        glu      age^2      bmi^2 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##      map^2       tc^2      ldl^2      hdl^2      tch^2      ltg^2 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##      glu^2    age:sex    age:bmi    age:map     age:tc    age:ldl 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    age:hdl    age:tch    age:ltg    age:glu    sex:bmi    sex:map 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##     sex:tc    sex:ldl    sex:hdl    sex:tch    sex:ltg    sex:glu 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    bmi:map     bmi:tc    bmi:ldl    bmi:hdl    bmi:tch    bmi:ltg 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    bmi:glu     map:tc    map:ldl    map:hdl    map:tch    map:ltg 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    map:glu     tc:ldl     tc:hdl     tc:tch     tc:ltg     tc:glu 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    ldl:hdl    ldl:tch    ldl:ltg    ldl:glu    hdl:tch    hdl:ltg 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    hdl:glu    tch:ltg    tch:glu    ltg:glu 
## 0.04761905 0.04761905 0.04761905 0.04761905
scale(x2_my[,11:64])[1:3,]*(0.0149/0.312)  # Test. It matches.
##      I(age^2)     I(bmi^2)    I(map^2)      I(tc^2)    I(ldl^2)
## 1 -0.01489801  0.022569491 -0.03113423 -0.004343613 -0.01377956
## 2 -0.04141065  0.005659051 -0.02738643 -0.031028148 -0.02487257
## 3  0.09190780 -0.004188469 -0.03892186 -0.002593396 -0.01434683
##       I(hdl^2)     I(tch^2)    I(ltg^2)    I(glu^2)     age:sex
## 1 -0.004644785 -0.030536295 -0.02889934 -0.02760496  0.03295978
## 2  0.040152014 -0.009512844  0.03726844  0.08827587 -0.00662906
## 3 -0.014904326 -0.030536295 -0.03491034 -0.02249732  0.08429420
##        age:bmi      age:map      age:tc      age:ldl       age:hdl
## 1  0.040688708  0.001665431 -0.04668754 -0.038355032 -0.0346110594
## 2 -0.006784318 -0.015980188 -0.01176263 -0.009683393  0.0007015656
## 3  0.071093617 -0.027993387 -0.09200892 -0.071848175 -0.0604660115
,,,,
##        tc:hdl      tc:tch      tc:ltg       tc:glu     ldl:hdl     ldl:tch
## 1  0.03993796 -0.01935874 -0.04301523 0.0009657478  0.04247711 -0.02210140
## 2 -0.01650978 -0.01554592 -0.01237870 0.0009353736 -0.02131774 -0.01159761
## 3  0.03010345 -0.01929665 -0.02725616 0.0099000824  0.03368385 -0.02212698
##       ldl:ltg       ldl:glu     hdl:tch       hdl:ltg     hdl:glu
## 1 -0.03121435 -0.0009247694  0.03359024  0.0008546068  0.03124011
## 2  0.01301080  0.0238520420 -0.02388336 -0.0947782113 -0.14078252
## 3 -0.01806814  0.0049276287  0.03305094  0.0183335317  0.03288985
##       tch:ltg     tch:glu     ltg:glu
## 1 -0.02827250 -0.01770909 -0.02787386
## 2  0.02537069  0.05318652  0.10431332
## 3 -0.02741067 -0.01728568 -0.02236807
x2_my[,11:64] = scale(x2_my[,11:64])*(0.0149/0.312) 

sqrt(sum(x2_my - diabetes$x2)^2)  # Two are the same. 데이터를 어떻게 만들었나 재현 끝.
## [1] 7.310626e-13
# Now we have it. x2 is generated from x1, it's quadratic term. and first order interaction.
### Do the Lasso, Lar, forward stagewise Regression as done in LARS manual
par(mfrow=c(2,2))
object <- lars(diabetes$x, diabetes$y)
plot(object)
object2 <- lars(diabetes$x, diabetes$y, type='lar')
plot(object2)
object3 <- lars(diabetes$x, diabetes$y, type='for')  # Can use abbreviations
plot(object3)
# 3개 비슷하다.  
# 특히, 논문에 있는 것 같이 Lasso와 LAR는 이 데이터에 대해 거의 같은 결과

Let’s go deeper

Lasso와 같은 regularized regression은 변수가 많을 때 변수 selection 하면서 자동적으로 괜찮은 모델을 만들고자 하려고 한 것. 위 예는 변수 수가 작다. diabetes$x2 의 64개도 작다. 더 가보자. 새로운 predictor x3를 x_df를 이용해 만들어 이것으로 Lasso에 적용하자.

왠지 bmi(body mass index)가 앞의 1’st order interaction과 또 interaction 할지도 모른다는 생각이 들어;

x3 <- model.matrix(~ bmi*(.^2) + I(age^2) + I(bmi^2) + I(map^2) + I(tc^2) + I(ldl^2) + I(hdl^2) + I(tch^2) +
                        I(ltg^2) + I(glu^2), data=x_df)[,-1] # get rid of 'intercept' column

str(x3)
##  num [1:442, 1:100] 0.0617 -0.0515 0.0445 -0.0116 -0.0364 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:100] "bmi" "age" "sex" "map" ...
x3[1:3,]
##           bmi          age         sex          map           tc
## 1  0.06169621  0.038075906  0.05068012  0.021872355 -0.044223498
## 2 -0.05147406 -0.001882017 -0.04464164 -0.026327835 -0.008448724
## 3  0.04445121  0.085298906  0.05068012 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##       I(age^2)    I(bmi^2)     I(map^2)      I(tc^2)     I(ldl^2)
## 1 1.449775e-03 0.003806422 4.783999e-04 1.955718e-03 0.0012124855
## 2 3.541986e-06 0.002649579 6.931549e-04 7.138094e-05 0.0003672336
## 3 7.275903e-03 0.001975910 3.215582e-05 2.079310e-03 0.0011692615
##      I(hdl^2)     I(tch^2)     I(ltg^2)     I(glu^2)      age:sex
## 1 0.001883633 6.719822e-06 3.963452e-04 0.0003113857 0.0019296915
## 2 0.005537081 1.559727e-03 4.668954e-03 0.0085015868 0.0000840163
## 3 0.001046906 6.719822e-06 8.201182e-06 0.0006723825 0.0043229587

......
##    bmi:ldl:hdl   bmi:ldl:tch   bmi:ldl:ltg   bmi:ldl:glu  bmi:hdl:tch
## 1 9.323843e-05  5.568980e-06 -4.276944e-05  3.790933e-05 6.941216e-06
## 2 7.340068e-05 -3.895686e-05 -6.740148e-05 -9.095145e-05 1.512701e-04
## 3 4.918055e-05  3.940201e-06 -4.352890e-06  3.941374e-05 3.728348e-06
##     bmi:hdl:ltg  bmi:hdl:glu   bmi:tch:ltg   bmi:tch:glu   bmi:ltg:glu
## 1 -5.330813e-05 4.725046e-05 -3.184008e-06  2.822193e-06 -2.167428e-05
## 2  2.617211e-04 3.531660e-04 -1.389065e-04 -1.874402e-04 -3.243010e-04
## 3 -4.118848e-06 3.729458e-05 -3.299900e-07  2.987932e-06 -3.300882e-06
diabetes$x2[1:3,]
##            age         sex         bmi          map           tc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##         age^2        bmi^2       map^2         tc^2       ldl^2
## 1 -0.01485516  0.022504574 -0.03104468 -0.004331120 -0.01373992
## 2 -0.04129154  0.005642773 -0.02730766 -0.030938902 -0.02480103
## 3  0.09164344 -0.004176421 -0.03880990 -0.002585937 -0.01430556
##          hdl^2        tch^2       ltg^2       glu^2      age:sex
## 1 -0.004631425 -0.030448463 -0.02881622 -0.02752556  0.032864976
## 2  0.040036524 -0.009485482  0.03716124  0.08802196 -0.006609993
## 3 -0.014861456 -0.030448463 -0.03480993 -0.02243261  0.084051745

,,,,,, ## map:hdl map:tch map:ltg map:glu tc:ldl tc:hdl ## 1 -0.01120743 -0.01382511 -0.01019384 -0.02577846 -0.006868965 0.03982309 ## 2 -0.03197943 0.00987452 0.02036965 0.03136196 -0.026235312 -0.01646229 ## 3 0.01209345 -0.01228188 -0.02031831 -0.01495348 -0.006596978 0.03001686 ## tc:tch tc:ltg tc:glu ldl:hdl ldl:tch ldl:ltg ## 1 -0.01930305 -0.04289151 0.0009629700 0.04235493 -0.02203783 -0.03112456 ## 2 -0.01550120 -0.01234310 0.0009326831 -0.02125642 -0.01156425 0.01297338 ## 3 -0.01924114 -0.02717776 0.0098716066 0.03358696 -0.02206334 -0.01801617 ## ldl:glu hdl:tch hdl:ltg hdl:glu tch:ltg ## 1 -0.0009221095 0.03349363 0.0008521487 0.03115026 -0.02819118 ## 2 0.0237834359 -0.02381466 -0.0945055990 -0.14037759 0.02529772 ## 3 0.0049134553 0.03295588 0.0182807986 0.03279524 -0.02733183 ## tch:glu ltg:glu ## 1 -0.01765816 -0.02779368 ## 2 0.05303354 0.10401328 ## 3 -0.01723596 -0.02230374
x3[,11:100] = scale(x3[,11:100])*(0.0149/0.312)   # 데이타 준비 끝.

So, let’s do Lasso on 100 predictors

# x3_lasso <- lars(x=x3, y=diabetes$y, type="lasso", trace=TRUE) # set trace=TRUE to see the lasso path
x3_lasso <- lars(x=x3, y=diabetes$y, type="lasso", trace=F) # Print out is too long
# 1, 9, 4, 7 번째 변수부터 포함시켜나감. 37번째 선택된 16번 변수 (hdl^2)가 48번째 step에서 제외됨.
# 대충 1/3 이상의 변수가 떨어져 나감.

par(mfrow=c(1,1))
plot(x3_lasso)  # can't see nothing.

x3_lasso
## 
## Call:
## lars(x = x3, y = diabetes$y, type = "lasso", trace = F)
## R-squared: 0.626 
## Sequence of LASSO moves:
##      bmi ltg map hdl bmi:map age:sex I(glu^2) I(bmi^2) age:map age:glu sex
## Var    1   9   4   7      37      20       19       12      22      28   3
## Step   1   2   3   4       5       6        7        8       9      10  11
##      bmi:tc:hdl bmi:hdl:glu glu age:ltg bmi:tc:tch I(age^2) bmi:age:hdl
## Var          87          97  10      27         88       11          69
## Step         12          13  14      15         16       17          18
##      sex:map bmi:age:sex map:hdl age:ldl sex:hdl tch:glu bmi:ldl:ltg
## Var       30          65      46      24      33      63          93
## Step      19          20      21      22      23      24          25
##      ldl:glu bmi:sex sex:tch bmi:age:ltg I(ltg^2) ldl:ltg tc:hdl
## Var       58      29      34          71       18      57     51
## Step      26      27      28          29       30      31     32
##      bmi:tc:ltg bmi:hdl:ltg hdl:tch hdl:ltg I(hdl^2) bmi:map:tch bmi:ltg
## Var          89          96      59      60       16          83      42
## Step         33          34      35      36       37          38      39
##      bmi:map:tc age age:hdl bmi:age:ldl tch:ltg bmi:age:glu map:glu
## Var          80   2      25          68      62          72      49
## Step         40  41      42          43      44          45      46
##      bmi:sex:map I(hdl^2) tc bmi:ldl:hdl I(ltg^2) bmi:tc:hdl bmi:sex:ldl
## Var           73      -16  5          91      -18        -87          75
## Step          47       48 49          50       51         52          53
##      bmi:ldl:tch I(tch^2) bmi:ldl:hdl I(bmi^2) bmi:sex:hdl sex:ldl hdl:ltg
## Var           92       17         -91      -12          76      32     -60
## Step          54       55          56       57          58      59      60
##      bmi:ldl tch tc:glu tc:ltg bmi:map:hdl hdl:glu map:tch tc:tch sex:glu
## Var       39   8     54     53          82      61      47     52      36
## Step      61  62     63     64          65      66      67     68      69

,,, snip

x3_lasso_coef_mat = coef(x3_lasso, s=c(.25,.50,0.75,1.00), mode="fraction")
# x3_lasso_coef_mat 분석:
# s = 1, 즉 no penalty (OLS) 시 : 
sum(x3_lasso_coef_mat[4,] == 0)  # all the variables are included, as expected.
## [1] 0
sum(abs(x3_lasso_coef_mat[4,]))  # 113971, this is the total sum of beta's absolute values in OLS
## [1] 113971.2
# penalty 즉 shrinkage를 안 줄 때 beta 들이 갖는 총합 

# s = 0.75 :  Give some penalty
sum(x3_lasso_coef_mat[3,] == 0)  # 2 variables got sacked.
## [1] 2
sum(abs(x3_lasso_coef_mat[3,]))  # 85413. 예상한 값. note '113971*0.75 = 85413'
## [1] 85413.39
# s = 0.25 :
sum(x3_lasso_coef_mat[1,] == 0)  # 5 variables got sacked at this stage
## [1] 5
sum(abs(x3_lasso_coef_mat[1,]))  # 28477
## [1] 28477.45

어떤 penalty 값에서 가장 좋은 결과를 보이는가 ========

Use 7-fold CV

?cv.lars
# CV_x3_lasso = cv.lars(x=x3, y=diabetes$y, K=7, index=seq(from = 0, to = 1, length=200), 
#                      type="lasso", trace=TRUE, mode="fraction")  # 좋다.  
CV_x3_lasso = cv.lars(x=x3, y=diabetes$y, K=7, index=seq(from = 0, to = 1, length=200), 
                      type="lasso", trace=FALSE, mode="fraction")  # 좋다.  

결과 분석

s가 거의 0에 가까운 곳에서 최고의 성능: 역시나 442개 밖에 되지 않는 data point들을 갖고, 억지로 100개 변수 모델을 만들면 이런 결과 어쨌든, lasso is cool.

Then at which ‘s’, does the lasso shows best result?? Easy, I don’t exactly have to do predictions for this.
Since the there are 200 indices between 0 to 1, each is 0.005 플롯을 보니 5번째 tick에서 최적. 따라서, s = 0.005*5 = 0.025에서 모델을 만들 것.

prediction with lars lasso model.

다른 경우와 같이 lars 패키지도 predict함수를 제공 (predict.lars). predict()가 generic이므로 모델이 lars lasso 타입이면 자동적으로 predict.lars()가 적용됨.

predict_to_self = predict(x3_lasso, x3, s=0.025, type="fit", mode="fraction") # It's not out-of-sample test
str(predict_to_self)  # it's got fit 
## List of 4
##  $ s       : num 0.025
##  $ fraction: num 0.025
##  $ mode    : chr "fraction"
##  $ fit     : Named num [1:442] 206.5 73.8 190.2 179.1 117.5 ...
##   ..- attr(*, "names")= chr [1:442] "1" "2" "3" "4" ...
MSE_lasso_0.025 = mean((predict_to_self$fit - diabetes$y)^2) # s=0.025시 MSE
MSE_lasso_0.025
## [1] 2580.15

**2580 Plot과 맞음. plot의 값보다 좀 더 작아야 함. since this is not a out-of-sample test.

Lastly, just for fun

Lasso with s=1 is simply OLS. So let’s check it with OLS regression result.

predict_to_self_s_1.0 = predict(x3_lasso, x3, s=1.0, type="fit", mode="fraction")
LM_OLS = lm(diabetes$y ~ x3)
predict_LM_OLS = predict(LM_OLS)

# Lasso at s=1의 결과와 OLS 결과가 같아야 함. Let's check.
mean((predict_to_self_s_1.0$fit - predict_LM_OLS)^2) 
## [1] 2.916263e-18
# 2.92e-18  ; 0 와 마찬가지.  둘은 같음. 

Done