R로 Lasso regression 연습

Learning & Reasoning/R

R로 Lasso regression 연습

이현봉 2015. 10. 5. 01:33

Lasso_LAR_with_LARS_Package Lasso Regression with R

LARS 패키지를 이용한 LASSO, LAR examples, by HBLEE

특히 HTF의 “The Elements of Statistical Learning” 에 언급된 Regression의 Shrinkage Method 중 하나인 LASSO (L1 Norm: Sum of Absolute values, PENALTY/CONSTRAINT-BASED VARIABLE SELECTION)에 집중
Regularized Regression으로 Ridge Regression도 있으나, LASSO가 더 유용하리라 판단

Reference :

- CRAN LARS 패키지 매뉴얼 : https://cran.r-project.org/web/packages/lars/lars.pdf

- LAR 논문 : http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

Data : LARS 패키지에 내재된 ‘diabetes’ 데이타. LAR 논문에서도 사용됨.

*전에 연습했던 것을 먼지 털어 올림

library(lars)

## Warning: package 'lars' was built under R version 3.2.2

## Loaded lars 1.2

data(diabetes)
str(diabetes)

## 'data.frame':    442 obs. of  3 variables:
##  $ x : AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr  "age" "sex" "bmi" "map" ...
##  $ y : num  151 75 141 206 135 97 138 63 110 310 ...
##  $ x2: AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##   ..- attr(*, ".Names")= chr  "age" "age" "age" "age" ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr  "1" "2" "3" "4" ...
##   .. ..$ : chr  "age" "sex" "bmi" "map" ...

“diabetes” 의 구조에 주목: Predictor로 쓰이는 X와 X2가 matrix.
즉, input variable이 matrix이어야 함. Predictor로 X 또는 X2를 사용 가능 x2는 x의 10개 variable들의 squared, interaction들을 포함한 super set.

Response는 ‘y’ variable

중요 : x와 x2가 ’AsIs’로 declare됨. 따라서 data.frame() 으로 coversion이나 formula 내에서 쓸 때 주의. ?AsIs 참고토록.

Data 의미 :

Predictors (x): 442명 각각 환자들에게서 10개의 당뇨병에 관련된 측정치를 수집

Response (y) : x 측정 1년후 환자들의 당뇨 상태.
모델 : Regression type Supervised Learning. x를 활용해 1년 후 환자들의 예후를 예측.

Let’s take a look at each variables, x first.

str(diabetes$x)  # named numeric matrix of 442x10. x is a matrix. No categorical data

##  AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:10] "age" "sex" "bmi" "map" ...

colnames(diabetes$x) # "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"

##  [1] "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"

sum(is.na(diabetes$x)) # No missing values

## [1] 0

diabetes$x[1:7,]

##               age         sex         bmi          map           tc
## [1,]  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## [2,] -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## [3,]  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
## [4,] -0.089062939 -0.04464164 -0.01159501 -0.036656447  0.012190569
## [5,]  0.005383060 -0.04464164 -0.03638469  0.021872355  0.003934852
## [6,] -0.092695478 -0.04464164 -0.04069594 -0.019442093 -0.068990650
## [7,] -0.045472478  0.05068012 -0.04716281 -0.015999223 -0.040095640
##              ldl          hdl          tch          ltg          glu
## [1,] -0.03482076 -0.043400846 -0.002592262  0.019908421 -0.017646125
## [2,] -0.01916334  0.074411564 -0.039493383 -0.068329744 -0.092204050
## [3,] -0.03419447 -0.032355932 -0.002592262  0.002863771 -0.025930339
## [4,]  0.02499059 -0.036037570  0.034308859  0.022692023 -0.009361911
## [5,]  0.01559614  0.008142084 -0.002592262 -0.031991445 -0.046640874
## [6,] -0.07928784  0.041276824 -0.076394504 -0.041180385 -0.096346157
## [7,] -0.02480001  0.000778808 -0.039493383 -0.062912950 -0.038356660

# LARS 논문(Table-1) 에 있는 값과 다름. So, 이 diabetes 데이타는 전처리를 거쳤음.  
unique(diabetes$x[,2]) # just two values. So one must be male, the other female.

## [1]  0.05068012 -0.04464164

??? 그런데 최소한 sex는 categorical이어야 하는데, x는 모두 numeric 임. so, sex는 이미 indicator variable로 바뀌었음을 예상. 그런데 [0|1] 이 아니고 값이 이상함.
원 데이타가 없는 상황에서 어떤 전처리를 했는지 알기 어려움. But, Lasso가 predictor 값의 scale에 영향을 받기에, 각 변수들을 미리 standardize 했을 수도.

colMeans(diabetes$x)  # OK. Means are zero

##           age           sex           bmi           map            tc 
## -3.613464e-16  9.824343e-17 -7.990599e-16  1.356826e-16 -9.239659e-17 
##           ldl           hdl           tch           ltg           glu 
##  1.285989e-16 -4.504396e-16  3.861403e-16 -3.836857e-16 -3.423209e-16

apply(diabetes$x, 2, sd)

##        age        sex        bmi        map         tc        ldl 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##        hdl        tch        ltg        glu 
## 0.04761905 0.04761905 0.04761905 0.04761905

# what the...  왜, 하필 sd=0.0476으로 표준화했을까? I don't want a know.

이제 response ’y’를 보자.

diabetes$y  # matches with the values in the paper

##   [1] 151  75 141 206 135  97 138  63 110 310 101  69 179 185 118 171 166
##  [18] 144  97 168  68  49  68 245 184 202 137  85 131 283 129  59 341  87
##  [35]  65 102 265 276 252  90 100  55  61  92 259  53 190 142  75 142 155
##  [52] 225  59 104 182 128  52  37 170 170  61 144  52 128  71 163 150  97
##  [69] 160 178  48 270 202 111  85  42 170 200 252 113 143  51  52 210  65
##  [86] 141  55 134  42 111  98 164  48  96  90 162 150 279  92  83 128 102
## [103] 302 198  95  53 134 144 232  81 104  59 246 297 258 229 275 281 179
## [120] 200 200 173 180  84 121 161  99 109 115 268 274 158 107  83 103 272
## [137]  85 280 336 281 118 317 235  60 174 259 178 128  96 126 288  88 292
## [154]  71 197 186  25  84  96 195  53 217 172 131 214  59  70 220 268 152
## [171]  47  74 295 101 151 127 237 225  81 151 107  64 138 185 265 101 137
## [188] 143 141  79 292 178  91 116  86 122  72 129 142  90 158  39 196 222
## [205] 277  99 196 202 155  77 191  70  73  49  65 263 248 296 214 185  78
## [222]  93 252 150  77 208  77 108 160  53 220 154 259  90 246 124  67  72
## [239] 257 262 275 177  71  47 187 125  78  51 258 215 303 243  91 150 310
## [256] 153 346  63  89  50  39 103 308 116 145  74  45 115 264  87 202 127
## [273] 182 241  66  94 283  64 102 200 265  94 230 181 156 233  60 219  80
## [290]  68 332 248  84 200  55  85  89  31 129  83 275  65 198 236 253 124
## [307]  44 172 114 142 109 180 144 163 147  97 220 190 109 191 122 230 242
## [324] 248 249 192 131 237  78 135 244 199 270 164  72  96 306  91 214  95
## [341] 216 263 178 113 200 139 139  88 148  88 243  71  77 109 272  60  54
## [358] 221  90 311 281 182 321  58 262 206 233 242 123 167  63 197  71 168
## [375] 140 217 121 235 245  40  52 104 132  88  69 219  72 201 110  51 277
## [392]  63 118  69 273 258  43 198 242 232 175  93 168 275 293 281  72 140
## [409] 189 181 209 136 261 113 131 174 257  55  84  42 146 212 233  91 111
## [426] 152 120  67 310  94 183  66 173  72  49  64  48 178 104 132 220  57

plot(table(diabetes$y))

Let’s look into x2 variable

str(diabetes$x2)  # 442x64 named numeric matrix. 64 predictors. Let's see those vars.

##  AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##  - attr(*, ".Names")= chr [1:28288] "age" "age" "age" "age" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:64] "age" "sex" "bmi" "map" ...

colnames(diabetes$x2)

##  [1] "age"     "sex"     "bmi"     "map"     "tc"      "ldl"     "hdl"    
##  [8] "tch"     "ltg"     "glu"     "age^2"   "bmi^2"   "map^2"   "tc^2"   
## [15] "ldl^2"   "hdl^2"   "tch^2"   "ltg^2"   "glu^2"   "age:sex" "age:bmi"
## [22] "age:map" "age:tc"  "age:ldl" "age:hdl" "age:tch" "age:ltg" "age:glu"
## [29] "sex:bmi" "sex:map" "sex:tc"  "sex:ldl" "sex:hdl" "sex:tch" "sex:ltg"
## [36] "sex:glu" "bmi:map" "bmi:tc"  "bmi:ldl" "bmi:hdl" "bmi:tch" "bmi:ltg"
## [43] "bmi:glu" "map:tc"  "map:ldl" "map:hdl" "map:tch" "map:ltg" "map:glu"
## [50] "tc:ldl"  "tc:hdl"  "tc:tch"  "tc:ltg"  "tc:glu"  "ldl:hdl" "ldl:tch"
## [57] "ldl:ltg" "ldl:glu" "hdl:tch" "hdl:ltg" "hdl:glu" "tch:ltg" "tch:glu"
## [64] "ltg:glu"

# x에 있는 10개 predictor에 더하여, square term, 그리고 interaction term들이 있음.
sum(is.na(diabetes$x2))   # NA 없고.

## [1] 0

sum(diabetes$x2[,1:10] == diabetes$x)  # 예상대로 앞쪽 10개 측정치는 x와 같음.

## [1] 0

x 에서 x2를 어찌 만들었는지 재현하자.

** Note that; diabetes $x와 diabetes$ x2가 ’AsIs’로 annotate되어있음. This is funny clutch of R to enforce (partial) immutability for some objects that shouldn’t be changed. But in R all the objects innards are visible if you try hard.

Some nasty reminder when dealing with “AsIs” objects.

# 1) Simple case;
?model.matrix   # takes data.frame object as data argument.

## starting httpd help server ... done

x2_my <- model.matrix(~., data=as.data.frame(diabetes$x))[,-1] # get rid of 'intercept' column
str(x2_my)   # note that 'model.matrix' returns matrix. x2_my is a named matrix

##  num [1:442, 1:10] 0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:10] "xage" "xsex" "xbmi" "xmap" ...

x2_my[1:3,]

##           xage        xsex        xbmi         xmap          xtc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##          xldl        xhdl         xtch         xltg        xglu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034

colnames(x2_my)  # 'x' is preprended

##  [1] "xage" "xsex" "xbmi" "xmap" "xtc"  "xldl" "xhdl" "xtch" "xltg" "xglu"

# x2_my[,'xage']   # works
# x2_my$xage       # doesn't work. remember, x2_my is not a dataframe.

2) a little less simple;

# x2_my <- model.matrix(~ xage + xsex + xbmi, data=as.data.frame(diabetes$x))[,-1]

“Error in eval(expr, envir, enclos) : object ‘xage’ not found” ; what the … something weird is happening here. So, let’s check matrix -> data.frame conversion

str(as.data.frame(diabetes$x))

## 'data.frame':    442 obs. of  1 variable:
##  $ x: AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr  "age" "sex" "bmi" "map" ...

# 'data.frame': 442 obs. of  1 variable:" ; It's got only one freaking variable.
as.data.frame(diabetes$x)[1:3,]

##               age         sex         bmi          map           tc
## [1,]  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## [2,] -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## [3,]  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##              ldl         hdl          tch          ltg         glu
## [1,] -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## [2,] -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## [3,] -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034

It prints out innocently. But internal data structure is not what I wanted. R is automatically pretty printing. pervert SOB. R이 나도 모르게 “예쁘게” 출력을 해줌. 내가 원하는 것 아닌데…

It’s cause ‘diabetes$x’ is ‘AsIs’ object. It wants to keep its original data structure and resists conversion to dataframe, and function formula notation comprehension.

Remember, R has funny polymorphic idea and it’s got no privacy. I can mess up with with objects in anyway I want to.

methods(as.data.frame)

##  [1] as.data.frame.aovproj*        as.data.frame.array          
##  [3] as.data.frame.AsIs            as.data.frame.character      
##  [5] as.data.frame.complex         as.data.frame.data.frame     
##  [7] as.data.frame.Date            as.data.frame.default        
##  [9] as.data.frame.difftime        as.data.frame.factor         
## [11] as.data.frame.ftable*         as.data.frame.integer        
## [13] as.data.frame.list            as.data.frame.logical        
## [15] as.data.frame.logLik*         as.data.frame.matrix         
## [17] as.data.frame.model.matrix    as.data.frame.noquote        
## [19] as.data.frame.numeric         as.data.frame.numeric_version
## [21] as.data.frame.ordered         as.data.frame.POSIXct        
## [23] as.data.frame.POSIXlt         as.data.frame.raw            
## [25] as.data.frame.table           as.data.frame.ts             
## [27] as.data.frame.vector         
## see '?methods' for accessing help and source code

예상대로… “as.data.frame” generic 함수는 call 대상 객체가 ’AsIs’로 선언된 것이면 “as.data.frame.AsIs”로 부른다.

x_df = as.data.frame.matrix(diabetes$x) # use 'as.data.frame.matrix'
str(x_df)  # looks like plain vanilla data.frame

## 'data.frame':    442 obs. of  10 variables:
##  $ age: num  0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
##  $ sex: num  0.0507 -0.0446 0.0507 -0.0446 -0.0446 ...
##  $ bmi: num  0.0617 -0.0515 0.0445 -0.0116 -0.0364 ...
##  $ map: num  0.02187 -0.02633 -0.00567 -0.03666 0.02187 ...
##  $ tc : num  -0.04422 -0.00845 -0.0456 0.01219 0.00393 ...
##  $ ldl: num  -0.0348 -0.0192 -0.0342 0.025 0.0156 ...
##  $ hdl: num  -0.0434 0.07441 -0.03236 -0.03604 0.00814 ...
##  $ tch: num  -0.00259 -0.03949 -0.00259 0.03431 -0.00259 ...
##  $ ltg: num  0.01991 -0.06833 0.00286 0.02269 -0.03199 ...
##  $ glu: num  -0.01765 -0.0922 -0.02593 -0.00936 -0.04664 ...

names(x_df)

##  [1] "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"

x_df$sex   # yup, this x_df is a real deal. We are out of that 'AsIs' curse

##   [1]  0.05068012 -0.04464164  0.05068012 -0.04464164 -0.04464164
##   [6] -0.04464164  0.05068012  0.05068012  0.05068012 -0.04464164
##  [11] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
##  [16]  0.05068012 -0.04464164  0.05068012 -0.04464164 -0.04464164
##  [21] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
##  [26]  0.05068012 -0.04464164 -0.04464164 -0.04464164  0.05068012
##  [31] -0.04464164 -0.04464164  0.05068012 -0.04464164 -0.04464164
##  [36]  0.05068012 -0.04464164 -0.04464164  0.05068012  0.05068012
##  [41]  0.05068012 -0.04464164  0.05068012 -0.04464164  0.05068012
##  [46]  0.05068012 -0.04464164 -0.04464164  0.05068012  0.05068012
##  [51] -0.04464164  0.05068012 -0.04464164 -0.04464164 -0.04464164
##  [56] -0.04464164 -0.04464164 -0.04464164 -0.04464164  0.05068012
##  [61] -0.04464164  0.05068012  0.05068012 -0.04464164  0.05068012
##  [66]  0.05068012  0.05068012  0.05068012  0.05068012 -0.04464164
##  [71] -0.04464164 -0.04464164  0.05068012  0.05068012  0.05068012
##  [76]  0.05068012  0.05068012 -0.04464164 -0.04464164 -0.04464164
##  [81] -0.04464164  0.05068012 -0.04464164 -0.04464164 -0.04464164
##  [86] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
##  [91] -0.04464164 -0.04464164 -0.04464164 -0.04464164 -0.04464164
##  [96] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
## [101] -0.04464164  0.05068012 -0.04464164  0.05068012 -0.04464164
snip......

** End of “AsIs” digression——————–

Use x_df to generate diabetes$x2

x2_my <- model.matrix(~.^2 + I(age^2) + I(bmi^2) + I(map^2) + I(tc^2) + I(ldl^2) + I(hdl^2) + I(tch^2) +
                                 I(ltg^2) + I(glu^2), data=x_df)[,-1] # get rid of 'intercept' column
str(x2_my)

##  num [1:442, 1:64] 0.03808 -0.00188 0.0853 -0.08906 0.00538 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:64] "age" "sex" "bmi" "map" ...

str(diabetes$x2)

##  AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
##  - attr(*, ".Names")= chr [1:28288] "age" "age" "age" "age" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:64] "age" "sex" "bmi" "map" ...

x2_my[1:3,]

##            age         sex         bmi          map           tc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##       I(age^2)    I(bmi^2)     I(map^2)      I(tc^2)     I(ldl^2)
## 1 1.449775e-03 0.003806422 4.783999e-04 1.955718e-03 0.0012124855
## 2 3.541986e-06 0.002649579 6.931549e-04 7.138094e-05 0.0003672336
## 3 7.275903e-03 0.001975910 3.215582e-05 2.079310e-03 0.0011692615
,,,,
##         map:hdl       map:tch       map:ltg       map:glu       tc:ldl
## 1 -0.0009492787 -5.669887e-05  4.354440e-04 -0.0003859623 0.0015398960
## 2 -0.0019590954  1.039775e-03  1.798974e-03  0.0024275330 0.0001619058
## 3  0.0001834779  1.469971e-05 -1.623933e-05  0.0001470409 0.0015592489
##          tc:hdl       tc:tch        tc:ltg       tc:glu      ldl:hdl
## 1  0.0019193372 0.0001146389 -0.0008804200 0.0007803734  0.001511251
## 2 -0.0006286828 0.0003336687  0.0005772992 0.0007790066 -0.001425974
## 3  0.0014754128 0.0001182057 -0.0001305864 0.0011824092  0.001106394
##        ldl:tch       ldl:ltg      ldl:glu       hdl:tch       hdl:ltg
## 1 9.026454e-05 -0.0006932264 0.0006144515  1.125064e-04 -8.640423e-04
## 2 7.568251e-04  0.0013094261 0.0017669375 -2.938764e-03 -5.084523e-03
## 3 8.864101e-05 -0.0000979251 0.0008866741  8.387505e-05 -9.265996e-05
##         hdl:glu       tch:ltg      tch:glu       ltg:glu
## 1  0.0007658568 -5.160784e-05 4.574338e-05 -3.513065e-04
## 2 -0.0068610475  2.698573e-03 3.641450e-03  6.300279e-03
## 3  0.0008390003 -7.423643e-06 6.721823e-05 -7.425854e-05

diabetes$x2[1:3,]

##            age         sex         bmi          map           tc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##         age^2        bmi^2       map^2         tc^2       ldl^2
## 1 -0.01485516  0.022504574 -0.03104468 -0.004331120 -0.01373992
## 2 -0.04129154  0.005642773 -0.02730766 -0.030938902 -0.02480103
## 3  0.09164344 -0.004176421 -0.03880990 -0.002585937 -0.01430556
##          hdl^2        tch^2       ltg^2       glu^2      age:sex
## 1 -0.004631425 -0.030448463 -0.02881622 -0.02752556  0.032864976
## 2  0.040036524 -0.009485482  0.03716124  0.08802196 -0.006609993
## 3 -0.014861456 -0.030448463 -0.03480993 -0.02243261  0.084051745
##        age:bmi      age:map      age:tc      age:ldl       age:hdl
## 1  0.040571674  0.001660641 -0.04655325 -0.038244710 -0.0345115069
## 2 -0.006764804 -0.015934224 -0.01172880 -0.009655541  0.0006995477
## 3  0.070889129 -0.027912869 -0.09174427 -0.071641516 -0.0602920920
##        age:tch      age:ltg     age:glu    sex:bmi     sex:map
## 1 -0.012112261  0.003064892 -0.03027751 0.06210301  0.01228204
## 2 -0.008368996 -0.010201680 -0.01138014 0.04451819  0.01373927
## 3 -0.014760528 -0.007763517 -0.06469909 0.04356153 -0.01815796
,,,,,
##         ldl:glu     hdl:tch       hdl:ltg     hdl:glu     tch:ltg
## 1 -0.0009221095  0.03349363  0.0008521487  0.03115026 -0.02819118
## 2  0.0237834359 -0.02381466 -0.0945055990 -0.14037759  0.02529772
## 3  0.0049134553  0.03295588  0.0182807986  0.03279524 -0.02733183
##       tch:glu     ltg:glu
## 1 -0.01765816 -0.02779368
## 2  0.05303354  0.10401328
## 3 -0.01723596 -0.02230374

# x2_my <-> diabetes$x2 ; Numbers are different again. Getting into my nerves. Standardization?
colMeans(diabetes$x2)  # OK. Means are zero

##           age           sex           bmi           map            tc 
## -4.807774e-20  7.347064e-18  1.427492e-18 -4.155996e-18  3.376235e-18 
##           ldl           hdl           tch           ltg           glu 
##  2.087212e-18  2.392604e-18  2.628454e-18  1.007916e-18  2.880740e-18 
##         age^2         bmi^2         map^2          tc^2         ldl^2 
## -2.562053e-17 -2.487680e-17  2.863079e-18  5.103624e-17  1.603920e-17 
##         hdl^2         tch^2         ltg^2         glu^2       age:sex 
## -2.200538e-17  6.511517e-17 -1.233121e-17 -4.219067e-19 -2.473796e-18 
##       age:bmi       age:map        age:tc       age:ldl       age:hdl 
## -6.574386e-18  1.253455e-18 -1.434924e-17 -1.945799e-18 -4.604180e-19 
##       age:tch       age:ltg       age:glu       sex:bmi       sex:map 
##  1.544865e-18 -5.812010e-18 -3.373169e-18 -4.895467e-18  2.879759e-19 
##        sex:tc       sex:ldl       sex:hdl       sex:tch       sex:ltg 
## -1.410015e-18  6.082570e-18  1.594979e-17 -6.308045e-17 -4.605652e-18 
##       sex:glu       bmi:map        bmi:tc       bmi:ldl       bmi:hdl 
##  1.442823e-17 -3.066428e-18 -1.311173e-17  2.385490e-20  2.539584e-17 
##       bmi:tch       bmi:ltg       bmi:glu        map:tc       map:ldl 
## -8.461438e-19 -5.782820e-20 -9.609171e-18  1.023099e-17 -8.185481e-18 
##       map:hdl       map:tch       map:ltg       map:glu        tc:ldl 
##  3.437681e-18 -8.184500e-18  1.106279e-17 -1.515948e-17  1.793643e-17 
##        tc:hdl        tc:tch        tc:ltg        tc:glu       ldl:hdl 
## -1.677815e-18 -2.114418e-17 -1.042220e-17  4.673537e-18  2.068999e-18 
##       ldl:tch       ldl:ltg       ldl:glu       hdl:tch       hdl:ltg 
## -2.395670e-18  1.845505e-18 -1.498814e-17  6.730516e-18 -3.275664e-18 
##       hdl:glu       tch:ltg       tch:glu       ltg:glu 
## -6.698996e-19  1.297241e-18 -1.880796e-19 -2.796359e-19

apply(diabetes$x2, 2, sd)  # 이럴 줄 알았음.

##        age        sex        bmi        map         tc        ldl 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##        hdl        tch        ltg        glu      age^2      bmi^2 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##      map^2       tc^2      ldl^2      hdl^2      tch^2      ltg^2 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##      glu^2    age:sex    age:bmi    age:map     age:tc    age:ldl 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    age:hdl    age:tch    age:ltg    age:glu    sex:bmi    sex:map 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##     sex:tc    sex:ldl    sex:hdl    sex:tch    sex:ltg    sex:glu 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    bmi:map     bmi:tc    bmi:ldl    bmi:hdl    bmi:tch    bmi:ltg 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    bmi:glu     map:tc    map:ldl    map:hdl    map:tch    map:ltg 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    map:glu     tc:ldl     tc:hdl     tc:tch     tc:ltg     tc:glu 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    ldl:hdl    ldl:tch    ldl:ltg    ldl:glu    hdl:tch    hdl:ltg 
## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 
##    hdl:glu    tch:ltg    tch:glu    ltg:glu 
## 0.04761905 0.04761905 0.04761905 0.04761905

scale(x2_my[,11:64])[1:3,]*(0.0149/0.312)  # Test. It matches.

##      I(age^2)     I(bmi^2)    I(map^2)      I(tc^2)    I(ldl^2)
## 1 -0.01489801  0.022569491 -0.03113423 -0.004343613 -0.01377956
## 2 -0.04141065  0.005659051 -0.02738643 -0.031028148 -0.02487257
## 3  0.09190780 -0.004188469 -0.03892186 -0.002593396 -0.01434683
##       I(hdl^2)     I(tch^2)    I(ltg^2)    I(glu^2)     age:sex
## 1 -0.004644785 -0.030536295 -0.02889934 -0.02760496  0.03295978
## 2  0.040152014 -0.009512844  0.03726844  0.08827587 -0.00662906
## 3 -0.014904326 -0.030536295 -0.03491034 -0.02249732  0.08429420
##        age:bmi      age:map      age:tc      age:ldl       age:hdl
## 1  0.040688708  0.001665431 -0.04668754 -0.038355032 -0.0346110594
## 2 -0.006784318 -0.015980188 -0.01176263 -0.009683393  0.0007015656
## 3  0.071093617 -0.027993387 -0.09200892 -0.071848175 -0.0604660115
,,,,
##        tc:hdl      tc:tch      tc:ltg       tc:glu     ldl:hdl     ldl:tch
## 1  0.03993796 -0.01935874 -0.04301523 0.0009657478  0.04247711 -0.02210140
## 2 -0.01650978 -0.01554592 -0.01237870 0.0009353736 -0.02131774 -0.01159761
## 3  0.03010345 -0.01929665 -0.02725616 0.0099000824  0.03368385 -0.02212698
##       ldl:ltg       ldl:glu     hdl:tch       hdl:ltg     hdl:glu
## 1 -0.03121435 -0.0009247694  0.03359024  0.0008546068  0.03124011
## 2  0.01301080  0.0238520420 -0.02388336 -0.0947782113 -0.14078252
## 3 -0.01806814  0.0049276287  0.03305094  0.0183335317  0.03288985
##       tch:ltg     tch:glu     ltg:glu
## 1 -0.02827250 -0.01770909 -0.02787386
## 2  0.02537069  0.05318652  0.10431332
## 3 -0.02741067 -0.01728568 -0.02236807

x2_my[,11:64] = scale(x2_my[,11:64])*(0.0149/0.312) 

sqrt(sum(x2_my - diabetes$x2)^2)  # Two are the same. 데이터를 어떻게 만들었나 재현 끝.

## [1] 7.310626e-13

# Now we have it. x2 is generated from x1, it's quadratic term. and first order interaction.

### Do the Lasso, Lar, forward stagewise Regression as done in LARS manual

par(mfrow=c(2,2))
object <- lars(diabetes$x, diabetes$y)
plot(object)
object2 <- lars(diabetes$x, diabetes$y, type='lar')
plot(object2)
object3 <- lars(diabetes$x, diabetes$y, type='for')  # Can use abbreviations
plot(object3)
# 3개 비슷하다.  
# 특히, 논문에 있는 것 같이 Lasso와 LAR는 이 데이터에 대해 거의 같은 결과

Let’s go deeper

Lasso와 같은 regularized regression은 변수가 많을 때 변수 selection 하면서 자동적으로 괜찮은 모델을 만들고자 하려고 한 것. 위 예는 변수 수가 작다. diabetes$x2 의 64개도 작다. 더 가보자. 새로운 predictor x3를 x_df를 이용해 만들어 이것으로 Lasso에 적용하자.

왠지 bmi(body mass index)가 앞의 1’st order interaction과 또 interaction 할지도 모른다는 생각이 들어;

x3 <- model.matrix(~ bmi*(.^2) + I(age^2) + I(bmi^2) + I(map^2) + I(tc^2) + I(ldl^2) + I(hdl^2) + I(tch^2) +
                        I(ltg^2) + I(glu^2), data=x_df)[,-1] # get rid of 'intercept' column

str(x3)

##  num [1:442, 1:100] 0.0617 -0.0515 0.0445 -0.0116 -0.0364 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "1" "2" "3" "4" ...
##   ..$ : chr [1:100] "bmi" "age" "sex" "map" ...

x3[1:3,]

##           bmi          age         sex          map           tc
## 1  0.06169621  0.038075906  0.05068012  0.021872355 -0.044223498
## 2 -0.05147406 -0.001882017 -0.04464164 -0.026327835 -0.008448724
## 3  0.04445121  0.085298906  0.05068012 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##       I(age^2)    I(bmi^2)     I(map^2)      I(tc^2)     I(ldl^2)
## 1 1.449775e-03 0.003806422 4.783999e-04 1.955718e-03 0.0012124855
## 2 3.541986e-06 0.002649579 6.931549e-04 7.138094e-05 0.0003672336
## 3 7.275903e-03 0.001975910 3.215582e-05 2.079310e-03 0.0011692615
##      I(hdl^2)     I(tch^2)     I(ltg^2)     I(glu^2)      age:sex
## 1 0.001883633 6.719822e-06 3.963452e-04 0.0003113857 0.0019296915
## 2 0.005537081 1.559727e-03 4.668954e-03 0.0085015868 0.0000840163
## 3 0.001046906 6.719822e-06 8.201182e-06 0.0006723825 0.0043229587

......
##    bmi:ldl:hdl   bmi:ldl:tch   bmi:ldl:ltg   bmi:ldl:glu  bmi:hdl:tch
## 1 9.323843e-05  5.568980e-06 -4.276944e-05  3.790933e-05 6.941216e-06
## 2 7.340068e-05 -3.895686e-05 -6.740148e-05 -9.095145e-05 1.512701e-04
## 3 4.918055e-05  3.940201e-06 -4.352890e-06  3.941374e-05 3.728348e-06
##     bmi:hdl:ltg  bmi:hdl:glu   bmi:tch:ltg   bmi:tch:glu   bmi:ltg:glu
## 1 -5.330813e-05 4.725046e-05 -3.184008e-06  2.822193e-06 -2.167428e-05
## 2  2.617211e-04 3.531660e-04 -1.389065e-04 -1.874402e-04 -3.243010e-04
## 3 -4.118848e-06 3.729458e-05 -3.299900e-07  2.987932e-06 -3.300882e-06

diabetes$x2[1:3,]

##            age         sex         bmi          map           tc
## 1  0.038075906  0.05068012  0.06169621  0.021872355 -0.044223498
## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724
## 3  0.085298906  0.05068012  0.04445121 -0.005670611 -0.045599451
##           ldl         hdl          tch          ltg         glu
## 1 -0.03482076 -0.04340085 -0.002592262  0.019908421 -0.01764613
## 2 -0.01916334  0.07441156 -0.039493383 -0.068329744 -0.09220405
## 3 -0.03419447 -0.03235593 -0.002592262  0.002863771 -0.02593034
##         age^2        bmi^2       map^2         tc^2       ldl^2
## 1 -0.01485516  0.022504574 -0.03104468 -0.004331120 -0.01373992
## 2 -0.04129154  0.005642773 -0.02730766 -0.030938902 -0.02480103
## 3  0.09164344 -0.004176421 -0.03880990 -0.002585937 -0.01430556
##          hdl^2        tch^2       ltg^2       glu^2      age:sex
## 1 -0.004631425 -0.030448463 -0.02881622 -0.02752556  0.032864976
## 2  0.040036524 -0.009485482  0.03716124  0.08802196 -0.006609993
## 3 -0.014861456 -0.030448463 -0.03480993 -0.02243261  0.084051745

,,,,,,
##       map:hdl     map:tch     map:ltg     map:glu       tc:ldl      tc:hdl
## 1 -0.01120743 -0.01382511 -0.01019384 -0.02577846 -0.006868965  0.03982309
## 2 -0.03197943  0.00987452  0.02036965  0.03136196 -0.026235312 -0.01646229
## 3  0.01209345 -0.01228188 -0.02031831 -0.01495348 -0.006596978  0.03001686
##        tc:tch      tc:ltg       tc:glu     ldl:hdl     ldl:tch     ldl:ltg
## 1 -0.01930305 -0.04289151 0.0009629700  0.04235493 -0.02203783 -0.03112456
## 2 -0.01550120 -0.01234310 0.0009326831 -0.02125642 -0.01156425  0.01297338
## 3 -0.01924114 -0.02717776 0.0098716066  0.03358696 -0.02206334 -0.01801617
##         ldl:glu     hdl:tch       hdl:ltg     hdl:glu     tch:ltg
## 1 -0.0009221095  0.03349363  0.0008521487  0.03115026 -0.02819118
## 2  0.0237834359 -0.02381466 -0.0945055990 -0.14037759  0.02529772
## 3  0.0049134553  0.03295588  0.0182807986  0.03279524 -0.02733183
##       tch:glu     ltg:glu
## 1 -0.01765816 -0.02779368
## 2  0.05303354  0.10401328
## 3 -0.01723596 -0.02230374

x3[,11:100] = scale(x3[,11:100])*(0.0149/0.312)   # 데이타 준비 끝.

So, let’s do Lasso on 100 predictors

# x3_lasso <- lars(x=x3, y=diabetes$y, type="lasso", trace=TRUE) # set trace=TRUE to see the lasso path
x3_lasso <- lars(x=x3, y=diabetes$y, type="lasso", trace=F) # Print out is too long
# 1, 9, 4, 7 번째 변수부터 포함시켜나감. 37번째 선택된 16번 변수 (hdl^2)가 48번째 step에서 제외됨.
# 대충 1/3 이상의 변수가 떨어져 나감.

par(mfrow=c(1,1))
plot(x3_lasso)  # can't see nothing.

x3_lasso

## 
## Call:
## lars(x = x3, y = diabetes$y, type = "lasso", trace = F)
## R-squared: 0.626 
## Sequence of LASSO moves:
##      bmi ltg map hdl bmi:map age:sex I(glu^2) I(bmi^2) age:map age:glu sex
## Var    1   9   4   7      37      20       19       12      22      28   3
## Step   1   2   3   4       5       6        7        8       9      10  11
##      bmi:tc:hdl bmi:hdl:glu glu age:ltg bmi:tc:tch I(age^2) bmi:age:hdl
## Var          87          97  10      27         88       11          69
## Step         12          13  14      15         16       17          18
##      sex:map bmi:age:sex map:hdl age:ldl sex:hdl tch:glu bmi:ldl:ltg
## Var       30          65      46      24      33      63          93
## Step      19          20      21      22      23      24          25
##      ldl:glu bmi:sex sex:tch bmi:age:ltg I(ltg^2) ldl:ltg tc:hdl
## Var       58      29      34          71       18      57     51
## Step      26      27      28          29       30      31     32
##      bmi:tc:ltg bmi:hdl:ltg hdl:tch hdl:ltg I(hdl^2) bmi:map:tch bmi:ltg
## Var          89          96      59      60       16          83      42
## Step         33          34      35      36       37          38      39
##      bmi:map:tc age age:hdl bmi:age:ldl tch:ltg bmi:age:glu map:glu
## Var          80   2      25          68      62          72      49
## Step         40  41      42          43      44          45      46
##      bmi:sex:map I(hdl^2) tc bmi:ldl:hdl I(ltg^2) bmi:tc:hdl bmi:sex:ldl
## Var           73      -16  5          91      -18        -87          75
## Step          47       48 49          50       51         52          53
##      bmi:ldl:tch I(tch^2) bmi:ldl:hdl I(bmi^2) bmi:sex:hdl sex:ldl hdl:ltg
## Var           92       17         -91      -12          76      32     -60
## Step          54       55          56       57          58      59      60
##      bmi:ldl tch tc:glu tc:ltg bmi:map:hdl hdl:glu map:tch tc:tch sex:glu
## Var       39   8     54     53          82      61      47     52      36
## Step      61  62     63     64          65      66      67     68      69

,,, snip

x3_lasso_coef_mat = coef(x3_lasso, s=c(.25,.50,0.75,1.00), mode="fraction")
# x3_lasso_coef_mat 분석:
# s = 1, 즉 no penalty (OLS) 시 : 
sum(x3_lasso_coef_mat[4,] == 0)  # all the variables are included, as expected.

## [1] 0

sum(abs(x3_lasso_coef_mat[4,]))  # 113971, this is the total sum of beta's absolute values in OLS

## [1] 113971.2

# penalty 즉 shrinkage를 안 줄 때 beta 들이 갖는 총합 

# s = 0.75 :  Give some penalty
sum(x3_lasso_coef_mat[3,] == 0)  # 2 variables got sacked.

## [1] 2

sum(abs(x3_lasso_coef_mat[3,]))  # 85413. 예상한 값. note '113971*0.75 = 85413'

## [1] 85413.39

# s = 0.25 :
sum(x3_lasso_coef_mat[1,] == 0)  # 5 variables got sacked at this stage

## [1] 5

sum(abs(x3_lasso_coef_mat[1,]))  # 28477

## [1] 28477.45

어떤 penalty 값에서 가장 좋은 결과를 보이는가 ========

Use 7-fold CV

?cv.lars
# CV_x3_lasso = cv.lars(x=x3, y=diabetes$y, K=7, index=seq(from = 0, to = 1, length=200), 
#                      type="lasso", trace=TRUE, mode="fraction")  # 좋다.  
CV_x3_lasso = cv.lars(x=x3, y=diabetes$y, K=7, index=seq(from = 0, to = 1, length=200), 
                      type="lasso", trace=FALSE, mode="fraction")  # 좋다.

결과 분석

s가 거의 0에 가까운 곳에서 최고의 성능: 역시나 442개 밖에 되지 않는 data point들을 갖고, 억지로 100개 변수 모델을 만들면 이런 결과 어쨌든, lasso is cool.

Then at which ‘s’, does the lasso shows best result?? Easy, I don’t exactly have to do predictions for this.
Since the there are 200 indices between 0 to 1, each is 0.005 플롯을 보니 5번째 tick에서 최적. 따라서, s = 0.005*5 = 0.025에서 모델을 만들 것.

prediction with lars lasso model.

다른 경우와 같이 lars 패키지도 predict함수를 제공 (predict.lars). predict()가 generic이므로 모델이 lars lasso 타입이면 자동적으로 predict.lars()가 적용됨.

predict_to_self = predict(x3_lasso, x3, s=0.025, type="fit", mode="fraction") # It's not out-of-sample test
str(predict_to_self)  # it's got fit

## List of 4
##  $ s       : num 0.025
##  $ fraction: num 0.025
##  $ mode    : chr "fraction"
##  $ fit     : Named num [1:442] 206.5 73.8 190.2 179.1 117.5 ...
##   ..- attr(*, "names")= chr [1:442] "1" "2" "3" "4" ...

MSE_lasso_0.025 = mean((predict_to_self$fit - diabetes$y)^2) # s=0.025시 MSE
MSE_lasso_0.025

## [1] 2580.15

**2580 Plot과 맞음. plot의 값보다 좀 더 작아야 함. since this is not a out-of-sample test.

Lastly, just for fun

Lasso with s=1 is simply OLS. So let’s check it with OLS regression result.

predict_to_self_s_1.0 = predict(x3_lasso, x3, s=1.0, type="fit", mode="fraction")
LM_OLS = lm(diabetes$y ~ x3)
predict_LM_OLS = predict(LM_OLS)

# Lasso at s=1의 결과와 OLS 결과가 같아야 함. Let's check.
mean((predict_to_self_s_1.0$fit - predict_LM_OLS)^2)

## [1] 2.916263e-18

# 2.92e-18  ; 0 와 마찬가지.  둘은 같음.

Done

'Learning & Reasoning > R ' 카테고리의 다른 글

Stationarity of Financial Time Series (0)	2015.12.31
Financial Time Series Import (0)	2015.12.21
A simple time series clustering (0)	2015.04.24
Signal and time series seen from eight miles high cloud - DFT & Simple digital filtering (0)	2015.02.18
Signal and time series seen from eight miles high cloud (0)	2015.02.15

현재글R로 Lasso regression 연습

Round Here

at Round Here, we are all the same

AI, decision tree, random forest, 파이썬, 머신러닝, Machine Learning, AdaBoost, deep learning, iSLR, 데이터사이언스, 인공지능, 시계열, 기계학습, 빅데이터, R, Python, chatGPT, 구글, time series, 안드로이드,

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Round Here