Ref : Differential과 Derivative의 차이는? 어떻게 이해해야 하나?
■ Differentials
z = f(x, y). If △x and △y are the increments of x and y, then the increment of z is
△z = f(x + △x, y + △y) - f(x, y)
Definition:
If z = f(x, y), then f is differentiable at (a, b) if △z can be expressed as,
If z = f(x, y), then f is differentiable at (a, b) if the partial derivatives fx and fy exist and are continuous at (a, b)
For a differentiable function of one variable, y = f(x), the differential of y is defined as
dy = f'(x)·dx
where dx is a independent variable that can be of any real number.
Definition:
For a differentiable function of two variables, z = f(x, y), and △x, △y that are increments of the independent variables x and y, the differentials dx and dy are the independent variables,
dx = △x , dy = △y
and the differential dz, also called total differential, is defined by
ex) find total differential of function of three variables, w = 2z3y sin x
■ Chain rule for a function of several variables
Chain Rule : one independent variable
ex) If z = x2y + 3xy3 where x = sin 2t and y = et . Find dz/dt when t = 0
Chain Rule : several variables
■ The Gradient of a function of several variables
Definition:
Let z = f(x1, x2, ..., xn) be a scalar function of x1, x2, x3, ..., xn.. Then the gradient of f, denoted by ∇f is the vector
Let z = f(x, y) be a function of x and y in the Cartesian coordinate system. Then ▽f (x, y) is the vector
Note that ∇f (x, y) is a vector on the x-y 2-D plane, and not in the x-y-z three dimensional space.
Some properties of the gradient:
- If f is a differentiable function of x1, x2, x3, ..., xn., then the directional derivative of f in the direction of the unit vector u (denoted by ∇u f(x1, x2, x3, ..., xn) is,
- Note that the directional derivative is a scalar which is the generalization of partial derivative. So, for 3D Cartesian space z=f(x, y), the directional derivatives along the vectors i, j at a point are simply partial derivative values fx , fy at that point.
- If ▽f (x, y) = 0, then ∇u f(x, y) = 0 for all u. That is, if a gradient at a point is zero vector, then all the directional derivatives of any direction of the point are zero.
- For z = f(x, y), the direction of maximum increase (direction of steepest ascent) of f is given by gradient ▽f(x, y) . The maximum value of ∇u f(x, y) is || ∇f(x, y) ||. (Note that ▽f (x, y) is a vector on x-y plane that points to the direction of maximum increase of f, and not a vector pointing along a tangential slope leading to a maximum increase). This property of the direction of maximum increase generalizes to a function of n variables.
- For z = f(x, y), the direction of minimum increase (steepest decrease/descent) of f is given by -▽f (x, y). 즉, ▽f (x, y)가 가장 빠르게 f 가 증가하는 방향을 가르키는 벡터이기에, 그 반대 방향인 -▽f (x, y)으로 움직이면 당연히 가장 빠르게 f 가 감소. The minimum value of ∇u f(x, y) is -|| ∇f(x, y) ||.
f가 (x, y) 부근에서 볼록함수 (convex function) 모양일 때, -▽f (x, y)가 minimum increase 방향을 가르킨다는 말은, (x, y)에서 -▽f (x, y) 방향으로 infinitesimal 하게 움직일 때 f(x, y)가 가장 급하게 작게 되며, 그 때 -|| ∇f(x, y) || 만큼 f(x, y)에서 바뀐다는 말. 이렇게 negative gradient -▽f (x, y)가 가르키는 방향으로 f 상에서 움직이면 f (x, y)의 local minimum인 점으로 가장 빠르게 가게 된다는 이 개념이 gradient descent. ▽f (x, y) 는 바로 그 (x, y)와 아주 가까운 주변에서나 유효한 것이기에 어떤 특정 (x, y)에서 ▽f (x, y)를 구했으면 조금만 -▽f (x, y) 방향으로 움직여야지 용감하게 크게 움직였다간 큰 일 날 수 있음. 이것이 gradient descent에서 learning rate로 표현. (물론 f (x, y) 지형이 x, y 각각의 변수에 따른 f의 변화 차이가 크지 않고 f (x, y)가 전반적으로 완만하면 변수들의 정규화(standardization) 를 하지 않고 learning rate를 좀 크게 해도 되겠다. 문제는 f (x, y) 지형을 보통 미리 알지 못하는 것)* 혼란 중에 하나로, gradient를 흔히 기울기(slope)로 잘 못 사용하는 것. 기울기는 df/dx, ∂f/∂y 같이 어떤 변수가 얼마큼 변할 때 함수 f 가 얼마나 변했나 그 비율(ratio)을 말하기에 숫자, 즉 scalar 임. 따라서 우리가 보통 말하는 기울기는 steepest ascent 방향을 뜻하는 벡터 ∇f 가 아니라 directional derivative가 맞음. But some people don't know the word "directional derivative". So, just be careful when you say gradient, whether it really means gradient (direction of steepest ascent, 최고로 가파른 상승으로 향하는 뱡향) as in ∇f, or a slope that is a directional derivative (어떤 특정한 방향이 주어 질 때 그 뱡향으로의 기울기). 즉, 우리가 보통 gradient descent를 얘기할 때 쓰는 가장 급한 내리막 "기울기"는 해당 (x, y) 위치에서 -▽f (x, y) 방향으로의 접선경사(tangential slope).
■ Using the Gradient Descent/Ascent to get local minimum/maximum
ex-1) Let f(x) = 2(x-2)2 + 3. Find its minimum.
a) Analytical Solution)
f'(x) = 4(x-2) = 0 leads x = 2 to be where f(x) has slope of 0. x = 2 is where f(x) has its local minimum.
f(2) = 3. So, (2, 3) is the global minimum point of f(x)
b) Using Gradient Descent)
simple one dimensional gradient descent demo
costFun1D <- function(x) { 2 * (x-2)^2 + 3 } grad1D <- function(x) { # gradient of a single variable function is just a derivative 2 * 2 * (x - 2) } # #=========== 1-d gradient descent demo ============= grad_desc_1d <- function(x_current, alpha, delta) { costFun1D_current = costFun1D(x_current) x_path <- c(x_current) f_path <- c(costFun1D_current) x_next = x_current - alpha * grad1D(x_current) # "-" for gradient descent iter = 0 # to guard against when the gradient descent diverges and blows up while (abs(costFun1D_current - costFun1D(x_next)) > delta & iter < 500) { iter = iter + 1 x_current <- x_next x_path = c(x_path, x_current) x_next = x_current - alpha * grad1D(x_current) costFun1D_current = costFun1D(x_current) f_path = c(f_path, costFun1D_current) } return(list("x_path" = x_path, "f_path" = f_path)) } # Plot the cost function xs = seq(0, 4, len=40)
- First example
plot(xs, costFun1D(xs), type='l', xlab='x', ylab=expression(2(x-2)^2+3)) abline(v=2, col='red', lty=2) paths = grad_desc_1d(0.1, 0.1, 1e-7) lines(paths$x_path, paths$f_path, type="b", col="blue") text(mean(paths$x_path[1:5]), 6, "Gradient Descent", col="blue")
print(tail(paths$x_path, 10))
## [1] 1.980852 1.988511 1.993107 1.995864 1.997518 1.998511 1.999107 ## [8] 1.999464 1.999678 1.999807
print(tail(paths$f_path, 10))
## [1] 3.000733 3.000264 3.000095 3.000034 3.000012 3.000004 3.000002 ## [8] 3.000001 3.000000 3.000000
- Second example
plot(xs, costFun1D(xs), type='l', xlab='x', ylab=expression(2(x-2)^2+3)) abline(v=2, col='red', lty=2) paths = grad_desc_1d(10, 0.22, 1e-7) lines(paths$x_path, paths$f_path, type="b", col="blue") text(mean(paths$x_path[1:5]), 6, "Gradient Descent", col="blue")
print(tail(paths$x_path, 10))
## [1] 10.000000 2.960000 2.115200 2.013824 2.001659 2.000199
print(tail(paths$f_path, 10))
## [1] 131.000000 4.843200 3.026542 3.000382 3.000006 3.000000
- Third example
plot(xs, costFun1D(xs), type='l', xlab='x', ylab=expression(2(x-2)^2+3)) abline(v=2, col='red', lty=2) paths = grad_desc_1d(3.2, 0.51, 1e-7) lines(paths$x_path, paths$f_path, type="b", col="blue") # does not converge. text(mean(paths$x_path[1:5]), 6, "Gradient Descent", col="blue")
print(tail(paths$x_path, 10))
## [1] -277045333 288127150 -299652232 311638326 -324103855 337068013 ## [7] -350550729 364572763 -379155669 394321900
print(tail(paths$f_path, 10))
## [1] 1.535082e+17 1.660345e+17 1.795829e+17 1.942369e+17 2.100866e+17 ## [6] 2.272297e+17 2.457716e+17 2.658266e+17 2.875180e+17 3.109795e+17
- Fourth example
plot(xs, costFun1D(xs), type='l', xlab='x', ylab=expression(2(x-2)^2+3)) abline(v=2, col='red', lty=2) paths = grad_desc_1d(3.2, 0.483, 1e-7) # step size is reduced to 0.483 lines(paths$x_path, paths$f_path, type="b", col="blue") # converges. text(mean(paths$x_path[1:5]), 6, "Gradient Descent", col="blue")
print(tail(paths$x_path, 10))
## [1] 1.998874 2.001049 1.999022 2.000911 1.999151 2.000791 1.999262 ## [8] 2.000687 1.999359 2.000597
print(tail(paths$f_path, 10))
## [1] 3.000003 3.000002 3.000002 3.000002 3.000001 3.000001 3.000001 ## [8] 3.000001 3.000001 3.000001
- Last example
plot(xs, costFun1D(xs), type='l', xlab='x', ylab=expression(2(x-2)^2+3)) abline(v=2, col='red', lty=2) paths = grad_desc_1d(3.2, 0.01, 1e-7) # step size is reduced to 0.01 lines(paths$x_path, paths$f_path, type="b", col="blue") # converges. text(mean(paths$x_path[1:5]), 6, "Gradient Descent", col="blue")
print(tail(paths$x_path, 10))
## [1] 2.001116 2.001071 2.001028 2.000987 2.000948 2.000910 2.000873 ## [8] 2.000838 2.000805 2.000773
print(tail(paths$f_path, 10))
## [1] 3.000002 3.000002 3.000002 3.000002 3.000002 3.000002 3.000002 ## [8] 3.000001 3.000001 3.000001
ex-2) 2 변수 예. Let f(x, y) = x2 + y2 - 2x + 3. Find its minimum. The plot is,
Analytical Solution:
To find relative minimum or maximum of a function of several variables, use the following properties.
Let f be a function of several variables. Then if f has a relative extremum at a point (x0, x1, ... xn) then one of the following is true,
a) ▽f(x0, x1, ... xn) = 0
b
f(x, y) = x2 + y2 - 2x + 3. The partial derivatives for x and y obviously exist. So ▽f(x, y) should be zero for f(x, y) to have a extremum. Thus,
So, x = 1 , y = 0 is a extremum point.
Because the f(x,y) is convex function, f(x, y) has a global minimum value of 2 at (1, 0)
Using Gradient Descent :
Taste of the two variable gradient descent
costFun2D <- function(x) { # input x is a two element vector of x and y x[1]^2 + x[2]^2 - 2*x[1] + 3 } grad2D <- function(x) { c(2*x[1] - 2, 2*x[2]) } # #=========== 2-d gradient descent demo ============= grad_desc_2d <- function(xy_current, alpha=0.1, delta=1e-7) { costFun2D_current = costFun2D(xy_current) x_path <- c(xy_current[1]) y_path <- c(xy_current[2]) f_path <- c(costFun2D_current) xy_next = c(xy_current[1] - alpha * grad2D(xy_current)[1] , xy_current[2] - alpha * grad2D(xy_current)[2] ) iter = 0 # to guard against when the gradient descent diverges and blows up while (abs(costFun2D_current - costFun2D(xy_next)) > delta & iter < 500) { iter = iter + 1 xy_current <- xy_next x_path = c(x_path, xy_current[1]) y_path = c(y_path, xy_current[2]) xy_next = c(xy_current[1] - alpha * grad2D(xy_current)[1] , xy_current[2] - alpha * grad2D(xy_current)[2] ) costFun2D_current = costFun2D(xy_current) f_path = c(f_path, costFun2D_current) } return(list("x_path" = x_path, "y_path" = y_path, "f_path" = f_path) ) }
Plotting preparation
x = seq(-9, 11, len=100) y = seq(-10, 10, len=100) z = costFun2D(expand.grid(x, y))
- First example
res = grad_desc_2d(c(-10.5, 8), alpha=0.1, delta=1e-7) paths = data.frame("x"=res$x_path, "y"=res$y_path, "f(x,y)"=res$f_path) print(tail(paths))
## x y f.x.y. ## 42 0.9987771 0.0008507059 2.000002 ## 43 0.9990217 0.0006805647 2.000001 ## 44 0.9992174 0.0005444518 2.000001 ## 45 0.9993739 0.0004355614 2.000001 ## 46 0.9994991 0.0003484491 2.000000 ## 47 0.9995993 0.0002787593 2.000000
contour(x, y, matrix(z$Var1, length(x)), xlab="x", ylab="y") points(res$x_path, res$y_path, col="blue", pch=19) points(tail(res$x_path, 1), tail(res$y_path, 1), col="red", pch=17)
- Second example : 매우 빠른 convergence. 8 step 소요
res = grad_desc_2d(c(-10.5, 8), alpha=0.4, delta=1e-7) paths = data.frame("x"=res$x_path, "y"=res$y_path, "f(x,y)"=res$f_path) print(tail(paths))
## x y f.x.y. ## 3 0.5400000 0.3200000 2.314000 ## 4 0.9080000 0.0640000 2.012560 ## 5 0.9816000 0.0128000 2.000502 ## 6 0.9963200 0.0025600 2.000020 ## 7 0.9992640 0.0005120 2.000001 ## 8 0.9998528 0.0001024 2.000000
contour(x, y, matrix(z$Var1, length(x)), xlab="x", ylab="y") points(res$x_path, res$y_path, col="blue", pch=19) points(tail(res$x_path, 1), tail(res$y_path, 1), col="red", pch=17)
- third example : step size가 커짐에 따라 overshoot 발생.
convergence까지 157 step 소요
res = grad_desc_2d(c(-10.5, 8), alpha=0.97, delta=1e-7) paths = data.frame("x"=res$x_path, "y"=res$y_path, "f(x,y)"=res$f_path) print(tail(paths))
## x y f.x.y. ## 152 1.0010069 -0.0007004802 2.000002 ## 153 0.9990535 0.0006584514 2.000001 ## 154 1.0008897 -0.0006189443 2.000001 ## 155 0.9991637 0.0005818077 2.000001 ## 156 1.0007862 -0.0005468992 2.000001 ## 157 0.9992610 0.0005140853 2.000001
contour(x, y, matrix(z$Var1, length(x)), xlab="x", ylab="y") points(res$x_path, res$y_path, col="blue", pch=19) points(tail(res$x_path, 1), tail(res$y_path, 1), col="red", pch=17)
- Fourth example : step size가 너무 커 convergence 못하고 minimum 못 찾음.
res = grad_desc_2d(c(-10.5, 8), alpha=1.01, delta=1e-7) paths = data.frame("x"=res$x_path, "y"=res$y_path, "f(x,y)"=res$f_path) print(tail(paths))
## x y f.x.y. ## 496 207866.7 -144602.2 64117961993 ## 497 -212022.0 147494.3 66708327657 ## 498 216264.5 -150444.2 69403344094 ## 499 -220587.8 153453.0 72207239196 ## 500 225001.5 -156522.1 75124411659 ## 501 -229499.5 159652.6 78159437890
contour(x, y, matrix(z$Var1, length(x)), xlab="x", ylab="y") points(res$x_path, res$y_path, col="blue", pch=19) points(tail(res$x_path, 1), tail(res$y_path, 1), col="red", pch=17)
- Last example : step size가 너무 커 convergence 못하고 diverge.
res = grad_desc_2d(c(2.5, 3.3), alpha=1.02, delta=1e-7) paths = data.frame("x"=res$x_path, "y"=res$y_path, "f(x,y)"=res$f_path) print(tail(paths))
## x y f.x.y. ## 496 -405129820 -891285606 9.585202e+17 ## 497 421335015 926937030 1.036735e+18 ## 498 -438188413 -964014511 1.121333e+18 ## 499 455715952 1002575092 1.212834e+18 ## 500 -473944588 -1042678095 1.311801e+18 ## 501 492902373 1084385219 1.418844e+18
contour(x, y, matrix(z$Var1, length(x)), xlab="x", ylab="y") points(res$x_path, res$y_path, col="blue", pch=19) points(tail(res$x_path, 1), tail(res$y_path, 1), col="red", pch=17)
Friendly Intro. on constrained optimization by Khan (Equality condition using Lagrange multipliers)
수많은 종류의 gradient descent/ascent, 상상력 부족이 문제인 활용성
오랜만에 R 사용. 30년 전 C로 할 적 생각하면 너무 편함. 활용할 기회가 많았더라면 좋았을 터. 여기서는 어렵다. R 문제는 아님.
Pity.
I'll miss it.
'Learning & Reasoning > Math Revisit' 카테고리의 다른 글
시그마, 다변수 미적분, 확률, Matrix Calculus (0) | 2016.12.15 |
---|---|
Implicit Differentiation (0) | 2016.12.15 |
Multi variable 미적분 - 편미분 (0) | 2016.04.19 |
Multi variable 미적분 토막 연습 - 극한 (0) | 2016.04.14 |
극한으로 가면 (0) | 2016.04.14 |