Learning & Reasoning/R

A simple time series clustering

이현봉 2015. 4. 24. 19:23

Time Series Clustering with R Time Series Clustering with R

An Exercise on Time Series Clustering.

A test with various TS representations.

For the reference ;
“A Symbolic Representation of Time Series, with Implications for
Streaming Algorithms (SAX paper)” - Lin & Keogh
http://www.cs.ucr.edu/~eamonn/SAX.pdf

Data :
The synthetic control chart time series
is used in the examples in the following sections.
The dataset contains 600 examples of control charts synthetically
generated by the process in Alcock and Manolopoulos (1999).
Each control chart is a time series with 60 values, and there are six
classes:
1-100: Normal;
101-200: Cyclic;
201-300: Increasing trend;
301-400: Decreasing trend;
401-500: Upward shift; and
501-600: Downward shift.

synthetic_control.data

SAX.pdf


Read in data

sc <- read.table("synthetic_control.data", header=F, sep="")
str(sc)
## 'data.frame':    600 obs. of  60 variables:
##  $ V1 : num  28.8 24.9 31.4 25.8 27.2 ...
##  $ V2 : num  34.5 25.7 30.6 30.5 29.2 ...
##  $ V3 : num  31.3 27.6 26.4 35.4 33.7 ...
##  $ V4 : num  31.3 32.8 24.3 25.6 25.6 ...
##  $ V5 : num  28.9 27.9 27.9 28 24.7 ...
##  $ V6 : num  33.8 31.6 28.5 25.3 28.9 ...
##  $ V7 : num  25.4 31.5 25 28.1 35.8 ...
##  $ V8 : num  27.8 35.5 32.4 29.4 34.9 ...
##  $ V9 : num  35.2 28 25.2 31.5 24.6 ...
##  $ V10: num  27.1 31.7 27.3 27.3 34.2 ...
##  $ V11: num  32.9 27.5 31.8 29 28 ...
##  $ V12: num  29.2 31.2 27.3 29 25.3 ...
##  $ V13: num  36 27.5 28.3 30 35.4 ...
##  $ V14: num  32.3 31.4 26.6 30.3 34.9 ...
##  $ V15: num  34.5 27.8 24 30.4 25.1 ...
##  $ V16: num  32.9 24.5 35.1 24.3 29.5 ...
##  $ V17: num  34.1 27.6 31.6 24.3 33.2 ...
##  $ V18: num  26.5 35.6 32.6 35.1 31.1 ...
##  $ V19: num  27.7 35.4 31 25.4 31.4 ...
##  $ V20: num  26.4 31.4 34.1 32.1 26.5 ...
##  $ V21: num  25.8 30.7 26.9 33.3 28.6 ...
##  $ V22: num  29.3 24.1 31.5 25 31.7 ...
##  $ V23: num  30.7 35.1 35 35.3 35.9 ...
##  $ V24: num  29.5 30.5 32.4 31.6 33 ...
##  $ V25: num  33 32 24.3 29.3 24.6 ...
##  $ V26: num  25 33.7 30.2 34.2 33.2 ...
##  $ V27: num  28.9 25.6 31.2 26.5 27.4 ...
##  $ V28: num  24.3 30.5 26.7 32.2 32.6 ...
##  $ V29: num  26.1 33.6 31.5 25.5 35.9 ...
##  $ V30: num  34.9 25.1 28.9 24.8 28 ...
##  $ V31: num  25 34.1 27.3 27.6 33.1 ...
##  $ V32: num  26.6 32.6 24.2 28.4 33.4 ...
##  $ V33: num  35.7 28.3 27 32.4 26.9 ...
##  $ V34: num  28.4 26.1 25.3 27 30.2 ...
##  $ V35: num  29.1 26.9 31.6 35.9 29.7 ...
##  $ V36: num  28.2 31.5 24.7 35.1 30.9 ...
##  $ V37: num  26.2 33.1 27.5 24.4 24.5 ...
##  $ V38: num  33.3 24.1 24.2 27.6 34 ...
##  $ V39: num  31 28.5 26.8 27.8 33.3 ...
##  $ V40: num  27 25.8 35.1 29.9 33.2 ...
##  $ V41: num  35.5 36 32.6 32.4 31.3 ...
##  $ V42: num  26.2 26.5 31.1 26.9 27.9 ...
##  $ V43: num  29 24.9 26.4 31.3 35.1 ...
##  $ V44: num  32 26 28.1 29.4 35.1 ...
##  $ V45: num  31.1 32.8 31.4 34.3 33.8 ...
##  $ V46: num  34.3 28.5 27.3 24.7 25.9 ...
##  $ V47: num  28.1 26.3 29.6 35.8 29.1 ...
##  $ V48: num  28.9 30.6 36 31.9 24.3 ...
##  $ V49: num  35.5 29 34.1 34.2 32.3 ...
##  $ V50: num  29.7 29.4 27.2 31.2 34.9 ...
##  $ V51: num  31.4 32.6 33.6 34.6 27.7 ...
##  $ V52: num  24.6 31 26.6 28.7 28 ...
##  $ V53: num  33.7 26.6 25.5 28.3 35.7 ...
##  $ V54: num  25 28.4 32.5 31.6 27.6 ...
##  $ V55: num  34.9 33.7 25.6 34.6 35.3 ...
##  $ V56: num  35 26.4 30 32.5 30 ...
##  $ V57: num  32.5 28.5 31.4 31 34.2 ...
##  $ V58: num  33.4 34.2 33.9 24.9 33.1 ...
##  $ V59: num  25.5 32.1 29.5 27.4 31.1 ...
##  $ V60: num  25.9 26.7 29.3 25.3 31 ...
head(sc)
##        V1      V2      V3      V4      V5      V6      V7      V8      V9
## 1 28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479
## 2 24.8923 25.7410 27.5532 32.8217 27.8789 31.5926 31.4861 35.5469 27.9516
## 3 31.3987 30.6316 26.3983 24.2905 27.8613 28.5491 24.9717 32.4358 25.2239
## 4 25.7740 30.5262 35.4209 25.6033 27.9700 25.2702 28.1320 29.4268 31.4549
## 5 27.1798 29.2498 33.6928 25.6264 24.6555 28.9446 35.7980 34.9446 24.5596
## 6 25.5067 29.7929 28.0765 34.4812 33.8000 27.6671 30.6122 25.6393 30.1171
##       V10     V11     V12     V13     V14     V15     V16     V17     V18
## 1 27.1159 32.8717 29.2171 36.0253 32.3370 34.5249 32.8717 34.1173 26.5235
## 2 31.6595 27.5415 31.1887 27.4867 31.3910 27.8110 24.4880 27.5918 35.6273
## 3 27.3068 31.8387 27.2587 28.2572 26.5819 24.0455 35.0625 31.5717 32.5614
## 4 27.3200 28.9564 28.9916 29.9578 30.2773 30.4447 24.3037 24.3140 35.0966
## 5 34.2366 27.9634 25.3216 35.4154 34.8620 25.1472 29.4686 33.1739 31.1274
## 6 26.5188 30.1524 27.8514 29.5582 32.3601 29.2064 26.1001 33.4677 33.9010
##       V19     V20     V21     V22     V23     V24     V25     V26     V27
## 1 27.6623 26.3693 25.7744 29.2700 30.7326 29.5054 33.0292 25.0400 28.9167
## 2 35.4102 31.4167 30.7447 24.1311 35.1422 30.4719 31.9874 33.6615 25.5511
## 3 31.0308 34.1202 26.9337 31.4781 35.0173 32.3851 24.3323 30.2001 31.2452
## 4 25.3679 32.0968 33.3303 25.0102 35.3155 31.6264 29.2806 34.2021 26.5077
## 5 31.3701 26.5173 28.6486 31.6565 35.9497 33.0321 24.6081 33.2025 27.4335
## 6 29.2674 34.8311 31.9815 26.4960 32.6645 27.7188 35.7385 32.8309 30.1509
##       V28     V29     V30     V31     V32     V33     V34     V35     V36
## 1 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584
## 2 30.4686 33.6472 25.0701 34.0765 32.5981 28.3038 26.1471 26.9414 31.5203
## 3 26.6814 31.5137 28.8778 27.3086 24.2460 26.9631 25.2919 31.6114 24.7131
## 4 32.2279 25.5265 24.8240 27.5587 28.3714 32.3667 26.9752 35.9346 35.1146
## 5 32.6355 35.8773 28.0295 33.1247 33.4129 26.9245 30.2123 29.6526 30.8644
## 6 30.5593 27.3321 27.4559 24.2361 34.7268 29.9207 27.2730 35.9963 32.3917
##       V37     V38     V39     V40     V41     V42     V43     V44     V45
## 1 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558
## 2 33.1089 24.1491 28.5157 25.7906 35.9519 26.5301 24.8578 25.9562 32.8357
## 3 27.4809 24.2075 26.8059 35.1253 32.6293 31.0561 26.3583 28.0861 31.4391
## 4 24.3749 27.6083 27.8433 29.8557 32.4185 26.8908 31.3209 29.3849 34.3336
## 5 24.5119 33.9931 33.3094 33.2040 31.2651 27.9072 35.1110 35.0757 33.8330
## 6 27.1390 26.4589 25.0466 35.5002 27.9961 25.8897 31.3951 30.7583 34.9652
##       V46     V47     V48     V49     V50     V51     V52     V53     V54
## 1 34.2553 28.0721 28.9402 35.4973 29.7470 31.4333 24.5556 33.7431 25.0466
## 2 28.5322 26.3458 30.6213 28.9861 29.4047 32.5577 31.0205 26.6418 28.4331
## 3 27.3057 29.6082 35.9725 34.1444 27.1717 33.6318 26.5966 25.5387 32.5434
## 4 24.7381 35.7690 31.8725 34.2054 31.1560 34.6292 28.7261 28.2979 31.5787
## 5 25.9481 29.1348 24.2875 32.3223 34.9244 27.7218 27.9601 35.7198 27.5760
## 6 28.0919 35.6706 33.4401 28.4580 31.1795 26.9458 35.8381 26.7134 25.1641
##       V55     V56     V57     V58     V59     V60
## 1 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717
## 2 33.6564 26.4244 28.4661 34.2484 32.1005 26.6910
## 3 25.5772 29.9897 31.3510 33.9002 29.5446 29.3430
## 4 34.6156 32.5492 30.9827 24.8938 27.3659 25.3069
## 5 35.3375 29.9993 34.2149 33.1276 31.1057 31.0179
## 6 27.3410 25.2093 33.4669 24.1094 33.1669 35.4907
# Let's see one sample from each class
idx <- c(1,101,201,301,401,501)

# Note that in order to plot the multiple time series (mts) with plot.ts() 
# function, each time series should be represented as columns of a data.frame.
# So, 
sample1 <- t(sc[idx, ])  # sample1 is now represented as mts type.
plot.ts(sample1, main="")

Now that we’ve seen how a typical time series of a class look like,
let’s proceed to sample 3 instances from each normal, decreasing &
upward shift classes as done in the “SAX” paper above.

set.seed(1024)
s <- sample(1:100, 3)
idx <- c(s, 300+s, 400+s)  
samples <- sc[idx,]
plot.ts(t(samples))

# Now we have 3 instances from each "normal", "decreasing" and "upward shift"  
# classes.  However, the instance '435' of upward-shift class looks like  
# "increasing" than "upward-shift".  Let's see how the algorithms tell about this.  

Since the “SAX” paper used hierarchical clustering with “complete” linkage
for agglomeration technique, we’ll do the same here.

1. hierarchical clustering using Euclidean Distance

dist(samples) # each row of the "samples" is a time series instance with 60 points.
##            22        98        35       322       398       335       422
## 98   41.03149                                                            
## 35   44.99728  40.46614                                                  
## 322 117.00905 114.73477 119.21542                                        
## 398 124.01885 126.79455 128.32119  39.61302                              
## 335 121.31889 122.71797 125.27042  43.90163  37.52390                    
## 422 110.17739 108.74740 103.94139 207.80285 215.04789 214.80720          
## 498  90.49287  87.61505  86.47926 185.48366 193.61355 192.32379  60.13164
## 435  72.30228  64.89646  61.77019 164.29250 175.85534 171.96494  75.86968
##           498
## 98           
## 35           
## 322          
## 398          
## 335          
## 422          
## 498          
## 435  65.15909
# Note that hclust is called on multiple time series in data.frame format
?hclust  # see "plot()" usage 
## starting httpd help server ... done
hc <- hclust(dist(samples), method="complete")  # proxy:::dist(sc)
par(mfrow=c(1,1))
plot(hc, main="Hierarchical Clustering with Euclidean")

# cut the tree to 3 clusters
observedLabels = rep(1:3, each=3) # labels of samples in sequence from beginning
rect.hclust(hc, k=3)

memb <- cutree(hc, k=3) ; memb  # the instance 435 is grouped to 1'st class 
##  22  98  35 322 398 335 422 498 435 
##   1   1   1   2   2   2   3   3   1
table(observedLabels, memb) # observedLabels and memb should be the same length 
##               memb
## observedLabels 1 2 3
##              1 3 0 0
##              2 0 3 0
##              3 1 0 2

435 is grouped to (22, 98, 35) than to (422, 498). We can see why through
dist(samples)

dist(samples)
##            22        98        35       322       398       335       422
## 98   41.03149                                                            
## 35   44.99728  40.46614                                                  
## 322 117.00905 114.73477 119.21542                                        
## 398 124.01885 126.79455 128.32119  39.61302                              
## 335 121.31889 122.71797 125.27042  43.90163  37.52390                    
## 422 110.17739 108.74740 103.94139 207.80285 215.04789 214.80720          
## 498  90.49287  87.61505  86.47926 185.48366 193.61355 192.32379  60.13164
## 435  72.30228  64.89646  61.77019 164.29250 175.85534 171.96494  75.86968
##           498
## 98           
## 35           
## 322          
## 398          
## 335          
## 422          
## 498          
## 435  65.15909

2. Hierarchical Clustering with DTW Distance

library(dtw) 
## Warning: package 'dtw' was built under R version 3.1.2
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.1.2
## 
## Attaching package: 'proxy'
## 
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## 
## The following object is masked from 'package:base':
## 
##     as.matrix
## 
## Loaded dtw v1.17-1. See ?dtw for help, citation("dtw") for use in publication.
distMatrix_dtw <- dist(samples, method="DTW")
hc_dtw <- hclust(distMatrix_dtw, method="complete")
par(mfrow=c(1,1))
plot(hc_dtw,  main="DTW Distance Based")

# cut tree to get 3 clusters
rect.hclust(hc_dtw, k=3)

memb_dtw <- cutree(hc_dtw, k=3)
table(observedLabels, memb_dtw)
##               memb_dtw
## observedLabels 1 2 3
##              1 3 0 0
##              2 0 3 0
##              3 1 0 2
# dtw distance based clustering shows the same result as the Euclidean based

3. Hierarchical Clustering with HAAR Wavelet

library(wavelets)
## Warning: package 'wavelets' was built under R version 3.1.2
samples_HAAR <- NULL
for (i in 1:nrow(samples)) {
  a <- t(samples[i,])
  wt <- dwt(a, filter="haar", boundary="periodic")  # dwt() applies to time series/collumns
  samples_HAAR <- rbind(samples_HAAR, unlist(c(wt@W, wt@V[[wt@level]])))
}
# @W : wavelet coefficients, @V : scaling coefficients
# Since TS in sc is not 2's exponents, wt cannot have exactly the same number
# of coefficients as original TS's data points.

samples_HAAR <- as.data.frame(samples_HAAR)
row.names(samples_HAAR) <- row.names(samples)
dist_dwt <- dist(samples_HAAR)
hc_dwt <- hclust(dist_dwt, method="complete")
par(mfrow=c(1,1))
plot(hc_dwt, main="HAAR wavelet distance")

# cut tree to get 3 clusters
rect.hclust(hc_dwt, k=3)

memb_dwt <- cutree(hc_dwt, k=3)
table(memb_dwt)
## memb_dwt
## 1 2 3 
## 4 3 2
table(observedLabels, memb_dwt)
##               memb_dwt
## observedLabels 1 2 3
##              1 3 0 0
##              2 0 3 0
##              3 1 0 2

To this point, all three (Euclidean, DTW, Haar Wavelet) distance measures
return the same result for the clustering

4. Hierarchical Clustering with SAX on raw data

library(seewave)
## Warning: package 'seewave' was built under R version 3.1.3
## 
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Welcome to seewave ! 
## The package is regularly updated, please check for new version [http://rug.mnhn.fr/seewave]
## Thanks to use the right reference when citing seewave in publications
## See citation('seewave')
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
plot.ts(t(samples))  # See the TS once more 

SAX(as.numeric(samples[4,]), alphabet = 10, PAA = 10)  # a test
##  [1] "j" "i" "h" "g" "e" "f" "d" "c" "b" "a"
samples_SAX <- NULL
for (i in 1:nrow(samples)) {
  samples_SAX <- rbind(samples_SAX, 
                       SAX(as.numeric(samples[i,]), alphabet = 10, PAA = 10))
}
# samples_SAX <- as.data.frame(samples_SAX)
# apply(samples_SAX, 1, charToRaw)

samples_SAX_Num = data.frame()
for (i in 1:nrow(samples_SAX)) {
  for (j in 1:ncol(samples_SAX)) {
  samples_SAX_Num[i,j] <- as.integer(charToRaw(samples_SAX[i,j])) - 96
  }
}

row.names(samples_SAX_Num) <- row.names(samples)
dist_sax <- dist(samples_SAX_Num)
hc_sax <- hclust(dist_sax, method="complete")
par(mfrow=c(1,1))
plot(hc_sax, main="SAX Distance")
# SAX distance clutering does clustering correctly.

# cut tree to get 3 clusters
rect.hclust(hc_sax, k=3)

memb_sax <- cutree(hc_sax, k=3)
table(observedLabels, memb_sax)  
##               memb_sax
## observedLabels 1 2 3
##              1 3 0 0
##              2 0 3 0
##              3 0 0 3

SAX based distance measure performs clustering as we originally
wanted. However, much of its power is due to its inherent “smoothing
effect of dimension reduction” as noted by the authors.

So, let’s do normalization and smoothing first on Time Series, and then
clustering

5. Do Normalization and Smoothing first on the raw Time Series

We normalize each time series to have a mean of zero and a
standard deviation of one because it is meaningless to
compare time series with different offsets and amplitudes
- From, Jessica Lin & Eamonn Keogh

# R has scale() for normalization with center=0, and sd=1.   
# scale() applies to the columns of the data frame.   

tt = scale(t(samples))
samples_norm = t(tt)

apply(samples_norm, 1, sd)  # test if the sd =1
##  22  98  35 322 398 335 422 498 435 
##   1   1   1   1   1   1   1   1   1
apply(samples_norm, 1, mean)  # test if the mean == 0. Correct.
##            22            98            35           322           398 
## -7.959851e-17 -1.147718e-16 -3.515486e-16 -7.064661e-17  6.486420e-17 
##           335           422           498           435 
## -5.590508e-17  2.567391e-16  3.907591e-16 -3.503573e-16
plot.ts(t(samples_norm))  

# TS shows the same shape as the raw TS but different scales 

5.1 Do the clustering on the normalized “samples_norm” using Euclidean Distance

hc_norm <- hclust(dist(samples_norm), method="complete")  # proxy:::dist(sc)
plot(hc_norm, main="Normalized & Euclidean")   

clust = rect.hclust(hc_norm, k=3)

memb <- cutree(hc_norm, k=3)
table(observedLabels, memb) # clustering Performance very poor 
##               memb
## observedLabels 1 2 3
##              1 1 1 1
##              2 3 0 0
##              3 0 0 3
sort(memb)  # show the groupings
##  22 322 398 335  98  35 422 498 435 
##   1   1   1   1   2   3   3   3   3

The result is worse than without normalization. Simple euclidean similarity
performs pooly when using normalized pattern. Why? Because the normalization
removed (absolute) magnitude of TS which Euclidean relied upon

5.1 Do the clustering on the normalized “samples_norm” using DTW Distance

library(dtw) 
distMatrix <- dist(samples_norm, method="DTW")
hc_norm_dtw <- hclust(distMatrix, method="complete")
par(mfrow=c(1,1))
# plot(hc_norm_dtw, labels=observedLabels, main="")
plot(hc_norm_dtw, main="Normalized & DTW")  # looks promising

# cut tree to get 3 clusters
rect.hclust(hc_norm_dtw, k=3)

memb_dtw <- cutree(hc_norm_dtw, k=3)
table(memb_dtw)
## memb_dtw
## 1 2 3 
## 3 3 3
sort(memb_dtw)
##  22  98  35 322 398 335 422 498 435 
##   1   1   1   2   2   2   3   3   3
table(observedLabels, memb_dtw)  
##               memb_dtw
## observedLabels 1 2 3
##              1 3 0 0
##              2 0 3 0
##              3 0 0 3
# Works perfectly using DTW based dissimilarity measure on normalized TS.

DTW based clustering with its flexible/elastic nature of matching brings
better performance on normalized time series data. And we haven’t done
smoothing yet.

This simple exercise shows judicious use of preprocessing on time series
for the clustering often bring better results.

Time Series classification also benefits from the preprocessing.