Aviso: Se está a ler esta mensagem, provavelmente, o browser que utiliza não é compatível com os "standards" recomendados pela W3C. Sugerimos vivamente que actualize o seu browser para ter uma melhor experiência de utilização deste "website". Mais informações em webstandards.org.

Warning: If you are reading this message, probably, your browser is not compliant with the standards recommended by the W3C. We suggest that you upgrade your browser to enjoy a better user experience of this website. More informations on webstandards.org.

Script rmarkdown (Link)

---
title: "Processing big geographic data sets with R"
author: "Manuel Campagnolo (ISA/ULisboa)"
date: "September 12-14, 2018"
output:
  html_document:
    toc: true # table of content true
    toc_depth: 3  # upto three depths of headings (specified by #, ## and ###)
    number_sections: true  ## if you want number sections at each table header
    #theme: united  # many options for theme, this one is my favorite.
    #highlight: tango  # specifies the syntax highlighting style
---



```{r setup, include=FALSE, purl=FALSE}
options(width=80)
knitr::opts_chunk$set(comment = "", warning = FALSE, message = FALSE, echo = TRUE, tidy = TRUE, size="small", cache=FALSE, eval=TRUE)
```


# Packages
```{r}
library(raster)
library(rgdal)
library(sf) # new package for geographic vector datasets (replaces sp)
library(mapview) #mapView
library(RANN) # nn2
library(data.table) # fread
library(fasterize)
library(tidyverse) #  contains dplyr
library(knitr)
library(date)
library(igraph)
library(randomcoloR)
library(concaveman)
```

<!--
Comments:

input<-'big-data-read-files.rmd';
outputR <- paste(tools::file_path_sans_ext(input), 'R', sep = '.');
outputhtml <- paste(tools::file_path_sans_ext(input), 'html', sep = '.');

to run from command line:
> rmarkdown::render('big-data-read-files.rmd',output_file='big-data-read-files.html')

to create r code and comment everything else:
> knitr::purl(input,outputR,documentation=2,quiet=T)

To interrupt processing use the following chunck:
```{r, echo=FALSE}
#knitr::knit_exit()
```
-->

```{r, echo=FALSE}
setwd("\\\\pinea\\mlc\\Aulas\\CURSOS_R\\isa-setembro-2018-grandes-cdg\\aulas")
```

<!--
nice tutorial on sf/tidyverse: http://strimas.com/r/tidy-sf/
-->

# Problem: determining meaningful spatial-temporal patches of active fires

Let's recall the approximate distribution of satellite (VIIRS) active fire observations over Portugal in 2017: it gives some idea of the spatial distribution of fires but doesn't distinguish dates. The goal of the problem at hand is to determine meaningful spatial-temporal patches (or clusters) of active fires.

```{r}
af<-data.table::fread(file.path(getwd(),"datasets","viirs-fires-2017-world.csv"))
afpt<-af[latitude>36 & latitude<43 & longitude < -6 & longitude > -10 ] # Continental Portugal
latlon<-CRS("+proj=longlat +datum=WGS84")
y<-raster(crs=latlon,resolution=0.01,xmn=-10,xmx=-6,ymn=36,ymx=42)
r<-rasterize(x=data.frame(lon=afpt$longitude,lat=afpt$latitude),y=y,field=afpt$frp, fun=max)
mapview(r)
```

The goal is to define spatial-temporal patches of fire activity. Let's first convert the dates in the active fire data set into Julian dates. For *date::as.date* the starting day is January 1, 1960, although this is irrelevant for our example. One could additionally take into account the time of acquisition available in the file and create a time in hours or minutes for each active fires. We are not going to do that to keep the example as simple as possible.

```{r}
dt<-date::as.date(x=afpt$acq_date, order = "ymd") %>% as.integer()
hist(dt, xlabel="Julian dates (starting January 1, 1960", ylabel="number of active fires", main="VIIRS active fires: Portugal 2017")
```

We have three dimensions: one is time, which is measured in days, and the other two are in space. To measure distances on the surface, one should avoid geographic coordinates. We can re-project the data onto a local coordinate reference system as the official CRS for Portugal (PT-TM06/ETRS89, EPSG:3763).

To make the sequence of operations easier to follow, we use *magrittr*'s pipe operator for chaining commands.

```{r}
sfpt<- as.data.frame(afpt) %>% st_as_sf(coords = c("longitude","latitude") , crs = 4326) %>% st_transform(crs=3763)
```

# Determining spatial distances between active fires

Spatial-temporal patches depend on the distance between active fires. Since coordinates are now in meters, we set the parameter *d* with a value in meters. For the time dimension, we define a time gap *t* between active fires.

```{r}
d<-2000 # meters
t<-2 # days
K<-20 # neighbors to consider
```
Next we build a graph which vertices represent active fires. In the graph, active fires are going to be connected if the distance between them is shorter than *d* and if the tile gap is not larger than *t*. We start by determining the list of edges that connect vertices. This is going to be ultimately represented by a *data.table* with columns *v1* and *v2*, where each pair *(v1,v2)* represents an edge in the graph.

```{r}
knn<-nn2(data=st_coordinates(sfpt),k=K,searchtype="radius",radius=d)
head(knn$nn.idx[,1:10],5)
head(knn$nn.dists[,1:5],3)
```

Note that the distance to the nearest neighbor is always 0. This is because in general the nearest neighbor is the vertex itself (if there are two vertices with precisely the same coordinates, the nearest neighbor can be a distinct vertex, but the distance is still 0). The next step is to determine all the edges between neighbor observations. In order to do this we rearrange the columns of *knn$nn.idx*. The first command is to create a first column with the indices of all observations (*1,2,3...*). The next command moves down to the second column of *edges* all neighbors, so the resulting matrix *edges* has just two columns, where the first column has the indices of the observations (repeated *K=ncol(neigh)* times), and the second column has the respective *j-th* neighbor.

```{r}
neigh<-cbind(1:nrow(knn$nn.idx),knn$nn.idx)
head(neigh[,1:10],4)
edges<-data.table(v1=rep(neigh[,1],ncol(neigh)-1),v2=as.vector(neigh[,-1]))
dim(edges)
```
At this point, one should remove pairs where one vertex is missing (in that case the index in the second column is 0).

```{r}
edges[v2==0] # non edges
edges<-edges[v2!=0] # remove non edges
```

# Adding temporal constraints

So far, we only used spatial coordinates, but we want our patches to satisfy a temporal restriction as well. As mentioned above, two active fires should be connected only if their acquisition dates differs at most by *t* days. In order to enforce this, we first need to add to our matrix the acquisition dates available in vector *dt*. Note, and this is a key aspect of the construction, that observations are indexed as in the original data set (the same order and length). Therefore the index of each observation in *edges* is precisely the same index than in *dt*.

```{r}
edges[,d1:=dt[v1]] # acquisition dates
edges[,d2:=dt[v2]] # acquisition dates
edges<-edges[abs(d1-d2)<=t]
```
Just for sake of clarity, we also add the information about the coordinates of the observations, although this is not necessary for the task at hand.
```{r}
xy<-st_coordinates(sfpt) # for clarity, but optional
edges[,x1:=xy[v1,"X"]] # coordinates, optional
edges[,y1:=xy[v1,"Y"]] # coordinates, optional
edges[,x2:=xy[v2,"X"]] # coordinates, optional
edges[,y2:=xy[v2,"Y"]] # coordinates, optional
edges[1:3,]
```

Let's double check that all distances are within  *d* and that all date gaps are within *t*:
```{r}
max(edges[,sqrt((x1-x2)^2+(y1-y2)^2)])
max(edges[,abs(d1-d2)])
sum(duplicated(edges)) # check that there are no duplicated edges
```

# Building and exploring the graph

We have now a list of `r nrow(edges)` and we can build a graph that represents active fires and their proximity relations. There are many types of graphs, including *directed* and *undirected* graphs. In our case, we only need an *undirected graph*, where the directions on the edges are not relevant. Note that the construction above doesn't guarantee that there are no multiple edges (i.e. two edges *(v1,v2)* and *(v2,v1)*). We could remove those edges directly from data.table *edges* but it is better to use available functions from package [igraph](http://igraph.org/r/doc/index.html) which are properly designed for that goal. In our case, we use function *igraph::simplify* that removes loops and multiple edges.

An introduction about drawing graphs with *igraph* can be found [here](http://igraph.org/r/doc/plot.common.html).

```{r}
Gall<-igraph::graph_from_data_frame(d=edges,directed=FALSE)
G<-igraph::simplify(Gall) # removes loops and multiple edges
```

Let's just confirm that the names of the vertices in the graph correspond to the indices of the vertices in the table *edges*.

```{r}
v<-as.integer(vertex_attr(G)$name) # vertex indices which match row indices of edges
range(v)== range(edges[,.(v1,v2)]) # confirm index numering
```

Next, we use a *igraph* functions that returns the connected components in *G*. Those are the clusters of vertices that are connected in the graph. Notice that the result is a named vector with one value per vertex: the name is the name of the vertex (of character type) and the value is the component number. This tell us what is the component for each vertex.

```{r}
CL<-igraph::components(G)
head(CL$membership) # returns vertice name and vertice cluster
head(names(CL$membership)) # names (character)
```

Let's consider one component of the graph and see what it looks like. It contains all vertices in that component and the edges between them. Function *induced_graph* returns the sub-graph with a subset of edges and all connections between them.

```{r}
head(CL$csize)
idxcomp<-2500
# vertices of the  component
vids<-as.integer(names(CL$membership)[CL$membership==idxcomp])
# subgraph
S<-igraph::induced_subgraph(G,vids=which(CL$membership==idxcomp))
```
We can plot the graph, and add to the plot additional information using labels and colors. Here, we label each vertex with the active fire acquisition date, and we color each vertex according to the FRP (Fire Radiative Power, expressed in Mega Watts). A vertex attribute (e.g. color) can be set for the graph with *set.vertex.attribute*. In alternative, the attribute can be set as a *plot* option.

```{r}
# colors for subgraph: depending on FRP
cores<-colorRampPalette(c(rgb(1,1,0), rgb(1,0,0)))(length(V(S)))[order(sfpt$frp[vids])]
S <- set.vertex.attribute(S, "color", value=cores)
# color is a new attribute
vertex.attributes(S)
# plot using xy coordinates, create layout for ploting
lxy<-st_coordinates(sfpt)[vids,]
```

Plotting defining labels and colors for the vertices, and *lxy* layout for the vertices.
```{r,eval=FALSE}
plot(S,layout=lxy,vertex.label=sfpt$acq_date[vids],vertex.size=30,vertex.color=cores)
plot(S,layout=lxy,vertex.label=sfpt$acq_time[vids],vertex.size=30,vertex.color=cores)
```
 
# Mapping spatial-temporal clusters of active fires

Let's go back to our *sf* object and add the information about the clusters in *sfpt*. This is very easy to do since it is just a new column with the cluster membership. Note that this works because the graph *G* was originally derived from the table *edges*, which, in turn, was build from all points in *sfpt*. Therefore, the vertex' indices match the rows of *sfpt*. For sake of completude, we re-compute components from graph *G* and then we incorporate that information in the spatial object *sfpt*.

```{r}
CL<-igraph::components(G)
sfpt$clust<-CL$membership
```
 
Let's then make the first map of spatial temporal patches. Here we assign the same color to all points that belong to the same patch, so we can visualize them in the map. We use random colors to make it unlikely that two neighbor patches have similar colors.
 
```{r}
n <- 20
palette <- distinctColorPalette(n)
cores<-rep(palette,times=max(sfpt$clust)/n)
mapview(sfpt,color=cores[sfpt$clust],cex=4)
```
 
Below, we consider a more sophisticated way of determining clusters: instead of considering connected components, we use Louvain algorithm that returns "dense" components. The map shows that this prevents long clusters with weak connections from occurring. Notice that the only difference with what we did earlier is replacing *components(G)* by *cluster_louvain(G)* since the graph is the same, and only the decomposition algorithm changes. In the map, we compare both results and we can see the result of creating "dense components" instead of the simplest form of "connected components".
 
```{r}
# Louvain
CLlouvain<-igraph::cluster_louvain(G) # returns community object; can use positive weights
sfpt2<-sfpt # make a copy of sfpt for mapview
sfpt2$clust<-CLlouvain$membership
mapview(sfpt,color=cores[sfpt$clust],cex=4)+mapview(sfpt2,color=cores[sfpt2$clust],cex=4)
```
 

# Converting clusters of active fires into polygon type features

Finally, we might want to represent the clusters as polygons (instead of points). We first determine new "MULTIPOINT" features, where each feature contains all points that belong to a cluster, and we select the features with at least three points (otherwise the polygons would be ill-defined). Function *summarize* below is used to determine the most frequent (the mode) date in each group (variable *date*), and the number of active fires per group (variable *N*).
 
```{r}
sfmpt<-sfpt2 %>% group_by(clust) %>% summarize(date=modal(acq_date), N=length(acq_date),do_union=TRUE) %>% st_cast(to="MULTIPOINT")
# number points in each feature
dimfeat<-sapply(st_geometry(sfmpt),function(x) length(x))
sfmpt3<-sfmpt[dimfeat>=6,] # two coordinates per point
```

There are many ways to create polygons from sets of points. The most direct way is to compute what is called the *convex hull*, which is the smallest convex polygon that contain all the points. R has a very efficient function to do this which is named *chull*. This function gets as input a set of points (represented by a two column matrix) and returns the subset of those points that define the convex hull. Since *chull* is applied just to the coordinates (not to the spatial object of call *sf*), we define a function that reads a *MULTIPOINT* feature, extracts the coordinates, applies *chull*, and returns a spatial object with the same coordinate reference system as the feature.

```{r}
fconv<-function(feat)
{
  xy<-st_coordinates(feat)[,c("X","Y")];
  pol<-xy[chull(xy),];
  st_polygon(list(rbind(pol,pol[1,])))
}
```
Now, we just need to apply *fconv* to all our clusters of points. Since the geometry of a *sf* object is a list of geometries, we want to apply *fconv* to each element of the list returned by *st_geometry(sfmpt3)*, where *sfmpt3* is our *MULTIPOLYGON* spatial object that contains all clusters with at least three points. This is done with *lapply* that read a list, applies function *FUN* to each element of the list, and then returns the new list of the results. The last command shows the result (the polygons that correspond to the clusters in map view).

```{r}
sfconv<-sfmpt3 # make a copy
# replace geometry by a polygon geometry applying convex hull to the groups of points
st_geometry(sfconv)<-st_sfc(lapply(st_geometry(sfmpt3),FUN=fconv), crs=st_crs(sfmpt)$proj4string)
mapview(sfconv)+mapview(sfpt2,color=cores[sfpt2$clust],cex=4)

```

The problem with determining the convex hull it that, usually, patches are not convex, so convex hulls tend to over evaluate the actual area of the patch of points. We can replace *chull* by a function that return what is called a *concave hull*. The concave hull is not uniquely defined as the convex hull is, so it depends on a few parameters that determine how "deep" the concave hall penetrates into the convex hull.

Package *concaveman* implements the fast algorithm [Park and Ho](http://www.iis.sinica.edu.tw/page/jise/2012/201205_10.pdf) for concave hulls. The parameter *concavity* varies from 1 to infinity and is 2 by default. Large values of the parameter approach the convex hull solution.  Below, we replace function  *chull* by function *concaveman* from package with the same name.
 
```{r}
fconc<-function(feat)
{
  xy<-st_coordinates(feat)[,c("X","Y")];
  pol<-concaveman(xy,concavity=3);
  st_polygon(list(rbind(pol,pol[1,])))
}
sfconc<-sfmpt3 # make a copy
st_geometry(sfconc)<-st_sfc(lapply(st_geometry(sfmpt3),FUN=fconc), crs=st_crs(sfmpt)$proj4string)
```

```{r,echo=FALSE, eval=FALSE}
aa<-st_read(file.path(getwd(),"datasets","Areas_Ardidas_2017_Portugal_Continental_v1.shp"),quiet=TRUE)
save(aa,sfpt2,sfconc,sfconv,file="patches.RData")
setwd("\\\\pinea\\mlc\\Aulas\\CURSOS_R\\isa-setembro-2018-grandes-cdg\\aulas")
load("patches.RData")
library("htmlwidgets")
library(mapview)
library(sf)
library("randomcoloR")
n <- 20
palette <- distinctColorPalette(n)
cores<-rep(palette,times=max(sfpt2$clust)/n)
aa<-st_read(file.path(getwd(),"datasets","Areas_Ardidas_2017_Portugal_Continental_v1.shp"),quiet=TRUE)
m<-mapview(sfpt2,color=cores[sfpt2$clust],cex=4)+mapview(aa,col.regions=distinctColorPalette(20))+mapview(sfconc,col.regions="gray")
# create html file from map m
saveWidget(m@map, file="areas_ardidas_pt_2017.html", selfcontained = FALSE)
```

# Comparison with the actual map of burned areas for Portugal in 2017

We look at the result and compare with the actual map of burned areas in Portugal for 2017, which are plotted using random colors. There are some main differences at locations where active fires are not available due to cloud cover or smoke plumes.

```{r}
aa<-st_read(file.path(getwd(),"datasets","Areas_Ardidas_2017_Portugal_Continental_v1.shp"),quiet=TRUE)
n <- 20
palette <- distinctColorPalette(n)
cores<-rep(palette,times=max(sfpt2$clust)/n)
mapview(sfpt2,color=cores[sfpt2$clust],cex=4,layer.name="active fires")+mapview(aa,col.regions=distinctColorPalette(20),layer.name="burned areas")+mapview(sfconc,col.regions="gray", layer.name="s-p patches")
```

Let's validate numerically those results. If we consider that the burned area map *aa* is the correct reference, we would like to know what is the proportion of the actual burned area that is classified as correct by the simple active fire based algorithm, and the area that is wrongly classified as burned by the active fire algorithm. In short, we want to compute a *error matrix* with two rows (burned and unburned according to the reference) and two columns (burned or not burned according to the active fire classification) that give all the relevant information.

One possibility to do this is to compute the intersection (with *st_intersection*) and the differences (with *st_difference*) and then compute the resulting areas with *st_area*. However, this is typically slow for a large number of polygons.

A much faster alternative is to compute an approximation of those areas by first rasterizing the polygons. In order to do that, we first create an empty raster with a given resolution. beow, we start by defining the raster over geographic coordinates since we know the extent we want to consider, but we could have started with cartographic coordinates right away. Funstion *st_transform* is applied to ensure that *aa* and *sfconc* have the same CRS.

```{r}
latlon<-CRS("+proj=longlat +datum=WGS84")
r<-raster(crs=latlon,resolution=0.0005,xmn=-10,xmx=-6,ymn=36,ymx=42)
r<-projectRaster(from=r, crs=CRS(st_crs(aa)$proj4string))
sfconc<-st_transform(sfconc,crs=st_crs(aa)$proj4string)
```

The next step is to rasterize both polygon data sets. This can be done very efficiently by *fasterize* of the package with the same name. To apply it, we first create a column with values 1, so fasterize assign that value to the pixels that overlap the polygons. 

```{r}
system.time({
sfconc$new<-1
rconc<-fasterize(sf=sfconc,raster=r,field="new",fun="max",background=0)
aa$new<-1
raa<-fasterize(sf=aa,raster=r,field="new",fun="max",background=0)
})
```

Finally, we need to count how many pixels occur in each of the four possible combinations. This can be done very easily and very fast using a *data.table*, creating it and grouping it by the possible values of *aa* and *conc*. This should take less than 10 seconds for approximately  100 million pixels.

```{r}
DT<-data.table(aa=values(raa),conc=values(rconc))
DT[,.N,by=.(aa,conc)]
```
If we want a more precise estimate, we can reduce the resolution o raster *r* and repeat the procedure.