Aviso: Se está a ler esta mensagem, provavelmente, o browser que utiliza não é compatível com os "standards" recomendados pela W3C. Sugerimos vivamente que actualize o seu browser para ter uma melhor experiência de utilização deste "website". Mais informações em webstandards.org.

Warning: If you are reading this message, probably, your browser is not compliant with the standards recommended by the W3C. We suggest that you upgrade your browser to enjoy a better user experience of this website. More informations on webstandards.org.

Script rmarkdown (Link)

Script Rmarkdown (Link)

O código abaixo (em formato rmarkdown)  pode ser copiado e editado: deverá então ser processado com "knitr" para produzir um ficheiro html:

 

#####################

 

---
title: "Processing big geographic data sets with R"
subtitle: "Read and display geographic data sets"
author: "Manuel Campagnolo (ISA/ULisboa)"
date: "September 12-14, 2018"
output:
  html_document:
    toc: true # table of content true
    toc_depth: 3  # upto three depths of headings (specified by #, ## and ###)
    number_sections: true  ## if you want number sections at each table header
    #theme: united  # many options for theme, this one is my favorite.
    #highlight: tango  # specifies the syntax highlighting style
---



```{r setup, include=FALSE, purl=FALSE}
options(width=80)
knitr::opts_chunk$set(comment = "", warning = FALSE, message = FALSE, echo = TRUE, tidy = TRUE, size="small", cache=FALSE, eval=TRUE)
```


# Packages
```{r}
library(raster)
library(rgdal) # read/write with GDAL
library(sf) # vector data (replaces sp)
library(mapview) #mapView
library(RANN) # nn2
library(data.table) # fread
library(velox) # alternative to raster
library(fasterize) # fast resampling
library(viridisLite) # colors (used in mapview)
```


<!--
Comments:

input<-'big-data-read-files.rmd';
outputR <- paste(tools::file_path_sans_ext(input), 'R', sep = '.');
outputhtml <- paste(tools::file_path_sans_ext(input), 'html', sep = '.');

to run from command line:
> rmarkdown::render('big-data-read-files.rmd',output_file='big-data-read-files.html')

to create r code and comment everything else:
> knitr::purl(input,outputR,documentation=2,quiet=T)

To interrupt processing use the following chunck:
```{r, echo=FALSE}
#knitr::knit_exit()
```
-->

```{r, echo=FALSE}
setwd("\\\\pinea\\mlc\\Aulas\\CURSOS_R\\isa-setembro-2018-grandes-cdg\\aulas")
library(knitr)
```

This short introduction shows how to read, write and display large geographic data sets with R. A very complete and easy to follow on-line resource is the eBook by [Lovelace, Nowosad and  Muenchow](https://geocompr.robinlovelace.net/). Package *sp* has been the main package to process vector spatial data, but it is being progressively abandoned in favor of the more recent  package *sf* [Simple Features for R](https://r-spatial.github.io/sf/articles/sf1.html). For *raster* data, package *raster* is still widely used, although some alternatives are available.

# Read and display big geographic data sets

## Vector data: the *sf* package

Let's read  a reasonably large shapefile: this is a subset of classes of COS2015 (DGT) and corresponds to the *flammable* classes (these are the areas in Continental Portugal that are labelled essentially as "floresta", "matos" e "vegetacao esparsa"). Below, we compare the elapsed time to read the shapefile with *rgdal::readOGR*, which returns a *sp* object in R, and with *sf::st_read* that returns a *sf* object. For large data sets,  *st_read* is much faster.

```{r}
system.time({flam.sp<-rgdal::readOGR(dsn=file.path(getwd(),"datasets"),layer="Flam2015v4",verbose = FALSE)}) # 21.27
flam.sp
```

```{r}
system.time({flam.sf<-sf::st_read(dsn=file.path(getwd(),"datasets","Flam2015v4.shp"),quiet=TRUE)}) # 4.32
print(flam.sf,n=5)
```


A *sf* object has a table structure. The attribute table of the data is extended with a special column called *geometry*.
The remaining columns of the table are feature's attributes, which each row of the table describes one feature. The main simple feature geometry types are  *POINT*, *LINESTRING*, *POLYGON*, *MULTIPOINT*, *MULTILINESTRING*, *MULTIPOLYGON* and *GEOMETRYCOLLECTION* as the mixed type. The geometry of features in a *sp* object can be extracted with *st_geometry*, which returns a list of the geometries of all features (the list is an object of class *sfc*, for simple feature geometry list-column).

If one relies on packages that only can be applied to *sp* objects, it can be useful to convert objects from *sf* to *sp* and vice-versa. This can be done with the following commands:
```{r,eval=FALSE}
as(x, "Spatial") # x is a sf object
st_as_sf(x) # x is a sp object
```

Let's use *sf* functions for some data exploration. For instance, one may want to determine the largest feature in *flam*. This can easily be done with *st_area* that returns the areas of the features. Package *sf* supports all spatial methods of the GIS standard ISO 19125-1:2004.

```{r}
areas<-st_area(flam.sf)
bigflam<-flam.sf[which.max(areas),]
```

There are many ways of displaying vector data sets (see [here](https://cran.r-project.org/web/packages/sf/vignettes/sf5.html)).
To display geographic data sets interactively in the viewer window, and combine them with background information, one can  use *mapview*.

```{r, eval=TRUE}
mapview(bigflam)
```

Another common task is to combine several data geographic data sets, possibly with distinct coordinate reference system. Let's read the administrative map of Portugal and determine the flammable areas that intersect it.

```{r, eval=TRUE}
caop<-sf::st_read(dsn=file.path(getwd(),"datasets","Cont_AAD_CAOP2015.shp"),quiet=TRUE)
print(caop,n=3)
# reproject caop to the CRS of flam.sf
caop.proj<-sf::st_transform(caop, crs=sf::st_crs(flam.sf))
concelho<-caop.proj[caop.proj$Concelho=="MONCHIQUE",]
flam.concelho<-sf::st_intersection(flam.sf,concelho)
mapview(concelho,col.regions="blue")+mapview(flam.concelho,col.regions="green")
```
## Raster data: the *raster* package

Next, we use package *raster* to read and process raster data. This package can deal with very large data sets since it does not need to load the whole data set in memory. If ones needs to read a very large file, *raster* can be used to create the connection, and then the data can be loaded by blocks of rows with *getValues* and processed one block at the time.

In the example below the data set is not too big, and therefore it can be loaded in memory. There are three distinct files to be stacked (three bands of a Landsat 8 surface reflectance image over the Alentejo) so *raster::stack* is used to read the list of file names. Then, the stack is converted into a multiband single image with *raster:brick*. Note that building a brick (instead of reading a native multilayer tif file) creates an object "in memory".

```{r}
system.time({s<-raster::stack(as.list(file.path(getwd(),"datasets",
                                        c("LC82030332014151LGN00_sr_band3.tif",
                                        "LC82030332014151LGN00_sr_band4.tif",
                                        "LC82030332014151LGN00_sr_band5.tif"))))})
inMemory(s) #FALSE
system.time({b<-raster::brick(s)}) # takes some time
inMemory(b) #TRUE
writeRaster(b,"b.tif",overwrite=TRUE) # creates multilayer file
b2<-brick("b.tif")
inMemory(b2) #FALSE
b
```

The number of pixels in the image can be accessed with *ncell(b)*. In this case it is `r prettyNum(raster::ncell(b))`. Bands can be combined to compute indices or can be displayed as RGB color composites (for instance with *mapview:viewRGB*). For this image, the Landsat 8 surface reflectance values are not supposed to be smaller than 0 or larger than 10000, so we assign *NA* to those pixels.
```{r}
names(b)<-c("b3","b4","b5")
b[b<=0 | b>10000]<-NA # non valid values
ndvi<-(b$b5-b$b4)/(b$b5+b$b4)
writeRaster(ndvi,file="ndvi.tif",overwrite=TRUE)
```

The *velox* package provides an alternative to package *raster*. It represents bands as matrices, so unlike *raster* it always load all pixel values in memory.
```{r}
system.time({s<-raster::stack(as.list(file.path(getwd(),"datasets",c("LC82030332014151LGN00_sr_band3.tif","LC82030332014151LGN00_sr_band4.tif", "LC82030332014151LGN00_sr_band5.tif"))))})
vxs<-velox(s)
names(vxs$rasterbands)<-c("b3","b4","b5") # each band is a matrix
```
With *velox*, each band needs to be accessed individually. So, for instance, if one wish to apply some function to each band (like converting to NA non valid pixel values) this has to be applied either to each band at the time (alternatively, one could use *lapply*).

```{r}
v<-function(x) {x[x<=0 | x>10000]<-NA; return(x)}
b5<-v(vxs$rasterbands$b5) # assign NA to non valid values
b4<-v(vxs$rasterbands$b4) # idem
vals<-(b5-b4)/(b5+b4)
vxndvi<-velox(vals,extent=vxs$extent,res=vxs$res,crs=vxs$crs)
```
Available methods for *VeloxRaster* can be listed with *?VeloxRaster*. In particular, velox objects can be converted to *raster* objects.
```{r}
r<-vxndvi$as.RasterLayer(band=1)
mapview(r)
```
The main interest of *Velox* is to perform some operations on rasters that are slow with tools from package *raster*. For instance, suppose we wish to extract *ndvi* values at *N* random locations.

```{r}
# extract values
ndvi<-raster("ndvi.tif")
ext<-as.vector(extent(ndvi))
N<-10000; locs<-cbind(runif(N,min=ext[1],max=ext[2]),runif(N,min=ext[3],max=ext[4]))
system.time({x<-raster::extract(ndvi,y=locs)}) # 4.94 for N=10000
sf.locs<-st_as_sf(data.frame(x=locs[,1],y=locs[,2]), coords = c("x","y") , crs = as.character(crs(ndvi)))
system.time({y<-vxndvi$extract_points(sf.locs)})
n<-4;setequal(round(x,n),round(y,n))
```
In the following example, we compare how long it takes to read the coordinates of the pixels and to crop a raster.
```{r}
# read coordinates
system.time({xy<-coordinates(ndvi)}) # 1.34
system.time({xy<-vxndvi$getCoordinates()}) # 0.44
system.time({x<-raster::crop(ndvi,0.5*extent(ndvi))}) # 2.66
system.time({vxndvi$crop(0.5*extent(ndvi))}) # 0.16
```

VeloxRaster objects are *ReferenceClass objects* and thus mutable as explained in the  [documentation](https://www.rdocumentation.org/packages/velox/versions/0.2.0). This means in particular that one object can be altered without the use of the *<-* or *=* symbols. For instance, the *crop* operation did re-size *vxndvi*, with the initial version being lost. To make a new copy of the object, one needs to use *$copy*.

```{r}
vxndvi$extent # after crop
# original version
vxndvi<-velox(vals,extent=vxs$extent,res=vxs$res,crs=vxs$crs)
vxndvi$extent #
# make a copy that is not referenced to vxndvi
vxcopy<-vxndvi$copy()
```


## Maps and colors

Maps, like the ones produced by *plot* or *mapview* have default colors. Usually, one wants to choose a given palette of colors to use and intervals of values that correspond to each color. Argument *at* of *mapview* is used to indicate the breaks in values that correspond to the colors. *R* provides different ways of defining vectors of colors: those are just strings that represent colors using an hexadecimal notation (see [hex colors](http://www.color-hex.com/)). For instance, color *red* corresponds to code "#ff0000", and *yellow*  has code "#ffff00". There are many ways of defining vectors of colors, including the ones below. Parameter *alpha* in [0,1] indicates transparency, with opaque corresponding to *alpha=1*.

```{r}
cores<-c("red","yellow","green","blue")
cores<-colorRampPalette(c(rgb(0,0,1,alpha=1), rgb(0,0,1,alpha=0)), alpha = TRUE)(8)
cores<-viridisLite::inferno(n=10,alpha=1,begin=0.2,end=1,direction= -1)
pie(rep(1,length(cores)),col=cores)
```

For instance, if we want to display the *ndvi* map with 4 classes between 0.1 and 1 then we can set the colors and the breaks as in the following example. A description of options for *mapview* can be found [here](https://environmentalinformatics-marburg.github.io/mapview/advanced/advanced.html). If the number of colors is lower than the number of intervals, colors are recycled.

```{r}
cores<-colorRampPalette(c(rgb(1,1,0,0.5), rgb(0,1,0,0.5)), alpha = TRUE)(10) # any number >=4
mapview(ndvi,col.regions=cores,at=c(0.1,.3,.5,.7,1))
```
One can edit data interactively over *mapview* maps (for instance, to create new features) with *mapedit::editMap*.

## Tables

The simplest structure to represent geographic data is a table: this can be used to represent a set of points, including its coordinates over some known coordinate reference system. There are different packages that allow us to manipulate tables in R. The *base* function *data.frame* and its methods are widely used but can be very slow for large tables. As an alternative, one can use *tibbles* which are an updated approach to *data.frame* functionality (this is part of *tidyverse*, a collection of R packages designed for data science). If we need to use very large tables, the most efficient approach to date in R is package *data.table* as shown   [here](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping).

Next we read a data set that contains global active fire data acquired by satellite. We compare the basic utility *read.csv* that returns a *data.frame* with *fread* of package *data.table*.

```{r}
#system.time({df<-read.csv(file=file.path(getwd(),"datasets","viirs-fires-2017-world.csv"))}) # 148.41
system.time({af<-data.table::fread(file.path(getwd(),"datasets","viirs-fires-2017-world.csv"))}) # 14.34
af
```

The output of *fread* is a *data.table* object, which can be queried and explored efficiently. For instance, to select active fires over Continental Portugal, one can use variables longitude and latitude to perform the query. Function *fwrite* creates a new text file.

```{r}
afpt<-af[latitude>36 & latitude<43 & longitude < -6 & longitude > -10 ]
data.table::fwrite(afpt,file=file.path(getwd(),"datasets","viirs-fires-2017-portugal.csv"))
```

New columns can be easily and very efficiently computed. In the example below, the footprint of each observation is computed as the product of the scan and track ground sample distances. Moreover,a *data.table* can be quickly ordered or grouped. An easy introduction to *data.table* can be found in this  [vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html).
```{r}
af[,footprint:=scan*track] # create new variable "footprint"
af[order(frp,decreasing=TRUE)] # order table by frp value
af[,sum(frp),by=acq_date] # returns table with 365 rows
af[,s:=sum(frp),by=acq_date] # adds new column to af
af[,.N,by=confidence] # a bit faster than >table(af$confidence)
af[,.(confidence,frp)] # select columns
```

Note that *data.table* adds/updates/deletes columns *by reference* as explained in this [vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html). While this speeds-up considerably some operations, in practice it means that one has to be careful when making a  copy of a *data.table* object.

```{r}
DT<-data.table(x=1:3,y=c(4,8,12))
newDT<-DT
newDT[,x:=NULL]
DT # is modified by reference
```

To avoid this problem, use function *copy* that creates a *deep copy* of *D* and operates on that copy.
```{r}
DT<-data.table(x=1:3,y=c(4,8,12))
newDT<-copy(DT)
newDT[,x:=NULL]
DT # is not modified
```

# Converting from vector to raster

The goal of *rasterize* is to convert (more precisely, re-sample) a set of georefenced points into raster format, where the raster can have an arbitrary extent and spatial resolution. Function *rasterize* from the *raster* package works with all geometries but is slow for large data sets.

```{r}
af<-data.table::fread(file.path(getwd(),"datasets","viirs-fires-2017-world.csv"))
afpt<-af[latitude>36 & latitude<43 & longitude < -6 & longitude > -10 ] # Continental Portugal
latlon<-CRS("+proj=longlat +datum=WGS84")
y<-raster(crs=latlon,resolution=0.01,xmn=-10,xmx=-6,ymn=36,ymx=42)
r<-rasterize(x=data.frame(lon=afpt$longitude,lat=afpt$latitude),y=y,field=afpt$frp, fun=max)
mapview(r)
```

In the example above, processing time was reasonable because the data set was rather small. If we consider a larger area (Africa), the number of points is much larger and *rasterize* becomes pretty slow.

```{r}
afaf<-af[latitude > -35 & latitude < 0 & longitude > 0 & longitude < 40] # S Africa
latlon<-CRS("+proj=longlat +datum=WGS84")
y<-raster(crs=latlon,resolution=0.01,xmn=0,xmx=40,ymn=-35,ymx=0)
system.time({r<-rasterize(x=data.frame(lon=afaf$longitude,lat=afaf$latitude),y=y,field=afaf$frp,fun=max)}) # 124.16
mapview(r)
```

To speed up this function, perhaps the best option is ti run in R algorithms from the Geospatial Data Abstraction Library [GDAL](https://www.gdal.org/), available through package *gdalUtils*. However, this uses functions external to R, and installation can be tricky.

As an alternative one can write a new function that works well with large data sets. The following function relies on *nn2* from package *RANN*. This provides a very fast way of finding neighbors in *N* dimensions and it is used below to determine, for each point, its closest pixel. Then, the value at the point is assigned to that pixel as long as the coordinates of the point fall into the pixel boundary.

```{r, eval=FALSE}
myrasterize<-function(x,y,field,background=NA,fun=max, radius=min(res(y)))
{
  colnames(x)<-c("lon","lat")
  if (fun(0:1)==max(0:1)) {x<-x[order(field),]; field<-field[order(field)]}
  if (fun(0:1)==min(0:1)) {x<-x[order(field,decreasing=TRUE),]; field<-field[order(field,decreasing=TRUE)]}
  # for each x point, look for the closest pixel, within radius
  coordy<-coordinates(y)
  knn<-nn2(data=coordy,query=x,k=1,searchtype = "radius",radius=radius) #
  idys<-knn$nn.idx[,1] # indicates the pixel which is closer to the point in x
  idys[idys==0]<-NA
  # remove x observations that do not have pixel neighbor or do not fall into the pixel
  DT<-data.table(idys,idxs=1:nrow(x),x,lony=coordy[idys,1],laty=coordy[idys,2])
  DT<-DT[!is.na(idys) & abs(lon-lony)<res(y)[1]/2 & abs(lat-laty)<res(y)[2]/2]
  # assign resampled values to y
  valsnew<-rep(background,ncell(y)) # initialize
  valsnew[DT$idys]<-field[DT$idxs]
  values(y)<-valsnew
  return(y)
}
system.time({r<-myrasterize(x=data.frame(lon=afaf$longitude,lat=afaf$latitude),y=y,field=afaf$frp,fun=max)}) # 26.09
```

For polygons, *fasterize* from the package with the same name  is much faster than *rasterize*. The last command compares the resulting raster with the original vector data over the extent of *concelho*.

```{r}
system.time({flam.sf<-sf::st_read(dsn=file.path(getwd(),"datasets","Flam2015v4.shp"))})
ext<-st_bbox(flam.sf)
RES<-50
r<-raster(xmn=ext$xmin,xmx=ext$xmax,ymn=ext$ymin,ymx=ext$ymax,crs=st_crs(flam.sf)$proj4string,resolution=RES)
rflam<-fasterize(flam.sf,r,field="FID_COS")
#
caop<-sf::st_read(dsn=file.path(getwd(),"datasets","Cont_AAD_CAOP2015.shp"),quiet=TRUE)
caop.proj<-sf::st_transform(caop, crs=sf::st_crs(flam.sf))
concelho<-caop.proj[caop.proj$Concelho=="MONCHIQUE",]
mapview(rflam)+mapview(sf::st_crop(flam.sf,sf::st_bbox(concelho)))
```