# Biolinv R package

The biolinv R (R Core Team 2013) package provides tools for modelling and forecasting biological invasions with special consideration to anthropochory, the transportation of propagules by man. Anthropic dispersal is at the core of biological invasions, however, available software that considers anthropochory often does so by modelling it as long distance dispersal. This implies that the new release sites for anthropic dispersal depend on the location of existent colonised sites, which is not necessarily true in common scenarios (e.g. when the source of propagules is the native distribution range or when anthropic dispersal is much faster than natural dispersal).

Starting from an on-going invasion time series, Biolinv provides tools for classifying sighting locations as resulting from either natural or anthropic dispersal, estimating the anthropic contribution to the invasion, for estimating the invasion’s natural dispersal kernel and for making future forecasts on the invasion process. This is done accounting for habitat suitability in the form of Habitat Suitability Model (HSM, in raster format) and providing a measure of precision in the estimates. For details on the methodology behind Biolinv see Butikofer *et al.* 2018.

## Core functions

The package is based on three main functions – namely *EM*, *simulacro* and *modsel*, for running the EM algorithm, generating a virtual dataset and measuring spatial similarity respectively – and several auxiliary functions. These latter ones are primarily called by the main functions, but can also be run on their own for specific tasks; others are for the graphical display of results.

### Function *EM*

The *EM* function is used to estimate which points are of natural origin and which points originated by anthropic dispersal, using nearest neighbour distances computed by the function itself from the invasion time series.

The *EM* algorithm at the core of this function iterates between updating the distribution of nearest neighbour distances for natural points (assumed to be from a half normal (0, sigma), and updating the probability each point is the result of natural dispersal (Pnat). Starting values for Pnat and sigma must be supplied, but the algorithm is robust to the exact values. The anthropic subset is assumed to come from a uniform distribution; the resulting distribution of nearest neighbour distances (with neighbours assumed to be extant points) is computed by simulation, using the observed points as the extant points and the specified spatial boundary (for details see *Myself 2017*). The output of this process is a vector of probabilities of being of natural origin (Pnat) for each data point. A threshold can then be used to classify points as natural or anthropic.

### Function *simulacro*

The *simulacro* function is used to simulate an invasion time series based on spatial boundaries, a proportion of natural vs. anthropic points, a natural dispersal kernel and a HSM. It can be used either to simulate ‘from scratch,’ or to match observed data in certain respects. Time series are generated by an iterative process for which every cycle represents a time step (e.g. one year). For each iteration (time step), a number of points of anthropic and natural origin are generated. Anthropic points are either generated randomly across the geographic boundaries with the function *RPG* (Random Points Generator) or copied from a data frame that includes columns named x, y, year and *Pnat*. In the latter case, x and y are the points coordinates in a projected coordinate system; year must contain the same year numbers for which the point generation is performed and Pnat (probability of being of natural origin) must contain values between zero and one (values smaller than a threshold specified in the simulacro function will then be considered of anthropic origin). Natural points are generated in user defined, yearly numbers starting from extant points at that time step. Distances from existing points are sampled from a user defined One Dimensional Dispersal Kernel (function *fx*) and directions are sampled randomly. This part is done with the *NPG* (Natural Points Generator) function. Spatial filtering is done on both the natural and anthropic subsets through the function spatFilter, which delete points from the input argument points based on the value of the raster cell from argument *MAP* they fall into. In order to account for the reduction in the number of points after the filtering, an excess of points is initially generated (arguments *FACNAT* and *FACANTH* express factors by which the wanted number of points is multiplied before filtering). Eventual superfluous points are then randomly deleted to reach the user defined quantity.

### Function *modSel*

*modSel* is used to compare several simulated time series with one observed invasion time series. It is a wrapper around the spatSim function which generates a measure of similarity between any two time series. This is done by computing Rippley’s K–function for the two datasets and measuring the sum of their square differences.

## Inputs

Requested input files are of two types: invasion time series and spatial boundaries.

### Invasion time series

The time series must be provided in the form of a data frame: one record per row containing the columns year, listing the year of the sighting in a numerical form; x and y, listing longitude and latitude respectively of the sighting locations in a projected coordinate system.

### Spatial boundaries

The spatial boundaries must be in the form of either a 1) RasterLayer object with cell values of either 0 (indicating the area where the computation occurs) or >=1 (indicating the area outside the study system) or 2) a SpatialPoligon(s) object. Like for the time series and all other spatial information in the package, also the spatial boundaries must be provided in a projected coordinate system.

An example invasion time series can be called with the command:

data(frogs)

frogs contains:

- year: year of the sighting
- y: latitude in the New Zealand Transverse Mercatore coordinate system (length unit of measure: meter).
- x: longitude in the New Zealand Transverse Mercatore coordinate system (length unit of measure: meter).

An example of spatial boundaries can be called with the command:

data(nzp)

nzp is a SpatialPolygons object with two features (the North and South islands of New Zealand).

## Sample analysis

Using the input data presented before, we show the typical Biolinv workflow to classify the sighting locations in the time series as either of anthropic or natural origin and quantify the human contribution to the invasion; estimate the natural dispersal kernel; and project the invasion in the future.

### Jackknife on invasion time series

data(frogs) frogsJK<- jackKnife(DF= frogs, N= 10)

The first step of the analysis is optional but we suggest implementing the Jackknife approach especially in the case of small time series (few localities or few years). Jackknifing generates multiple datasets from the original time series, all of which will need to be processed in the same way. Ideally the processing can be done in parallel; naturally, the processing time increases considerably if Jackknifed datasets must be processed in series. All following instructions are supposed to be repeated for each of the Jackknifed datasets.

### Running the EM algorithm

data(nzp) data(frogs) randp<- RPG(rpopn=1000, boundary=nzp, SP= 'random_frog') frogsEM<- EM(dataset= frogs, randompoints= randp, sigma=6, pi=0.5) # plot output plotlacro(x= frogsEM, outline= nzp)

Function *EM* runs the EM algorithm. The dataset parameter is the time series (frogs). The output of the EM function is a copy of dataset with a new column, frogs$Pnat, which indicates the probability of that point being of natural origin. Argument randompoints can be generated with the function *RPG* (Random Points Generator), which is used to generate a Poisson point process within the spatial boundaries of the analysis. This dataset is used to compute the nearest neighbour distances used in the *EM* algorithm. rpopn is the number of desired points to be generated (1000 is a good number for most cases). *SP* is the name to be stored in the column species of the output.

The output of the *EM* function and its argument dataset can be conveniently plotted by the function plotlacro, where argument x is the data frame to be plotted and argument outline is the spatial boundaries for the analysis*.*

### Setting up time series simulation

idst<- frogsEM[1:10,] Cr<- frogsEM[-(1:10),] yr<- unique(Cr$year) nNoYear<- rep(NA,length(unique(Cr$year))) hNoYear<- rep(NA,length(unique(Cr$year))) for(i in 1:length(unique(Cr$year))){ CrYear<- Cr[Cr$year==unique(Cr$year)[i],] #Cr for that year nNoYear[i]<- nrow(CrYear[CrYear$Pnat>=.5,]) #natural points for that year hNoYear[i]<- nrow(CrYear[CrYear$Pnat<.5,]) #human points for that year } AV<- c(2,3,4.5,7.5,11,15,20,25) #alpha values frogsLacro<- simulacro(INIDIST=idst,YEARS=yr, BOUNDARY=nzp,NNAT=nNoYear,NANTH=hNoYear, FACNAT=10, A=AV,X=seq(.1,30,.1), TRUEANTH=TRUE,TRUEDB=Cr,PROB=.5, ITERATIONS=10,HSM=nzp)

The simulacro function generates multiple (or one) virtual datasets that have the same anthropic points as the input dataset (frogs) and the same number of natural points, but generated with varying dispersal kernels. The simulated dataset that most resembles the frogs dataset will then be selected in the next paragraph with the modSel function. The simulacro function requires considerable processing power to generate all the required datasets, we therefore suggest to perform an initial test run using a low number of Alpha values and replicates before running the full dataset generation process.

The simulacro function requires several arguments: *INIDIST* is the initial distribution from which every simulated dataset starts. In the example above, the first ten records of frogs have been used for this (idst). *YEARS* is a vector of unique year numbers used to correctly name the output simulated time series. *BOUNDARY* is a map of the geographic boundaries within which points are generated (same as in previous functions). *NNAT* is either a vector of (integer) numbers of natural points to be generated per year or the mean (integer) number of natural points to be generated each year. *NANTH* is the same but for the anthropic points. *FACNAT* and *FACANTH* are factors by which the number of natural and anthropic points respectively are multiplied before the HSM filtering. They are only used if *HSM* is a raster file and should be bigger numbers if the proportion of suitable cells is low. A is a vector of unique alpha values used to make virtual datasets. In the example, eight values are used, this means that eight datasets per replicate will be created by the simulacro function. Similarly, *C* is a vector of unique C values, however it is not necessary to set it to anything else than the default value of 2. X is a vector of possible dispersal distances used for simulating natural dispersal. This set of distances will be sampled from the One Dimensional Dispersal Kernel function with C= 2 and Alpha= A. *TRUEANTH* can be either “true” or “false”. If “true”, *TRUEDB* will be used to generate anthropic points, if “false” anthropic points will be generated with the *RPG* function in the quantities specified in *NANTH*. The use of this latter option is discouraged at this stage of the process as it is mainly meant for future forecasts of the invasion process (see below). Using *TRUEDB* will generate better datasets for the estimation of the dispersal kernel. *TRUEDB* is a data frame containing points of anthropic origin; must be in the same format as *INIDIST*; rows where *TRUEDB$Pnat*>*PROB* will be ignored. *PROB* is a threshold value over which values in *TRUEDB* time series are considered of natural origin. *ITERATIONS* is the number of datasets with the same Alpha (A) and C values. For a full analysis we suggest a higher number of iterations than the one in the example above (this function has tested positively with 90 replicates). HSM is the habitat suitability map. It can be a RasterLayer object, giving the probability of successful establishment of new populations, or a SpatialPolygons object, giving the boundaries within which new points are generated.

The output of the simulacro function is a list of lists of data frames, each of which is a simulated time series. If *DIR*=TRUE one folder will also be created for each Alpha and C combination containing all the replicates datasets set in *ITERATION*. if *DIR*=FALSE (default) the order in the list will follow the order of the C and Alpha values respectively as set in *C* and *A*.

### Selecting best time series simulation

data(nzw) frogsSum<- modSel(WIN= nzw, M0= frogsEM, M2= frogsLacro, AV= c(2,3,4.5,7.5,11,15,20,25), RAD= seq(0,30000,1000)) # plot output: plotAlpha(SSIM= frogsSum, REP= 10) plotAlpha(SSIM= frogsSum, REP= 10, BP=TRUE)

To select the simulated time series that is the most similar to our observed one (frogs) use the *modSel* function. It takes five arguments. *WIN* is the spatial boundaries of the analysis (like nzp) in the form of a “window” object from package “spatstat”. An example *WIN* object (nzw) is called in the first row. *M0* is the output of the EM function (observed time series). *M2* is the output of the simulacro function (simulated time series list). *AV* is a numeric vector of the Alpha values used in the simulated datasets in the same order as in the list of argument *M2*. It is used to save the output data frame. *RAD* is a numeric vector of search distances for the K-function. This parameter indicates the radiuses at which the K–function *K*(r) is evaluated. Fewer values speed up the computation time but produce less accurate results.

The output of this function is a data frame with two columns: dissimilarity contains the dissimilarity values computed by the *spatSim* function; *compAlpha* contains the input values from *AV*. This data frame can be conveniently visualised with the *plotAlpha* function. When argument *BP* is set as “true” all replicates are represented as boxplots, otherwise only the average line is plotted.

### Future projections

idst<- frogsEM yr<- seq(2015,2030,1) hNoYear<-rep(c(1,0,0),length.out=length(yr)) AV<- 20 frogsLacro<- simulacro(INIDIST=idst,YEARS=yr, BOUNDARY=nzp,NNAT=2,NANTH=hNoYear, A=AV,X=seq(.1,30,.1), TRUEANTH=FALSE,PROB=.5, HSM=nzp,FACNAT=2) f<- frogsLacro[[1]][[1]] f<-f[f$year>2010,] plot(nzp) points(idst$x, idst$y,pch=3) points(f[f$Pnat>=.5,'x'],f[f$Pnat>=.5,'y'],col=4) points(f[f$Pnat<.5,'x'],f[f$Pnat<.5,'y'],col=2)

Function *simulacro* can also be used to project an invasion into the future. The main differences with respect to the use described above are the argument *YEARS*, which, for future projections, must contain the year numbers for which to produce new points, and the argument *TRUEANTH*, which must be set as “false”. The mean number of anthropic and natural points to be generated per year can be computed by averaging the number of anthropic and natural points per year as classified by function *EM*. Note that values of less than one are not acceptable as arguments *NNAT* and *NANTH* represent numbers of points to be generated.