Landmap is back!

Research, Teaching

Landmap was a service that provided UK geospatial data for academic use until its Joint Information Systems Committee (JISC) funding ceased in July 2014. Landmap hosted a large amount of data, including satellite imagery, digital elevation models (LiDAR, satellite and photgrammetric) and building heights and classifications. It was a sad day when Landmap was ‘switched off’, not least because it meant our GIS students had to put in a great deal more effort to find data for their projects!

The good news is that the Landmap data is now hosted at the Centre for Environmental Data Archival (CEDA) and is gradually becoming available once more for educational use. Those who used the previous incarnation of Landmap will have become familiar (or not) with the Kaia user interface. This has now gone, and all data is stored in a hierarchical file system organised into geographic regions. This is fairly coarse, with the highest spatial resolution typically being city.

So far, I have found that there is no unambiguous spatial reference in the naming scheme for files in some collections. This makes it difficult to find the data you want, for example, if there are 100 image files in the ‘London’ folder. I am currently documenting these cases to ensure I am the only one that has to spend time on it!

Geocomputation SVM workshop: Your feedback

SpaceTimeLab, Teaching

At the recent International Conference on GeoComputation at UT Dallas, I ran a workshop on Support Vector Machines for Spatial and Temporal Analysis. The participation was boosted to around 30 by the cancellation of one of the other workshops, and I thank those who expected to be discussing Emerging Trends in Data-Intensive, High Performance, and Scalable Geocomputation for their patience and effort!

image6

This was the first workshop run by SpaceTimeLab, and was supported by the Crime, Policing and Citizenship Project (http://www.ucl.ac.uk/cpc/) and ISPRS Working Group II/5: GeoComputation and GeoSimulation (http://www2.isprs.org/commissions/comm2/wg5.html). The workshop was very enjoyable and I was incredibly impressed by the technical ability and enthusiasm of the participants.

At the end of the workshop I circulated a Google poll to get feedback and ideas for future workshops. Thank you to all who completed it. From the responses, it seems that Neural Networks are something that people are still very interested in learning more about. Random Forests also appear to be a hot topic. We are currently doing some work at SpaceTimeLab on Random Forests so I hope to make this a subject for a future workshop.

I will publish some of workshop materials on this site in the near future.

Were you at the workshop? If so feel free to add any more suggestions in the comments below. What should be the topic of SpaceTimeLab’s next workshop?

The picture of a law abiding cyclist

A bit of fun, R Stuff

A summary of my commutes

This is a picture of my commutes between home and work over the past three years. The points are coloured by the speed value recorded by the GPS device. Red points are slower and white points are faster. The individual points from all my commutes are overlaid on each other and the opacity is a proxy for the number of times a road segment has been traversed. Road segments with stronger colour have been traversed more often. It is important to note that raw GPS speed records are not particularly accurate and depend on factors such as the number of satellites in view, multipath from buildings and trees etc. Therefore it is just an indication of approximate speed.

AroundWork

Activities around UCL

In the second picture, the clustering of red points around road intersections represents stop points, showing that I am a law abiding cyclist! Rightly or wrongly, cyclists as a group have a bad reputation in London for not stopping at red lights. As with any group of road users, the majority of cyclists obey the rules of the road. Unfortunately, the running of red lights is something that is very visible because junctions naturally have a captive audience. I will not get into the debate on this issue, but I found it interesting when looking at my commutes that I could clearly see all the junctions and pedestrian crossings appear. In the future, I hope to look into the various causes of delay for cycle commuters in more detail.

The method

Garmin activities come in a format called .tcx, which is like a .gpx file with the ability to store additional data like heart rate and cadence. I imported the activities into R and converted them to a data frame using the following code:

library(XML) #Load the XML library

# Place your .tcx files in a folder on their own and list files
files <- list.files()

# Create an empty data frame to store the activities using the column headers from the .tcx files
	
actv <- data.frame("value.ActivityID"=NA, "value.Time"=NA,"value.Position.LatitudeDegrees"=NA,
"value.Position.LongitudeDegrees"=NA,
"value.AltitudeMeters"=NA,           
"value.DistanceMeters"=NA,
"value.Value"=NA,
"value.Speed"=NA,                    
"Time"=NA) 

# Loop through the files and fill up the data frame
for(i in 1:length(files))
	{
	doc <- xmlParse(files[i])
	nodes 0)  #Check that there is data in the activity
		{
		# Convert nodes to a data frame and give the activity an ID
		mydf <- cbind(i, plyr::ldply(nodes, as.data.frame(xmlToList))) 
		colnames(mydf)[1] <- "value.ActivityID"
		
		# I included this as some of my activities had different numbers of 
		# fields. This may not be needed (majority had 9).
		if(ncol(mydf)==9)
			{
			actv <- rbind(actv, mydf)
			}
		}
	}

To make the visualisations, I first converted the data frame to a SpatialPointsDataFrame using the following function:

tcxToPoints <- function(tcx=NA, actv=NA, proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))
	{
	# A function to import Garmin Connect activities and convert them to spatialPoints objects
	# Inputs:
	# tcx = An individual .tcx file 
	# actv = A data frame of activities if using above code
	# proj4string = coordinate reference system for SpatialPointsDataFrame

	if(!is.na(tcx))
		{
		doc <- xmlParse(tcx)
		nodes <- getNodeSet(doc, "//ns:Trackpoint", "ns")
		mydf <- plyr::ldply(nodes, as.data.frame(xmlToList))
		}
	else
		{
		mydf <- actv
		}
	# remove missing coordinates
	mydf <- mydf[-(which(is.na(mydf[,"value.Position.LatitudeDegrees"]))),]

	coords <- cbind(as.numeric(as.matrix(mydf[,"value.Position.LongitudeDegrees"])), as.numeric(as.matrix(mydf[,"value.Position.LatitudeDegrees"])))
	pts <- SpatialPointsDataFrame(coords=coords, proj4string=proj4string,
	data=subset(mydf, select=-c(value.Position.LatitudeDegrees, value.Position.LongitudeDegrees))) # data without the coordinates
	mode(pts@data[,6]) <- "numeric"
	
	# Create a speed column in kph
	pts@data <- cbind(pts@data, (pts@data[,6]*3600)/1000)
	colnames(pts@data)[ncol(pts@data)] <- "speedKph"
	
	# Change dates to POSIX format
	pts@data[,2] <- as.POSIXct(gsub("T", " ", pts@data[,2]))
	return(pts=pts)
	}
tcxPoints <- tcxToPoints(actv=actv)
# remove unrealistic speeds (in this case over 80kph)
tcxPoints <- tcxPoints[-which(tcxPoints@data[,8]>80),]
# transform to OSGB 1936
tcxPointsOSGB <- spTransform(tcxPoints, CRSobj=CRS("+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"))

I carried out an intermediate step here to isolate journeys that had start and end points within 500 metres of my work location and 200 metres of my home locations. I then created the plot using the following code:

plt <- tcxPointsOSGB # Or a subset thereof

# Create breaks and define colour palette, the alpha value 
# is used to define transparency
brks <- quantile(plt@data$speedKph, seq(0,1,1/5), na.rm=T, digits=2)
cols <- colorRampPalette(c(rgb(1,0,0,0.025), 
rgb(1,1,1,0.05)), alpha=T)(5)

# Set background to black and plot
par(bg="black")
plot(plt, col=cols[findInterval(x=plt@data[,8], vec=brks)], pch=1, cex=0.35)
par(bg="white") #Reset background to white

# Create a palette for the legend without transparency
legCols <- colorRampPalette(c(rgb(1,0,0,1), 
rgb(1,1,1,1)), alpha=T)(5)

# Add a legend using the bounding box for positioning
legend(x=bbox(plt)[1,1], y=bbox(plt)[2,2], 
legend=leglabs(round(brks, digits=2)), 
fill=legCols, cex=0.75, title="Speed (km/h)",
text.col="white", bty="n")

# Add a North arrow using bounding box for positioning
# and height

arrowHeight <- (bbox(plt)[1,2]-bbox(plt)[1,1])/5
arrows(bbox(plt)[1,2], bbox(plt)[2,1], bbox(plt)[1,2], 
bbox(plt)[2,1]+arrowHeight, col="white", code=2, lwd=3)

Hopefully this code can be applied to any .tcx file without much modification, but .tcx files may vary according to the functionality of the device. I would be interested to know if anyone applies this code to their data and come across any problems.

A quick look at three years of commuting to UCL

A bit of fun

I love cycling and running, and around this time three years ago I purchased my first GPS watch to use as a training aid. As most who have purchased such a device will know, once you start using it you very soon start recording everything you do, even if it’s just commuting to and from work or going to the shops. If it isn’t recorded it didn’t happen, right?

Since I started using one, the popularity of GPS trackers has grown massively and a huge number of apps have been designed to cater for this demand including MapMyFitness, Endomondo and Strava to name a few. These apps, Strava in particular, have created a whole new social phenomenon whereby people can compete against one other to get the fastest time on particular ‘segments’ and earn the Kudos of being K or QOM (King or Queen of the Mountain).

What I find most interesting about this phenomenon is the vast amount of data that is being collected and stored on cyclists’ mobility patterns. As is the case with me, many people now track their daily commute by bicycle as a matter of course. This creates significant opportunities to research cycling commuting behaviour at the aggregate level.

There is a general feeling, I believe, that cyclists journey times are not affected by vehicular traffic. If there is a queue, a cyclist can just go up the inside, or overtake and bypass the queue completely. However, anyone who cycles often in London will know that it is often not that simple. London is an old city with narrow roads, and frequently it can be too dangerous to bypass traffic, or there is simply not enough space. This means that there can be considerable variation in the time it takes to commute the same route by bicycle.

As a start, I wanted to analyse this quantitatively by looking at my own tracks. The first thing I did was to plot the durations of the activities against their distances, which you can see below. I thought it was quite interesting that I was able to qualitatively identify different activity types visually. For example, the two clusters of points arranged horizontally are commutes between UCL and my current home and previous home.

A quick view of a few years of activity data

 

The smaller cluster at 5000 metres and just over 20 minutes is the Wimbledon Common Park Run, a weekly 5k race that I do often. You can also clearly see the two distinct profiles for running and cycling activities.

What is interesting about the commuting activities is their horizontal extent on the plot. With a cursory glance, the mean commuting time is approximately 40 minutes for my old home location and 45 for the new one, but there is considerable variation around this. There are many reasons why commuting times may vary like this, including level of effort, wind direction, precipitation, traffic congestion, cycle congestion (i.e. the number of cyclists occupying the space available for cyclists), the precise time at which the activity started and ended, and variations in the route amongst others. In future work, I hope to look in more detail into how to isolate these effects.

 

 

Loading PostGIS geometries into R without rgdal – an approach without loops

R Stuff

Some work I have been doing recently has involved setting up a PostGIS database to store spatio-temporal data and corresponding geometries. I like to do my analysis in R, so I needed to import the well-known binary (WKB) geometries into the R environment. If you have drivers for PostGIS in your gdal installation, this is straightforward, and instructions are here. However, getting the drivers can be tricky on Windows and was not something I wanted to spend time doing. Luckily, it is possible to get around the problem by returning the geometries as well-known text (WKT) from PostGIS and converting them to spatial objects in R. Lee Hachadoorian describes a way of doing this here. The issue with Lee’s method (as he points out) is that it uses loops, which most R users know are very inefficient. I wanted to avoid loops and came up with the following solution using RPostgreSQL, rgeos and sp:

#Load the required packages
require(RPostgreSQL)
require(rgeos)
require(sp)
# create driver
dDriver <- dbDriver("PostgreSQL")
#connect to the server, replacing with your credentials
conn <- dbConnect(dDriver, user="user", password="password", dbname="dbname")

In the following query I am selecting the ID (id) and geometry (geom) columns from a table called ‘your_geometry’ that fall within a specified bounding box. The function ST_AsText is used to convert WKB to WKT.

# Select and return the geometry as WKT
rs <- dbSendQuery(conn, 'select id, ST_AsText(geom) from your_geometry where
your_geometry.geom && ST_MakeEnvelope(-0.149, 51.51, -0.119, 51.532, 4326);')
# Fetch the results
res <- fetch(rs, -1)
dbClearResult(rs)

The readWKT function converts the WKT to R Spatial objects such as SpatialPoints, SpatialLines or SpatialPolygons. In this case I am using lines.

# Use the readWKT function to create a list of SpatialLines 
# objects from the PostGIS geometry column
str <- lapply(res[,2], "readWKT", p4s=CRS("+proj=longlat +datum=WGS84"))
# Add the IDs to the SpatialLines objects using spChFIDs
coords <- mapply(spChFIDs, str, as.character(res[,1]))

Now it is a case of creating a SpatialLinesDataFrame and adding the remaining attributes from the attribute table.

# Query the remaining fields in the attribute table
rs <- dbSendQuery(conn, 'select * from your_geometry where
your_geometry.geom && ST_MakeEnvelope(-0.149, 51.51, -0.119, 51.532, 4326);')
res <- fetch(rs, -1)
dbClearResult(rs)
# Create a SpatialLinesDataFrame with the geometry and the 
# attribute table
rownames(res) <- res[,1]
# Here I assume the geometry is in the last column and remove it from the attributes
data <- SpatialLinesDataFrame(SpatialLines(unlist(lapply(coords, function(x) x@lines)),proj4string=CRS("+proj=longlat +datum=WGS84")), res[,-ncol(res)])

The tricky bit of this was working out the way that Line, Lines, SpatialLines and SpatialLinesDataFrame objects interact. readWKT makes SpatialLines objects. If you try to create a SpatialLinesDataFrame from the result, it will give you a single feature with space for one attribute. Therefore, it is necessary to extract the Lines from the SpatialLines, and then convert them back to individual SpatialLines objects… This was a real pain to work out, so I thought I would post it in case it is useful for anyone.

Note that I have queried the same data twice; first to select the geometry as WKT and then to get the remaining attributes. I did this because the query was fast and I didn’t want to manually type out the column names of the attribute table in the first query. It would be easy to do the whole operation in one query.

If you have any ways of simplifying this or speeding it up further I would be interested to know!

Local traffic forecasting models

Research, The STANDARD Project

The majority of my PhD research was centred on developing road traffic forecasting models. I worked on the EPSRC funded STANDARD project (Spatio-temporal Analysis of Network Data and Route Dynamics), and was fortunate to have access to a large dataset of travel times provided by Transport for London, collected using automatic number plate recognition (ANPR) on London’s road network. Diagram 1 shows, in simple terms, how ANPR works.

anpr

Diagram 1 – Observing travel times using ANPR: a vehicle passes camera l1 at time t1, and its number plate is read. It then traverses link l and passes camera l2 at time t2 and its number plate is read again. The two number plates are matched using inbuilt software, and the TT is calculated as t2 – t1. Raw TTs are converted to UTTs by dividing by len(l), which is the length of link l. This figure is reproduced from the original article.

I was interested in short term travel time forecasting, attempting to answer the question: given knowledge of what travel times are like now, how will they develop over the next hour? The standard way to do this is to build a model that accounts for the dependency between past, current, and potential future values of travel time using knowledge gained from a large dataset of historical traffic patterns. Statistical time series models and neural networks are popular choices. What we found in the STANDARD project, and has been recognised elsewhere, is that using a single model is not always sufficient to model the variation in travel times across time and space. This is because of temporal nonstationarity and spatial heterogeneity. Therefore, I focussed a lot of my research effort on developing local models for travel time forecasting. To deal with the temporal nonstationarity I developed a kernel based model, with local kernels centred on each time point, called Local Online Kernel Ridge Regression (LOKRR). The idea is that these local kernels capture all the relevant historical information about travel times at particular times of day.

local-kernels-concept

Diagram 2 – Concept of local kernels: For time point t, a kernel Kt is created of size 2w+1, where w is a window size

Diagram 2 shows the concept of local kernels graphically. The green point is the point to be forecast and the green box is the most recently observed travel time pattern. The red boxes indicate the data stored in the local kernel and the red dots represent the same time of day as the green point on previous days. The window size w is important because it captures the variability in travel times. For example, if one were to commute to work on the same road at approximately the same time each day, one may observe that the road tends to become congested at approximately the same time each day, and may be able to make statements such as “if I leave after 9am there is always too much traffic”, or “if I set off before 8am my journey is usually pretty quick”. However, there is usually significant variation around such trends. For instance, on some days a link may become congested earlier or later than usual; or the congestion may be slightly more or less severe, and one might find oneself making a statement such as “It’s especially busy today, something must have happened”, or “wow, it’s really quiet, it’s usually really busy by now”. These intuitive observations summarise the variability inherent in traffic data and are accounted for in the window.

The end result is a model that is better able to forecast the variability in travel times. Diagram 3 shows the performance of the model at forecasting travel times 1 hour ahead on a road link in Central London. Because LOKRR uses local knowledge of likely traffic conditions, it is better able to forecast the afternoon peak period than the comparison models.

a

forecast-plot-weds

bforecast-plot-thurs

cforecast-plot-fri

Diagram 3 – Time series plots of the observed series (thick black line) against the forecast series 1 hour ahead on a) Wednesday 8th June b) 9th June c) 10th June 2011. Comparison models are Elman neural network (ANN), autoregressive integrated moving average (ARIMA) and support vector regression (SVR). These figures are reproduced from the original article.

This research is a good starting point. However, there are still many challenges that need to be addressed. For example, none of the models used here can forecast the large peak on the morning of the Wednesday in Diagram 3a. This is non-recurrent congestion, which may be caused by an incident, or a planned event. Often the occurrence and effect of such events is unpredictable, but we can model their spread using spatio-temporal approaches, which is the focus of ongoing research.

This post is based on an article entitled Local Online Kernel Ridge Regression for Forecasting of Urban Travel Times, published in Transportation Research Part C: Emerging Technologies. The article is open access, get the PDF at the link below.

PDF: Local Online Kernel Ridge Regression for Forecasting of Urban Travel Times.

DOI: 10.1016/j.trc.2014.05.015