. . "VOSLODStatsAnalysisR" . . . . "2017-06-13T05:38:14Z" . "2017-06-13T05:38:14.711775"^^ . . . "2017-06-13T05:38:14.711775"^^ . "VOSLODStatsAnalysisR" . . . . . . . . . . . . . . "041ab49012c77d982b9e3395d75ec3f1" . . . . . . . . . . . "VOSLODStatsAnalysisR" . . . . "2017-06-13T05:38:14Z" . . "---+ Analyzing Linked Open Data with R\n\n%TOC%\n\n---++ Background\n\nThe idea of statistical data analysis has never been more popular, from Nate \nSilver's book The Signal and the Noise: The Art and Science of Prediction to \nindustry trends such as Big Data.\n\nData itself comes in a vast variety of models, formats, and sources. Users of \nthe R language are familiar with CSV and HDF5 files, ODBC databases, and more. \n\n---++ Linked Data\n\nAgainst this sits [[http://en.wikipedia.org/wiki/Linked_data][Linked Data]], \nbased on a graph structure: simple triple statements (entity-attribute-value or \nEAV) using HTTP URIs to denote and dereference entities, linking pools of data \nby means of shared data and vocabularies (ontologies). \n\nFor example, a photography website might use an entity for each photo it hosts, \nwhich, when dereferenced, displays a page-impression showcasing the photograph \nwith other metadata surrounding it. This metadata is a blend of the WGS84 geo \nontology (for expressing a photo's latitude and longitude) and the EXIF \nontology (for expressing its ISO sensitivity). By standardizing on well-known \nontologies for expressing these predicates, relationships can be built between \ndiverse fields. For example, UK crime report data can be tied to Ordnance \nSurvey map/gazetteer data; or a BBC service can share the same understanding of \na musical genre as Last.FM by standardizing on a term from the MusicBrainz \nontology.\n\nThe lingua franca of Linked Data is RDF, to express and store the triples, and \nSPARQL, to query over them.\n\n---++ Worked Example\n\nThe challenge: to explore the United Kingdom's population density using data from \nDBpedia. \n\n---+++ Collating the Data\n\nWe start by inspecting a well-known data-point, the city of Edinburgh.\n\nThe following URLs and URI patterns will come in handy:\n * [[http://dbpedia.org/sparql][http://dbpedia.org/sparql]] - the DBpedia SPARQL endpoint against which \nqueries are executed\n * [[http://en.wikipedia.org/wiki/Edinburgh][http://en.wikipedia.org/wiki/Edinburgh]] - a page in Wikipedia about \nEdinburgh\n * dbpprop:title - a CURIE, identifying the title attribute within \nthe dbpprod namespace ([[http://dbpedia.org/property/][http://dbpedia.org/property/]]), thereby assigning a \ntitle (\"The City of Edinburgh\").\n * [[http://dbpedia.org/resource/Edinburgh][http://dbpedia.org/resource/Edinburgh]] - an identifier for a DBpedia \nresource, comprised of data automatically extracted from the corresponding Wikipedia page; if \nyou view it in a Web browser, it redirects to:\n * [[http://dbpedia.org/page/Edinburgh][http://dbpedia.org/page/Edinburgh]] - a human-readable view of the DBpedia \nresource, showing its attributes and values\n\nOn examining the last of these, we see useful properties such as \n\ndbpedia-owl:populationTotal 495360 (xsd:integer)\n\n\nThis tells us that dbpedia-owl:populationTotal is a useful predicate by which \nto identify a settlement's population. (Note: we do not need to search for \nentities of some kind of \"settlement\" type; merely having a populationTotal\nimplies that the entity is a settlement and choosing the wrong kind of \nsettlement - stipulating \"it has to be a town or a city\" would risk losing data \nsuch as villages and hamlets.)\n\nLooking further down the page, we see the two properties:\n\ngeo:lat 55.953056 (xsd:float)\ngeo:long -3.188889 (xsd:float)\n\n\nAgain, we do not need to know the type of the entity; that it has a latitude \nand longitude is sufficient.\n\nSo far, we have some rudimentary filters we can apply to dbpedia: to make a table \nof latitude, longitude, and corresponding population.\n\nFinally, we can filter it down to places in the UK, as we see the property:\n\ndbpedia-owl:country dbpedia:United_Kingdom\n\n\n---+++Constructing the SPARQL Query\n\nWe can build a SPARQL query using the above constraints, thus:\n\n\nprefix dbpedia: \nprefix dbpedia-owl: \n\nSELECT DISTINCT ?place \n ?latitude \n ?longitude \n ?population \n WHERE \n {\n ?place dbpedia-owl:country .\n ?place dbpedia-owl:populationTotal ?population .\n ?place geo:lat ?latitude .\n ?place geo:long ?longitude .\n } \n ORDER BY ?place \n LIMIT 100\n\n\nand the resultset looks like:\n\n| *place* | *latitude* | *longitude* | *population* |\n| [[http://dbpedia.org/resource/Aberporth][http://dbpedia.org/resource/Aberporth]] | 52.1333 | -4.55 | 2485 |\n| [[http://dbpedia.org/resource/Accrington][http://dbpedia.org/resource/Accrington]] | 53.7534 | -2.36384 | 35203 |\n| [[http://dbpedia.org/resource/Acomb,_North_Yorkshire][http://dbpedia.org/resource/Acomb,_North_Yorkshire]] | 53.955 | -1.126 | 22215 |\n| [[http://dbpedia.org/resource/Adamstown,_Pitcairn_Islands][http://dbpedia.org/resource/Adamstown,_Pitcairn_Islands]] | -25.0667 | -130.1 | 48 |\n| [[http://dbpedia.org/resource/Aldeburgh][http://dbpedia.org/resource/Aldeburgh]] | 52.15 | 1.6 | 2793 |\n| [[http://dbpedia.org/resource/Aldershot][http://dbpedia.org/resource/Aldershot]] | 51.247 | -0.7598 | 33840 |\n| [[http://dbpedia.org/resource/Alkborough][http://dbpedia.org/resource/Alkborough]] | 53.6835 | -0.667179 | 455 |\n| [[http://dbpedia.org/resource/Alkborough][http://dbpedia.org/resource/Alkborough]] | 53.6856 | -0.667179 | 455 |\n| [[http://dbpedia.org/resource/Alkborough][http://dbpedia.org/resource/Alkborough]] | 53.6835 | -0.6658 | 455 |\n| [[http://dbpedia.org/resource/Alkborough][http://dbpedia.org/resource/Alkborough]] | 53.6856 | -0.6658 | 455 |\n| [[http://dbpedia.org/resource/Ambleside][http://dbpedia.org/resource/Ambleside]] | 54.4251 | -2.9626 | 2600 |\n| [[http://dbpedia.org/resource/Applecross][http://dbpedia.org/resource/Applecross]] | 57.433 | -5.80958 | 544 |\n| [[http://dbpedia.org/resource/Arthington][http://dbpedia.org/resource/Arthington]] | 53.9 | -1.58 | 561 |\n| [[http://dbpedia.org/resource/Ashington][http://dbpedia.org/resource/Ashington]] | 55.181 | -1.568 | 27335 |\n\n\n---+++ Data Sanitization\n\nHowever, on executing this against the DBpedia SPARQL endpoint, we see some \nstrange \"noise\" points. Some of these might be erroneous (Wikipedia being \nhuman-curated), but some of them arise from political arrangements such as the \nremains of the British Empire ? for example, Adamstown in the Pitcairn Islands \n(a British Overseas Territory, way out in the Pacific Ocean). Hence, to make \nplotting the map easier, the data is further filtered by latitude and longitude \nto points within a crude rectangle surrounding all the UK mainland.\n\n\nprefix dbpedia: \nprefix dbpedia-owl: \n\nSELECT DISTINCT ?place \n ?latitude \n ?longitude \n ?population \n WHERE \n {\n ?place dbpedia-owl:country .\n ?place dbpedia-owl:populationTotal ?population .\n ?place geo:lat ?latitude .\n ?place geo:long ?longitude .\n FILTER ( ?latitude > 50 \n AND ?latitude < 60 \n AND ?longitude < 2\n AND ?longitude > -7 \n )\n }\n\n\n---++ Map\n\nThe [[http://gadm.org/][GADM database of Global Administrative Areas]] site has \nfree maps of country outlines available for download as Shapefiles, ESRI, \nKMZ, or R native. In this case, we download a Shapefile, unpack the zip archive, \nand move the file GBR_adm0.shp into the working directory.\n\n---++R\n\nThere is an R module for executing SPARQL queries against an endpoint. We \ninstall some dependencies as follows:\n\n\n> install.packages(\"maptools\")\n> install.packages(\"akima\")\n> install.packages(\"SPARQL\")\n\n\nThe maptools library allows us to load a Shapefile into an R data frame; akima \nprovides an interpolation function; and SPARQL provides the interface for executing queries.\n\n---+++ The Script\n\n\n#!/usr/bin/env Rscript\n\nlibrary(\"maptools\")\nlibrary(\"akima\")\nlibrary(\"SPARQL\")\n\nquery<-\"prefix dbpedia: \nprefix dbpedia-owl: \n\nSELECT DISTINCT ?place \n ?latitude \n ?longitude \n ?population \n WHERE \n {\n ?place dbpedia-owl:country .\n ?place dbpedia-owl:populationTotal ?population .\n ?place geo:lat ?latitude .\n ?place geo:long ?longitude .\n FILTER ( ?latitude > 50 \n AND ?latitude < 60 \n AND ?longitude < 2\n AND ?longitude > -7 \n )\n }\"\n\nplotmap<-function(map, pops, im) {\n image(im, col=terrain.colors(50))\n points(pops$results$longitude, pops$results$latitude, cex=0.25, col=\"#ff30000a\") \n contour(im, add=TRUE, col=\"brown\")\n lines(map, xlim=c(-8,3), ylim=c(54,56), col=\"black\")\n}\n\nq100<-paste(query, \" limit 100\")\n\nmap<-readShapeLines(\"GBR_adm0.shp\")\n\nif(!(exists(\"pops\"))) {\n pops<-SPARQL(\"http://dbpedia.org/sparql/\", query=query)\n}\n\ndata <- pops$results[with(pops$results, order(longitude,latitude)), ]\ndata <- data[with(data, order(latitude,longitude)), ]\nim <- with(data, interp(longitude, latitude, population**.25, duplicate=\"mean\"), xo=seq(-7,1.25, length=200), yo=seq(50,58,200), linear=FALSE)\n\nplotmap(map, pops, im)\n\nfit<-lm(population ~ latitude*longitude, data)\nprint(summary(fit))\n\nsubd<-data[c(\"latitude\",\"longitude\",\"population\")]\nprint(cor(subd))\n\n\nRun this interactively from R:\n\n\nbash$ R\n...\n> source(\"dbpedia-uk-map.R\")\n...\n\n\nAfter a few seconds to load and execute the query, you should see a map showing \nthe outline of the UK (including a bit of Northern Ireland) with green/yellow \nheat-map and contour lines of the population density. Individual data-points \nare plotted using small blue dots. \n\n\n\nThis is a rather na?ve plot: interpolation is not aware of water, so \ninterpolates between Stranraer and Belfast regardless of the Irish Sea in the \nway; however, it looks reasonable on land, with higher values over large centers \nof population such as London, the Midlands, and the central belt in Scotland \n(from Edinburgh to Stirling to Glasgow).\n\nThe script runs two statistical analyses:\n\n * a simple linear regression of population with latitude and longitude:\n\nCall:\nlm(formula = population ~ latitude * longitude, data = data)\n\nResiduals:\n Min 1Q Median 3Q Max \n -37155 -17695 -12856 -8212 62278176 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) -273340.2 478917.6 -0.571 0.568\nlatitude 5489.0 9139.8 0.601 0.548\nlongitude -41694.5 146956.3 -0.284 0.777\nlatitude:longitude 776.8 2788.6 0.279 0.781\n\nResidual standard error: 660500 on 9206 degrees of freedom\nMultiple R-squared: 7.733e-05, Adjusted R-squared: -0.0002485 \nF-statistic: 0.2373 on 3 and 9206 DF, p-value: 0.8704 \n\n\n * correlation between latitude, longitude and population, shown as a matrix:\n\n latitude longitude population\nlatitude 1.000000000 -0.291285847 0.008083111\nlongitude -0.291285847 1.000000000 -0.004161835\npopulation 0.008083111 -0.004161835 1.000000000\n\n\nThis shows a very slight correlation of population-density with longitude, and \nabout twice as much correlation with latitude, but neither is statistically \nsignificant in the given data (as the p value should be less than 0.01, not up \naround 0.5-0.8).\n\n---++ Next Steps\n\nSPARQL has a SERVICE keyword that allows federation, i.e., the execution of \nqueries against multiple SPARQL endpoints, joining disparate data by common \nvariables (a/k/a SPARQL-FED). For example, it should be possible to enrich the data by \nblending Geonames and the Ordnance Survey gazetteer in the query. \n\nOver to you to explore the data some more!" .