. . . . . "dd0b654c229874eb1862598a3d4da5fa" . . . . . . . . . . . . "VirtuosoFacetsViewsLinkedData" . . . . . . . . . . . . . . "2017-06-13T05:37:46Z" . . . . . . . . . . . . "%META:TOPICPARENT{name=\"VOSIndex\"}%\n\n---+Faceted Views over Large-Scale Linked Data\n\n%TOC%\n\n---++What\nFaceted views over structured and semi structured data have been popular in \nuser interfaces for some years. Deploying such views of arbitrary linked \ndata at arbitrary scale has been hampered by lack of suitable back end \ntechnology. Many ontologies are also quite large, with hundreds of thousands \nof classes.\n\nAlso, the linked data community has been concerned with the processing cost \nand potential for denial of service presented by public SPARQL end points.\n\nThis article discusses how we use Virtuoso Cluster Edition for providing \ninteractive browsing over billions of triples, combining full text search, \nstructured querying and result ranking. We discuss query planning, run-time \ninferencing and partial query evaluation. This functionality is exposed \nthrough SPARQL, a specialized web service and a web user interface.\n\n---++Why\n\nThe transition of the web from a distributed document repository into a \nuniversal, ubiquitous database requires a new dimension of scalability \nfor supporting rich user interaction. If the web is the database, then \nit also needs a query and report writing tool to match. A faceted user \ninteraction paradigm has been found useful for aiding discovery and \nquery of variously structured data. Numerous implementations exist but \nthey are chiefly client side and are limited in the data volumes they \ncan handle.\n\nAt the present time, linked data is well beyond prototypes and proofs \nof concept. This means that what was done in limited specialty domains \nbefore must now be done at real world scale, in terms of both data \nvolume and ontology size. On the schema, or T box side, there exist \nmany comprehensive general purpose ontologies such as Yago[1], OpenCyc[2], \nUmbel[3] and the DBpedia[4] ontology and many domain specific ones, such \nas [5]. For these to enter into the user experience, the platform must \nbe able to support\nthe user's choice of terminology or terminologies as needed,\npreferably without blow up of data and concomitant slowdown.\n\nLikewise, in the LOD world, many link sets have been\ncreated for bridging between data sets.Whether such linkage\nis relevant will depend on the use case. Therefore we provide\nfine grained control over which owl:sameAs assertions will\nbe followed, if any.\n\nAgainst this background, we discuss how we tackle incremental\ninteractive query composition on arbitrary data with\nVirtuoso Cluster[6].\n\n\n---++How\n\nUsing SPARQL or a web/web service interface, the user\ncan form combinations of text search and structured criteria,\nincluding joins to an arbitrary depth. If queries are\nprecise and select a limited number of results, the results are\ncomplete. If queries would select tens of millions of results,\npartial results are shown.\n\nThe system being described is being actively developed\nas of this writing, early March of 2009 and is online\nat http://lod.openlinksw.com/. The data set is a combination\nof DBpedia, MusicBrainz, Freebase, UniProt, NeuroCommons,\nBio2RDF, and web crawls from PingTheSemanticWeb.com.\n\nThe hardware consists of two 8-core servers with 16G RAM\nand 4 disks each. The system runs on Virtuoso 6 Cluster\nEdition. All application code is written in SQL procedures\nwith limited client side Ajax, the Virtuoso platform itself is\nin C.\n\nThe facets service allows the user to start with a text\nsearch or a fixed URI and to refine the search by specifying\nclasses, property values etc., on the selected subjects or any\nsubjects referenced therefrom.\n\nThis process generates queries involving combinations of\ntext and structured criteria, often dealing with property\nand class hierarchies and often involving aggregation over\nmillions of subjects, specially at the initial stages of query\ncomposition. To make this work with in interactive time,\ntwo things are needed:\n\n 1. a query optimizer that can almost infallibly produce the right join order based on cardinalities of the specific constants in the query\n 2. a query execution engine that can return partial results after a timeout.\n\nIt is often the case, specially at the beginning of query\nformulation, that the user only needs to know if there are\nrelatively many or few results that are of a given type or involve a given property. Thus partially evaluating a query\nis often useful for producing this information. This must\nhowever be possible with an arbitrary query, simply citing\nprecomputed statistics is not enough.\n\nIt has for a long time been a given that any search-like\napplication ranks results by relevance. Whenever the facets\nservice shows a list of results, not an aggregation of result\ntypes or properties, it is sorted on a composite of text match\nscore and link density.\n\nThe article is divided into the following parts:\n\n * SPARQL query optimization and execution adapted for run time inference over large subclass structures.\n * Resolving identity with inverse functional properties\n * Ranking entities based on graph link density\n * SPARQL partial query evaluation for displaying partial results in fixed time\n * a facets web service providing an XML interface for submitting queries, so that the user interface is not required to parse SPARQL\n * a sample web interface for interacting with this\n * sample queries and their evaluation times against combinations of large LOD data sets\n\n\n---+++Processing Large Hierarchies in SPARQL\n\nVirtuoso has for a long time had built-in superclass and\nsuperproperty inference. This is enabled by specifying the\nDEFINE input:inference \"context\" option, where context\nis previously declared to be all subclass, subproperty, equivalence,\ninverse functional property and same as relations\ndefined in a given graph. The ontology file is loaded\ninto its own graph and this is then used to construct the\ncontext. Multiple ontologies and their equivalences can be\nloaded into a single graph which then makes another context\nwhich holds the union of the ontology information from the\nmerged source ontologies.\n\nLet us consider a sample query combining a full text\nsearch and a restriction on the class of the desired matches:\n\n\nDEFINE input:inference \"yago\"\nPREFIX cy: \nSELECT DISTINCT ?s1 AS ?c1\n ( bif:search_excerpt \n ( bif:vector ( 'Shakespeare' ), ?o1 ) \n ) AS ?c2\nWHERE \n {\n ?s1 ?s1textp ?o1 .\n FILTER \n ( bif:contains (?o1, '\"Shakespeare\"') ) .\n ?s1 a cy:Performer110415638\n } \nLIMIT 20\n\n\nThis selects all Yago performers that have a property that\ncontains \"Shakespeare\" as a whole word.\n\nThe DEFINE input:inference \"yago\" clause means that\nsubclass, subproperty and inverse functions property statements\ncontained in the inference context called yago are considered\nwhen evaluating the query. The built-in function\nbif:search_excerpt makes a search engine style summary\nof the found text, highlighting occurrences of Shakespeare.\n\nThe bif:contains function in the filter specifies the full text\nsearch condition on ?o1.\n\nThis query is a typical example of queries that are executed\nall the time when a user refines a search. We will now\nlook at how we can make an efficient execution plan for the\nquery. First, we must know the cardinalities of the search\nconditions:\n\nTo see the count of subclasses of Yago performer, we can do:\n\n\nPREFIX cy: \nSELECT COUNT (*)\nFROM \nWHERE \n {\n ?s rdfs:subClassOf cy:Performer110415638 \n OPTION (TRANSITIVE, T_DISTINCT) \n }\n\n\nThere are 4601 distinct subclasses, including indirect ones.\nNext we look at how many Shakespeare mentions there\nare:\n\n\nSELECT COUNT (*) \nWHERE\n {\n ?s ?p ?o .\n FILTER \n ( bif:contains (?o, 'Shakespeare') ) \n }\n\n\nThere are 10267 subjects with Shakespeare mentioned in\nsome literal.\n\n\nDEFINE input:inference \"yago\"\nPREFIX cy: \nSELECT COUNT (*) \nWHERE \n {\n ?s1 a cy:Performer110415638\n }\n\n\nThere are 184885 individuals that belong to some subclass\nof performer.\n\nThis is the data that the SPARQL compiler must know\nin order to have a valid query plan. Since these values\nwill wildly vary depending on the specific constants in the\nquery, the actual database must be consulted as needed\nwhile preparing the execution plan. This is regular query\nprocessing technology but is now specially adapted for deep\nsubclass and subproperty structures.\n\nConditions in the queries are not evaluated twice, once\nfor the cardinality estimate and once for the actual run.\nInstead, the cardinality estimate is a rapid sampling of the\nindex trees that reads at most one leaf page.\n\nConsider a B tree index, which we descend from top to\nthe leftmost leaf containing a match of the condition. At\neach level, we count how many children would match and\nalways select the leftmost one. When we reach a leaf, we see\nhow many entries are on the page. From these observations,\nwe extrapolate the total count of matches.\n\nWith this method, the guess for the count of performers\nis 114213, which is acceptably close to the real number.\nGiven these numbers, we see that it makes sense to first\nfind the full text matches and then retrieve the actual classes\nof each and see if this class is a subclass of performer. This\nlast check is done against a memory resident copy of the\nYago hierarchy, the same copy that was used for enumerating\nthe subclasses of performer.\n\nHowever, the query\n\n\nDEFINE input:inference \"yago\"\nPREFIX cy: \nSELECT DISTINCT ?s1 AS ?c1, \n ( bif:search_excerpt \n ( bif:vector ('Shakespeare'), ?o1 ) \n ) AS ?c2\nWHERE \n {\n ?s1 ?s1textp ?o1 .\n FILTER \n ( bif:contains (?o1, '\"Shakespeare\"') ) . \n ?s1 a cy:ShakespeareanActors\n }\n\n\nwill start with Shakespearean actors since this is a leaf\nclass with only 74 instances and then check if the properties\ncontain Shakespeare and return their search summaries.\n\nIn principle, this is common cost based optimization but\nis here adapted to deep hierarchies combined with text patterns.\nAn unmodified SQL optimizer would have no possibility\nof arriving at these results.\n\nThe implementation reads the graphs designated as holding\nontologies when first needed and subsequently keeps a\nmemory based copy of the hierarchy on all servers. This\nis used for quick iteration over sub/superclasses or properties\nas well as for checking if a given class or property is\na subclass/property of another. Triples with OWL predicates\nequivalentClass, equivalentProperty and sameAs\nare also cached in the same data structure if they occur in\nthe ontology graphs.\n\nAlso cardinality estimates for members of classes near the\nroot of the class hierarchy take some time since a sample of\neach subclass is needed. These are cached for some minutes\nin the inference context, so that repeated queries will not\nredo the sampling.\n\n\n---+++ Inverse Functional Properties and Same As\n\nSpecially when navigating social data, as in FOAF[7] and\nSIOC[8] spaces, there are many blank nodes that are identified\nby properties only. For this, we offer an option for\nautomatically joining to subjects which share an IFP value\nwith the subject being processed. For example, the query\nfor the friends of friends of Kjetil Kjernsmo returns empty:\n\n\nSELECT COUNT (?f2) \nWHERE \n {\n ?s a foaf:Person ; \n ?p ?o ; \n foaf:knows ?f1 .\n ?o bif:contains \"'Kjetil Kjernsmo'\" .\n ?f1 foaf:knows ?f2 \n }\n\n\nBut with the option\n\n\nDEFINE input:inference \"b3sifp\"\nSELECT COUNT (?f2) \nWHERE \n {\n ?s a foaf:Person ; \n ?p ?o ; \n foaf:knows ?f1 .\n ?o bif:contains \"'Kjetil Kjernsmo'\" .\n ?f1 foaf:knows ?f2 \n }\n\n\nwe get 4022. We note that there are many duplicates\nsince the data is blank nodes only, with people easily represented\n10 times. The context b3sifp simple declares that\nfoaf:name and foaf:mbox sha1sum should be treated as inverse\nfunctional properties (IFP). The name is not an IFP\nin the actual sense but treating it as such for the purposes\nof this one query makes sense, otherwise nothing would be\nfound.\n\nThis option is controlled by the choice of the inference\ncontext, which is selectable in the interface discussed below.\n\nThe IFP inference can be thought of as a transparent addition\nof a subquery into the join sequence. The subquery\njoins each subject to its synonyms given by sharing IFPs.\nThis subquery has the special property that it has the initial\nbinding automatically in its result set. It could be expressed\nas:\n\n\nSELECT ?f \nWHERE \n {\n ?k foaf:name \"Kjetil Kjernsmo\" .\n { \n SELECT ?org ?syn \n WHERE \n {\n ?org ?p ?key .\n ?syn ?p ?key .\n FILTER \n ( bif:rdf_is_sub \n ( \"b3sifp\", ?p, , 3 ) \n && \n ?syn != ?org \n ) \n }\n } \n OPTION \n ( \n TRANSITIVE , \n T_IN (?org), \n T_OUT (?syn), \n T_MIN (0), \n T_MAX (1) \n ) \n FILTER \n ( ?org = ?k ) .\n ?syn foaf:knows ?f . \n }\n\n\nIt is true that each subject shares IFP values with itself\nbut the transitive construct with 0 minimum and 1 maximum\ndepth allows passing the initial binding of ?org directly\nto ?syn, thus getting first results more rapidly. The\nrdf_is_sub function is an internal that simply tests whether\n?p is a subproperty of b3s:any_ifp.\n\nInternally, the implementation has a special query operator\nfor this and the internal form is more compact than\nwould result from the above but the above could be used to\nthe same effect.\n\nThe issues of run time vs precomputed identity inference\nthrough IFPs and owl:sameAs are discussed in much more\ndetail at[9].\n\nOur general position is that identity criteria are highly\napplication specific and thus we offer the full spectrum\nof choice between run time and precomputing. Further,\nweaker identity statements than sameness are difficult to\nuse in queries, thus we prefer identity with semantics of\nowl:sameAs but make this an option that can be turned on\nand off query by query.\n\n\n---+++Entity Ranking\n\nIt is a common end user expectation to see text search\nresults sorted by their relevance. The term entity rank refers\nto a quantity describing the relevance of a URI in an RDF\ngraph.\n\nThis is a sample query using entity rank:\n\n\nPREFIX yago: \nPREFIX prop: \nSELECT DISTINCT ?s2 AS ?c1 \nWHERE \n {\n ?s1 ?s1textp ?o1 .\n ?o1 bif:contains 'Shakespeare' .\n ?s1 a yago:Writer110794014 .\n ?s2 prop:writer ?s1\n } \nORDER BY DESC ( (?s2) )\nLIMIT 20 \nOFFSET 0\n\n\nThis selects works where a writer with Shakespeare in\nsome property is the writer.\n\nHere the query returns subjects, thus no text search summaries, so only the entity rank of the returned subject is\nused. We order text results by a composite of text hit score\nand entity rank of the RDF subject where the text occurs.\nThe entity rank of the subject is defined by the count of\nreferences to it, weighed by the rank of the referrers and the\noutbound link count of referrers. Such techniques are used\nin text based information retrieval.[15]\n\nOne interesting application of entity rank and inference\non IFPs and owl:sameAs is in locating URIs for reuse. We\ncan easily list synonym URIs in order of popularity as well\nas locate URIs based on associated text. This can serve in\napplication such as the Entity Name Server[14].\n\nEntity ranking is one of the few operations where we take\na precomputing approach. Since a rank is calculated based\non a possibly long chain of references, there is little choice\nbut to precompute. The precomputation itself is straightforward\nenough: First all outbound references are counted\nfor all subjects. Next all ranks of subjects are incremented\nby 1 over the referrer's outbound link count. On successive\niterations, the increment is based on the rank increment the\nreferrer received in the previous round.\n\nThe operation is easily partitioned, since each partition\nincrements the ranks of subjects it holds. The referrers are\nspread throughout the cluster, though. When rank is calculated,\neach partition accesses every other partition. This\nis done with relatively long messages, referee ranks are accessed\nin batches of several thousand at a time, thus absorbing\nnetwork latency.\n\nOn the test system, this operation performs a single pass\nover the corpus of 2.2 billion triples and 356 million distinct\nsubjects in about 30 minutes. The operation has 100% utilization\nof all 16 cores. Adding hardware would speed it up,\nas would implementing it in C instead of the SQL procedures\nit is written in at present.\n\nThe main query in rank calculation is\n\n\nSELECT O , \n P , \n iri_rank (S)\nFROM rdf_quad TABLE \nOPTION (NO CLUSTER)\nWHERE isiri_id(O) \nORDER BY O\n\n\nThis is the SQL cursor iterated over by each partition.\nThe no cluster option means that only rows in this process's\npartition are retrieved. The RDF_QUAD table holds the\nRDF quads in the store, i.e., triple plus graph. The S, P, O\ncolumns are the subject, predicate, and object respectively.\nThe graph column is not used here. The iri_rank() is a partitioned SQL function. This works by using the S argument\nto determine which cluster node should run the function.\nThe specifics of the partitioning are declared elsewhere.\nThe calls are then batched for each intended recipient and\nsent when the batches are full. The SQL compiler automatically\ngenerates the relevant control structures. This is like\nan implicit map operation in the map-reduce terminology.\n\nA SQL procedure performs the following tasks: loops over this cursor, adds up the rank per new O which is then persisted into a table. Then leveraging the fact that RDF relations (predicates) have discernible fine-grained semantics, and are identified by IRIs, we can use the aforementioned relation semantics to determine link coefficients of relation subjects or objects. With extraction of named entities from text content, we can further place a given entity into a referential context and use this as a weighting factor. This is to be explored in future work. The experience thus far shows that we greatly benefit from Virtuoso being a general purpose DBMS, as we can create application specific data structures and control flows where these are efficient. For example, it would make little sense to store entity ranks as triples due to space consumption and locality considerations. With these tools, the whole ranking functionality took under a week to develop.\n\n\n---+++ Query Evaluation Time Limits\n\nWhen scaling the Linked Data model, we have to take it\nas a given that the workload will be unexpected and that the\nquery writers will often be unskilled in databases. Insofar\npossible, we wish to promote the forming of a culture of\ncreative reuse of data. To this effect, even poorly formulated\nquestions deserve an answer that is better than just timeout.\n\nIf a query produces a steady stream of results, interrupting\nit after a certain quota is simple. However, most interesting\nqueries do not work in this way. They contain aggregation,\nsorting, maybe transitivity.\n\nWhen evaluating a query with a time limit in a cluster\nsetup, all nodes monitor the time left for the query. When\ndealing with a potentially partial query to begin with, there\nis little point in transactionality. Therefore the facet service\nuses read committed isolation. A read committed query\nwill never block since it will see the before-image of any\ntransactionally updated row. There will be no waiting for\nlocks and timeouts can be managed locally by all servers in\nthe cluster.\n\nThus, when having a partitioned count, for example, we\nexpect all the partitions to time out around the same time\nand send a ready message with the timeout information\nto the cluster node coordinating the query. The condition\nraised by hitting a partial evaluation time limit differs from\na run time error in that it leaves the query state intact on\nall participating nodes. This allows the timeout handling to\ncome fetch any accumulated aggregates.\n\nLet us consider the query for the top 10 classes of things\nwith \"Shakespeare\" in some literal. This is typical of the\nworkload generated by the faceted browsing web service:\n\n\nDEFINE input:inference \"yago\"\nSELECT ?c \n COUNT (*) \nWHERE \n {\n ?s a ?c ; \n ?p ?o .\n ?o bif:contains \"Shakespeare\"\n } \nGROUP BY ?c \nORDER BY DESC 2 \nLIMIT 10\n\n\nOn the first execution with an entirely cold cache, this\ntimes out after 2 seconds and returns:\n\n\n?c COUNT (*)\nyago:class/yago/Entity100001740 566\nyago:class/yago/PhysicalEntity100001930 452\nyago:class/yago/Object100002684 452\nyago:class/yago/Whole100003553 449\nyago:class/yago/Organism100004475 375\nyago:class/yago/LivingThing100004258 375\nyago:class/yago/CausalAgent100007347 373\nyago:class/yago/Person100007846 373\nyago:class/yago/Abstraction100002137 150\nyago:class/yago/Communicator109610660 125\n\n\nThe next repeat gets about double the counts, starting\nwith 1291 entities.\n\nWith a warm cache, the query finishes in about 300 ms (4\ncore Xeon, Virtuoso 6 Cluster) and returns:\n\n\n?c COUNT (*)\nyago:class/yago/Entity100001740 13329\nyago:class/yago/PhysicalEntity100001930 10423\nyago:class/yago/Object100002684 10408\nyago:class/yago/Whole100003553 10210\nyago:class/yago/LivingThing100004258 8868\nyago:class/yago/Organism100004475 8868\nyago:class/yago/CausalAgent100007347 8853\nyago:class/yago/Person100007846 8853\nyago:class/yago/Abstraction100002137 3284\nyago:class/yago/Entertainer109616922 2356\n\n\nIt is a well known fact that running from memory is thousands\nof times faster than from disk.\n\nThe query plan begins with the text search. The subjects\nwith \"Shakespeare\" in some property get dispatched to the\npartition that holds their class. Since all partitions know the\nclass hierarchy, the superclass inference runs in parallel, as\ndoes the aggregation of the group by. When all partitions\nhave finished, the process coordinating the query fetches the\npartial aggregates, adds them up and sorts them by count.\n\nIf a timeout occurs, it will most likely occur where the\nclasses of the text matches are being retrieved. When this\nhappens, this part of the query is reset, but the aggregate\nstates are left in place. The process coordinating the query\nthen goes on as if the aggregates had completed. If there are\nmany levels of nested aggregates, each timeout terminates\nthe innermost aggregation that is still accumulating results,\nthus a query is guaranteed to return in no more than n\ntimeouts, where n is the number of nested aggregations or\nsubqueries.\n\n---+++Facets Web Service\n\nThe Virtuoso Facets web service is a general purpose RDF\nquery facility for facet based browsing. It takes an XML\ndescription of the view desired and generates the reply as\nan XML tree containing the requested data. The user agent\nor a local web page can use XSLT for rendering this for the\nend user. The selection of facets and values is represented as\nan XML tree. The rationale for this is the fact that such a\nrepresentation is easier to process in an application than the\nSPARQL source text or a parse tree of SPARQL and more\ncompactly captures the specific subset of SPARQL needed\nfor faceted browsing. All such queries internally generate\nSPARQL and the SPARQL generated is returned with the\nresults. One can therefore use this is a starting point for\nhand crafted queries.\n\nThe query has the top level element . The child\nelements of this represents conditions pertaining to a single\nsubject. A join is expressed with the property or propertyof\nelement. This has in turn children which state conditions\non a property of the first subject. Property and propertyof\nelements can be nested to an arbitrary depth and many\ncan occur inside one containing element. In this way, tree-shaped\nstructures of joins can be expressed.\n\nExpressing more complex relationships, such as intermediate\ngrouping, subqueries, arithmetic or such requires writing\nthe query in SPARQL. The XML format is for easy automatic\ncomposition of queries needed for showing facets, not\na replacement for SPARQL.\n\nConsider composing a map of locations involved with\nNapoleon. Below we list user actions and the resulting\nXML query descriptions.\n\n\n * Enter in the search form \"Napoleon\":\n\n\n napoleon\n \n\n\n * Select the \"types\" view:\n\n\n napoleon\n \n\n\n * Choose \"MilitaryConflict\" type:\n\n\n napoleon\n \n \n\n\n * Choose \"NapoleonicWars\":\n\n\n napoleon\n \n \n \n\n\n * Select \"any location\" in the select list beside the\n\"map\" link; then hit \"map\" link:\n\n\n napoleon\n \n \n \n\n\n\nThis last XML fragment corresponds to the below text of\nSPARQL query:\n\n\n\nSELECT ?location AS ?c1 \n ?lat1 AS ?c2 \n ?lng1 AS ?c3\nWHERE \n {\n ?s1 ?s1textp ?o1 .\n FILTER \n ( bif:contains (?o1, '\"Napoleon\"') ) .\n ?s1 a .\n ?s1 a .\n ?s1 ?anyloc ?location .\n ?location geo:lat ?lat1 ; \n geo:long ?lng1\n }\nLIMIT 200 \nOFFSET 0\n\n\nThe query takes all subjects with some literal property\nwith \"Napoleon\" in it, then filters for military conflicts and\nNapoleonic wars, then takes all objects related to these\nwhere the related object has a location. The map has the\nobjects and their locations.\n\nFigure 1: The displayed result\n\n---+++ VoID Discoverability \n\nA long awaited addition to the LOD cloud is the Vocabulary\nof Interlinked Data ([[http://www.w3.org/TR/void/][VoID]])[10]. Virtuoso automatically\ngenerates VoID descriptions of data sets it hosts.\nVirtuoso incorporates an SQL function rdf_void_gen\nwhich returns a Turtle representation of a given graph's\nVoID statistics.\n\n\n---+++Test System and Data\n\nThe test system consists of two 2x4 core Xeon 5345,\n2.33 GHz servers with 16G RAM and 4 disks each. The\nmachines are connected by two 1Gbit Ethernet connections.\nThe software is Virtuoso 6 Cluster. The Virtuoso server is\nsplit into 16 partitions, 8 for each machine. Each partition\nis managed by a separate server process.\n\nThe test database has the following data sets:\n\n * DBpedia 3.2\n * MusicBrainz\n * Bio2RDF\n * NeuroCommons\n * UniProt\n * Freebase (95M triples)\n * PingTheSemanticWeb (1.6M miscellaneous files from http://www.pingthesemanticweb.com/).\n\nOntologies:\n\n * Yago\n * OpenCyc\n * Umbel\n * DBpedia\n\nThe database is 2.2 billion triples with 356 million distinct URIs.\n\n---+++Future Work\n\nAll the functions discussed above are presently being productized\nfor delivery with Virtuoso 6, so that single servers\nare open source and clusters commercial only. The most\nrelevant future work is thus final debugging and tuning of\nexisting functionality.\n\nThe technology will be first commercially used as a platform\nfor an Amazon EC2 offering of the whole LOD cloud\non a cluster of servers. This complements the existing line\nof data sets pre-packaged by OpenLink[11].\n\nFor more sophisticated, also editable user facing functionality,\nOpenLink is presently working with the developers of\nOntoWiki[12] on integrating the functionality discussed here\ninto OntoWiki as a new large-scale back-end. From this development,\nwe expect to have the functional equivalent of\nFreebase[13], except with more data, working with open,\nstandard data models, being more integrable and above all\nhaving a full range of deployment options. This means anything\nfrom the desktop to the data center with either software\nas service or installation at end user sites as options.\n\nWe presently rank search results on text match scores and\nlink density around the URIs related to the text hits. We\nexpect having semantics associated with links to open new\npossibilities in this domain. We plan to leverage link semantics\nfor ranking but as of this writing have not extensively\nexplored this.\n\n---+++Conclusions\nWe have presented a set of query processing techniques\nand a web service and user interface for interactive browsing\nof a large corpus of linked data. We have shown significant\nscalability on low cost server hardware, with open\nended scale out capacity for larger data set sizes and more\nconcurrent usage.\n\nThe service described is online and is also packaged with\nVirtuoso 6 open source distributions.\n\nThe technical experience derived from developing this service\nemphasizes the following:\n\n * Central importance of a SPARQL/SQL cost model that is aware of hierarchies \nand is capable of sampling data as needed. Without the right execution plan, no \namount of hardware will save the day.\n * The importance of enforcing a cap on resource usage.\n * The need for scale-out in order to have enough data in memory. Disk is a far \ngreater bottleneck than processor or network speed. Scaling out in a shared nothing \nfashion is by far the most economical and scalable means of increasing total memory, \ndisk bandwidth and processing power.\n * Additional verification of our capacity to schedule parallel query processing \non a distributed memory cluster without being killed by latency.\n * Confirmation of the Virtuoso platform's flexibility for building additional \ndata intensive services, such as entity ranking.\n\nPresent work is therefore concentrated on refining and\nproductizing the platform and its RDF applications. We believe\nthis to be a significant infrastructure element enabling\nthe take off of linked data.\n\n\n---+++Tutorials\n\n * [[VirtuosoLODSampleTutorial][Faceted Browsing Sample using LOD Cloud Cache data space]].\n\n---++Related\n\n * Facets Web Service:\n * [[VirtuosoFacetsWebService][Virtuoso Facets Web Service]]\n * Facet Browser Installation and configuration: \n * [[VirtFacetBrowserInstallConfig][Virtuoso Facet Browser Installation and configuration]] \n * Facet APIs:\n * [[VirtFacetBrowserAPIs][Virtuoso APIs for FCT REST services]]\n * [[VirtFacetBrowserAPIsFCTEXEC][fct_exec API Example]] \n * Pivot Viewer and CXML: \n * [[http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtSparqlCxmlFacetPivotBridge#AncSparqlCxmlFacetPivotBridge][Facet Pivot Bridge - A bridge to PivotViewer from Virtuoso's Faceted query service for RDF]] \n * [[http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtSparqlCxml#AncFacetTypeAutoDetection][Auto-Detection of Facet Type]]\n * Tutorials:\n * [[VirtuosoLODSampleTutorial][Faceted Browsing Sample using LOD Cloud Cache data space]] \n * [[VirtuosoFacetsWebServiceSOAPExample][SOAP Facets Example]] \n * [[VirtFacetBrowserInstallConfigQueried][Querying The Facet Browser Web Service endpoint]] \n * [[VirtFCTFeatureQueries][Virtuoso Facet Browser Featured Queries]] \n * [[http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtVisualizeWithPivotViewer#GenFCT][Visualizing Your Data With PivotViewer Using The Facet Browser]] \n * [[VirtTipsAndTricksCustomControlLabelsURI][Custom Controlling Virtuoso Labels for URI functionality Example]]\n * [[VirtuosoFacetsWebServiceCustmExamples][Facets Web Service: Examples for customizing different types]]\n * [[VirtuosoFacetsWebServiceChoiceExample][Facets Web Service: Choice of Labels Example]] \n * Downloads: \n * [[http://sourceforge.net/project/showfiles.php?group_id=161622&package_id=319652][Virtuoso 6.0 TP1]]\n * [[http://s3.amazonaws.com/opldownload/uda/vad-packages/6.2/virtuoso/fct_dav.vad][Virtuoso Facet Browser VAD package]]\n\n \n\n---++References\n\n[1] Suchanek, F.M.; Kasneci, G.; Weikum, G.: YAGO: A Core of Semantic Knowledge \nUnifying WordNet and Wikipedia. WWW2007, ACM 978-1-59593-654-7/07/0005.\n\n[2] Overview of OpenCyc. http://www.cyc.com/cyc/opencyc/overview\n\n[3] UMBEL Ontology, Vol. 1: Technical Documentation, TR 08-08-28-A1. \nhttp://www.umbel.org/doc/UMBELOntology+vA1.pdf\n\n[4] Auer, S.; Bizer, C.; Lehmann, J.; Kobilarov, G.; Cyganiak, R.; Ives, Z.: \nDBpedia: A Nucleus for a Web of Open Data. In Aberer et al. (Eds.): The Semantic \nWeb, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, \nISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007. LNCS 4825 Springer 2007, \nISBN 9783-540762973.\n\n[5] The National Center for Biomedical Ontology: Resources. \nhttp://bioontology.org/repositories.html\n\n[6] OpenLink Software, Inc. Virtuoso 6 FAQ. \nhttp://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html\n\n[7] Brickley, D.; Miller, L.: FOAF Vocabulary Specification 0.91. \nhttp://xmlns.com/foaf/spec/\n\n[8] Bojars, U.; Breslin, J.G. (eds.): SIOC Core Ontology Specification. \nhttp://rdfs.org/sioc/spec/\n\n[9] Erling, O.: \"E Pluribus Unum\", or \"Inversely Functional Identity\", or \"Smooshing \nWithout the Stickiness\". \nhttp://www.openlinksw.com/weblog/oerling/?id=1498\n\n[10] Hausenblas, M.: Discovery and Usage of Linked Datasets on the Web of Data. \nNodMag #4. Available at \nhttp://www.talis.com/nodalities/pdf/nodalities+issue4.pdf\n\n[11] OpenLink Software, Inc. Virtuoso Universal Server (Cloud Edition) AMI for EC2. \nhttp://virtuoso.openlinksw.com/wiki/main/Main/VirtuosoEC2AMI\n\n[12] Auer, S.; Dietzold, S.; Riechert, T.: OntoWiki A Tool for Social, Semantic \nCollaboration. 5th International Semantic Web Conference, Nov 5th?9th, Athens, GA, \nUSA. In I. Cruz et al. (Eds.): ISWC 2006, LNCS 4273, pp. 736-749, 2006. \nSpringer-Verlag Berlin Heidelberg 2006.\n\n[13] Metaweb Technologies, Inc.: What is Freebase? \nhttp://www.freebase.com/view/en/what+is+freebase\n\n[14] Stoermer, H.: Entity Name System: The Back-bone of an Open and Scalable Web of \nData. In: Proceedings of the IEEE International Conference on Semantic Computing, \nICSC 2008, number CSS-ICSC 2008-4-28-25. IEEE, August 2008. Available at \nhttp://www.okkam.org/publications/stoermer-EntityNameSystem.pdf\n\n[15] Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. \nIn: Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, \nBrisbane, Australia. Available at \nhttp://ilpubs.stanford.edu:8090/361/" . . . "VirtuosoFacetsViewsLinkedData" . "2017-06-13T05:37:46.746026"^^ . . . . "2017-06-13T05:37:46.746026"^^ . "VirtuosoFacetsViewsLinkedData" . . . . . . . "2017-06-13T05:37:46Z" . . . . .