. "2017-06-13T05:47:58Z" . . "%VOSWARNING%\n\n\n%META:TOPICPARENT{name=\"VOSArticles\"}%\n\n---+Business Intelligence Extensions for SPARQL\n\nOrri Erling (Program Manager, OpenLink Virtuoso) and %BR%\nIvan Mikhailov (Lead Developer, OpenLink Virtuoso).\n\nOpenLink Software, 10 Burlington Mall Road Suite 265 Burlington, MA 01803 U.S.A%BR%\n[[http://www.openlinksw.com/][http://www.openlinksw.com/]]\n\n%TOC%\n\n---++Abstract\n\nWe believe that the possibility to use SPARQL as a front end to\nheterogenous data without significant cost in performance or expressive\npower is key to RDF taking its rightful place as the lingua franca of\ndata-integration. To this effect, we demonstrate how RDF and SPARQL can\ntackle a standard relational workload.\n\nWe discuss extending SPARQL for business intelligence (BI) workloads and\nrelate experiences on running SPARQL against relational and native RDF\ndatabases. We use the well known TPC-H benchmark as our reference schema\nand workload. We define a mapping of the TPC-H schema to RDF and restate\nthe queries as BI extended SPARQL. To this effect, we define aggregation\nand nested queries for SPARQL. \n\nWe demonstrate that it is possible to perform the TPC-H workload restated\nin SPARQL against an existing RDBMS without loss of performance or\nexpressivity and without changes to the RDBMS.\n\n\n---++Introduction and Motivation\n\nRDF promises to be a top-level representation for data extracted or dynamically\nmapped from any conceivable source. Thus, RDF's chief promise is in the field\nof information integration, analysis and discovery. Yet it is difficult to imagine\nany business reporting, let alone more complex information integration task that\nwould not involve aggregating and grouping.\n\nAs a data access and data integration vendor, OpenLink has a natural\ninterest in seeing SPARQL succeed as a top level language for answering\nbusiness questions on data mapped from any present day data warehouse or\nother repository. \n\nThis potential role of SPARQL is however fundamentally undermined if\nSPARQL cannot perform any part of the database industry's baseline business\nintelligence benchmark, TPC-H.\n\nTo this effect, we have extended SPARQL with expressions in results,\naggregates and grouping and derived tables. These extensions allow a\nstraightforward translation of arbitrary SQL queries to SPARQL. We call\nthis extended SPARQL \"SPARQL BI\".\n\nWe demonstrate the operation of SPARQL BI versions of TPC-H queries on\nrelational data managed by Virtuoso and Oracle. We also demonstrate the same\nworkload on the same data stored as RDF in Virtuoso.\n\n---++Test Data\nWe use the TPC-H schema mapped to RDF in all our examples. The table names\nare directly converted to classes and the column names are directly converted\nto predicates in namespace [[http://www.openlinksw.com/schemas/TPC-H][http://www.openlinksw.com/schemas/TPC-H]]. The\nprefix tpch is used to refer to this namespace throughout the paper.\n\n---++SPARQL Extensions\n---+++Expressions\n\nIn its proposed recommendation form, SPARQL does not allow returning any\nvalue that is not retrieved through a triple pattern. Expressions are only allowed\nin filters but cannot be returned.\nWe lift this restriction by allowing expressions in the result set.\nConsider:\n\n\nSELECT (?extendedprice * (1 - ?discount))\n WHERE\n {\n ?line a tpch:lineitem ;\n tpch:lineextendedprice ?extendedprice ;\n tpch:linediscount ?discount\n }\n\n\nWe can shorten the notation as\n\nSELECT ( ?line->tpch:extendedprice * ( 1 - ?line->tpch:discount ) )\n WHERE { ?line a tpch:lineitem }\n\n\nThe -> (dereference) operator allows referring to a property without naming\nit as a variable. This is exactly equivalent to having a triple pattern binding\na variable to the mentioned property of the subject within the group pattern\nenclosing the reference. For a select, the enclosing group pattern is considered\nto be the top level pattern of the where clause or in the event of a union, the\ntop level of each term of the union. Each distinct dereference adds exactly one\ntriple pattern to the enclosing group pattern, thus multiple uses -> do not each\nadd a triple pattern. Having multiple copies of an identical pattern might lead\nto changes in cardinality if multiple input graphs were being considered.\n\nIf a line item had multiple discounts or extended prices, then we would get\nthe cartesian product of both. If a property referenced via -> is absent, the\nexpression does not get evaluated in the first place.\n\nThe optional dereference operator |> will produce an unbound value if the\nproperty does not exist. Further, mentioning the same chain of dereferences\nmultiple times in the same group pattern will not cause redundant triple patterns\nto be added or result in more joining that is necessary.\n\nWe further allow expressions in the place of variables in triple patterns. To\nscope the above query to orders by customers in France, we could write:\n\n\nSELECT ( ?li->tpch:extendedprice * ?li->tpch:discount )\n WHERE\n {\n ?li a tpch:lineitem .\n ?li->tpch:l_orderkey->tpch:o_custkey->tpch:c_nationkey tpch:n_name \"France\"\n }\n\nThe sequence of dereferences expands into triple patterns, as in:\n\n... ?li tpch:l_orderkey ?v1 .\n ?v1 tpch:o_custkey ?v2 .\n ?v2 tpch:c_nationkey ?v3 .\n ?v3 tpch:n_name \"France\" .\n\n\n\n---+++Aggregation\n\nWe introduce the SUM, COUNT, AVG, MIN, and MAX aggregate functions\nfrom SQL. Their semantics with respect to NULL are inherited from SQL. To\ncount result rows without regard to any value being defined, COUNT (*) is\nintroduced.\n\nIf grouping is desired, aggregate expressions can be combined with non-\naggregate expressions in a selection list. The non-aggregate expressions\nwill function as grouping columns, i.e., the aggregates are calculated for\neach distinct combination of the grouping columns. No special GROUP BY\nclause is needed.\n\n\nSELECT ?l->tpch:l_linestatus \n COUNT(*) \n SUM(?l->tpch:extendedprice)\n WHERE { l a tpch:lineitem }\n\ngives the count and total value of lineitems for each distinct lineitem status.\nUser defined aggregates from Virtuoso SQL can be used in SPARQL as well,\nusing the sql: namespace.\n\n---+++Nesting of Queries and Named Results\n\nSQL allows nesting queries, in effect treating the evaluation of a query as\na table (derived table) or as a value in an expression (scalar subquery).\nWe allow embedding a SPARQL SELECT in the place of a triple pattern. The\nsyntax is as in --\n\n\nSELECT ?line\n WHERE\n {\n ?line a tpch:lineitem .\n { SELECT MAX ( ?l2->tpch:extendedprice ) AS ?maxprice\n WHERE { ?l2 a tpch:lineitem } \n } .\n FILTER ( line->tpch:extendedprice = ?maxprice ) \n }\n\n\nThis selects all lineitems with extendedprice equal to the highest extended-\nprice in the set of lineitems.\n\nWe note that we have a SQL-style explicit comparison for joining the nested\nselect with the outer select. The bindings that are in scope in the pattern\ncontaining the nested select are also in scope inside the nested select. In\nthis the scope rules resemble SQL' s rules for subqueries.\n\nIn Virtuoso SQL and Virtuoso/PL we allow SPARQL queries in all places\nwhere \"plain\" SQL SELECT could be used, e.g., SQL query can contain SPARQL\nsubqueries.\n\n---++Sample Queries\n\nDue to space constraints, we chose to pick only two of the [[http://demo.openlinksw.com/DAV/home/demo/tpch/][twenty-two queries of the TPC-H workload]]. These were selected because of their relative complexity\nand use of nested queries.\n\nThe TPC-H definition states the business questions for Q15 and Q18 as\nfollows:\n\nQ15, The Top Supplier Query finds the supplier who contributed the \nmost to the overall revenue for parts shipped during a given quarter \nof a given year. In case of a tie, the query lists all suppliers \nwhose contribution was equal to the maximum, presented in supplier \nnumber order.\n\nQ18, The Large Volume Customer Query finds a list of the top 100 \ncustomers who have ever placed large quantity orders. The query lists \nthe customer name, customer key, the order key, date and total price \nand the quantity for the order.\n\n\nThe Q15 SQL text used with Virtuoso is:\n\n SELECT s_suppkey, \n s_name, \n s_address, \n s_phone, \n total_revenue\n FROM supplier,\n ( SELECT l_suppkey AS supplier_no,\n SUM( l_extendedprice * ( 1 - l_discount ) ) AS total_revenue\n FROM lineitem\n WHERE l_shipdate >= { D '1996-01-01' } \n AND l_shipdate < { FN timestampadd ( SQL_TSI_MONTH, 3, {d '1996-01-01'} ) }\n GROUP BY l_suppkey\n ) AS revenue\n WHERE s_suppkey = supplier_no \n AND total_revenue = ( SELECT max(total_revenue)\n FROM ( SELECT l_suppkey AS supplier_no, \n SUM ( l_extendedprice * ( 1 - l_discount ) ) AS total_revenue\n FROM lineitem\n WHERE l_shipdate >= { D '1996-01-01' } \n AND l_shipdate < { FN timestampadd ( SQL_TSI_MONTH, 3, {d '1996-01-01'} ) }\n GROUP BY l_suppkey \n ) AS revenue\n )\n ORDER BY s_suppkey\n ;\n\n\n\nThe corresponding SPARQL BI text is:\n\n\nSPARQL\nPREFIX tpch \nSELECT ?supplier \n ?s_name \n ?s_address \n ?s_phone \n ?total_revenue\n WHERE\n {\n ?supplier a tpch:supplier ;\n tpch:s_name ?s_name ;\n tpch:s_address ?s_address ;\n tpch:s_phone ?s_phone .\n { \n SELECT ?supplier\n (\n SUM ( l_extendedprice * ( 1 - l_discount ) ) \n )\n AS ?total_revenue\n WHERE\n {\n ?lineitem a tpch:lineitem ;\n tpch:l_shipdate ?l_shipdate ;\n tpch:l_suppkey ?supplier .\n FILTER\n ( \n ?l_shipdate >= xsd:date ('1996-01-01') \n AND\n ?l_shipdate < bif:dateadd ('month', 3, xsd:date ('1996-01-01')) \n )\n }\n }\n { \n SELECT MAX ( ?all_totals.total_revenue ) AS ?maxtotal\n WHERE\n {\n { \n SPARQL \n SELECT ( \n SUM ( l_extendedprice * ( 1 - l_discount ) ) \n )\n AS ?total_revenue\n WHERE\n {\n ?lineitem a tpch:lineitem ;\n tpch:l_shipdate ?l_shipdate ;\n tpch:l_suppkey ?l_suppkey .\n FILTER\n ( ?l_shipdate >= xsd:date ('1996-01-01') \n AND\n ?l_shipdate < bif:dateadd ('month', 3, xsd:date ('1996-01-01')) \n )\n }\n } AS all_totals\n }\n }\n FILTER \n ( ?total_revenue = ?maxtotal )\n }\n ORDER BY ?supplier\n;\n\n\nThe Virtuoso text of Q18 is:\n\n\n SELECT c_name, \n c_custkey, \n o_orderkey, \n o_orderdate, \n o_totalprice,\n SUM ( l_quantity )\n FROM lineitem, \n orders, \n customer\n WHERE\n o_orderkey \n IN ( SELECT l_orderkey\n FROM lineitem\n GROUP BY l_orderkey\n HAVING SUM( l_quantity ) > 250 \n )\n AND c_custkey = o_custkey\n AND o_orderkey = l_orderkey\nGROUP BY c_name, \n c_custkey, \n o_orderkey, \n o_orderdate, \n o_totalprice\nORDER BY o_totalprice DESC, \n o_orderdate;\n\n\nThe SPARQL BI version is:\n\n SELECT ?c_name \n ?customer \n ?order \n ?o_orderdate \n ?o_totalprice\n SUM ( ?l_quantity )\n FROM \n WHERE \n {\n ?customer a tpch:customer ; \n foaf:name ?c_name .\n ?order a tpch:order ;\n tpch:ordertotalprice ?o_totalprice ;\n tpch:orderdate ?o_orderdate ; \n tpch:has_customer ?customer .\n [ a tpch:lineitem ; \n tpch:linequantity ?l_quantity ;\n tpch:has_order ?order \n ] .\n {\n SELECT ?sum_order \n SUM (?quantity) AS ?sum_q\n WHERE \n {\n [ a tpch:lineitem ;\n tpch:linequantity ?quantity ;\n tpch:has_order ?sum_order \n ]\n }\n } .\n FILTER\n (\n ?sum_order = ?order \n AND \n ?sum_q > 250\n )\n }\nORDER BY DESC (?o_totalprice) \n ?o_orderdate\n\n\n\n---++Experiments\n\nFor this test, we had the test data on both an Oracle 10G server and a Virtuoso 5.0 server\non the same machine. The tables from the Oracle and Virtuoso servers were\nattached to another Virtuoso server, which served as the SPARQL front end.\n\nFor Q15, Virtuoso SQL gave the answer in 180 ms and with SPARQL the\nanswer took 3800 ms. For Q18, Virtuoso SQL gave 340 ms and SPARQL 371 ms.\n\nThe large difference with Q15 is due to the SQL compiler choosing a different\nplan because the reformulated text has a different structure. Some further tuning\nwill eliminate the difference.\n\nWith the Oracle backend, we obtained the correct answers but our setup\ndid not pass the queries through to Oracle as a single SQL statement; hence\nthe performance was less than would have been seen if the queries were natively\nsubmitted to Oracle.\n\nSome further adjustments will result in the queries passing through the\npipeline as single statements, at which point we will have a negligible\ntranslation overhead.\n\nThe test database was at 1 percent scale, hence the results are not about\nTPC-H performance per se, but are solely aimed at verifying that the correct\nanswers are produced and queries are executed as close to the data as possible.\n\nWe also dumped the data as physical triples stored in Virtuoso. Our aim is to\narrange things so that the physical triples version will at no point be more than\nthree times slower than the equivalent relational setting on Virtuoso, running\nwith a database scaled to 100G. Reaching this requires some enhancements to\nour SQL implementation, specifically for dealing with queries with dozens of\njoined tables. We note that we get a self-join for each column referenced. This\nwork was not completed at the submission deadline.\n\nSince the features discussed were first implemented within days of the\nsubmission deadline, no tuning or adaptation of the Virtuoso SQL was\npossible within the time limit, hence results are not anywhere near final\nand most interesting experiments had to be left out. \n\nWe intend to further study the comparative performance of SPARQL going to\nnatively stored triples and compare this with SQL performance with single\ninstance and clustered Virtuoso databases. One line of future work is\nbenchmarking SPARQL and SQL based vertical storage schemes. We note that\nthe RDF model is a vertical storage scheme almost by nature. Declaring that\ntriples with given predicates be stored apart, in a table that keeps only\nsubject and object results in a column-oriented store.\n\nWe further intend to broaden the scope of the present example around TPC-H \nby including more sources in the mapping. This will demonstrate that the\nsame queries can be run without loss of performance on a number of similar\nbut distinct relational database instances. Thus SPARQL does become a data\nintegration tool that exceeds the capabilities of SQL views merging data from\nmultiple sources, for example.\n\n---++Conclusions\n\nThe work discussed here demonstrates the feasibility of querying existing \nrelational data through extended SPARQL without loss of performance or \nexpressivity and without any modification to the relational data store \nin question. A skeptic might ask what the value of an alternate syntax for SQL is, when\nSQL is universally known and applied. We would point out that us bringing\nSPARQL on par with SQL for decision support queries is not aimed at replacing\nSQL but at making SPARQL capable of fulfilling its role as a language for\nintegration.\n\nIndeed, we retain all of SPARQL's and RDF's flexibility for uniquely\nidentifying entities, for abstracting away different naming conventions,\nlayouts and types of primary and foreign keys and so forth.\n\nIn the context of mapping relational data to RDF, we could map several\ninstances of comparable but different schemas to the common terminology and\ncouch all our queries within this terminology. Further, we can join from this\nworld of mapped data to native RDF data, such as the data in the Linking\nOpen Data project. For example, we could join regional sales data to the US\ncensus data set within a single query.\n\nOnce we have demonstrated that performance or expressivity barriers do not\ncripple SPARQL when performing traditional SQL tasks, we have removed a\nsignificant barrier from enterprise adoption of RDF and open data.\n\n---++Related \n\n * [[http://demo.openlinksw.com/DAV/home/demo/tpch/][TPC-H workload Sample Queries]]\n\n\n---++References\n 1 W3C RDF Data Access Working Group: SPARQL Query Language for RDF: [[http://www.w3.org/TR/rdf-sparql-query/][http://www.w3.org/TR/rdf-sparql-query/]]\n 1 Transaction Processing Performance Council: TPC-H a Decision Support Benchmark. \n[[http://www.tpc.org/tpch/][http://www.tpc.org/tpch/]]\n 1 Orri Erling, Ivan Mikahilov: Adapting an ORDBMS for RDF Storage and Mapping. Proceedings \nof the First Conference on Social Semantic Web. Leipzig (CSSW 2007), SABRE. Volume P-113 \nof GI-Edition - Lecture Notes in Informatics. Bonner Kollen Verlag, ISBN 978-3-88579-207-9" . "a5b4b19f1a3fdddfd4945e4188025dd9" . . . "VOSArticleBISPARQL2" . "2017-06-13T05:47:58Z" . . "VOSArticleBISPARQL2" . . "VOSArticleBISPARQL2" . . . . . . . . . . . "2017-06-13T05:47:58.569581"^^ . "2017-06-13T05:47:58.569581"^^ . . . .