This document details how large RDF data set files can be bulk loaded into Virtuoso. The data sets may consist of multiple files, which may be loaded into one or several graphs.
06.02.3129
and open source version 6.1.3
, but must be manually loaded into older versions?.
DirsAllowed
parameter defined in the virtuoso INI file, after which the Virtuoso server must be restarted.
rdf_loader_run()
function understands.
Any of these may be compressed with gzip (i.e., with the additional .gz
file name extension) to save space; in such case, they will be automatically expanded by the bulk loader.
.grdf | Geospatial RDF |
.nq | N-Quads |
.nt | N-Triples |
.owl | OWL |
.rdf | RDF/XML |
.trig | TriG |
.ttl | Turtle |
.xml | RDF/XML |
ld_dir()
or ld_dir_all()
function call.
The content of a file with the same name as a data file plus the .graph
filename extension will be used for that data file (e.g., my_data.n3.graph
will be used with my_data.n3
).
The content of a file named global.graph
will be used for any and all other data files in that directory.
graph_iri
) of ld_dir()
or ld_dir_all()
is NULL
, any data files that do not have a corresponding .graph
file will not be loaded.
<source-file>.<ext> <source-file>.<ext>.graph global.graph
myfile.n3 ;; RDF data myfile.n3.graph ;; Contains Graph IRI name into which RDF data from myfile.n3 will be loaded global.graph ;; Contains Graph IRI name into which RDF data from any files that do not have a specific graph name file will be loaded
http://dbpedia.org
, in the *.graph
file.
isql
to register the file(s) to be loaded by running the appropriate function, e.g.
--
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');
ld_dir()
to load only from the specified directory, excluding any subdirectories --
SQL> ld_dir ('<source-filename-or-directory>', '<file name pattern>', 'graph iri');
ld_dir_all()
to load from the specified directory, including any and all subdirectories --
SQL> ld_dir_all ('<source-filename-or-directory>', '<file name pattern>', 'graph iri');
DB.DBA.load_list
can be used to check the list of data sets registered for loading, and the graph IRIs into which they will be or have been loaded.
The ll_state
field can have three values: 0 indicating the data set is to be loaded; 1 the data set load is in progress; or 2 the data set load is complete:
SQL> select * from DB.DBA.load_list; ll_file ll_graph ll_state ll_started ll_done ll_host ll_work_time ll_error VARCHAR NOT NULL VARCHAR INTEGER TIMESTAMP TIMESTAMP INTEGER INTEGER VARCHAR _____________________________________________________________________________________________________________________________________ ./dump/d1/file1.n3 http://file1 2 2010.10.20 9:21.18 0 2010.10.20 9:21.18 0 0 NULL NULL ./dump/d2/file2.n3 http://file2 2 2010.10.20 9:21.18 0 2010.10.20 9:21.18 0 0 NULL NULL ./dump/file.n3 http://file 2 2010.10.20 9:21.18 0 2010.10.20 9:21.18 0 0 NULL NULL 3 Rows. -- 1 msec. SQL>
rdf_loader_run()
function:
SQL> rdf_loader_run();
rdf_loader_run()
function prototype is:
rdf_loader_run ( IN max_files INTEGER := NULL , IN log_enable INT := 2 )
log_enable = 2
setting is that triggers are disabled, to speed the loading of data.
If triggers are required (e.g., for RDF Graph replication between nodes), then the log_enable
mode should be set to 3 when calling the rdf_loader_run()
function as follows:
rdf_loader_run (log_enable=>3);
On a multi-core machine, it is recommended that data sets be split into multiple files, and that these be registered in the DB.DBA.load_list
table with the ld_dir()
function.
Once registered for load, the rdf_loader_run()
function can be run multiple times (we recommend a maximum of one rdf_loader_run()
call for every 2.5 processor cores), to optimally parallelize the data load and hence maximize load speed.
A sample script that can be run from command line (e.g., bulk_load.sh
) might look like --
. /opt/openlink/virtuoso/virtuoso-enterprise.sh isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & wait isql 111 dba dba exec="checkpoint;"
This can be run with the simple command:
sh /opt/openlink/virtuoso/bin/bulk_load.sh
rdf_load_stop()
, at which point all currently running threads will be allowed to complete and then exit:
SQL> rdf_load_stop();
rdf_loader_run()
is complete, you can check the DB.DBA.load_list
to confirm all data sets were loaded successfully.
This is indicated by an ll_state
value of 2
and an ll_error
value of NULL
.cl_exec('rdf_ld_srv(log_enable)')
" commands (where log_enable
is 2
or 3
, as with the rdf_loader_run()
function) can be used to invoke a single "rdf_loader_run()
" on each node of the cluster:
SQL> cl_exec('rdf_ld_srv()'); Done. -- 265956 msec. SQL>