Friday, December 20, 2013

How to manually convert Non-RDF(CSV, txt) to RDF, insert into Virtuoso Triple-store and query the data using SPARQL

WHY I DID THIS

I was doing some research about how to use OpenLink Virtuoso in my own project. First I want to try something basic, so I start with Triple-store and SPARQL query. This article is about how to transform Non-RDF file to RDF, insert RDF into Virtuoso Triple-store, and then search the result by using SPARQL.

Brief introduction

Transform Non-RDF to RDF

There are a lot of tools can be used to transform Non-RDF to RDF. I will use the GRefine RDF Extension. Here is a tutorial video.

What is Virtuoso

OpenLink Virtuoso is an ambitious software, it provides Web, File, and Database server functionality alongside Native XML Storage, and Universal Data Access Middleware. And, yes, it has implementation of Triple-store to store RDFs.

Let's focus on the Virtuoso triple-store.

Virtuoso triple-store is built atop of traditional RDBMS( See implementation here ). Triples are stored in a table called RDF_QUAD(See table below). Every RDF is composed by triples which will be inserted into RDF_QUAD table.

RDF_QUAD table

Column NameData TypeDescription
GIRI_IDGraph - Primary Key
SIRI_IDSubject - Primary Key
PIRI_IDPredicate - Primary Key
OANYObject - Primary Key
Triple-store uses Graph to group triples. There is a IRI, similar to URI, to identify every Graph. So when you try to insert data into triple-store, you should specify the IRI of the Graph. 

To search triples in a Graph, you should first give the IRI of the Graph, then use Virtuoso SPARQL (which is implemented atop of SQL by Virtuoso) to query the triple-store.

My experiment

My goal is to combine air quality data with disease data using Virtuoso triple-store engine. I found the data source on the AirNow and CDC Database

Step1: Convert Non-RDF to RDF


First I imported data into GRefine RDF Extension. Of course, you should trim some irrelevant date before importing. After this step, you should see the similar results below. 

 
Then I started to create RDF skeleton, select RDF->Edit RDF Skeleton. If you don't know how to do this, watch the tutorial video above.

You can see the RDF skeleton graph below.
The result should be like this:

Next is to export the RDF file, choose Export->RDF as XML or Export->RDF as Turtle. Done!

Step 2: Insert RDF into Virtuoso Triple-store

Before start, you should install the Virtuoso, I recommend you to use Virtuoso Open-source Edition. Try to build from source code to get the most stable version(which was 6.1.8 when wrote this blog).

Then connect Virtuoso by using command line tool: isql. The default isql location should be at /usr/local/bin/isql .

If you want to insert TTL(TURTLE file), use command below:
SQL> DB.DBA.TTLP_MT (file_to_string_output ('tmp/users.ttl'), '', 'http://mytest.com');
If you want to insert RDF/XML, use command below:
SQL> DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output ('tmp/Kingsley_Idehen.rdf'), '', 'http://mytest.com');
More details about RDF insertion, see here.

Step3: Query data with SPARQL

First, open the virtuoso web UI and log into system as dba. Choose the Linked Data tap. Open the SPARQL subtap.
Fill the Default Graph IRI as the one you used to insert your RDF.

Fill the SPARQL query in the Query section. If you have no idea what is SPARQL and how to use it, read this quick start guide.

Here is the result I got:


And this is the SPARQL I used:



PREFIX aa:<http://airnow.gov/>

PREFIX dd:<http://wonder.cdc.gov/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX time: <http://www.w3.org/2006/time#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX place: <http://purl.org/ontology/places#>
select ?year,?month,?city,?state,?cause,?death,?O3,?PM
where {

?deathURI rdf:value ?death;
  dc:date ?date1;
  geo:location ?loc1;
  rdf:type 'Disease of the nervous system';
  rdf:type ?cause.

?date1 time:year ?year;
  time:month ?month.

?loc1 place:State ?state.

?O3URI rdf:value ?O3;
  geo:location ?loc2;
  dc:date ?date2;
  rdf:type 'O3'.

OPTIONAL{
?PMURI rdf:value ?PM;
  geo:location ?loc2;
  dc:date ?date2;
  rdf:type 'PM2.5'.
}

?loc2 place:State ?state;
  place:City ?city.

?date2 time:year ?year;
  time:month ?month.
}
ORDER BY ?year ?month