Federated Queries with SPARQL

Sometimes when you are writing SPARQL, you want to combine information from datasets that live in different places and have different SPARQL endpoints. In the past, I've done this using PHP or Ruby code. My code would perform the necessary query on each dataset and then combine the results returned.

I've always wondered if perhaps there was a more efficient way to do this. Ideally, I wanted to be able to perform this task within SPARQL itself. This functionality is the basis for federated queries. Introduced in SPARQL 1.1, the basic idea is that, as a client using one SPARQL endpoint, you can query several datasets with other SPARQL endpoints. For the purposes of my experiments and prototyping, I'm using a local instance of Fuseki 1.3 to run my queries.

Using WorldCat Data to Feed a Wikidata Query

So, what kind of federated query might you want to run? Well, how about combining data from a specific WorldCat Bib graph with Wikidata? Let's look a very simple use case, in which I'm loading the graph for a given OCLC record URI, extracting the predicate for the OCLC Number and then querying Wikidata's SPARQL endpoint based on that number.

PREFIX schema: <http://schema.org/>
PREFIX library: <http://purl.org/library/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?work
FROM <http://www.worldcat.org/oclc/1210>
WHERE { 
<http://www.worldcat.org/oclc/1210> library:oclcnum ?oclcNumber. 

SERVICE <https://query.wikidata.org/sparql> { 
?work wdt:P243 ?oclcNumber. }
}

The crucial part of the query here is the SERVICE keyword. This allows you to specify the URL for a SPARQL endpoint for a dataset that you want to run a query against. The SERVICE keyword is also repeatable, so you can run a set of queries against multiple SPARQL endpoints. Keep in mind latency, though, because you will be waiting for each endpoint to respond before you get a complete response.

Querying Multiple Datasets

Let's look at another example in which I'm getting data back from WorldCat, Wikidata and the British Library. Here, I'm trying to find all the "Works" Wikidata and the British Library think are associated with the creator within a specific WorldCat Bib graph. I'm returning their URIs and their "English" title.

PREFIX schema: <http://schema.org/> 
PREFIX library: <http://purl.org/library/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/> 
PREFIX dct: <http://purl.org/dc/terms/>
 
SELECT ?work ?workLabel
FROM <http://www.worldcat.org/oclc/470488115> 
WHERE {  
<http://www.worldcat.org/oclc/470488115> schema:author ?creatorURI.
BIND(replace(STR(?creatorURI), "^(.*[\\/])*", "") AS ?creatorID)  
 
{SERVICE <https://query.wikidata.org/sparql> {
  ?author wdt:P214 ?creatorID.
  ?work wdt:P50 ?author .
  ?work wdt:P1476 ?workLabel
  } 
}  
UNION
{
SERVICE <http://bnb.data.bl.uk/sparql> {  
?bl_creator  ?creatorURI.
?work dct:creator ?bl_creator.
?work dct:title ?workLabel
 }
} 
}

Creating a New Graph from Data in Different Datasets

So, what if you have a dataset of your own that you want to extend with data from another dataset? One potential example of this is to get multi-lingual title and description information for a Bib graph and then create a new graph with this information as well as the original data from the Bib graph.

PREFIX schema: <http://schema.org/> 
PREFIX library: <http://purl.org/library/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/> 
 
CONSTRUCT
{
<http://www.worldcat.org/oclc/1210> ?p ?o.
<http://www.worldcat.org/oclc/1210> schema:name ?name.
<http://www.worldcat.org/oclc/1210> schema:description ?description .
}

FROM <http://www.worldcat.org/oclc/1210> 
WHERE {  
<http://www.worldcat.org/oclc/1210> ?p ?o
<http://www.worldcat.org/oclc/1210> library:oclcnum ?oclcNumber.  
 
SERVICE <https://query.wikidata.org/sparql> {  
    ?work wdt:P243 ?oclcNumber. 
    ?work rdfs:label ?name.
    OPTIONAL {
        ?work schema:description ?description .
    }
} 
}

This will create a graph that has all of the basic statements associated with the Bib URI as well as "title" and "description" information from Wikidata. The new information from Wikidata has been added as statements associated with the existing Bib URI. I can now choose to store this graph in my own triple store or publish it as a static graph document at a URI of my choosing.

However, doing that wouldn't really meet best practices for linked data. Because I've created my own graph for a "thing," I should mint my own URI for my description. I can cache any statements I want to use in my description and add a schema:sameAs reference into the graph to link my description to the sources I drew the data from. The SPARQL query that does this looks like this.

PREFIX schema: <http://schema.org/> 
PREFIX library: <http://purl.org/library/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/> 
 
CONSTRUCT
{
?worldcatSubjects ?p ?o. 
<http://www.mylibrary.org/bib/oclc/1210> schema:name ?wc_name.
<http://www.mylibrary.org/bib/oclc/1210> schema:description ?wc_description .
<http://www.mylibrary.org/bib/oclc/1210> schema:name ?name.
<http://www.mylibrary.org/bib/oclc/1210> schema:description ?description .
<http://www.mylibrary.org/bib/oclc/1210> schema:sameAs <http://www.worldcat.org/bib/oclc/1210> .
<http://www.mylibrary.org/bib/oclc/1210> schema:sameAs ?work .
}

FROM <http://www.worldcat.org/bib/oclc/1210> 
WHERE {  
?worldcatSubjects ?p ?o.
<http://www.worldcat.org/oclc/1210> schema:name ?wc_name.
<http://www.worldcat.org/oclc/1210> schema:description ?wc_description.
<http://www.worldcat.org/oclc/1210> library:oclcnum ?oclcNumber.
FILTER regex(STR(?worldcatSubjects), "worldcat.org").  
 
SERVICE <https://query.wikidata.org/sparql> {  
    ?work wdt:P243 ?oclcNumber. 
    ?work rdfs:label ?name.
    OPTIONAL {
        ?work schema:description ?description .
    }
} 
}

What is happening in the CONSTRUCT? First, I'm pulling in all the statements with the subject that contains the string "worldcat.org". Next I'm adding triple statements for my new graph with the URI <http://www.mylibrary.org/bib/oclc/1210>. These statements include all the values for the schema:title and schema:description predicates I've gathered from both WorldCat.org and Wikidata. Lastly, I'm adding schema:same predicates that say my graph is the same as the graphs in WorldCat.org and Wikidata that I've gathered data from. There is a lot of room for interpretation when creating a new graph. I've chosen to cache the entire bib graph in my local graph for a bib but not add the predicates and objects associated with the bib subject URI <http://www.worldcat.org/bib/oclc/1210> to my bib subject URI <http://www.mylibrary.org/bib/oclc/1210> . I could have also chosen to incorporate all the predicates and objects associated with the bib subject URI to my bib subject URI. There are pros and cons to each approach. I chose this one because it is simpler CONSTRUCT query.

I hope you've learned a lot from reading this month's posts on Querying Linked Data. Next month, our posts and webinar will focus on creating Linked Data. If you've been following along and have questions or need a refresher, next week is our Querying Linked Data webinar. So sign up and join us to learn more.

  • Karen Coombs

    Karen Coombs

    Senior Product Analyst