Making Sense of Linked Data with Python
This is a guest post by my colleague Jeff Mixter in OCLC Research who works regularly with linked data using Python.
Introduction
Continuing the series on consuming linked data, this post will focus on how to use Python to load, parse and traverse RDF data. As with the “Server-Side Linked Data Consumption with Ruby” post, we will go over three basic tasks for consuming RDF with Python.
- Fetching the data
- Parsing the data returned into a graph
- Traversing the graph to display the data
To get started, you will need to download and install some Python modules. While there are several Python modules for working with RDF data, this post will focus on two. The first module is called RDFLib (https://github.com/RDFLib/rdflib), and it supports the fundamental functions required for loading, parsing and traversing RDF data. The second module is called RDFLib JSON-LD (https://github.com/RDFLib/rdflib-jsonld), and it adds the ability to parse and serialize RDF data as JSON-LD. JSON-LD is a popular serialization for use in web development and is also gaining traction as a serialization that search engines can consume.
To install these modules, you can download the GitHub code and manually install them by navigating to the download directory and entering the command:
python setup.py install
Or, if you have PIP (https://pypi.python.org/pypi/pip) installed, you can just enter the following commands:
pip install rdflib pip install rdflib-jsonld
Please note that you will need to install RDFLib before installing RDFLib JSON-LD. Once these two modules are installed, you are ready to start working with RDF data.
Fetching Data
There are three common scenarios for obtaining RDF data to work with in Python. Either,
- you have an RDF file on your local computer that you will read in your Python script,
- your Python code will retrieve RDF data from the web,
- or your Python code can create its own RDF data to load and parse.
In following with the previous Ruby blog post, we will work with the second option and retrieve our RDF data from the web.
Next, we need to decide which serialization to work with. RDF data can be serialized in a variety of ways, including as N-Triples, Turtle, RDF/XML and JSON-LD. Python makes handling this variety easy, as the combination of the RDFLib and RDFLib JSON-LD modules provide support for them all.
When fetching data from the web, different RDF serializations can be requested by setting appropriate HTTP accept headers. Below is example code for requesting RDF data from WorldCat.org for the record with OCLC number 82671871, serialized at RDF/XML. Each of the RDF serializations have different HTTP headers. Due to shifting recommendations, some serializations have a few different headers that could be used, but typically a server will only accept one. The W3C provides recommendations on HTTP headers for each serialization of RDF.
import urllib2
# The URI for the data you want to fetch
uri = 'http://www.worldcat.org/oclc/82671871'
# The content type you want to set in the Request Headers.
# This example is for RDF/XML
request_headers = {'Accept': 'application/rdf+xml'}
# Build the request with the URI and Header parameters
request = urllib2.Request(uri, headers = request_headers)
# Fetch the request
response = urllib2.urlopen(request)
# Read and Print the request
data = response.read()
print(data)
The urllib2 Python module is included with the most recent releases of both Python 2 and 3, so you should not need to do anything other than import the module code. Once you have the data, the next step is to parse it into a graph using the RDFLib Python module.
Parsing the Data
Parsing data can also be referred to as loading data. In the following example code, the RDF/XML serialized data for OCLC number 82671871 is parsed into a graph, which will be stored in memory for the duration of the Python code’s execution. The script demonstrates only the parsing step by re-serializing the graph as RDF/XML.
import urllib2
import rdflib
import rdflib_jsonld
# Code from Fetching Data example
uri = 'http://www.worldcat.org/oclc/82671871'
request_headers = {'Accept': 'application/rdf+xml'}
request = urllib2.Request(uri, headers = request_headers)
response = urllib2.urlopen(request).read()
rdf_triple_data = response
# Start of Parsing Data Code Example
# Create an empty graph that we can load data into
graph = rdflib.Graph()
# Parse the fetched data into the graph and tell the code that the
#format of the data is N-triple ('xml')
graph.parse(data=rdf_triple_data, format='xml')
# To make sure it worked we will serialize the data at N-triples
#('nt') and print it out
# The response should be the same as the data that we initially parsed
#into the graph (order of the triples does not matter)
new_data = graph.serialize(format='nt')
print(new_data)
Instead of using the urllib2 module, you could just use RDFLib to parse a specified URI. This has the benefit of requiring that you determine the appropriate HTTP Request Accept headers. Instead, all you need to do is point RDFLib at the URI, and it will take care of the rest. Below is an example of how to use only RDFLib to parse a URI into a virtual graph:
import rdflib
import rdflib_jsonld
# Use the parse functions to point directly at the URI
uri = 'http://www.worldcat.org/oclc/82671871'
graph = rdflib.Graph()
graph.parse(uri)
new_graph = graph.serialize(format='nt')
print (new_graph)
Traversing the Graph
We are now ready to apply some interesting queries to the data that we fetched from the web and parsed into a virtual graph. Three examples with example code follow.
- A SPARQL query for all Predicates
- Python functions for all Predicates
- A list of the Creative Work name(s), Author name(s), Descriptions and Subject name(s)
A SPARQL query to find all the Predicates
While you can use SPARQL queries in your Python code, it should be noted that they are not as efficient as using the RDFLib functions, which will be described in the next example. The example code below shows how to query a virtual graph, which is loaded with the data from the previous step, and query for all of the Predicates.
import urllib2
import rdflib
import rdflib_jsonld
# Code from Fetching Data and Parsing Data examples
uri = 'http://www.worldcat.org/oclc/82671871'
request_headers = {'Accept': 'application/rdf+xml'}
request = urllib2.Request(uri, headers = request_headers)
response = urllib2.urlopen(request).read()
graph = rdflib.Graph()
graph.parse(data=response, format='xml')
# Form the SPARQL query
predicate_query = graph.query("""
select ?predicates
where {?s ?predicates ?o}
""")
# For each results print the value
for row in predicate_query:
print('%s' % row)
In addition to using RDFLib, Python has another popular SPARQL module called SPARQL Wrapper (https://github.com/RDFLib/sparqlwrapper), which can be used for querying remote graphs.
Using built-in RDFLib functions to find all the Predicates
Another way to traverse an RDF graph in Python is to use the built-in RDFLib functions. In addition to being faster to execute, the functions can also save a lot of time by combining what would otherwise be multiple SPARQL queries into a single function. The snippet of code that we will look at next does the same thing as the previous example code: it asks for all of the Predicates. But instead of using SPARQL, we will use the functions built into the RDFLib module. If you run both of these code snippets, notice how fast the RDFLib functions approach is compared to the SPARQL query approach.
import urllib2
import rdflib
import rdflib_jsonld
# Code from Fetching Data and Parsing Data examples
uri = 'http://www.worldcat.org/oclc/82671871'
request_headers = {'Accept': 'application/rdf+xml'}
request = urllib2.Request(uri, headers = request_headers)
response = urllib2.urlopen(request).read()
graph = rdflib.Graph()
graph.parse(data=response, format='nt')
# Grab a list of all of the Predicates in the graph
predicates = graph.predicates(subject=None, object=None)
# For each item in the predicates generator, print it out
for predicate in predicates:
print(predicate)
There is a lot more that you can do with the RDFLib module code. For more detail, please take a look at their tutorial page on Navigating Graphs (http://rdflib.readthedocs.io/en/stable/intro_to_graphs.html).
Putting all the pieces together
Finally, we will go over a piece of functional code that extracts from a virtual graph four sets of values.
- Name(s) of the Creative Work
- Name(s) of the Author(s)
- Description of the Creative Work
- Name(s) of the Subject(s)
from __future__ import print_function
import sys
import rdflib
from rdflib import URIRef, Namespace, RDF, Graph, Literal, BNode, plugin, Variable
from optparse import OptionParser
# given a subject uri and a string for a schema.org predicate,
# return a list of any matching objects
# representing the object by its name property if available,
# otherwise representing the object by its uri
#graph = rdflib.Graph()
def get_labels(graph, uri, predicate_string):
predicate = rdflib.term.URIRef(u'http://schema.org/'+predicate_string)
name = rdflib.term.URIRef(u'http://schema.org/name')
object_list = []
for obj in graph.objects(uri, predicate):
label = obj
if graph.value(obj, name):
label = graph.value(obj, name)
object_list.append(label)
object_labels = ('\n'.join(object_list))
return(object_labels)
#if __name__ == "__main__":
def main():
# set default uri and predicates
uri = rdflib.term.URIRef(u'http://www.worldcat.org/oclc/82671871')
predicates_delimited = "name,creator,description,about"
# look for uri and predicates parameters that over-ride the defaults
parser = OptionParser()
parser.add_option("-u", dest="uri", help="The URI of the RDF resource", action='store')
parser.add_option("-p", dest="predicates_delimited", help="A comma-separated list of predicates to list, e.g., name,creator,contributor,about", action='store')
(options, args) = parser.parse_args(sys.argv)
if options.uri:
uri = rdflib.term.URIRef(options.uri)
if options.predicates_delimited:
predicates_delimited = options.predicates_delimited
predicates = predicates_delimited.split(",")
# create an in-memory RDF graph for the resource named in uri
graph = rdflib.Graph()
graph.parse(uri)
# for each of the strings in the predicates list ...
for predicate_string in predicates:
# get a label(s) for any object(s) in the graph for the predicate
print(get_labels(graph,uri,predicate_string))
if __name__ == "__main__":
main()
The example code above can be broken into a few distinct pieces. First, we set a default URI (http://www.worldcat.org/oclc/82671871) and set of Predicates (name, creator, description and about) to query for. Second, we load the URI data into the graph, and then for each Predicate, we loop through a function that searches the graph for the Predicate URI (which is composed for concatenating the Predicate label to the predefined URI prefix ‘http://schema.org/’). In the get_labels function, the code does two important things. First, it finds the Object associated with the current Predicate, and then it checks to see if that Object has a name. If the Object has a name, the function delivers back the name value. But if there is no name for the Object, the URI is returned. This is important because there will frequently be times during which names of related Objects are not included, but you still want to be able to retrieve the URI for later lookup. This code can be viewed and downloaded from GitHub (https://github.com/mixterj/c4l16_ld_python).
Conclusion
Python is very handy for working with RDF data. The RDFLib module provides a powerful set of tools for creating, parsing, traversing and editing RDF data. The maturity of the RDFLib module is complemented by Python’s intuitive syntax. These two standout features are the reasons that I choose to use Python when working with any RDF data.
-
Karen Coombs
Senior Product Analyst