Linked Data formats, tools, challenges, opportunities; CSVW, schema.org/Dataset, schema.org/ScholarlyArticle

There’s a lot of potential value in existing data and research.
Unfortunately, in general, we publish datasets without sufficient metadata to determine even the datatype or the physical unit of the measured variables. Was this considered a linearly-depdendent variable or an independent variable? When was the dataset collected?

We thus have need for tools that support publishing new data and research in linked data formats
and for tools that allow us to annotate and extract structured data from PDFs (and threaded comments which may indicate relations to other datasets and findings)

What are the use cases for Linked Data in [open source] Data Engineering?

  • Publish a ScholarlyArticle predicated upon logical premises: Datasets, derived statistics
  • Prepare a Meta-Analysis predicated upon multiple ScholarlyArticles, trials
  • Consume ScholarlyArticles with a tool that discovers study control URIs, filters with and without inclusion criteria, performs blind analyses upon the linked Datasets
  • Publish a Dataset such that downstream users have sufficient metadata to merge, join, and concatenate if appropriate
  • Annotate already-published ScholarlyArticles with not just highlights and threaded comments,
    but with structured data points and relations to other resources which may seem to confirm or seem to disprove a given finding
  • Enable one-click installation of requisite software for conducting the statistical analysis and generating charts

Which tools support Linked Data? RDF, JSONLD, CSVW?

How can I add a few metadata rows to the top of a spreadsheet in order to indicate (with URIs) what the column describes, the [XSD] datatype, and the physical unit [meters, metres, or centimeters]; and export that to a format that other tools today and tomorrow can easily gain value from by reusing the given data in other contexts?

What are some of the existing formats, schema, vocabularies, and ontologies for linked data publishing, provenance, and reproducibility? Where are the gaps?

How can we store and publish Linked Data today?

  • Flat files on my computer:
    • data.rdf.xml, data.ttl, data.jsonld.json (JSONLD)
    • HTML + extra attributes (RDFa, Microdata)
  • Data Catalogs / Data Repositories:
  • Databases:
    • SQL (BLOB/JSON columns, limited schema adapatations, EAV)
    • AtomSpace (a hypergraph for AGI)
    • Graph Databases
    • Triplestores / Quadstores
  • Query languages:
    • SQL (BLOB/JSON columns, limited schema adapatations, EAV)
    • SPARQL
      • RDF* and SPARQL* (more flexible property graphs)
    • GraphQL-LD

What are the possible and existing solutions for overcoming the performance bottlenecks of more-verbose, more-expressive, and less-storage-efficient Linked Data Formats?
(Because data reusability given network effects is a significant opportunity.)

  • RDF HDT
  • Database optimization
  • Projects like Apache Arrow, which minimize data-reshaping when being copied between disk and RAM
1 Like

Really great summary of the problem and survey of the landscape. Solving this would significantly increase data enrichment and the ability to combine data sources from various publications.

Hey thanks. I probably should have clarified that I’m seeking everyone’s answers to these same questions and not just talking to myself.

I’ve been thinking about this at Gigantum (gigantum.com). The cloud platform side of our tool is pretty new, so still in the early stages of shaping what that looks like. Soon we’re going to add the ability to mint DOIs for Datasets which is a good step, but I’ve been thinking a lot about adding some sort of metadata via a JSON-LD file, so datasets are more useful and discoverable.

One of our foundational principles is that given the crazy amount of things people need to do to make their data science work transparent and useful, automation is necessary. So I’d really like to be able to automatically generate a JSON-LD file and serve it alongside the dataset. My biggest problem is that the standards are dense and I don’t see a lot of good examples. It’s not clear to me what the minimal useful set of metadata is, and in particular fields that can be automatically populated are ideal.

I’d be very interested to hear what metadata would be most useful to generate and include in a JSON-LD file and any other tips or examples of people doing this well.

I agree that it is worrying that we still haven’t gotten into a realization of the potential for machine-readable, self-described data publishing on a large scale.

At the same time, you point out the historic problem: the verboseness.

I’m just chiming in here with some pointers. I wrote a post on the topic: Linked Data Science - For improved understandability of computer-aided research, and also presented on the topic on Linked Data Sweden 2018: Semantic Web :heart: Data Science? Practical large scale semantic data handling with RDFIO & RDF-HDT. We also made a small ugly URI resolver based on the RDF-HDT commandline client: urisolve.

In summary, my biggest hopes would be in the direction of something like SWI-Prolog (for its versatility and relative performance, for in-memory datasets) and RDF-HDT.

Some people might cringe when I mention Prolog, but SWI is actually continually developed and even has a web-based “jupyter-style” notebook feature called swish. Also, SWI has excellent semweb support, including an RDF-HDT plugin. Also, Blazegraph has turned out to be able to handle large datasets well, but I don’t find SPARQL to be a very productive environment: You can’t build re-usable queries in the way you can in Prolog, so you can’t “build / sharpen your tools” to increase your power to do things. Prolog really shines there. Perhaps we need some more modern and/or data-oriented solution though? If Datomic (implementing the datalog subset of prolog) would be open source, I could imagine it would be THE solution.

My 5c.