Instance Data Guidance

CASE provides an ontology to the community. The ontology is written using RDF as its base language, using the OWL2 vocabulary to define classes, properties and relationships. This RDF is serialized in the Turtle syntax.

CASE instance data is also written using RDF as its base language. Instance data is serialized in the JSON-LD syntax instead of Turtle, to support CASE producers and consumers that work day-to-day in JSON instead of graph engines. (Data can be converted between JSON-LD, Turtle, and other RDF formats with readily available tooling.)

This document provides CASE community guidance and practices with designing instance data.

Node identifier format

RDF graphs have nodes, which are linkable data; literals, which are data that can be linked to but can only be annotated with a limited set of primitive types; and edges, which link nodes to either other nodes or to literals.

Nodes and edges are namespaced identifiers, typically seen in instance data in an abbreviated Prefix format: prefix:identifier. A context definition in the graph will provide the expanded form of this abbreviation. For an instance data node identifier, the namespace will typically represent a knowledge base, e.g. kb:identifier would be defined as expanding to http://example.org/kb/identifier. This can be seen in JSON-LD in a context dictionary:

{
  "@context": {
    ...
    "kb": "http://example.org/kb/",
    ...
  },
  "@graph": [
    {
      "@id": "kb:node1",
      ...
    }
  ]
}

That JSON-LD snippet can be written equivalently to an RDF engine as:

{
  "@graph": [
    {
      "@id": "http://example.org/kb/node1",
      ...
    }
  ]
}

RDF is a flexible language, providing many ways to represent the same data. Example instance data the CASE community provides follow a few conventions for the benefit of interoperability among CASE community members.

Node identifier prefixes

URN vs. HTTP(S)

RDF identifiers must be IRIs. An IRI can follow one of several schemes, including URN, HTTP, or HTTPS. Which scheme to use is up to the consumer, but CASE examples use one particular prefix, http://example.org/kb/, for a few different reasons:

Early in example data drafting, CASE based its knowledge base URL on a special-case URN designated for usage in examples, urn:example:. Unfortunately, a graph technology the community used was unable to handle the urn: scheme in specifically JSON-LD. That technology has since posted a bugfix correction, but we provide this historic note as a reminder to CASE producers to test instance data among the use case scenarios of their expected consumers.
The domain example.org is defined in IETF 6761 Section 6.5 to be a non-resolving domain. Graph engines might include data retrieval capabilities for IRIs encountered in their graphs as users navigate data; but, they are expected to be aware that processing http://example.org should not result in a network retrieval (whether in their own hard-coded logic, or in lower-level DNS resolution). Though, note that the prohibitions on resolving example.org from IETF 6761 are worded as SHOULD NOTs, with only DNS Registrars assigned a MUST NOT prohibition on registration.

Hash vs. slash

Namespace prefixes in RDF typically end in with the Hash or slash? decision: Should the identifier end with a # character to represent an HTML within-page anchor point, or with a / character to represent an independent page at the end of an IRI?

CASE examples end their knowledge base prefix with a slash character, based on the assumption that a knowledge base navigator might be supporting multiple elementary types of clients: Graph engines, which might make programmatic requests of the IRI; and web browsers, for users wanting to view HTTP renders of the IRI. IRIs that end in hash might cause an expectation that a knowledge base provide a dump of all node identifiers to a web browser, and rely on the browser to skip into the middle of the page.

Note that CASE and UCO ontology files follow the # pattern, because even the largest ontology files between CASE and UCO have a concise memory footprint, on the order of kilobytes. In contrast, a knowledge base will likely hit millions of node identifiers early in its usage for any case analysis.

Blank prefix avoidance

CASE examples use the prefix kb: for instance data, e.g. kb:node1. Early in example data drafting, a blank prefix was used, e.g. :node1. This is allowed in most RDF syntaxes, but JSON-LD requires prefixes not be blank (per JSON-LD 1.1, Section 9.1), because some JSON processing technologies are not able to handle the empty string as a dictionary key.

Node identifier suffixes

Background on UUIDs

There are 5 versions of UUID currently:

v1 and v2 potentially leak personally identifiable information (PII)
v4 is controlled by random number generator
v3 (md5) and v5 (sha1) are built via hashing controlled input data

Wikipedia provides a further description of UUID versions.

Guidance for using UUIDs in CASE JSON-LD

An RDF triple consists of a subject, predicate, and object, where the subject is a node, and the object may be a node or literal value. For all non-blank nodes in CASE RDF graphs, UUIDs should be generated as the trailing part of the identifier. (Blank nodes are nodes that do not have an explicit identifier.) CASE does not specify a version for adopters because of different pros/cons for the versions. The recommendation for end-users is to use v3/v5 for use cases where repeatable identifiers are desired, i.e. the same input will result in the same UUID. (This is helpful for, say, documentation example serializations generated by software.) Version 4 is recommended for use cases where every node created should have a unique UUID.

The JSON-LD produced via CASE (examples on caseontology.org and GitHub) use one of these above versions in the following manner:

kb:name-of-CASE-concept-generated-UUID: (E.g. kb:cyber-relationship-aaaa9b36-0126-42aa-9cc6-1e81ca31111)

The name of CASE concept portion is some rendering of the @type of the node. This portion of the identifier is provided from community members' experience working with graph data and UUID-based node identifiers. For instance, an analyst querying for objects in a CASE bundle could be presented with these results to a query What CASE objects are in this bundle:

kb:4eda8e3d-f047-4fc4-90a6-6d99ebced1a8
kb:590e30b2-232a-4883-84da-2a45361eb57a
kb:762b9a42-596a-4c70-a43b-c3f57f90cf93
kb:9da7064c-9442-449f-96ea-f6e20fbcac43
kb:f1e888a4-7a9d-42d9-af5e-01144ceda3ef

Or, the analyst could be presented with these results:

kb:malware-f1e888a4-7a9d-42d9-af5e-01144ceda3ef
kb:person-762b9a42-596a-4c70-a43b-c3f57f90cf93
kb:person-9da7064c-9442-449f-96ea-f6e20fbcac43
kb:picture-590e30b2-232a-4883-84da-2a45361eb57a
kb:relationship-4eda8e3d-f047-4fc4-90a6-6d99ebced1a8

This practice is only meant to provide an informal hint to the type of the node, and carries no programmatically-derivable significance.

@id's will be accompanied by an @type dictionary key when creating a new node, while the @type field is typically missing when used as an object reference within a triple/JSON-LD value. This is because the @type JSON dictionary key implements the RDF type designation. E.g. this JSON-LD:

{
  "@id": "kb:someclass-d903683d-c6e1-4ba8-a6cf-cc39006a58e0",
  "@type": "uco-core:UcoObject"
}

is semantically equivalent to this in Turtle:

kb:someclass-d903683d-c6e1-4ba8-a6cf-cc39006a58e0 rdfs:type uco-core:UcoObject .

Most Turtle serializations would present that with rdfs:type shortened to a:

kb:someclass-d903683d-c6e1-4ba8-a6cf-cc39006a58e0 a uco-core:UcoObject .