Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More granular control over Blank node serialization #2549

Open
TheMessik opened this issue Jun 21, 2024 · 1 comment
Open

More granular control over Blank node serialization #2549

TheMessik opened this issue Jun 21, 2024 · 1 comment
Labels
enhancement Incrementally add new feature

Comments

@TheMessik
Copy link

Version

4.10.0

Feature

When serializing a DatasetGraph into NQ format, I find that all blank nodes with specified labels get a "B" prepended to the label, e.g. a blank node with a label "students" would be serialized as "_:Bstudents".
This is somewhat annoying for my use case: an RML engine needs to follow a particular spec, including filling in blank node patterns.

My workaround currently consists of Regex replacing, but this is far from ideal.

I'd like to suggest a more granular control of how the NQ writer (and all writers in general) handle Blank nodes: give the user an option to preserve the original blank node without prepending a "B" in front of the label.

Code example that performs the serialization:

DatasetGraph graph = ...; // some graph
OutputStream out = new ByteArrayOutputStream();
RDFWriter.source(graph)
    .lang(Lang.NQ)
    .output(out);
String serialized = out.toString().replaceAll("_:B", "_:");

Are you interested in contributing a solution yourself?

Perhaps?

@TheMessik TheMessik added the enhancement Incrementally add new feature label Jun 21, 2024
@afs
Copy link
Member

afs commented Jun 22, 2024

Hi @TheMessik,

Blank nodes from data from a parser will be large random numbers. So I'm assuming you are controlling the RDF production and setting the blank node label yourself.

The RDFWriter builder doesn't currently provide a way to set the NodeFormatter. It would be good to add this.

If you want to read such data in, and preserve the label (with care!), then use RDFParser.create().labelToNode(labelToNode) with LabelToNode.createUseLabelAsGiven(). Your code is responsible for blank node label uniqueness and the rules about what happens on graph merge and reading files multiple times.

For writing: NodeFormatter is the interface for controlling the RDF term output.

In extending RDFWriterBuilder, interfaces WriterGraphRIOT and WriterDatasetGraphRIOT, the low level per-format interfaces, will need changing.

There several kinds of writer for the N-Triples/Turtle family of syntax - streamed, flat, batching and collecting - all use a NodeFormatter.

At the RDFWriter level, there isn't the "writer profile" abstraction like there is when reading (where there is a node maker FactoryRDF carried by ParserProfile).

N-Quads is the simplest output form. It is streamed and uses WriterStreamRDFPlain.

Below is the code that is used for N-Quads. You could use that, modified at NodeFmtLib.encodeBNodeLabel to just use the label. Be careful - some characters aren't legal in a blank node label string.

    public static void main() {
        String input = "_:x <x:p> <x:o> .";
        Graph graph = RDFParser.fromString(input, Lang.NT).toGraph();
        AWriter out = IO.wrapUTF8(System.out);
        NodeFormatter fmt = new NodeFormatterNT() {
            @Override
            public void formatBNode(AWriter w, String label) {
                w.print("_:");
                String lab = NodeFmtLib.encodeBNodeLabel(label);
                w.print(lab);
            }
        };
        StreamRDF stream = new WriterStreamRDFPlain(out, fmt) ;
        StreamRDFOps.graphToStream(graph, stream);
    }

Hope that helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Incrementally add new feature
Projects
None yet
Development

No branches or pull requests

2 participants