Skip to content

A survey, experiment report and thought-dump of blending generative AI (e.g., GPT) with Linked Data. Also, stuff on ML workflows, alternatives to GPT and prompting.

Notifications You must be signed in to change notification settings

PR0CK0/GenerativeAI_and_LinkedData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Generative AI and Linked Data

A detailed record of my research into the interplay of LD and ML. Mostly interested in maintaining ML provenance, from requirements to phase-out and archival, with a focus on generative AI models. The general focus of the research:

  1. Getting generative models to spit out valid triples and maybe DL axioms
  2. ML workflows, from end-to-end, in an operational context
  3. GPT alternatives, including ones that can run locally and can be fine-tuned more explicitly
  4. Prompting techniques... we already know a lot about this stuff but there are interesting fragments in the literature to consider

This repo reflects that research, organizing it as appropriate.

All linked papers saved locally as PDFs in this repo :)

Table of Contents

  1. Survey
  2. LLMs and LD Generation
  3. ML Workflows
    1. ML Workflow Papers
    2. MLOps Systems/Platforms
    3. Ontologies
    4. ML Model Provenance
  4. GPT Alternatives
    1. A List of Large Models
    2. Local GPTs
    3. Hosting Services
    4. Evaluating Models
    5. Jailbroken GPT
  5. Prompting Techniques
    1. Prompting Technique Papers

Survey

On the interplay between Linked Data constructs and LLMs, Tim (@A-J-S97) has included me on a paper of his that surveys this exact topic. He extracted 5 key sub-topics of research:

  • Knowledge Graph Generation
  • Knowledge Graph Completion
  • Knowledge Graph Enrichment
  • Ontology Alignment
  • Language Model Probing

It is in the publication pipeline and stuck on Teams. Any paper cited here is out of the scope of Tim's paper, or else it was missed.

LLMs and LD Generation

Fine-tuning

One thought I had was fine-tuning GPT on LD instance data and DL axioms to get a model really good at spitting out triples. This could be possible but it would require the collection of prompt-response pairs that could be generated (by using GPT), to then fine-tune off of. But at the end of the day it would take manual effort, which is why most work in the literature tends to just use GPT with a few examples in a good prompt to get valid RDF and OWL output.

ML Workflows

By "ML Workflows", I mean all associated nomenclature, e.g.:

  • ML Operations (MLOps)
  • ML Engineering
  • AutoML
  • Etc.

MLOps Systems/Platforms

What platforms exist for managing ML in an operational context? I mean not just training and storing models, but tracking their creation, phasing them out, archiving them, etc. The entire lifecycle. Everyone simply assumes it's an ad-hoc process or totally proprietary. This is incorrect. Below are several tools for MLOps, including the important aspect of provenance.

ML Workflow Papers

Ontologies

ML Model Provenance

It would be interesting to see if it is possible to embed RDF data, or at least IRIs, into a model file to carry along its provenance. This would be nice so that models transferred between workers would maintain their provenance.

Techniques

  1. Send a zip containing the model and its provenance (e.g., its relevant provenance graph)
  2. Embed the provenance into the model
    1. Normalize the serialization regardless of serialization type with a ZIP script, embed the provenance into the header; unzipping it will produce the model that can be used without problem
      • The Python library zipfile lets you add comments to zip files that you can read when decompressing. See this script.
    2. Embed it some other way, depending on serialization format (e.g., PMML is XML based but this is probably unscalable for large models with thousands of millions of parameters)
    3. Embed it as simply as possible with an actual IRI and let that IRI dereference to a web resource that contains the full model info (e.g., in the filename, in the header/footer of the file, etc.)

Model Serialization Formats

  • Hierarchical Data Format 5 (HDF5) - binary
  • Open Neural Network Exchange (ONNX) - binary
  • Predictive Model Markup Language (PMML) - XML
  • Pickle - binary
  • SavedModel - folder structure
  • Model checkpoints - binary

GPT Alternatives

In a bit of Orwellian doublespeak, "OpenAI" is entirely closed-source. Directly quoting from the 100-page GPT-4 technical report:

"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."

So, this section pertains to any "alternatives" to GPT, including ways of getting around its limitations. This is for the reason that a lot of military data is classified and cannot even be discussed on online platforms like ChatGPT.

A List of Large Models

There are tons of large transformer models, e.g., BERT. All of them are potential alternatives to GPT, but (disclaimer!) they are all inferior in almost every circumstance. OpenAI has some secret sauce that simply places them leagues above any other models.

  • Awesome Huge Models - The best resource on all of them, including GPTs, LLaMa, PaLM, BLOOM, etc.); I contributed some to it and it is a one-stop shop

Local GPTs

Hosting Services

Evaluating Models

Jailbroken GPT

It is possible to prompt GPT so heavily with instructional input, that it can be "persuaded" to evade some of OpenAI's restrictions (e.g., ethical ones):

Prompting Techniques

This is the definitive guide on prompt engineering, including links to papers and outside resources, as a GitHub repo:

Prompting Technique Papers

About

A survey, experiment report and thought-dump of blending generative AI (e.g., GPT) with Linked Data. Also, stuff on ML workflows, alternatives to GPT and prompting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages