Skip to content
Advertisement

Convert from Prodigy’s JSONL format for labeled NER to spaCy’s training format?

I am new to Prodigy and spaCy as well as CLI coding. I’d like to use Prodigy to label my data for an NER model, and then use spaCy in python to create models.

Prodigy outputs in SQLite format. SpaCy takes in this other kind of format, not sure what to call it:

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]

How can I convert from one to the other? It seems like this should be easy, but I cannot find it anywhere.

I have no problem loading in the dataset, just converting.

Advertisement

Answer

Prodigy should export this training format with data-to-spacy as of version 1.9: https://prodi.gy/docs/recipes#data-to-spacy

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement