Skip to content

KoBERTSeg for Topic Segmentation

Notifications You must be signed in to change notification settings

DSBA-Lab/kobertseg

 
 

Repository files navigation

KoBERTSEG: Local Context Based Topic Segmentation Using KoBERT

This is an official code for <KoBERTSEG: Local Context Based Topic Segmentation Using KoBERT, JKIIE, 2022>.
Codes for building backbone BERT architecture refers to Presumm.

KoBERTSEG

Highlights

KoBERTSEG is a KoBERT based model for topic segmentation, which has its main objective in splitting a document into sub-documents each having only one single topic.

Our main contributions are as follows:

  • KoBERTSEG uses the structure used for document summarization, BERTSUM, which is able to embed per-sentence representation for give document input, and as a result can calculate relationships between each sentence which is critical for topic segmentation.

  • By taking advantage of long-sequence understanding ability of KoBERTSEG, it reduces False Positive error by margin compared to other methods, which is critical for down-stream tasks such as document summarization, document classification, etc.

  • Not only introducing the strcuture to be used for topic segmentation, we introduce a novel way of using a huge Korean dataset, which was originally constructed for summarization, for topic segmentation. By exploiting such a way of making dataset and training on it, we could train a model that can be generalized for news articles covering a wide range.

Evaluation

Model Precision Recall F-1 Score $p_k$ WindowDiff
Random 6.40 6.30 6.34 0.5077 0.4877
Even 7.76 7.75 7.75 0.4118 0.4118
LSTM
(Koshorek et al., 2018)
67.66 89.80 76.25 0.1526 0.1531
BTS
(Jeon et al., 2019)
90.02 98.90 93.65 0.0355 0.0457
KoBERTSEG
(window=3)
95.69 98.50 96.79 0.0193 0.0233
KoBERTSEG-SUM
(window=3)
97.38 98.78 97.86 0.0136 0.0157

Usage

Requirements

Linux, Python>=3.8, PyTorch>=1.7.1

We recommend you to use Anaconda to create a conda environment:

conda create -n kobertseg python=3.7 pip
conda install pytorch=1.7.1 cudatoolkit=9.2 -c pytorch

Dataset(bfly)

You can download bfly-soft article dataset here for training/evaluation; unzip and put the files(train.jsonl, dev.jsonl, test.jsonl) under directory dataset/bfly

s### Simple Inference With trained model at models/best_model.pth and demo text file at ./simple_inference/demo.story, implement simple inference using the command below;

python simple_inference.py --test_from models/best_model.pth

This command will produce inference_result.txt file under the directory ./simple_inference. It produces logit scores for topic segmentation between every two sentences.

About

KoBERTSeg for Topic Segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.3%
  • Gherkin 4.3%
  • Jupyter Notebook 3.6%
  • Shell 0.8%