Vietnamese NLP tasks

Dependency parsing

Experiments employ the benchmark Vietnamese dependency treebank VnDT of 10K+ sentences, using 1,020 sentences for test, 200 sentences for development and the remaining sentences for training. LAS and UAS scores are computed on all tokens (i.e. including punctuation).

VnDT v1.1:

	Model	LAS	UAS	Paper	Code
Predicted POS	PhoBERT-base (2020)	78.77	85.22	PhoBERT: Pre-trained language models for Vietnamese	Official
Predicted POS	PhoBERT-large (2020)	77.85	84.32	PhoBERT: Pre-trained language models for Vietnamese	Official
Predicted POS	Biaffine (2017)	74.99	81.19	Deep Biaffine Attention for Neural Dependency Parsing
Predicted POS	jointWPD (2018)	73.90	80.12	A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
Predicted POS	jPTDP-v2 (2018)	73.12	79.63	An improved neural network model for joint POS tagging and dependency parsing
Predicted POS	VnCoreNLP (2018)	71.38	77.35	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	Official

Results on the VnDT v1.1 for Biaffine, jPTDP-v2 and VnCoreNLP are reported in the jointWPD paper "A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing."

VnDT v1.0:

	Model	LAS	UAS	Paper	Code
Predicted POS	VnCoreNLP (2018)	70.23	76.93	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	Official
Gold POS	VnCoreNLP (2018)	73.39	79.02	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	Official
Gold POS	BIST BiLSTM graph-based parser (2016)	73.17	79.39	Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations	Official
Gold POS	BIST BiLSTM transition-based parser (2016)	72.53	79.33	Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations	Official
Gold POS	MSTparser (2006)	70.29	76.47	Online large-margin training of dependency parsers
Gold POS	MaltParser (2007)	69.10	74.91	MaltParser: A language-independent system for datadriven dependency parsing

Results for the BIST graph/transition-based parsers, MSTparser and MaltParser are reported in "An empirical study for Vietnamese dependency parsing."

Machine translation

English-Vietnamese translation

Dataset is from The IWSLT 2015 Evaluation Campaign, also be obtained from https://github.com/tensorflow/nmt.

English-to-Vietnamese

tst2015 is used for test

Model	BLEU	Paper	Code
Stanford (2015)	26.4	Stanford Neural Machine Translation Systems for Spoken Language Domains

tst2013 is used for test

Model	BLEU	Paper	Code
Nguyen and Salazar (2019)	32.8	Transformers without Tears: Improving the Normalization of Self-Attention	Official
Provilkov et al. (2019)	33.27 (uncased)	BPE-Dropout: Simple and Effective Subword Regularization
Xu et al. (2019)	31.4	Understanding and Improving Layer Normalization	Official
CVT (2018)	29.6 (SST)	Semi-Supervised Sequence Modeling with Cross-View Training
ELMo (2018)	29.3 (SST)	Deep contextualized word representations
Transformer (2017)	28.9	Attention is all you need	Link
Kudo (2018)	28.5	Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Google (2017)	26.1	Neural machine translation (seq2seq) tutorial	Official
Stanford (2015)	23.3	Stanford Neural Machine Translation Systems for Spoken Language Domains

The ELMo score is reported in Semi-Supervised Sequence Modeling with Cross-View Training. The Transformer score is available at https://github.com/duyvuleo/Transformer-DyNet.

Vietnamese-to-English

tst2013 is used for test

Model	BLEU	Paper	Code
Provilkov et al. (2019)	32.99 (uncased)	BPE-Dropout: Simple and Effective Subword Regularization
Kudo (2018)	26.31	Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Named entity recognition

16,861 sentences for training and development from the VLSP 2016 NER shared task:
- 14,861 sentences are used for training.
- 2k sentences are used for development.
Test data: 2,831 test sentences from the VLSP 2016 NER shared task.
NOTE that in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. The VLSP 2016 NER data also consists of gold POS and chunking tags as reconfirmed by VLSP 2016 organizers. This scheme results in an unrealistic scenario for a pipeline evaluation:
- The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name.
- Gold POS and chunking tags are NOT available in a real-world application.
For a realistic scenario, contiguous syllables constituting a full name are merged to form a word. POS/chunking tags--if used--have to be automatically predicted!

Model	F1	Paper	Code	Note
PhoBERT-large (2020)	94.7	PhoBERT: Pre-trained language models for Vietnamese	Official
PhoBERT-base (2020)	93.6	PhoBERT: Pre-trained language models for Vietnamese	Official
VnCoreNLP (2018) [1]	91.30	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	Official	Used ETNLP embeddings
BiLSTM-CRF + CNN-char (2016) [1]	91.09	End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF	Official / Link	Used ETNLP embeddings
VNER (2019)	89.58	Attentive Neural Network for Named Entity Recognition in Vietnamese
VnCoreNLP (2018)	88.55	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	Official	Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + CNN-char (2016) [2]	88.28	End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF	Official / Link	Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + LSTM-char (2016) [2]	87.71	Neural Architectures for Named Entity Recognition	Link	Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF (2015) [2]	86.48	Bidirectional LSTM-CRF Models for Sequence Tagging	Link	Pre-trained embeddings learned from Baomoi corpus

[1] denotes that scores are reported in "ETNLP: a visual-aided systematic approach to select pre-trained embeddings for a downstream task"
[2] denotes that BiLSTM-CRF-based scores are reported in "VnCoreNLP: A Vietnamese Natural Language Processing Toolkit"

Part-of-speech tagging

27,870 sentences for training and development from the VLSP 2013 POS tagging shared task:
- 27k sentences are used for training.
- 870 sentences are used for development.
Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.

Model	Accuracy	Paper	Code
PhoBERT-large (2020)	96.8	PhoBERT: Pre-trained language models for Vietnamese	Official
PhoBERT-base (2020)	96.7	PhoBERT: Pre-trained language models for Vietnamese	Official
jointWPD (2018)	95.97	A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
VnCoreNLP-VnMarMoT (2017)	95.88	From Word Segmentation to POS Tagging for Vietnamese	Official
jPTDP-v2 (2018)	95.70	An improved neural network model for joint POS tagging and dependency parsing
BiLSTM-CRF + CNN-char (2016)	95.40	End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF	Official / Link
BiLSTM-CRF + LSTM-char (2016)	95.31	Neural Architectures for Named Entity Recognition	Link
BiLSTM-CRF (2015)	95.06	Bidirectional LSTM-CRF Models for Sequence Tagging	Link
RDRPOSTagger (2014)	95.11	RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger	Official

Result for jPTDP-v2 is reported in "A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing."
Results for BiLSTM-CRF-based models and RDRPOSTagger are reported in "From Word Segmentation to POS Tagging for Vietnamese."

Word segmentation

Training & development data: 75k manually word-segmented training sentences from the VLSP 2013 word segmentation shared task.
Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.

Model	F1	Paper	Code
VnCoreNLP-RDRsegmenter (2018)	97.90	A Fast and Accurate Vietnamese Word Segmenter	Official
UETsegmenter (2016)	97.87	A hybrid approach to Vietnamese word segmentation	Official
jointWPD (2018)	97.81	A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
vnTokenizer (2008)	97.33	A Hybrid Approach to Word Segmentation of Vietnamese Texts
JVnSegmenter (2006)	97.06	Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
DongDu (2012)	96.90	Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt

Results for VnTokenizer, JVnSegmenter and DongDu are reported in "A hybrid approach to Vietnamese word segmentation."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vietnamese.md

vietnamese.md

Vietnamese NLP tasks

Dependency parsing

VnDT v1.1:

VnDT v1.0:

Machine translation

English-Vietnamese translation

English-to-Vietnamese

Vietnamese-to-English

Named entity recognition

Part-of-speech tagging

Word segmentation

Files

vietnamese.md

Latest commit

History

vietnamese.md

File metadata and controls

Vietnamese NLP tasks

Dependency parsing

VnDT v1.1:

VnDT v1.0:

Machine translation

English-Vietnamese translation

English-to-Vietnamese

Vietnamese-to-English

Named entity recognition

Part-of-speech tagging

Word segmentation