Skip to content

[EMNLP 2023] The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

License

Notifications You must be signed in to change notification settings

FSoft-AI4Code/TheVault

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

License: MIT Python 3.8 arXiv The Vault on HuggingFace datasets

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Table of content


The Vault Dataset

Data Summary

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.

Something something

Data Structure

Data Instances