GitHub - bdqnghi/SAR_API_mapping: [FSE 2019] Learning Cross-Language API Mappings with Little Knowledge

SAR: Learning Cross-Language API Mappings with Little Knowledge, ESEC/FSE 2019

Domain Adaptation: this part provides an intuitive explanation of domain adaptation
Usage of source code in Python
Data: java vectors and cs vectors
Newly found mappings: the list of 425 newly found API mappings.
We reused part of MUSE from Facebook Research to build up our tool.

Problem

Cross-language API mappings: to provide mappings of similar APIs across languages, e.g Figure below.

Why the mappings are useful?
- Most of the language translation tools today, e.g J2CSTranslator, Java2Csharp are operated based on a list of configuration mappings between languages. Example of such configurations:
  - https://github.com/ydanila/sharpen_imazen_config/blob/master/sharpen.config/src/sharpen/config/MEConfiguration.java
- As a result, intuitively, more accurate mappings added to the configuration file leads to better translation and reduce error rate.

In addition, most of the previous work focus on mining the mappings on a stable dataset. Even though if the dataset is large, it doesn't reflect the fast evolution of software as new APIs are added.

Prominent previous work: MAM, StaMiner, Api2Api, DeepAM. Among them, Api2Api and DeepAM are the neural-based methods that proved the advantages against MAM and StaMiner, we focus on these work.

Api2Api: To compute a translation matrix W based on a large number of seeds (dictionary), the matrix then is used to generalize the mapping between the 2 vector spaces X and Y, such that Y = WX (can also say translate X to Y)
DeepAM: To find a joint embedding space that can both represent them together.

Common ground: to find some intermediate layer that can connect the 2 vector spaces, so-called the joint embedding space.

Common drawback: both require some form of parallel data, which is not easy to achieve. These can be seen as the supervised learning approach.

Api2Api: require a large amount of dictionary to compute a good translation matrix,
DeepAM: require a large amount of <code, text description> pair to compute the joint embedding

Is there another way to map the 2 vector spaces without the need for parallel data? Or is there an unsupervised approach that can adapt the 2 spaces?

Core idea

To represent 2 languages as the domain vector spaces X and Y and try to align them together with very little supervision. This problem can be generalized into the domain adaptation problem, more intuive explaination can be found here

Usage

Dependencies

Python 3 with NumPy/SciPy
Conda or Python Pip, we recommend the user to use Pip.
PyTorch
Faiss (recommended) for fast nearest neighbor search (CPU or GPU).

Available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up the nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".

If you use Conda to install pytorch, and if the default Pytorch does not work, please try this command:

conda install pytorch --channel pytorch

For the other libs, if using conda, the command will be : conda install numpy

If you use pip, only this command is good enough to install all of the necessary requirements.

pip install -r requirements.txt

Run the code: adversarial training and refinement (CPU|GPU)

A sample command to learn a mapping using adversarial training and iterative Procrustes refinement:

python3 unsupervised.py --src_lang java --tgt_lang cs --src_emb data/java_vectors.txt --tgt_emb data/cs_vectors.txt --n_refinement 2 --emb_dim 50 --max_vocab 300000 --epoch_size 100000 --n_epochs 1 --identical_dict_path "dict/candidates_dict.txt" --dico_eval "eval/java-cs.txt"

Evaluate cross-lingual embeddings (CPU|GPU)

python3 evaluate.py --src_lang java --tgt_lang cs --src_emb dumped/debug/id/vectors-java.txt --tgt_emb dumped/debug/id/vectors-cs.txt --dico_eval "eval/java-cs.txt" --max_vocab 200000

Output

After the learning step, the output can be found in the directory "dumped/debug/id", where "id" is a random uuid generated each time ones trigger the script "unsupervised.py" to train the model. Inside such directory, there will be two files vectors-java.txt and vectors-cs.txt that represent for the two aligned new vector spaces.

Some explanations and tips:

src_emb: the source vector space
tgt_emb: the target vector space
n_epochs: number of epochs, usually up to 5 is good enough.
epoch_size: size of the epoch to run over the training data once, for large vocabulary(e.g 100.000 words), should be around 500.000-1.000.000. The current default is 100.000
n_refinement: number of refinement steps, usually the result converges after 2 iterations if the initial results are already good.
emb_dim: size of the input embeddings, now the default is 50, 50 is also the recommended size to get a good performance.
identical_dict_path: path to the synthetic dictionary. Since we're based on class and method name to induce a synthetic dictionary for the refinement, it needs to be precalculated and store to somewhere first, otherwise, the computation will be slow if the size of the 2 input embeddings is large. Initially, if there is no synthetic dictionary, the program needs to generate it for the first time and may take some times to finish generating, depending on the size of the embeddings.
dico_eval: path to the evaluation dictionary
If the discriminator loss reaches 0.35, it's a good sign that the model converges, more training may not affect much.
After the training step, a new folder is generated under "dumped/debug" with a unique ID each time a script is running, new embeddings are written in there.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
commits_data		commits_data
data		data
dict		dict
dumped		dumped
eval		eval
figs		figs
new_found		new_found
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
DOMAIN_ADAPTATION.md		DOMAIN_ADAPTATION.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
STATUS.md		STATUS.md
evaluate.py		evaluate.py
infer_mappings.py		infer_mappings.py
paper.pdf		paper.pdf
requirements.txt		requirements.txt
supervised.py		supervised.py
translation_matrix.py		translation_matrix.py
unsupervised.py		unsupervised.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAR: Learning Cross-Language API Mappings with Little Knowledge, ESEC/FSE 2019

Problem

Core idea

Usage

Dependencies

Run the code: adversarial training and refinement (CPU|GPU)

Evaluate cross-lingual embeddings (CPU|GPU)

Output

Some explanations and tips:

About

Releases 2

Packages

Languages

License

bdqnghi/SAR_API_mapping

Folders and files

Latest commit

History

Repository files navigation

SAR: Learning Cross-Language API Mappings with Little Knowledge, ESEC/FSE 2019

Problem

Core idea

Usage

Dependencies

Run the code: adversarial training and refinement (CPU|GPU)

Evaluate cross-lingual embeddings (CPU|GPU)

Output

Some explanations and tips:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages