Skip to content

CGCL-codes/awesome-code-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 

Repository files navigation

Awesome Deep Learning for Code Intelligence

Awesome Maintenance

This document presents a meticulously curated collection of awesome research papers, datasets, and tools dedicated to the application of machine learning techniques in code intelligence.

Code intelligence involves the application of machine learning techniques to extract knowledge from large-scale code repositories, with the aim of developing intelligent tools to improve the quality and productivity of computer programming.

The list includes the publication year for each paper (or the submission year for pre-prints and arXiv articles), the name of the first author, and the publication venue. Additionally, if the code associated with the research is available, it is linked via a corresponding hyperlink.

Related Survey

Year Title Author Venue Code
2018 A Survey of Machine Learning for Big Code and Naturalness Allamanis et al. CSUR Code
2021 A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research Watson et al. TOSEM Code
2020 Synergy between Machine/Deep Learning and Software Engineering- How Far Are We? Wang et al. arXiv Code
2020 A Survey on Deep Learning for Software Engineering Yang et al. CSUR Code
2020 Deep Learning & Software Engineering- State of Research and Future Directions Devanbu et al. arXiv Code
2021 CodeXGLUE- A Machine Learning Benchmark Dataset for Code Understanding and Generation Lu et al. arXiv Code

Code Representation

To represent source code, we need to first determine what to represent. Various work has proposed to extract code features from multiple perspectives, including code tokens, intermediate representation, abstract syntax tree, as well as many kinds of flow graphs.

Code Tokens

Code tokens, shaping the textual appearance of source code, are composed of function name, keywords, and various variable identifiers. These tokens are simple yet effective to represent the semantics of programs. The majority of approaches for processing code involve breaking the program down into a sequence of tokens based on specific delimiters, such as spaces or the capitalization patterns in identifiers (for identifiers like SortList and intArray).

Year Title Author Venue Code
2017 Synthesizing benchmarks for predictive modeling Cummins et al. CGO Code
2015 Toward deep learning software repositories White et al. ICSE Code
2016 Summarizing source code using a neural attention model Iyer et al. ACL Code
2016 A convolutional attention network for extreme summarization of source code Allamanis et al. ICML Code
2019 Open Vocabulary Learning on Source Code with a Graph-Structured Cache Cvitkovic et al. ICML Code
2021 A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code Chirkova et al. NAACL Code
2020 Learning and Evaluating Contextual Embedding of Source Code Kanade et al. ICML Code
2020 Codebert: A pre-trained model for programming and natural languages Feng et al. EMNLP Code
2020 Big code!= big vocabulary: Open-vocabulary models for source code Karampatsis et al. ICSE Code

API

There have been multiple methods proposed to analyze the API sequences in programs. One line of work is about mining API usage patterns from a large code corpus to demonstrate how to use an API. Another line of work is API recommendation, which aims to recommend or generate a sequence of APIs for users.

Year Title Author Venue Code
2015 How can I use this method? Moreno et al. ICSE Code
2017 An unsupervised approach for discovering relevant tutorial fragments for APIs Jiang et al. ICSE Code
2017 DeepAM: Migrate APIs with Multi-Modal Sequence to Sequence Learning Deepam et al. IJCAI Code
2016 Deep API learning Gu et al. FSE Code
2017 Exploring API Embedding for API Usages and Applications Nguyen et al. ICSE Code
2019 SAR: learning cross-language API mappings with little knowledge Bui et al. FSE Code

AST

The Abstract Syntax Tree (AST) is a tree-structured intermediate representation of code that describes the syntactic structure of a program. In an AST, the leaf nodes typically correspond to the tokens of variables and method names in the source code, while the non-leaf nodes represent the syntactic structure of code, like function definition, branch functions. As a result, this representation allows ASTs to be useful for both capturing the lexical information (e.g., variable number) and the syntactic structure of the source code. In practice, we can extract ASTs using several open source tools, e.g., tree-sitter parser, and LLVM Clang.

Year Title Author Venue Code
2016 Convolutional neural networks over tree structures for programming language processing Mou et al. AAAI Code
2020 Modeling programs hierarchically with stack-augmented LSTM Liu et al. JSS Code
2019 A novel neural source code representation based on abstract syntax tree Zhang et al. ICSE Code
2018 Deep code comment generation Hu et al. ICPC Code
2019 code2vec: Learning distributed representations of code Alon et al. PLDI Code
2019 code2seq: Generating Sequences from Structured Representations of Code Alon et al. ICLR Code
2020 Structural language models of code Alon et al. ICML Code
2017 A syntactic neural model for general-purpose code generation Yin et al. ACL Code
2018 Tree-to-tree neural networks for program translation Chen et al. ICLR Code

IR

The Intermediate Representation (IR) is a well-formed structure that is independent of programming languages and machine architectures. It is used by compilers to accurately represent the source code during the translation process from the source code to low-level machine code. The IR can express the operations of the target machine. It is natural to enhance the code embeddings via utilizing IRs, with the benefit of limited vocabulary to significantly alleviate the OOV issue.

Year Title Author Venue Code
2018 Neural code comprehension: A learnable representation of code semantics Ben et al. Neurips Code
2020 IR2Vec: LLVM IR based Scalable Program Embeddings Venkatakeerthy et al. TACO Code
2020 Compiler-based graph representations for deep learning models of code Brauckmann et al. CC Code
2021 ProGraML: Graph-based Deep Learning for Program Optimization and Analysis Cummins et al. ICML Code
2021 How could Neural Networks understand Programs? Peng et al. ICML Code

Code Graphs

Currently, many approaches have been proposed to convert programs into graphs to better represent the rich structural information within the programs, including ControlFlow Graph (CFG), Data-Flow Graph (DFG) and Code Property Graph (CPG). The CFG represents the computation and control flow of a program. In this representation, each node represents a basic block and each edge represents the transitions of control flow in the program. The DFG is a directed graph that illustrates data relationships among various functions. Each node in the DFG has input and output data ports, and each edge links an output port to an input port on another node.

Year Title Author Venue Code
2018 Learning to represent programs with graphs Allamanis et al. ICLR Code
2017 Smartpaste: Learning to adapt source code Allamanis et al. arXiv Code
2018 Generative code modeling with graphs Brockschmidt et al. ICLR Code
2020 Flow2Vec: value-flow-based precise code embedding Sui et al. OOPSLA Code
2021 ProGraML: Graph-based Deep Learning for Program Optimization and Analysis Cummins et al. ICML Code
2021 PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair Chen et al. NeurIPS Code
2017 Intelligent development environment and software knowledge graph Lin et al. NeurIPS Code
2020 Graph4code: A machine interpretable knowledge graph for code Abdelaziz et al. arXiv Code
2020 Exploiting Code Knowledge Graph for Bug Localization via Bi-directional Attention Zhang et al. ICPC Code

Other Features of Code

In addition to the aforementioned features of code that have already been widely explored, there also exist several kinds of features that are used in some specific scenarios.

Year Title Author Venue Code
2018 Code vectors: Understanding programs through embedded abstracted symbolic traces Henkel et al. FSE Code
2019 Learning to Represent Edits Yin et al. ICLR Code
2019 Neural Networks for Modeling Source Code Edits Zhao et al. arXiv Code
2020 Cc2vec: Distributed representations of code changes Hoang et al. ICSE Code
2019 On Learning Meaningful Code Changes via Neural Machine Translation Tufano et al. ICSE Code
2021 Copy that! Editing Sequences by Copying Spans Panthaplackel et al. AAAI Code
2020 A Structural Model for Contextual Code Changes Brody et al. OOPSLA Code
2021 Learning Structural Edits via Incremental Tree Transformations Yao et al. ICLR Code

Hybrid

To leverage multiple code features, several approaches to representing source code in a hybrid fashion have been developed.

Year Title Author Venue Code
2018 Deep code search Gu et al. ICSE Code
2016 Deep learning code fragments for code clone detection White et al. ASE Code
2018 Deepsim: deep learning code functional similarity Zhao et al. FSE Code
2018 Improving automatic source code summarization via deep reinforcement learning Wan et al. ASE Code
2019 Multi-modal attention network learning for semantic source code retrieval Wan et al. ASE Code

Application

Code Classification

Classifying source code into different classes (e.g., different functionalities and programming languages), is important for many tasks such as code categorization, programming language identification, code prediction, and vulnerability detection. Various studies have been conducted to classify code snippets into categories based on their functionalities.

Year Title Author Venue Code
2016 Convolutional neural networks over tree structures for programming language processing Mou et al. AAAI Code
2018 Adapting neural text classification for improved software categorization Leclair et al. ICSME Code
2019 Bilateral dependency neural networks for cross-language algorithm classification Bui et al. SANER Code
2018 SCC: Automatic classification of code snippets Alreshedy et al. SCAM Code
2020 SCC++: predicting the programming language of questions and snippets of Stack Overflow Alrashedy et al. JSS Code

Vulnerability Detection and Bug Finding

Detecting vulnerabilities or bugs in programs is essential for assuring the quality of software, as well as saves much effort and time for software development. Although many tools have been developed for vulnerability detection, e.g., Clang Static Analyzer, Coverity, Fortify, Flawfinder, Infer, and SVF, most of them are based on static analysis. Recently, a growing number of works employ deep learning to discover vulnerabilities.

Year Title Author Venue Code
2016 Automatically Learning Semantic Features for Defect Prediction Wang et al. ICSE Code
2017 Software defect prediction via convolutional neural network Li et al. QRS Code
2018 Automatic feature learning for predicting vulnerable software components Dam et al. TSE Code
2018 Vuldeepecker: A deep learning-based system for vulnerability detection Li et al. NDSS Code
2019 μVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection Zou et al. TPSC Code
2021 SySeVR: A framework for using deep learning to detect software vulnerabilities Li et al. TDSC Code
2018 Cross-project transfer representation learning for vulnerable function discovery Lin et al. TII Code
2018 Maximal divergence sequential autoencoder for binary software vulnerability detection Le et al. ICLR Code
2019 Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks Zhou et al. NeurIPS Code
2020 Combining graph-based learning with automated data collection for code vulnerability detection Wang et al. TIFS Code
2021 DeepWukong: Statically detecting software vulnerabilities using deep graph neural network Cheng et al. TOSEM Code
2021 Combining Graph Neural Networks with Expert Knowledge for Smart Contract Vulnerability Detection Liu et al. TKDE Code
2021 Vulnerability Detection with Fine-Grained Interpretations Li et al. FSE Code
2021 Interpreting deep learning-based vulnerability detector predictions based on heuristic searching Zou et al. TOSEM Code
2018 Deepbugs: A learning approach to name-based bug detection Pradel et al. OOPSLA Code
2019 Improving bug detection via context-based code representation learning and attention-based neural networks Li et al. OOPSLA Code
2020 Neural Attribution for Semantic Bug-Localization in Student Programs Gupta et al. NeurIPS Code
2021 Fault Localization with Code Coverage Representation Learning Li et al. ICSE Code
2021 Learning to find naming issues with big code and small supervision He et al. PLDI Code

Code Completion

Code completion is a core feature of most modern IDEs. It offers the developers a list of possible code hints based on available information.

Year Title Author Venue Code
2014 Code completion with statistical language models Raychev et al. PLDI Code
2017 Neural code completion Liu et al. ICLR Code
2018 Code completion with neural attention and pointer networks Li et al. IJCAI Code
2016 Learning python code suggestion with a sparse pointer network Bhoopchand et al. arXiv Code
2019 Pythia: Ai-assisted code completion system Svyatkovskiy et al. SIGKDD Code
2021 Code prediction by feeding trees to transformers Kim et al. ICSE Code
2020 Structural language models of code Alon et al. ICML Code
2021 Code completion by modeling flattened abstract syntax trees as graphs Wang et al. AAAI Code
2020 IntelliCode Compose: Code Generation Using Transformer Svyatkovskiy et al. FSE Code
2020 A Self-Attentional Neural Architecture for Code Completion with Multi-Task Learning Liu et al. ICPC Code
2020 Multi-task learning based pre-trained language model for code completion Liu et al. ASE Code
2021 Fast and memory-efficient neural code completion Svyatkovskiy et al. MSR Code
2020 On-the-Fly Adaptation of Source Code Models using Meta-Learning Shrivastava et al. arXiv Code
2019 Generative Code Modeling with Graphs Brockschmidt et al. ICLR Code
2018 A Retrieve-and-Edit Framework for Predicting Structured Outputs Hashimoto et al. NIPS Code

Type Inference

Programming languages with dynamic typing, like Python and JavaScript, allow for rapid prototyping for developers and can save the time of software development dramatically. However, without the type information, unexpected run-time errors are prone to occur, which may introduce bugs and produce low-quality code. Current works on type inference, with the aim of automatically inferring variable types, mainly fall into two categories: the static-analysis-based and learning-based.

Year Title Author Venue Code
2018 MaxSMT-based type inference for Python 3 Hassan et al. CAV Code
2004 Faster than C: Static type inference with Starkiller Salib et al. PyCon Proceedings Code
2015 Predicting program properties from big code Raychev et al. Communications of the ACM Code
2016 Python probabilistic type inference with natural language support Xu et al. FSE Code
2018 Deep learning type inference Hellendoorn et al. FSE Code
2019 NL2Type: Inferring JavaScript Function Types from Natural Language Information Malik et al. ICSE Code
2020 Typewriter: Neural type prediction with search-based validation Pradel et al. FSE Code
2020 Lambdanet: Probabilistic type inference using graph neural networks Wei et al. ICLR Code
2020 OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints Pandi et al. arXiv Code
2020 Typilus: neural type hints Allamanis et al. PLDI Code
2021 Type4Py: Deep Similarity Learning-Based Type Inference for Python Mir et al. arXiv Code

Code Search

Code search aims to retrieve a code snippet by a natural-language query (nl-tocode) or code query (code-to-code). The nl-to-code search refers to searching code fragments that have similar semantics to the natural-language query from a codebase. In contrast to nl-to-code search, the input of code-to-code search is source code, rather than natural-language description. The objective of the code-to-code search is to find code snippets that are semantically related to an input code from a codebase.

Year Title Author Venue Code
2015 Codehow: Effective code search based on api understanding and extended boolean model (e) Lv et al. ASE Code
2016 Relationship-aware code search for JavaScript frameworks Li et al. FSE Code
2018 Deep code search Gu et al. ICSE Code
2019 Multi-modal attention network learning for semantic source code retrieval Wan et al. ASE Code
2020 A Multi-Perspective Architecture for Semantic Code Search Haldar et al. ACL Code
2020 OCoR: An Overlapping-Aware Code Retriever Zhu et al. ASE Code
2019 Coacor: Code annotation for code retrieval with reinforcement learning Yao et al. WWW Code
2019 Aroma: Code recommendation via structural code search Luan et al. OOPSLA Code
2020 Deep Graph Matching and Searching for Semantic Code Retrieval Ling et al. TKDD Code
2019 When deep learning met code search Cambronero et al. FSE Code
2018 FaCoY: a code-to-code search engine Kim et al. ICSE Code
2021 Interactive Cross-language Code Retrieval with Auto-Encoders Chen et al. ASE Code

Code Clone Detection

Numerous software engineering activities, including code reuse, vulnerability detection, and code search, rely on detecting similar code snippets (or code clones). There are basically four main types of code clones: Type-1 code clones are ones that are identical except for spaces, blanks, and comments. Type-2 code clones denote identical code snippets except for the variable, type, literal, and function names. Type-3 code clones denote two code snippets that are almost identical except for a few statements that have been added or removed. Type-4 code clones denote heterogeneous code snippets with similar functionality but differing code structures or syntax. To handle different types of code clones, various works have been proposed.

Year Title Author Venue Code
2002 CCFinder: A multilinguistic token-based code clone detection system for large scale source code Kamiya et al. TSE Code
2008 NICAD- Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization Roy et al. ICPC Code
2007 Deckard: Scalable and accurate tree-based detection of code clones Jiang et al. ICSE Code
2016 Sourcerercc: Scaling code clone detection to big-code Sajnani et al. ICSE Code
2016 Deep learning code fragments for code clone detection White et al. ASE Code
2017 Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code Wei et al. IJCAI Code
2018 Deepsim: deep learning code functional similarity Zhao et al. FSE Code
2020 SCDetector: Software Functional Clone Detection Based on Semantic Tokens Analysis Wu et al. ASE Code
2019 A novel neural source code representation based on abstract syntax tree Zhang et al. ICSE Code
2019 Learning-based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection Buch et al. SANER Code
2020 Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree Wang et al. SANER Code
2020 funcGNN: A Graph Neural Network Approach to Program Similarity Nair et al. ESEM Code
2021 Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks Mehrotra et al. TSE Code
2018 Deep Learning Similarities from Different Representations of Source Code Tufano et al. MSR Code

Code Summarization

Inspired by the text generation work in NLP, many approaches have been put forward to systematically generate a description or function name to summarize the semantics of source code.

Year Title Author Venue Code
2010 Supporting program comprehension with source code summarization Haiduc et al. ICSE Code
2013 Autocomment: Mining question and answer sites for automatic comment generation Wong et al. ASE Code
2015 Clocom: Mining existing source code for automatic comment generation Wong et al. SANER Code
2013 Evaluating source code summarization techniques: Replication and expansion Eddy et al. ICPC Code
2013 Natural Language Models for Predicting Programming Comments Movshovitz et al. ACL Code
2016 A convolutional attention network for extreme summarization of source code Allamanis et al. ICML Code
2016 Summarizing source code using a neural attention model Iyer et al. ACL Code
2018 Deep code comment generation Hu et al. ICPC Code
2019 code2seq: Generating Sequences from Structured Representations of Code Alon et al. ICLR Code
2019 Structured neural summarization Fernandes et al. ICLR Code
2020 A transformer-based approach for source code summarization Ahmad et al. ACL Code
2021 SIT: Code Summarization with Structure-Induced Transformer Wu et al. ACL Code
2018 Improving automatic source code summarization via deep reinforcement learning Wan et al. ASE Code
2020 Improved code summarization via a graph neural network Leclair et al. ICPC Code
2021 CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees Shi et al. EMNLP Code
2019 A Neural Model for Generating Natural Language Summaries of Program Subroutines Leclair et al. ICSE Code
2020 Improved Automatic Summarization of Subroutines via Attention to File Context Haque et al. MSR Code
2020 Suggesting Comment Completions for Python using Neural Language Models Ciurumelea et al. SANER Code
2020 Retrieval-based neural source code summarization Zhang et al. ICSE Code
2020 Retrieve and refine: exemplar-based neural comment generation Wei et al. ASE Code
2021 Retrieval-Augmented Generation for Code Summarization via Hybrid GNN Liu et al. ICLR Code
2021 EditSum: A Retrieve-and-Edit Framework for Source Code Summarization Li et al. ASE Code
2018 Summarizing source code with transferred api knowledge Hu et al. IJCAI Code
2019 Code generation as a dual task of code summarization Wei et al. NeurIPS Code
2020 Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning Ye et al. WWW Code
2019 Learning to Spot and Refactor Inconsistent Method Names Liu et al. ICSE Code
2021 Deep Just-In-Time Inconsistency Detection Between Comments and Source Code Panthaplackel et al. AAAI Code
2020 Suggesting Natural Method Names to Check Name Consistencies Nguyen et al. ICSE Code
2020 Learning to Update Natural Language Comments Based on Code Changes Panthaplackel et al. ACL Code
2020 Automating Just-In-Time Comment Updating Liu et al. ASE Code
2021 Automating the removal of obsolete TODO comments Gao et al. FSE Code

Program Translation

Translating programs from a deprecated programming language to a modern one is important for software maintenance. Many neural machine translation-based methods have been proposed for program translation.

Year Title Author Venue Code
2013 Lexical statistical machine translation for language migration Nguyen et al. FSE Code
2015 Using machine translation for converting python 2 to python 3 code Aggarwal et al. Technical Report Code
2015 Divide-and-conquer approach for multi-phase statistical migration for source code Nguyen et al. ASE Code
2018 Tree-to-tree neural networks for program translation Chen et al. ICLR Code
2017 DeepAM: Migrate APIs with Multi-Modal Sequence to Sequence Learning Deepam et al. IJCAI Code
2020 Unsupervised translation of programming languages Lachaux et al. NeurIPS Code

Program Synthesis

Program synthesis is a task for generating source code using high-level specifications (e.g., program descriptions or input-output samples). Given the natural-language inputs, current approaches resort to generating programs through machine translation.

Year Title Author Venue Code
2006 Learning for semantic parsing with statistical machine translation Wong et al. NAACL Code
2011 Automating string processing in spreadsheets using input-output examples Gulwani et al. POPL Code
2014 Structured Generative Models of Natural Source Code Maddison et al. ICML Code
2015 Language to code: Learning semantic parsers for if-this-then-that recipes Quirk et al. ACL Code
2016 Language to logical form with neural attention Dong et al. ACL Code
2016 Latent attention for if-then program synthesis Liu et al. NIPS Code
2016 Improved semantic parsers for if-then statements Beltagy et al. ACL Code
2016 Latent Predictor Networks for Code Generation Ling et al. ACL Code
2017 A syntactic neural model for general-purpose code generation Yin et al. ACL Code
2017 Abstract Syntax Networks for Code Generation and Semantic Parsing Rabinovich et al. ACL Code
2017 Neural Programming by Example Shu et al. AAAI Code
2017 DeepCoder: Learning to write programs Balog et al. ICLR Code
2017 RobustFill: Neural Program Learning under Noisy I/O Devlin et al. ICML Code
2017 Seq2sql: Generating structured queries from natural language using reinforcement learning Zhong et al. arXiv Code
2018 Mapping Language to Code in Programmatic Context Iyer et al. EMNLP Code
2018 Selecting representative examples for program synthesis Pu et al. ICML Code
2018 NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System Lin et al. LREC Code
2018 An encoder-decoder framework translating natural language to database queries Cai et al. IJCAI Code
2018 Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task Yu et al. EMNLP Code
2018 Syntaxsqlnet: Syntax tree networks for complex and cross-domain text-to-sql task Yu et al. EMNLP Code
2019 Learning to infer program sketches Nye et al. ICML Code
2019 AutoPandas: neural-backed generators for program synthesis Bavishi et al. OOPSLA Code
2019 Sparc: Cross-domain semantic parsing in context Yu et al. ACL Code
2019 CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases Yu et al. EMNLP Code
2019 A Grammar-Based Structural CNN Decoder for Code Generation Sun et al. AAAI Code
2019 Spoc: Search-based pseudocode to code Kulal et al. NIPS Code
2020 HISyn: human learning-inspired natural language programming Nan et al. FSE Code
2021 Evaluating large language models trained on code Chen et al. arXiv Code
2022 Competition-Level Code Generation with AlphaCode Li et al. AI Code
2022 CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis Nijkamp et al. arXiv Code
2022 PaLM: Scaling Language Modeling with Pathways Chowdhery et al. arXiv Code
2023 InCoder: A Generative Model for Code Infilling and Synthesis Fried et al. ICLR Code
2022 PanGu-Coder: Program Synthesis with Function-Level Language Modeling Christopoulou et al. arXiv Code
2022 ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages Chai et al. ACL Findings Code
2023 StarCoder: may the source be with you! Li et al. TMLR Code
2023 Code Llama: Open Foundation Models for Code Roziere et al. arXiv Code
2023 CodeT5+: Open Code Large Language Models for Code Understanding and Generation Wang et al. EMNLP Code
2023 CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X Zheng et al. KDD Code

Program Repair

Automatically localizing and repairing bugs in programs can save much manual effort in software development. One line of work is to learn the patterns of how programmers edit the source code, which can be used to check syntax errors while compiling. Another line of work is focusing on repairing programs by generating patches.

Year Title Author Venue Code
2016 Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks Bhatia et al. arXiv Code
2018 Syntax and Sensibility: Using language models to detect and correct syntax errors Santos et al. SANER Code
2017 DeepFix: Fixing Common C Language Errors by Deep Learning Gupta et al. AAAI Code
2021 SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair Chen et al. TSE Code
2018 Deep Reinforcement Learning for Programming Language Correction Gupta et al. arXiv Code
2019 SampleFix: Learning to Correct Programs by Sampling Diverse Fixes Hajipour et al. arXiv Code
2019 Neural Program Repair by Jointly Learning to Localize and Repair Vasic et al. ICLR Code
2020 Hoppity: Learning graph transformations to detect and fix bugs in programs Dinella et al. ICLR Code
2014 Neural turing machines Graves et al. arXiv Code
2019 DeepDelta: Learning to Repair Compilation Errors Mesbah et al. FSE Code
2020 Learning to Fix Build Errors with Graph2Diff Neural Networks Tarlow et al. ICSE Code
2020 Codit: Code editing with tree-based neural models Chakraborty et al. TSE Code
2021 A Syntax-Guided Edit Decoder for Neural Program Repair Zhu et al. FSE Code
2020 Graph-based, Self-Supervised Program Repair from Diagnostic Feedback Yasunaga et al. ICML Code
2021 TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer Berabi et al. ICML Code
2020 Self-Supervised Bug Detection and Repair Allamanis et al. NeurIPS Code
2021 CURE: Code-Aware Neural Machine Translation for Automatic Program Repair Jiang et al. ICSE Code
2018 An empirical investigation into learning bug-fixing patches in the wild via neural machine translation Tufano et al. ASE Code
2018 Learning to Generate Corrective Patches using Neural Machine Translation Hata et al. arXiv Code
2018 Learning to Repair Software Vulnerabilities with Generative Adversarial Networks Harer et al. NeurIPS Code
2020 Synthesize, execute and debug: Learning to repair for neural program synthesis Gupta et al. NeurIPS Code
2020 DLFix: Context-based Code Transformation Learning for Automated Program Repair Li et al. ICSE Code
2020 Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair Tian et al. ASE Code
2004 At the end of synthesis: narrowing program candidates Shriver et al. ICSE-NIER Code
2020 Human-in-the-loop automatic program repair Bohme et al. ICST Code
2021 Interactive Patch Filtering as Debugging Aid Liang et al. ICSME Code
2019 Learning to optimize halide with tree search and random programs Adams et al. TOG Code

Code Optimization

Year Title Author Venue Code
2018 Learning to optimize tensor programs Chen et al. NeurIPS Code
2020 FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System Zheng et al. ASPLOS Code
2020 Ansor: Generating high-performance tensor programs for deep learning Zheng et al. OSDI Code
2013 Predictive modeling in a polyhedral optimization space Park et al. IJPL Code

Other Applications

Year Title Author Venue Code
2021 ProGraML: Graph-based Deep Learning for Program Optimization and Analysis Cummins et al. ICML Code
2020 Deep program structure modeling through multi-relational graph-based learning Ye et al. PACT Code
2020 Designing PairBuddy – A Conversational Agent for Pair Programming Robe et al. arXiv Code
2021 On the Evaluation of Commit Message Generation Models: An Experimental Study Tao et al. ICSME Code
2018 Large-scale and language-oblivious code authorship identification Abuhamad et al. CCS Code

Dataset

Year Title Author Venue Code
2019 Codesearchnet challenge: Evaluating the state of semantic code search Husain et al. arXiv Code
2021 CoSQA: 20,000+ Web Queries for Code Search and Question Answering Huang et al. ACL Code
2016 Probabilistic model for code with decision trees Raychev et al. OOPSLA Code
2017 A parallel corpus of Python functions and documentation strings for automated code documentation and code generation Barone et al. IJCNLP Code
2020 PyMT5: multi-mode translation of natural language and Python code with transformers Clement et al. EMNLP Code
2018 Deep code comment generation Hu et al. ICPC Code
2021 Retrieval-Augmented Generation for Code Summarization via Hybrid GNN Liu et al. ICLR Code
2018 Deep learning type inference Hellendoorn et al. FSE Code
2021 CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks Puri et al. arXiv Code
2019 JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation Agashe et al. EMNLP Code
2021 ProGraML: Graph-based Deep Learning for Program Optimization and Analysis Cummins et al. ICML Code
2019 Recommendations for Datasets for Source Code Summarization Leclair et al. NAACL Code
2021 CoDesc: A Large Code-Description Parallel Dataset Hasan et al. ACL Code
2021 Measuring Coding Challenge Competence With APPS Hendrycks et al. NeurIPS Code
2021 AVATAR: A Parallel Corpus for Java-Python Program Translation Ahmad et al. arXiv Code
2018 StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow Yao et al. WWW Code
2021 PyTorrent: A Python Library Corpus for Large-scale Language Models Bahrami et al. arXiv Code
2021 CodeQA: A Question Answering Dataset for Source Code Comprehension Liu et al. EMNLP Code
2021 CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Lu et al. NeurIPS Code

CHALLENGES AND OPPORTUNITIES

Comprehensive Code Representation

Year Title Author Venue Code
2019 Open Vocabulary Learning on Source Code with a Graph-Structured Cache Cvitkovic et al. ICML Code
2020 Big code!= big vocabulary: Open-vocabulary models for source code Karampatsis et al. ICSE Code
2021 A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code Chirkova et al. NAACL Code

Multi-Lingual and Cross-Language

Year Title Author Venue Code
2021 Disentangled Code Representation Learning for Multiple Programming Languages Zhang et al. ACL Code
2022 Multilingual training for Software Engineering Ahmed et al. ICSE Code
2019 Clcdsa: cross language code clone detection using syntactical features and api documentation Nafi et al. ASE Code
2019 Bilateral dependency neural networks for cross-language algorithm classification Bui et al. SANER Code
2019 SAR: learning cross-language API mappings with little knowledge Bui et al. FSE Code
2021 Interactive Cross-language Code Retrieval with Auto-Encoders Chen et al. ASE Code
2022 Cross-Domain Deep Code Search with Few-Shot Meta Learning Chai et al. ICSE Code
2022 Cross-Language Binary-Source Code Matching with Intermediate Representations Gui et al. SANER Code

Model Interpretability

Year Title Author Venue Code
2021 Vulnerability Detection with Fine-grained Interpretations Li et al. FSE Code
2021 Interpreting deep learning-based vulnerability detector predictions based on heuristic searching Zou et al. TOSEM Code
2021 Interpretable Program Synthesis Zhang et al. CHI Code
2021 PyExplainer: Explaining the Predictions of Just-In-Time Defect Models Pornprasit et al. ASE Code

Robustness and Security

Year Title Author Venue Code
2017 Towards evaluating the robustness of neural networks Carlini et al. SP Code
2018 Robust physical-world attacks on deep learning visual classification Eykholt et al. CVPR Code
2017 Towards evaluating the robustness of neural networks Carlini et al. SP Code
2019 On evaluating adversarial robustness Carlini et al. arXiv Code
2020 Adversarial attacks on deep-learning models in natural language processing: A survey Zhang et al. TIST Code
2020 Semantic Robustness of Models of Source Code Ramakrishnan et al. arXiv Code
2020 Adversarial Examples for Models of Code Yefet et al. OOPSLA Code
2021 Adversarial Attacks to API Recommender Systems: Time to Wake Up and Smell the Coffee? Nguyen et al. ASE Code
2020 Adversarial robustness for code Bielik et al. ICML Code
2021 Adversarial Robustness of Deep Code Comment Generation Zhou et al. arXiv Code
2019 Misleading Authorship Attribution of Source Code using Adversarial Learning Quiring et al. USENIX Security Code
2021 A Practical Black-box Attack on Source Code Authorship Identification Classifiers Liu et al. TIFS Code
2021 Backdoors in Neural Models of Source Code Ramakrishnan et al. arXiv Code
2021 You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion Schuster et al. USENIX Security Code
2021 Explanation-Guided Backdoor Poisoning Attacks Against Malware Classifiers Severi et al. USENIX Security Code
2020 Generating Adversarial Examples for Holding Robustness of Source Code Processing Models Zhang et al. AAAI Code

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published