- SCAM2023CIGAR: Contrastive Learning for GitHub Action RecommendationJiangnan Huang, and Bin LinSource Code Analysis and Manipulation (Best paper award) 2023
GitHub Actions was introduced in 2019 as an integrated solution for CI/CD to automate software development workflow. Since then, it has gained tremendous popularity among developers. In a GitHub Actions workflow, actions refer to custom applications for performing complex but frequently repeated tasks. Actions can be typically found in GitHub Marketplace or public GitHub repositories. Prior studies have already disclosed that developers often reuse actions to reduce double work and improve productivity. However, it is not trivial for developers, especially novices, to figure out which action to reuse due to the large number of actions available and the limited search functionality GitHub Marketplace provides. To address this issue, we propose CIGAR (ContrastIve learning for GitHub Action Recommendation). Given the textual description of a task developers want to execute, CIGAR will recommend the most relevant actions. CIGAR exploits a pre-trained RoBERTa model to convert sequences of words into high-dimensional vector representations, and is fine tuned through a contrastive learning objective. The performance of CIGAR was evaluated on a novel dataset curated based on prior research, and the results demonstrate that CIGAR can reliably recommend actions needed by developers and significantly outperforms the GitHub Marketplace search engine. Our study indicates the promise of employing contrastive learning for GitHub action recommendation. The promising performance achieved can potentially drive a wider adoption of GitHub Actions and facilitate the automation of software development workflows.
- LTU Attacker for Membership InferenceJoseph Pedersen, Rafael Muñoz-Gómez, Jiangnan Huang, and 3 more authorsAlgorithms 2022
We address the problem of defending predictive models, such as machine learning classifiers (Defender models), against membership inference attacks, in both the black-box and white-box setting, when the trainer and the trained model are publicly released. The Defender aims at optimizing a dual objective: utility and privacy. Privacy is evaluated with the membership prediction error of a so-called “Leave-Two-Unlabeled” LTU Attacker, having access to all of the Defender and Reserved data, except for the membership label of one sample from each, giving the strongest possible attack scenario. We prove that, under certain conditions, even a “naïve” LTU Attacker can achieve lower bounds on privacy loss with simple attack strategies, leading to concrete necessary conditions to protect privacy, including: preventing over-fitting and adding some amount of randomness. This attack is straightforward to implement against any model trainer, and we demonstrate its performance against MemGaurd. However, we also show that such a naïve LTU Attacker can fail to attack the privacy of models known to be vulnerable in the literature, demonstrating that knowledge must be complemented with strong attack strategies to turn the LTU Attacker into a powerful means of evaluating privacy. The LTU Attacker can incorporate any existing attack strategy to compute individual privacy scores for each training sample. Our experiments on the QMNIST, CIFAR-10, and Location-30 datasets validate our theoretical results and confirm the roles of over-fitting prevention and randomness in the algorithms to protect against privacy attacks.
- Comparing Local Search Initialization for K-Means and K-Medoids Clustering in a Planar Pareto Front, a Computational StudyJiangnan Huang, Zixi Chen, and Nicolas DupinIn Optimization and Learning 2021
Having N points in a planar Pareto Front (2D PF), k-means and k-medoids are solvable in O(N^3) time by dynamic programming algorithms. Standard local search approaches, PAM and Lloyd’s heuristics, are investigated in the 2D PF case to solve faster large instances. Specific initialization strategies related to 2D PF cases are implemented with the generic ones (Forgy’s, Hartigans, k-means++). Applying PAM and Lloyd’s local search iterations, the quality of local minimums are compared with optimal values. Numerical results are computed using generated instances, which were made public. This study highlights that local minimums of a poor quality exist for 2D PF cases. A parallel or multi-start heuristic using four initialization strategies improves the accuracy to avoid poor local optimums. Perspectives are still open to improve local search heuristics for the specific 2D PF cases.