These abstracts are for talks at this event.
NEPLS is a venue for ongoing research, so the abstract and supplemental material associated with each talk is necessarily temporal. The work presented here may be in a state of flux. In all cases, please consult the authors' Web pages for up-to-date information. Please don't refer to these pages as a definitive source.
Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs
Dennis Liew (University of Massachusetts Boston)
GPUs are progressively being integrated into modern society, playing a pivotal role in Artificial Intelligence and High-Performance Computing. Programmers need a deep understanding of the GPU programming model to avoid subtle data-races in their codes. Static verification that is sound and incomplete can guarantee data-race freedom, but the alarms it raises may be spurious and need to be validated. We establish a True Positive Theorem for a static data-race detector for GPU programs, i.e., a result that identifies a class of programs for which our technique only raises true alarms. Our work builds on the formalism of memory access protocols, that models the concurrency operations of CUDA programs. The crux of our approach is an approximation analysis that can correctly identify true alarms, and pinpoint the conditions that make an alarm imprecise. Our approximation analysis detects when the reported locations are reachable (control independence, or CI), and when the reported locations are precise (data independence, or DI), as well identify inexact values in an alarm. In addition to a True Positive result for programs that are CI and DI, we establish the root causes of spurious alarms depending on whether CI or DI are present. We apply our theory to introduce FaialAA, the first sound and partially complete data-race detector. We evaluate FaialAA in three experiments. First, in a comparative study with the state-of-the-art tools, we show that FaialAA confirms more DRF programs than others while emitting 1.9× fewer potential alarms. Importantly, the approximation analysis of FaialAA detects 10 undocumented data-races. Second, in an experiment studying 6 commits of data-race fixes in open source projects OpenMM and Nvidia’s MegaTron, FaialAA confirmed the buggy and fixed versions of 5 commits, while others were only able to confirm 2. Third, we show that 59.5% of 2,770 programs are CI and DI, quantifying when the approximation analysis of FaialAA is complete.
A Transactions Extension for WebAssembly
Eliot Moss (University of Massachusetts Amherst)
Our group has been working on designing and building an easily targeted runtime supporting managed language persistence built on top of non-volatile memory. A significant part of that support is providing transactions (essentially software transactional memory) over heap objects. We recently completed specification, a (slow) reference interpreter implementation, and much of a more performant interpreter+JIT implementation of a transactions extension for WebAssembly (Wasm). We will offer a concise overview of this extension with a few examples, and show how exploiting the type system can reduce run-time checks and bookkeeping related to transactions. No previous knowledge of Wasm is required.
How (Not) To Evaluate the Human Factors of PLs
Will Crichton (Brown University)
Claims abound in the PL literature about programming systems being "easy", "intuitive", or "usable" (all direct quotes from OOPSLA'23 papers). Yet, while PL papers will systematically evaluate claims about correctness and performance, they rarely investigate human factors with the same rigor. In this talk, I will discuss: why is there such a disconnect between our claims and our evidence? What are the challenges in importing traditional psychological or HCI methods into PL? And most importantly: the next time you want to make a claim about the human factors of your system, how can you do better than counting lines of code?
Metamorph: Synthesizing Large Objects from Dafny APIs
Alexander Bai (Tufts University)
In this talk, we focus on synthesizing sequences of method calls based on user-provided formal specification of an API in Dafny. This branch of programs is becoming increasingly relevant with the availability of fully verified APIs. We introduce a new approach that uses counterexamples returned by the Dafny verifier to reason about the effects of method calls one at a time, limiting the complexity of solver queries. To further guide the synthesis process, we observe that the use of counterexamples to represent object states allows us to easily decompose the synthesis goal into multiple subgoals. We identify which method calls satisfy which subgoals, and combine this information using integer programming. Our approach aims to limit both the complexity of SMT queries and the overall number of SMT queries. We implement these ideas in a tool called Metamorph that can produce a sequence of method calls that construct an object and put it in a state specified by the user. We compare Metamorph to a baseline approach and conduct an ablation study to show how different features of the tool contribute to its performance. This is joint work with Aleksandr Fedchin and Jeffrey S. Foster.
Cobblestone: Iterative Automation for Formal Verification
Emily First (UC San Diego)
Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but it is expensive. Recent work has used machine learning to automatically synthesize proofs. This talk presents Cobblestone, a novel approach that improves on the state of the art by taking advantage of partial progress in proof synthesis attempts. Unlike prior tools, Cobblestone can produce multiple unsuccessful proofs using an LLM, identify the working portions of those proofs, and combine them into a single, successful proof, taking advantage of internal partial progress. We evaluate Cobblestone on two benchmarks of open-source Coq projects, controlling for the training data leakage that can happen with LLMs. Cobblestone proves 48% of the theorems, while the previous state-of-the-art tool proves 17%. We also evaluate Cobblestone in a setting where it is given external partial proof progress from oracles, serving as proxies for humans. When the theorem is broken down into a set of subgoals and given a set of relevant lemmas already proven in the project, Cobblestone can prove up to 58% of the theorems. This talk also details a qualitative study of the theorems Cobblestone is and is not able to prove to outline potential future directions.
A Verified Foreign Function Interface Between Coq and C
Joomy Korkut (Princeton University and Bloomberg L.P.)
One can write dependently typed functional programs in Coq, and prove them correct in Coq; one can write low-level programs in C, and prove them correct with a C verification tool. We demonstrate how to write programs partly in Coq and partly in C, and interface the proofs together. The Verified Foreign Function Interface (VeriFFI) guarantees type safety and correctness of the combined program. It works by translating Coq function types (and constructor types) along with Coq functional models into VST function-specifications; if the user can prove in VST that the C functions satisfy those specs, then the C functions behave according to the user-specified functional models (even though the C implementation might be very different) and the proofs of Coq functions that call the C code can rely on that behavior. To achieve this translation, we employ a novel, hybrid deep/shallow description of Coq dependent types. This is joint work with Kathrin Stark (Heriot-Watt University) and Andrew W. Appel (Princeton University).
On Compositionality in Hardware Pipeline Design
Rishiyur S. Nikhil (Bluespec, Inc.)
When designing a hardware pipeline for a CPU or accelerator (using a hardware design language), we would like to design each pipeline stage independently in order to reason about correctness of each stage independently, to be able to test each stage independently, to be able to replace/improve/refine each stage independently, and to be able to compile each stage independently (into lower-level RTL or gates). But (1) there are often flow-control dependencies between stages---a stage can advance a datum only if the next stage also advances its datum. (2) Many pipelines have feedback paths (e.g., CPU redirection due to branch misprediction, CPU register bypassing and scoreboarding) which are complicated by flow-control dependencies. (3) If there are inter-stage combinational paths, they reduce the speed at which the pipeline can be clocked. In this talk, we show how we solve these issues using the BSV (Bluespec) hardware design language. This technique is used throughout the design of our open-source "Fife" RISC-V CPU, a 5+ stage pipeline with branch prediction, for which we present results and provide a demo.
Eventually, Understanding: An Adaptive Tutor for Linear Temporal Logic
Siddhartha Prasad (Brown University), Ben Greenman (University of Utah), Tim Nelson (Brown University), Shriram Krishnamurthi (Brown University)
For several years, we have been studying Linear Temporal Logic (LTL) from a human factors perspective. This has led us to an understanding of the sorts of misconceptions that LTL learners demonstrate. However, as educators, we recognize that it isn’t always easy to add new materials to classes. As a result, we have distilled these insights into a tutor designed to be used by learners with minimal prior instruction. In contrast to traditional tutoring systems, which are often tied to a specific curriculum or course, our tutor is grounded in the context of common LTL misconceptions. This allows it to give learners actionable feedback when they get questions wrong, tailored insight into their misconceptions, and adaptive question sets designed to drill them on their specific weaknesses.
Work-in-Progress: An SMT-Based, Correct-by-Construction Place-and-Route Framework
Edward Wang (MIT)
Place-and-route (P&R) plays a key bottleneck in integrated circuit design, and existing approaches for P&R fall short in correctness and methodology. We introduce an SMT-based approach which addresses these concerns and introduces some new possibilities for improved co-optimization of designs. We introduce the system design and problem encoding and show some preliminary results.
Automated Program Repair, What Is It Good For? Not Absolutely Nothing!
Hadeel Eladawy (University of Massachusetts, Amherst), Claire Le Goues(Carnegie Mellon University), Yuriy Brun (University of Massachusetts, Amherst)
Industrial deployments of automated program repair (APR), e.g., at Facebook and Bloomberg, signal a new milestone for this exciting and potentially impactful technology. In these deployments, developers use APR-generated patch suggestions as part of a human-driven debugging process. Unfortunately, little is known about how using patch suggestions affects developers during debugging. This talk describes a controlled user study with 40 developers with a median of 6 years of experience. The developers engage in debugging tasks on nine naturally-occurring defects in real-world, open-source, Java projects, using Recoder, SimFix, and TBar, three state-of-the-art APR tools. For each debugging task, the developers either have access to the project’s tests, or, also, to code suggestions that make all the tests pass. These suggestions are either developer-written or APR-generated, which can be correct or deceptive. Deceptive suggestions, which are a common APR occurrence, make all the available tests pass but fail to generalize to the intended specification. Through a total of 160 debugging sessions, we find that access to a code suggestion significantly increases the odds of submitting a patch. Access to correct APR suggestions increases the odds of debugging success by 14,000% as compared to having access only to tests, but access to deceptive suggestions decreases the odds of success by 65%. Correct suggestions also speed up debugging. Surprisingly, we observe no significant difference in how novice and experienced developers are affected by APR, suggesting that APR may find uses across the experience spectrum. Overall, developers come away with a strong positive impression of APR, suggesting promise for APR-mediated, human-driven debugging, despite existing challenges in APR-generated repair quality.
You can't think about thinking without thinking about thinking about MEMO!
Kartik Chandra (MIT)
Thinking is amazing enough. But think about the fact that we can think *about* thinking — our own and each other's! This remarkable human ability, often called "theory of mind," empowers us to communicate, cooperate, collaborate… really, to coexist at all. If we are to build AI that thinks like us (or even *with* us!), we had better understand how it works. Over the past decade, many cognitive scientists have argued that theory of mind is well-described by *recursive probabilistic programming*. Nearly every facet of human behavior—language, education, art, morality—has been studied through this lens. But it's hard work: for such models, existing PPLs are hard to debug and slow to run. In practice, our cogsci colleagues often end up expressing models in R and fitting them by days-long grid search. Here, I'll present a new domain-specific PPL, memo, that addresses these issues and is already being used by scientists at multiple institutions. Memo has two key ideas: (1) dedicated syntax/semantics for theory-of-mind concepts (agency, knowledge, belief), and (2) compilation of nested inference to fast/parallel/differentiable array code. I'll show how memo works by live-coding a real-time cogsci experiment✨, and then discuss what makes memo interesting from a PL perspective. Finally, I'll say why I think reflecting on memo's design can help us better understand the foundations of human intelligence. (This is joint work with Tony Chen, Josh Tenenbaum, and Jonathan Ragan-Kelley, made possible by dozens of collaborators across PL and cognitive science. You can read more about memo here: https://github.com/kach/memo)
Sike: Static Analysis For The Shell
Anirudh Narsipur (Brown University)
Shell programs are used routinely for configuration, processing, and automation tasks. Unfortunately, their execution can have catastrophic consequences when gone wrong: when buggy, these programs can lead to disastrous and unrecoverable errors due to their ubiquitous interaction with the filesystem and the broader environment. The combination of the shell’s unique semantics, the use of black box components, and the dynamic nature of the broader environment significantly complicates reasoning about shell programs. We present a novel static analyzer for shell programs called Sike that notifies users of program incorrectness before the execution of shell programs. Central to Sike is reasoning about the effects of these programs on the file system. Sike symbolically executes shell scripts, using SMT-based string reasoning to capture much of the Shell’s inbuilt string manipulation capabilities. We combine this with a Hoare-style logic to statically detect the incorrect composition of components with respect to their side effects on the filesystem. We are in the process of evaluating stash on a variety of real-world shell scripts, including ones from Debian packages and a variety of open-source Github repositories.
Last modified Sunday, September 22nd, 2024 8:29:17am |