These abstracts are for talks at this event.

NEPLS is a venue for ongoing research, so the abstract and supplemental material associated with each talk is necessarily temporal. The work presented here may be in a state of flux. In all cases, please consult the authors' Web pages for up-to-date information. Please don't refer to these pages as a definitive source.


Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

Dennis Liew (University of Massachusetts Boston)

GPUs are progressively being integrated into modern society, playing a
pivotal role in Artificial Intelligence and High-Performance
Computing. Programmers need a deep understanding of the GPU
programming model to avoid subtle data-races in their codes. Static
verification that is sound and incomplete can guarantee data-race
freedom, but the alarms it raises may be spurious and need to be
validated.

We establish a True Positive Theorem for a static data-race detector
for GPU programs, i.e., a result that identifies a class of programs
for which our technique only raises true alarms. Our work builds on
the formalism of memory access protocols, that models the concurrency
operations of CUDA programs. The crux of our approach is an
approximation analysis that can correctly identify true alarms, and
pinpoint the conditions that make an alarm imprecise. Our
approximation analysis detects when the reported locations are
reachable (control independence, or CI), and when the reported
locations are precise (data independence, or DI), as well identify
inexact values in an alarm. In addition to a True Positive result for
programs that are CI and DI, we establish the root causes of spurious
alarms depending on whether CI or DI are present.

We apply our theory to introduce FaialAA, the first sound and
partially complete data-race detector. We evaluate FaialAA in three
experiments. First, in a comparative study with the state-of-the-art
tools, we show that FaialAA confirms more DRF programs than others
while emitting 1.9× fewer potential alarms. Importantly, the
approximation analysis of FaialAA detects 10 undocumented
data-races. Second, in an experiment studying 6 commits of data-race
fixes in open source projects OpenMM and Nvidia’s MegaTron, FaialAA
confirmed the buggy and fixed versions of 5 commits, while others were
only able to confirm 2. Third, we show that 59.5% of 2,770 programs
are CI and DI, quantifying when the approximation analysis of FaialAA
is complete.


A Transactions Extension for WebAssembly

Eliot Moss (University of Massachusetts Amherst)

Our group has been working on designing and building an easily
targeted runtime supporting managed language persistence built on top
of non-volatile memory. A significant part of that support is
providing transactions (essentially software transactional memory)
over heap objects. We recently completed specification, a (slow)
reference interpreter implementation, and much of a more performant
interpreter+JIT implementation of a transactions extension for
WebAssembly (Wasm). We will offer a concise overview of this extension
with a few examples, and show how exploiting the type system can
reduce run-time checks and bookkeeping related to transactions. No
previous knowledge of Wasm is required.


How (Not) To Evaluate the Human Factors of PLs

Will Crichton (Brown University)

Claims abound in the PL literature about programming systems being
"easy", "intuitive", or "usable" (all direct quotes from OOPSLA'23
papers). Yet, while PL papers will systematically evaluate claims
about correctness and performance, they rarely investigate human
factors with the same rigor. In this talk, I will discuss: why is
there such a disconnect between our claims and our evidence? What are
the challenges in importing traditional psychological or HCI methods
into PL? And most importantly: the next time you want to make a claim
about the human factors of your system, how can you do better than
counting lines of code?


Metamorph: Synthesizing Large Objects from Dafny APIs

Alexander Bai (Tufts University)

In this talk, we focus on synthesizing sequences of method calls
based on user-provided formal specification of an API in Dafny. This
branch of programs is becoming increasingly relevant with the
availability of fully verified APIs.

We introduce a new approach that uses counterexamples returned by the
Dafny verifier to reason about the effects of method calls one at a
time, limiting the complexity of solver queries. To further guide the
synthesis process, we observe that the use of counterexamples to
represent object states allows us to easily decompose the synthesis
goal into multiple subgoals. We identify which method calls satisfy
which subgoals, and combine this information using integer
programming. Our approach aims to limit both the complexity of SMT
queries and the overall number of SMT queries.

We implement these ideas in a tool called Metamorph that can produce a
sequence of method calls that construct an object and put it in a
state specified by the user. We compare Metamorph to a baseline
approach and conduct an ablation study to show how different features
of the tool contribute to its performance.

This is joint work with Aleksandr Fedchin and Jeffrey S. Foster.


Cobblestone: Iterative Automation for Formal Verification

Emily First (UC San Diego)

Formal verification using proof assistants, such as Coq, is an
effective way of improving software quality, but it is
expensive. Recent work has used machine learning to automatically
synthesize proofs. This talk presents Cobblestone, a novel approach
that improves on the state of the art by taking advantage of partial
progress in proof synthesis attempts. Unlike prior tools, Cobblestone
can produce multiple unsuccessful proofs using an LLM, identify the
working portions of those proofs, and combine them into a single,
successful proof, taking advantage of internal partial progress. We
evaluate Cobblestone on two benchmarks of open-source Coq projects,
controlling for the training data leakage that can happen with
LLMs. Cobblestone proves 48% of the theorems, while the previous
state-of-the-art tool proves 17%. We also evaluate Cobblestone in a
setting where it is given external partial proof progress from
oracles, serving as proxies for humans. When the theorem is broken
down into a set of subgoals and given a set of relevant lemmas already
proven in the project, Cobblestone can prove up to 58% of the
theorems. This talk also details a qualitative study of the theorems
Cobblestone is and is not able to prove to outline potential future
directions.


A Verified Foreign Function Interface Between Coq and C

Joomy Korkut (Princeton University and Bloomberg L.P.)

One can write dependently typed functional programs in Coq, and prove
them correct in Coq; one can write low-level programs in C, and prove
them correct with a C verification tool. We demonstrate how to write
programs partly in Coq and partly in C, and interface the proofs
together. The Verified Foreign Function Interface (VeriFFI) guarantees
type safety and correctness of the combined program. It works by
translating Coq function types (and constructor types) along with Coq
functional models into VST function-specifications; if the user can
prove in VST that the C functions satisfy those specs, then the C
functions behave according to the user-specified functional models
(even though the C implementation might be very different) and the
proofs of Coq functions that call the C code can rely on that
behavior. To achieve this translation, we employ a novel, hybrid
deep/shallow description of Coq dependent types.

This is joint work with Kathrin Stark (Heriot-Watt University) and
Andrew W. Appel (Princeton University).


On Compositionality in Hardware Pipeline Design

Rishiyur S. Nikhil (Bluespec, Inc.)

When designing a hardware pipeline for a CPU or accelerator (using a
hardware design language), we would like to design each pipeline stage
independently in order to reason about correctness of each stage
independently, to be able to test each stage independently, to be able
to replace/improve/refine each stage independently, and to be able to
compile each stage independently (into lower-level RTL or gates).

But (1) there are often flow-control dependencies between stages---a
stage can advance a datum only if the next stage also advances its
datum.  (2) Many pipelines have feedback paths (e.g., CPU redirection
due to branch misprediction, CPU register bypassing and scoreboarding)
which are complicated by flow-control dependencies. (3) If there are
inter-stage combinational paths, they reduce the speed at which the
pipeline can be clocked.

In this talk, we show how we solve these issues using the BSV
(Bluespec) hardware design language.  This technique is used
throughout the design of our open-source "Fife" RISC-V CPU, a 5+ stage
pipeline with branch prediction, for which we present results and
provide a demo.


Eventually, Understanding: An Adaptive Tutor for Linear Temporal Logic

Siddhartha Prasad (Brown University), Ben Greenman (University of Utah), Tim Nelson (Brown University), Shriram Krishnamurthi (Brown University)

For several years, we have been studying Linear Temporal Logic (LTL)
from a human factors perspective. This has led us to an understanding
of the sorts of misconceptions that LTL learners demonstrate. However,
as educators, we recognize that it isn’t always easy to add new
materials to classes. As a result, we have distilled these insights
into a tutor designed to be used by learners with minimal prior
instruction.

In contrast to traditional tutoring systems, which are often tied to a
specific curriculum or course, our tutor is grounded in the context of
common LTL misconceptions. This allows it to give learners actionable
feedback when they get questions wrong, tailored insight into their
misconceptions, and adaptive question sets designed to drill them on
their specific weaknesses.


Work-in-Progress: An SMT-Based, Correct-by-Construction Place-and-Route Framework

Edward Wang (MIT)

Place-and-route (P&R) plays a key bottleneck in integrated circuit
design, and existing approaches for P&R fall short in correctness and
methodology. We introduce an SMT-based approach which addresses these
concerns and introduces some new possibilities for improved
co-optimization of designs. We introduce the system design and problem
encoding and show some preliminary results.


Automated Program Repair, What Is It Good For? Not Absolutely Nothing!

Hadeel Eladawy (University of Massachusetts, Amherst), Claire Le Goues(Carnegie Mellon University), Yuriy Brun (University of Massachusetts, Amherst)

Industrial deployments of automated program repair (APR), e.g., at
Facebook and Bloomberg, signal a new milestone for this exciting and
potentially impactful technology. In these deployments, developers use
APR-generated patch suggestions as part of a human-driven debugging
process. Unfortunately, little is known about how using patch
suggestions affects developers during debugging.

This talk describes a controlled user study with 40 developers with a
median of 6 years of experience. The developers engage in debugging
tasks on nine naturally-occurring defects in real-world, open-source,
Java projects, using Recoder, SimFix, and TBar, three state-of-the-art
APR tools. For each debugging task, the developers either have access
to the project’s tests, or, also, to code suggestions that make all
the tests pass. These suggestions are either developer-written or
APR-generated, which can be correct or deceptive. Deceptive
suggestions, which are a common APR occurrence, make all the available
tests pass but fail to generalize to the intended
specification. Through a total of 160 debugging sessions, we find that
access to a code suggestion significantly increases the odds of
submitting a patch. Access to correct APR suggestions increases the
odds of debugging success by 14,000% as compared to having access only
to tests, but access to deceptive suggestions decreases the odds of
success by 65%. Correct suggestions also speed up
debugging. Surprisingly, we observe no significant difference in how
novice and experienced developers are affected by APR, suggesting that
APR may find uses across the experience spectrum. Overall, developers
come away with a strong positive impression of APR, suggesting promise
for APR-mediated, human-driven debugging, despite existing challenges
in APR-generated repair quality.


You can't think about thinking without thinking about thinking about MEMO!

Kartik Chandra (MIT)

Thinking is amazing enough. But think about the fact that we can
think *about* thinking — our own and each other's! This remarkable
human ability, often called "theory of mind," empowers us to
communicate, cooperate, collaborate… really, to coexist at all. If we
are to build AI that thinks like us (or even *with* us!), we had
better understand how it works.

Over the past decade, many cognitive scientists have argued that
theory of mind is well-described by *recursive probabilistic
programming*. Nearly every facet of human behavior—language,
education, art, morality—has been studied through this lens. But it's
hard work: for such models, existing PPLs are hard to debug and slow
to run. In practice, our cogsci colleagues often end up expressing
models in R and fitting them by days-long grid search.

Here, I'll present a new domain-specific PPL, memo, that addresses
these issues and is already being used by scientists at multiple
institutions. Memo has two key ideas: (1) dedicated syntax/semantics
for theory-of-mind concepts (agency, knowledge, belief), and (2)
compilation of nested inference to fast/parallel/differentiable array
code. I'll show how memo works by live-coding a real-time cogsci
experiment✨, and then discuss what makes memo interesting from a PL
perspective. Finally, I'll say why I think reflecting on memo's design
can help us better understand the foundations of human intelligence.

(This is joint work with Tony Chen, Josh Tenenbaum, and Jonathan
Ragan-Kelley, made possible by dozens of collaborators across PL and
cognitive science. You can read more about memo here:
https://github.com/kach/memo)


Sike: Static Analysis For The Shell

Anirudh Narsipur (Brown University)

Shell programs are used routinely for configuration, processing, and
automation tasks. Unfortunately, their execution can have catastrophic
consequences when gone wrong: when buggy, these programs can lead to
disastrous and unrecoverable errors due to their ubiquitous
interaction with the filesystem and the broader environment. The
combination of the shell’s unique semantics, the use of black box
components, and the dynamic nature of the broader environment
significantly complicates reasoning about shell programs.

We present a novel static analyzer for shell programs called Sike that
notifies users of program incorrectness before the execution of shell
programs. Central to Sike is reasoning about the effects of these
programs on the file system. Sike symbolically executes shell scripts,
using SMT-based string reasoning to capture much of the Shell’s
inbuilt string manipulation capabilities. We combine this with a
Hoare-style logic to statically detect the incorrect composition of
components with respect to their side effects on the filesystem. We
are in the process of evaluating stash on a variety of real-world
shell scripts, including ones from Debian packages and a variety of
open-source Github repositories.


Last modified Sunday, September 22nd, 2024 8:29:17am