Reactive Dataflow for Inflight Error Handling in ML Workflows

Abstract

Modern data analytics pipelines comprise traditional data transformation operations and pre-trained ML models deployed as user-defined functions (UDFs). Such pipelines, which we call ML workflows, generally produce erroneous results due to data errors inadvertently introduced by ML models. Model errors are one of the main obstacles to improved accuracy of ML workflows. In this paper, we present Popper, a dataflow system—for expressing ML workflows—that natively supports inflight error handling. Users can extend ML workflows expressed in Popper by plugging in error handlers to improve accuracy. We propose reactive dataflow, a novel cyclic graph-based dataflow model that provides convenient abstractions for interleaving dataflow operators with user-defined error handlers for detecting and correcting errors on the fly. We also propose an efficient execution strategy amenable to pipeline parallel execution of reactive dataflow. We discuss open research challenges for making error handling a first-class citizen in dataflow systems and present preliminary evaluation of our prototypical system, which shows the effectiveness and benefits of inflight error handling in ML workflows.

Type
Conference paper
Publication
In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning (DEEM@SIGMOD)
Kaustubh Beedkar
Kaustubh Beedkar
Assistant Professor