P2D: A Transpiler Framework for Optimizing Data Science Pipelines

Yordan Grigorov, Haralampos Gavriilidis, Sergey Redyuk, Kaustubh Beedkar, Volker Markl

April, 2023

Abstract

In this paper, we propose a transpilation-based approach to optimize data science pipelines that comprise databases (DBMS) and data science runtimes(e.g., Python). Our approach allows to identify DBMS supported operations and translate them into SQL to leverage DBMSes for accelerating data science workloads. The optimization target is twofold’:’ First, to improve data loading, by reducing the amount of data to be transferred between runtimes. Second, to exploit DBMS processing capabilities by ``pushing down’’ certain pre-processing operations. Our optimizations are based on an intermediate representation, which allows supporting different data science libraries and DBMSes as frontends and backends respectively, making it suitable for different data science pipelines. Our evaluation with real and synthetic datasets shows that our approach can accelerate data science workloads by up to an order of magnitude over state-of-the-art approaches.

Type

Conference paper

Publication

In 7th Workshop on Data Management for End-To-End Machine Learning

P2D: A Transpiler Framework for Optimizing Data Science Pipelines

Abstract

Kaustubh Beedkar

Assistant Professor