P2D: A Transpiler Framework for Optimizing Data Science Pipelines

Abstract

In this paper, we propose a transpilation-based approach to optimize data science pipelines that comprise databases (DBMS) and data science runtimes(e.g., Python). Our approach allows to identify DBMS supported operations and translate them into SQL to leverage DBMSes for accelerating data science workloads. The optimization target is twofold’:’ First, to improve data loading, by reducing the amount of data to be transferred between runtimes. Second, to exploit DBMS processing capabilities by ``pushing down’’ certain pre-processing operations. Our optimizations are based on an intermediate representation, which allows supporting different data science libraries and DBMSes as frontends and backends respectively, making it suitable for different data science pipelines. Our evaluation with real and synthetic datasets shows that our approach can accelerate data science workloads by up to an order of magnitude over state-of-the-art approaches.

Type
Conference paper
Publication
In 7th Workshop on Data Management for End-To-End Machine Learning
Kaustubh Beedkar
Kaustubh Beedkar
Assistant Professor