Adding Scripting with side-effects to RumbleDB

David-Marian BuzatuETH Zurich

ETH Zurich

Our work brings to RumbleDB the ability to perform side-effecting expressions and function invocations using scripting, additionally extending the capabilities with local variable declarations, while loops and assign expressions. Built for large, unstructured and heterogeneous JSON datasets, RumbleDB is a query execution engine running on the JSONiq query language, and is highly integrated with Apache Spark.

JSONiq adapts the XQuery syntax and semantics to support expressions using JSON, rather than XML, however the similarities allowed us to incorporate XQuery specification for a scripting extension within JSONiq. The complexity of scripting stems from its introduction of side-effects in the non-side-effecting environment of JSONiq, which required careful design and implementation details to support these effects at runtime. Moreover, the scripting semantics transitions the declarative style of JSONiq closer to an imperative experience, where users may write operations closer to what imperative languages like Python have to offer, from declaring local variables that can be re-assigned later on, to running while loops with break or continue operations and stopping function executions early with exit statements. Moreover, adding scripting supports previous versions of RumbleDB workloads, and only enhances the experience with side-effects when needed. Experiments show no significant degradation of RumbleDB’s performance on previous workloads. Finally, we showcase the verbosity of scripting within RumbleDB by providing an implementation for QuickSort and by modifying a dataset through repetitive function invocations. Our results demonstrate the usability of scripting in RumbleDB workloads and future work may provide integration with databases or DataLakes to support long-lived updates done directly using side-effecting operations.

Presentation, 8 November 2024