XML processing on the cluster: Bringing XQuery 3.1 to RumbleDB
RumbleDB is a scalable, open-source engine for querying large-scale semi-structured data using JSONiq on Apache Spark. Building on previous contributions by David-Marian Buzatu and Marco Schöb, we have continued the shared effort to extend RumbleDB with initial support for XQuery 3.1, enabling distributed declarative processing of massive collections of XML documents. This new capability allows users from the XML ecosystem to leverage the power of Spark without abandoning familiar declarative patterns.
Our implementation covers a wide range of XQuery 3.1 features, including FLWOR expressions, node constructors and comparisons, built-in functions, arithmetic, and more. As a result, we now pass over 50% of the 32,000+ test cases in the official QT3 Test Suite, marking a major step toward full compliance. We also integrate community-driven proposals for the XQuery Scripting Extension and JSONiq Updating Expressions, further enriching querying capabilities of RumbleDB.
This work has significantly increased the compatibility of RumbleDB with existing XQuery-based applications. By bridging the gap between the XML world and modern distributed systems for the Big Data era, we hope to foster wider adoption of RumbleDB and grow an open, collaborative community around scalable declarative data processing.
Matteo Agnoletto is a Master’s student in Computer Science at ETH Zurich with professional experience in Systems Design, Cybersecurity, and DevOps. He has contributed to XML support on the RumbleDB query engine, which is now the focus of his Master’s thesis. Passionate about bridging research and practical applications in the tech industry, Matteo is also actively involved in entrepreneurial projects, collaborating with startups and applying his expertise in software architecture and automation to real-world challenges.