1.3. Models for NoSQL languages and systems
A striking aspect of the NoSQL ecosystem is its diversity. Concerning data stores, we find at least a dozen heavily used solutions (MongoDB, Apache Cassandra, Apache HBase, Apache CouchDB, Redis, Microsoft CosmosDB to name a few). Each of these solutions comes with its integrated query interface (some with high-level query languages, others with a low-level operator API). But besides data store, we also find processing engines such as Apache Hadoop (providing a MapReduce interface), Apache Spark (general cluster computing) or Apache Flink (stream-oriented framework), and each of these frameworks can target several of the aforementioned stores. This diversity of solutions translates to ever more complex application code, requiring careful and often brittle or inefficient abstractions to shield application business logic from the specificities of every data store. Reasoning about such programs, and in particular about their properties with respect to data access has become much more complex. This state of affairs has prompted a need for unifying approaches allowing us to target multiple data stores uniformly.
A first solution is to consider SQL as the unifying query language. Indeed, SQL is a well-known and established query language, and being able to query NoSQL data stores with the SQL language seems natural. This is the solution proposed by Curé et al. [CUR 11]. In this work, the database user queries a “virtual relational database” which can be seen as a relational view of different NoSQL stores. The approach consists of two complementary components. The first one is a data mapping which describes which part of a NoSQL data store is used to populate a virtual relation. The second is the Bridge Query Language (BQL), an intermediate query representation that bridges the gap between the high-level, declarative SQL, and the low-level programming API exposed by various data stores. The BQL makes some operations, such as iteration or sorting, explicit. In particular, BQL exposes a foreach construct that is used to implement the SQL join operator (using nested loops). A BQL program is then translated into the appropriate dialect. In [CUR 11], the authors give two translations: one targeting MongoDB and the other targeting Apache Cassandra, using their respective Java API.
While satisfactory from a design perspective, the solution of Curé et al. may lead to sub-optimal query evaluation, in particular in the case of joins. Indeed, in most frameworks (with the notable exception of Apache Spark), performing a double nested loop to implement a join implies that the join is actually performed on the client side of the application, that is, both collections to be joined are retrieved from the data store and joined in main memory. The most advanced contribution to date that provides not only a unified query language and data model, but also a robust query planner is the work of Kolev et al. on CloudMdsQL [KOL 16b]. At its heart, CloudMdsQL is a query language based on SQL, which is extended in two ways. First, a CloudMdsQL program may reference a table from a NoSQL store using the store’s native query language. Second, a CloudMdsQL program may contain blocks of Python code that can either produce synthetic tables or be used as user-defined functions (UDFs) to perform application logic. A programmer may query several data stores using SELECT statements (the full range of SQL’s SELECT syntax is supported, including joins, grouping, ordering and windowing constructs) that can be arbitrarily nested. One of the main contributions of Kolev et al. is a modular query planner that takes each data store’s capability into account and furthermore provides some cross data store optimizations. For instance, the planner may decide to use bind joins (see [HAA 97]) to efficiently compute a join between two collections stored in different data stores. With a bind join, rather than retrieving both collections on the client side and performing the join in main memory, one of the collections is migrated to the other data store where the join computation takes place. Another example of optimization performed by the CloudMdsQL planner is the rewriting of Python for each loops into plain database queries. The CloudMdsQL approach is validated by a prototype and an extensive benchmark [KOL 16a].
One aspect of CloudMdsQL that may still be improved is that even in such a framework, reasoning about programs is still difficult. In particular, CloudMdsQL is not so much a unified query language than the juxtaposition of SQL’s SELECT statement, Python code and a myriad of ad-hoc foreign expressions (since every data manipulation language can be used inside quotations). A more unifying solution, from the point of view of the intermediate query representation, is the Hop.js framework of Serrano et al. [SER 16]. Hop.js is a multi-tier programming environment for Web application. From a single JavaScript source file, the framework deduces both the view (HTML code), the client code (client-side JavaScript code) and the server code (server-side JavaScript code with database calls), as well as automatically generating server/client communications in the form of asynchronous HTTP requests. More precisely, in [COU 15], Serrano et al. applied the work of Cheney et al. [CHE 13b, CHE 13a, CHE 14] on language-integrated queries to Hop.js. In [COU 15], list comprehension is used as a common query language to denote queries to different data stores. More precisely, borrowing the Array comprehension syntax2 of the Ecmascript 2017 proposal, the author can write queries as:
[ for ( x of table )
      if ( x.age >= 18 ) { name: x.name, age: x.age } ]
which mimics the mathematical notation of set comprehension:
In this framework, joins may be written as nested for loops. However, unlike the work of Curé et al., array comprehension is compiled into more efficient operations if the target back-end supports them.
Despite their variety of data models, execution strategies and query languages, NoSQL systems seem to agree on one point: their lack of support for schema! As is well-known, the lack of schema is detrimental to both readability and performance (for instance, see the experimental study of the performance and readability impact of data modeling in MongoDB by Gómez et al. [GÓM 16]). This might come as a surprise, database systems have a long tradition of taking types seriously (from the schema and constraints of RDBMS to the various schema standards for XML documents and the large body of work on type-checking XML programs). To tackle this problem, Benzaken et al. [BEN 13] proposed a core calculus of operators, dubbed filters. Filters reuse previous work on semantic sub-typing (developed in the context of static type checking of XML transformations [FRI 08]) and make it possible to: (i) model NoSQL databases using regular types and extensible records, (ii) give a formal semantics to NoSQL query languages and (iii) perform type checking of queries and programs accessing data. In particular, the work in [BEN 13] gives a formal semantics of the JaQL query language (originally introduced in [BEY 11] and now part of IBM BigInsights) as well as a precise type-checking algorithm for JaQL programs. In essence, JaQL programs are expressed as sets of mutually recursive functions and such functions are symbolically executed over the schema of the data to compute an output type of the query. The filter calculus was generic enough to encode not only JaQL but also MongoDB’s query language (see [HUS 14]). One of the downsides of the filter approach, however, is that to be generic enough to express any kind of operator, filters rely on low-level building blocks (such as recursive functions and pattern matching) which are not well-suited for efficient evaluation.