DBDM Project: mini SQL over CSV files
Table of Contents
The goal of this project is to implement a mini SQL engine that reads data from CSV files.
You can work either alone or with one person.
The project can be implemented in C++, Java, Python (2.7 or 3.6) or OCaml and should run either on Mac OS X, Ubuntu Linux (16.04 or 17.10) or Fedora Linux (25, 26 or 27). Whenever possible use "standard" build/dependency toolchains (CMake, Maven/Gradle, Pip, Opam+ocamlbuild).
It should be accompanied by a README file containing information on:
- the name of the student(s)
- the targeted runtime environment (and if necessary the packages to be installed)
- the build procedure
- any information about technical/design choices you made
The whole project is due for 2018-03-27 to Emmanuel Coquery (warning: there can be mail failures on 2018-03-28).
1 CSV MiniSQL Engine (8 points)
This part is due for 2018-03-14 by mail to Emmanuel Coquery.
1.1 Relational Algebra Engine Over CSV (4 points)
First, design a data structures to represent:
- Simple relational algebra operators and expressions (selection, project, cartesian product, relation (i.e. table), renaming, minus, union). You can use a simple language for selection conditions (e.g. boolean connectors and = and < comparisons, without computation of new values).
- Data to be processed (tuples, collections of tuples). You can also assume that all atomic data is of the same type (only strings or only numbers). Collections of tuples should be writable to CSV format (with a first header line)
Then, implement a simple evaluator for your relational algebra expressions:
- a relation is read from a csv file named after the relation (e.g.
R.csv
for a relation R) - cartesian product will use nested loops
- all operators will produce data in an eager way
Note that there are libraries in all proposed languages to read CSV data.
1.2 MiniSQL to Relational Algebra Compiler (4 point)
Write an MiniSQL to algebra compiler:
- MiniSQL grammar is given in the annexes
- You can start with a simple version where there is no nested queries.
The
IN
condition can be translated into a join by pushing theFROM
part of the nested query to theFROM
part of the englobing query and by:- merging the WHERE condition of the nested and the global query
- adding equalities between attributes of the
SELECT
of the nested query and attributes on the left part of theIN
keyword
Note that these operations may require a renaming of the relations of the subquery
If the
WHERE
condition is a conjunction of atomic conditions, theNOT IN
condition can be transformed into aMINUS
statement between the original (sub)query without theNOT IN
condition and a transformation of the original (sub)query where theNOT IN
condition is replaced by anIN
condition.If the original
WHERE
condition is more complex it can be transformed in Disjunctive Normal Form an then transformed into =UNION=s before applying the aforementioned transformation.
2 A Better Algebra Engine (6 points)
2.1 Collapsing algebra operations (2 points)
Implement 2 new operators that regroup existing ones:
- the ReadSelectProjectRename operator that reads a CSV file, filter lines and transform remaning tuples keeping only the proper attributes after renaming
- the JoinProjectRename operator that combine two relations in a θ-join and transform resulting tuples keeping only the proper attributes after renaming (the θ-join being a cartesian product followed by a selection)
Try to use algorithms that are more efficient (in cpu and/or memory) than what would habe been done using the simple relational algebra evaluator.
2.2 Simple algebra expression optimization (2+2 points)
First, write an optimizer that uses relational algebra equivalence rules to:
- push down selections
- add projection to discard unneeded attributes as soon as possible
- replace subexpressions using the two new introduced operators
Next, try to reorder cartesian products to minimize the number of cartesian products that are not immediately restricted by a selection (i.e. that don't correspond to θ-joins).
3 Extensions (6+ points)
You can implement some of the following extensions.
3.2 GROUP BY + Aggregation Functions (4+ points)
Add a GROUP BY
statement to MiniSQL as well as a some aggregation
functions (i.e. MIN
, MAX
, …) to MiniSQL.
Devise a relational algebra operator that correspond to a SELECT
... FROM Single_Relation GROUP BY ...
query and implement it.
Add the GROUP BY functionality to your engine.
You can make some restrictions on the positions where the GROUP BY
statement occurs.
You earn 2 additional points for managing subqueries with GROUP BY
that are related to the englobing query.
3.3 Advanced expressions on atomic values (2 points)
Manage two atomic data types in your MiniSQL: strings and numbers. Add some builtin functions as well as constants to represent and manipulate these values. Modify the renaming operator so it can also create new attributes computed using atomic value expressions.
3.4 Other ideas
- Better query optimizer using a cost model
- Factorizing joins over union/minus
- Various join implementations
- Computing meta-data on intermediate results for dynamically choose an appropriate join implementation
- Lazy evaluation (e.g. in the form of iterators) to reduce memory usage
- The possibility to use temporary files if an intermediate result needs to be materialized and doesn't fit in memory (either available memory or any limit you decide)
- Benchmarking various version of your query evaluator
- Any idea you have
4 Annexes
4.1 CSV file conventions
The following assumptions will be made on CSV files:
- the first line contains attribute names
- every row contains the same number of fields
4.2 MiniSQL Syntax
S ::= "SELECT" ATTS "FROM" RELS "WHERE" COND | "(" S ")" "MINUS" "(" S ")" | "(" S ")" "UNION" "(" S ")" ATTS ::= ATTD "," ATTS | ATTD ATTD ::= ATT | ATT "AS" ID ATT ::= ID "." ID ID ::= ... // matches [A-Za-z][A-Za-z0-9_]+ RELS ::= REL "," RELS | REL REL ::= FILENAME ID | "(" S ")" ID COND ::= AND_COND "OR" COND | AND_COND AND_COND ::= AT_COND "AND" AND_COND | AT_COND AT_COND ::= ATT "=" ATT | ATT "<" ATT | ATT "IN" "(" S ")" | ATT "NOT" "IN" "(" S ")" FILENAME ::= ... // matches \"[A-Za-z\-_0-9]+\.csv\"
4.3 Samples
Some sample files (csv and queries) are available in samples-2018.zip.