QR decomposition integration into DBMS

Size: px

Start display at page:

Download "QR decomposition integration into DBMS"

Candace Hunter
5 years ago
Views:

1 Department of Informatics, University of Zürich BSc Thesis QR decomposition integration into DBMS Dzmitry Katsiuba Matrikelnummer: July 2, 2017 supervised by Prof. Dr. M. Böhlen and Oksana Dolmatova

3 Acknowledgements Most of all, I would like to express my sincere gratitude to my supervisor Oksana Dolmatova for her patience, helpfulness and support. I am very thankful for the feedback and guidance she provided. Also, I would like to thank Prof. Dr. Michael Böhlen for giving me the opportunity to write my bachelor thesis at the Database Technology Group of the University of Zurich.

5 Abstract The demand to analyse data stored in DBMSs has increased significantly during the last few years. Since the analysis of scientific data is mostly based on statistical and linear algebra operations (e.g. vector multiplication, operations on matrices), the computation of the latter plays a big role in data processing. However, the current approach to deal with statistics is to export data from a DBMS to a math program, like R or MATLAB. This implies additional time and memory costs. At the same time the column-store approach has become popular and a number of hybrid or pure column-store systems, such as MonetDB, Apache Cassandra etc. are available. I investigate the benefits of incorporating a linear operation into a column-oriented DBMS. For this purpose, I integrate the QR decomposition in MonetDB, analyse the complexity of the implementation and empirically compare the performance with the existing R-solution (exporting the data with UDFs). The results of experimental evaluation show, that the embedded R solution works faster than the QR integration in MonetDB, when a virtualized environment used. On an unvirtualized host MonetDB has a significantly better performance, exceeding the results of R. The implemented Q QR function allows calculation on relations directly in the DBMS. It uses the SQL syntax, which makes the usage of the function easy and intuitive. The user doesn t require any additional skills, while writing of UDF functions for different sets of parameters means a certain programmer effort.

7 Zusammenfassung Die Nachfrage nach Datenanalyse der in Datenbanken gespeicherter Information ist in den letzten Jahren deutlich gewachsen. Die Auswertung der wissenschaftlichen Daten basiert meist auf statistischen und linearen Algebraoperationen (z.b. Vektorrechnungen, Matrizen-Operationen usw.). Dabei erhält die Möglichkeit, direkt in Datenbanken diese Operationen auszuführen, ein immer grösseres Gewicht. Die verbreitete Vorgehensweise besteht darin, Daten in ein mathematisches oder statistisches Programm, wie beispielsweise R oder MATLAB, zu exportieren. Das Problem dabei: Es verursacht zusätzlichen Zeit- und Speicheraufwand. Gleichzeitig werden spaltenorientierte Datenbanken immer populärer. Auch sind auf dem Markt reine oder hybride spaltenorientierte Datenbanken (wie MonetDB, Apache Cassandra) verfügbar. Meine Arbeit untersucht die Vorteile der Integration von linear algebraischen Operationen in einer spaltenorientierten Datenbank. Dazu integriere ich die QR-Zerlegung in MonetDB, analysiere die Komplexität der Implementierung und vergleiche empirisch ihre Performanz mit der existierenden R-Lösung (unter Anwendung von UDFs für den Datenexport). Die Ergebnisse meiner experimentellen Analyse zeigen, dass in einer virtuellen Umgebung die eingebettete R-Lösung eine grössere Schnelligkeit aufweist, als die QR-Integration in MonetDB. Anders verhält es sich in einer nicht virtuellen Umgebung: Da weist MonetDB eine signifikant bessere Performanz auf und überholt gar die R-Lösung. Die implementierte Q QR-Funktion ermöglicht die Berechnungen auf Relationen direkt in der Datenbank. Die dabei genutzte, übliche SQL-Syntax macht den Gebrauch dieser Funktion einfach und intuitiv. Dafür benötigt der Benutzer keine zusätzlichen Kenntnisse, während das Schreiben von UDF-Funktionen für verschiedene Parameterkombinationen einen zusätzlichen Programmieraufwand bedeutet.

9 Table of Contents 1 Introduction Motivation Problem definition Related Work QR Decomposition Matrix operations in a DBMS Task description 7 4 Approach MonetDB Implementation Symbol tree Relation tree Statement tree Translation to MAL-Plan Complexity Analysis Complexity of the Gram-Schmidt algorithm Complexity of the implementation Experimental evaluation Setup Test Case Selection and Metrics Data Set Results W ith ordering mode W ithout ordering mode Operating environment Change of complexity Optimization Summary and future work 43

10 x Table of Contents A Code 47 A.1 sql parser.y A.2 rel bin.c A.3 sql gencode.c B Relation-Plan 55 C MAL-Plan 57 C.1 MAL-Plan for with ordering mode C.2 MAL-Plan for without ordering mode D UDF 63 D.1 UDF for with ordering mode D.2 UDF for without ordering mode D.3 UDF without Q QR E Detailed test results 65 x

11 1 Introduction 1.1 Motivation The amount of data produced by scientists has increased significantly during the last decades. The demand for structured and practical approach to processing and storing these data is thus even higher. Database management systems (DBMSs) represent an efficient storage solution with integrated set of mathematical and statistical operations (e.g. minimum, maximum, average, count etc.). However, scientific data often require a more complicated evaluation and computations, such as linear algebra operations, which cannot always be processed directly in DBMSs. Math and statistics programs, like R or MATLAB, allow eluding the lack of statistical functionality in popular DBMSs. At the same time column-oriented DBMSs have attracted a lot of attention. These databases differ in the way information is stored, namely the attribute values belonging to the same column are stored separately, contiguously and densely packed, while the traditional row-oriented database systems store the entire records (rows) one by one [1]. The column-store approach has become popular and a number of hybrid or pure column-store systems, such as MonetDB, Kdb, VectorWise, Infobright, Exasol are now available. Column-oriented systems have some important advantages compared to rowstores, e.g. compression of the repeating values in columns, working with linear sets of values, especially when those sets are much larger than available RAM. Specifically due to these advantages column-oriented databases offer a great flexibility for analytical workloads. The goal of my thesis is to expand MonetDB functionality by integrating one of the most well-known linear algebra operations, the QR decomposition (QRD), and to investigate the benefits of incorporating a linear operation into a column-oriented DBMS. 1.2 Problem definition Many mathematical and statistical problems arising during scientific research are solved with the help of linear algebra operations, which require matrix and array operations. QRD is one of the popular operations. Calculations of eigenvalue algorithm, the QR algorithm, the linear least squares problem and many other problems are often solved

12 2 CHAPTER 1. INTRODUCTION using the QRD. Unfortunately, it doesn t belong to the standard set of database functions and additional steps for its execution are required. Moreover, the QRD, as any linear operation, is defined over matrices, but not relations. A relation should be adjusted in order to be suitable for the QRD. The current approach to perform this and other statistical calculation is to export data from a DBMS, perform the operation using a math tool and after that import the results back into the DBMS. This process implies additional time and memory costs. Moreover, it is a potential source of errors. The integration of the QRD in a column-oriented DBMS can be done by implementing a new function, which returns the matrix Q from the QRD (Q QR function). This approach allows doing the decomposition directly in the database, using existing relations, without additional export and import operations. It also saves resources and excludes possible non-calculation errors. 2

13 2 Related Work In this section I give an overview about QRD algorithms, describe DBMSs that allow working with linear algebra operations. I also mention the possibility of processing these operations on database relations. 2.1 QR Decomposition QR Decomposition (QRD) of a matrix, also known as the QR factorization, is a decomposition of matrix A into a product A = QR, where Q is an orthogonal matrix, and R is an upper triangular matrix. I assume, that the matrix A is of size mxn, where m is a number of rows, n is a number of columns and m n. QRD is used for solving several mathematical problems, such as: computation of matrix inversion, determination of the numerical rank of a given matrix, singular value decomposition, least squares problem, etc. Algorithm 1 Classical Gram-Schmidt algorithm [2] 1: for k:=1 to n do 2: for i:=1 to k-1 do 3: s := 0 4: for j:=1 to m do 5: s := s + a j,i a j,k 6: r i,k := s 7: for i:=1 to k-1 do 8: for j:=1 to m do 9: a j,k := a j,k a j,i r i,k 10: s := 0 11: for j:=1 to m do 12: s := s + a 2 j,k 13: r k,k := sqrt(s) 14: for j:=1 to m do 15: a j,k := a j,k /r k,k

14 4 CHAPTER 2. RELATED WORK There are two common algorithms for executing QR-decomposition: Householder transformations Gram-Schmidt algorithm (see Algorithm 1) These algorithms differ in terms of performance and possibility of parallelizing their execution. Transforming parts of the classical Gram-Schmidt algorithm into euclidean norm and dot product functions produces the vector-based version of the algorithm (see Algorithm 2). The algorithm normalizes the vectors in sequence, doing orthogonalization for the rest of the vectors after every normalization (reorthogonalizing them by already normalized vectors). Algorithm 2 Vector-based Gram-Schmidt algorithm 1: for k:=1 to n number of vectors do 2: r k,k := euclidean norm(a k ) normalization 3: q k := a k /r k,k 4: for j:=k+1 to n do 5: r := dot product(q k, a j ) orthogonalization 6: a j := a j r q k The reorthogonalization doubles the computational effort and for this reason the method of Householder is usually preferred. The advantage of Gram-Schmidt algorithm is that it is possible to perform the orthogonalization of one vector to as many other vectors as it is needed [2]. If, as according to the task (see Chapter 3), only the matrix Q is needed, then the modified Gram-Schmidt algorithm is much more efficient [9]. 2.2 Matrix operations in a DBMS Data-intensive research fields rely on the ability to efficiently process massive amounts of experimental data using database technologies. A DBMS works mostly with relations and not with arrays or matrices, as it is often necessary for scientific purposes. Outsourcing those computations to other programs represents a solution for this computation gap. The most DBMSs have only a limited set of algebra operations and functions, such as count, minimum, maximum, average etc. These are functions that are not dependent on the order of the values in the attributes. More complex operations that respect the order of attributes are not built-in and can only be performed using user-defined functions (UDF), embedded packages or by exporting the data into external solutions. One of the biggest relational database Oracle has an R Enterprise solution, which integrates R with Oracle Database [10]. MonetDB has an embedded R package, which also allows using the full functionality of R inside of MonetDB. However, this approach needs programming effort and involves writing of single UDF function for each set of input parameters (e.g by changing the number or data type of input columns, etc.). The 4

15 2.2. MATRIX OPERATIONS IN A DBMS 5 implemented Q QR function allows processing the linear algebra operation directly in DBMS, without exporting and re-importing of the data. To bridge the gap between the needs of the scientific world and the current DBMS technologies, the first SQL-based query language (SciQL) for scientific applications was introduced. SciQL s key innovation is the introduction of an array type, which works on the same level as a table. It uses both tables and arrays as first-class citizens and provides relational algebra-like operations on them [15]. The difference between these two types is their structure: tables are represented by a set of tuples, while an array is identified by attributes (dimensions) and their constraints. Combination of indeces (values of these attributes) denotes a cell, where a value of one non-dimensional attribute is stored. Thus, one tuple from a table is represented in arrays by a set of cell indices and the non-dimensional value in this cell. SciQL uses MonetDB as the target platform. Its column-store architecture corresponds well to the array representation, which significantly reduces the impedance mismatch between query language and array manipulation and, as a result, the development effort [16]. To address arrays in queries, the user applies the table syntax. That makes the transition from SQL to SciQL easier [6]. The implemented Q QR function allows working directly on relations, without paying attention to the array representation or learning a new query language. SciDB is an open-source analytical database that provides not only data management, but also complex analytics. In contrast to SciQL, SciDB was completely new developed. SciDB is oriented towards the data management needs of scientists, fulfilling their requirements regarding complex analytics and working with the large data sets [14]. As such it mixes statistical and linear algebra operations with data management operations (e.g. create, delete, update values, managing of constrains, etc.). SciDB takes natural nested multi-dimensional array data model as basis that eliminates the conversion between tables and arrays. For this goal a new functional (AFL) and a SQL-like query (AQL) languages were invented. SciDB provides an extensibility framework (for user-defined data types, functions, aggregates but also array operators). That makes it possible to apply highly specific algorithms in conjunction with some pre-processing handled by SciDB s built-in or user-defined array operators [13]. In contrast to SciDB for employing the new defined function in MonetDB, the user doesn t need to learn a new query language or deal with a new data model. The Q QR function runs directly on existing relations. 5

17 3 Task description In this chapter I describe the function that has to be implemented, including its structure, input parameters and constraints. I give a running example for the input relation, on which I show the defined parameters. Finally, using the running example, I define the expected results of the function. The goal of this thesis is to integrate the QR decomposition (QRD) into columnoriented DBMS by extending it with the new Q QR function. So that afterwards the following SQL query can be processed: SELECT F ROM Q QR (relation name ON on attributes ORDER BY order by attributes); (3.1) This query returns a relation with the same schema as the input relation and represents the Q matrix from the QRD. Since I want only the Q matrix as the result, QRD can be implemented according to the Gram-Schmidt Algorithm (see Chapter 2.1). Figure 3.1 illustrates the structure of the input relation. Descriptive part {}}{ Application part {}}{ o 1 o 2 o... o n d 1 d 2 d... d n a 1 a 2 a... a n } {{ } Order By part Figure 3.1: Structure of the input relation The application part (in (3.1) on attributes) includes the set of only numeric-domain attributes, which represents the matrix A for the QR decomposition (see Chapter 2.1). The relationship between the size of the application part and the whole number of attributes in the relation forms the ratio of the application part. Attributes, which are not in the application part, form the descriptive part of the input relation. The Order By part is a subset of descriptive part (in (3.1) order by attributes). It includes

18 8 CHAPTER 3. TASK DESCRIPTION the attributes, the tuples in the output relation and the corresponding matrix A for the QRD have to be sorted by. The attributes in Order By part can also be of non-numeric data types. The ratio of the Order By part is a relationship between its size and the number of attributes in the descriptive part. For the implementation I use MonetDB (Jun2016-SP2). The column-oriented architecture of MonetDB works with the attributes as arrays. It allows me to use the vector-based version of the algorithm (see Algorithm 2), which handles the values of one attribute together as one vector. The DBMS has to recognize ambiguous or not existing names of attributes and the input relation in the passed query, and, if applicable, it returns an appropriate error message. The input relation can be an existing relation or a result of an arbitrary select subquery. As a running example I take a relation r, which is displayed in Figure 3.2. Descriptive part {}}{ Application part {}}{ a b c x y z t t t t t t } {{ } Order By part Figure 3.2: Relation r An example SQL query for this relation is shown in Query (3.2) SELECT F ROM Q QR (r ON x, y, z ORDER BY a); (3.2) QRD is performed over 3 attributes (x, y, z) from the application part of r. After ordering the attributes by a, the relation r and the matrix A look as shown in Figure 3.3. a b c x y z t t t t t t A = Figure 3.3: Relation r ordered by attribute a and the corresponding matrix A 8

19 9 a b c x y z Figure 3.4: Results of the query (3.2) The Order By part in (3.2) is presented only by attribute a, the ratio of Order By part is thus 1/3, while the ratio of the application part is 50%. Figure 3.4 demonstrates what the results of the Q QR function applied on the relation r has to look like. 9

21 4 Approach In this chapter I give a description of the architecture and features of the used DBMS (MonetDB). Afterwards I describe the necessary implementation steps. 4.1 MonetDB MonetDB is a column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is actively used in businesses such as health care, telecommunications as well as in sciences such as astronomy. Innovations at all layers of a DBMS such as storage model based on vertical fragmentation, a modern architecture of CPU-tuned query execution, different ways of optimizations, adaptive indexing and a modular software architecture allow MonetDB to achieve significant speed up compared to traditional designs [5]. It supports SQL:2003 standard on client side. Architecture MonetDB has a three-level software stack: SQL front-end: The top layer of the architecture provides the access point for the user. The task of front-end is to map the user s data to the MonetDB s internal structure (BAT) and to translate user queries to MonetDB Assembly Language, known as MAL. The user query is transformed from a SQL query to an internal relational algebra representation, which is then optimized using domain-specific rules and heuristics (e.g. reducing the size of intermediate results, by pushing selections down in the tree). This represents the strategic optimization of the parsed query [5]. At this level I need to extend the parser with the new syntax, and ensure that all internal transformations and the strategic optimizations are done. Tactical-optimizers: In the middle layer the resulting MAL-plan is going through a row of tactical optimizations. It is inspired more by the programming language than by classical database query optimization. Each optimizer, sprinkled with resource management or flow of control directives, takes a MAL plan and transforms it into a more efficient one. It provides facilities ranging from symbolic processing up to just-in-time

22 12 CHAPTER 4. APPROACH r a b c x y z OID a OID b OID c OID x OID y OID z o 1 7 o 1 8 o 1 3 o 1 0 o 1 2 o o 2 1 o 2 2 o 2 0 o 2 3 o 2 4 o = o 3 4 o 3 1 o 3 8 o 3 2 o 3 1 o o 4 2 o 4 4 o 4 1 o 4 8 o 4 9 o o 5 9 o 5 0 o 5 8 o 5 2 o 5 9 o o 6 4 o 6 5 o 6 2 o 6 1 o 6 0 o 6 1 } {{ } BAT s Figure 4.1: Internal (BAT) representation of relation r data distribution [5]. Columnar abstract-machine kernel: The bottom layer of MonetDB stores each attribute of a relation as columns (in fact as arrays) known as BAT. It includes the library of highly optimized implementations of the binary relational algebra operators, stored in corresponding BAT modules. Those operators have access to the whole metadata of BATs, it allows them to perform operational optimization. The interface is described in the MAL module sections. For the implementation of the task I can reuse already existing BAT algebra modules. Nothing additional needs to be done at this level for the purpose of optimization. BAT MonetDB stores each attribute of a relation in a separate table, called a BAT (Binary Association Table). Each BAT consist of two columns: the head column, where the object-identifiers (OID) are stored, and the tail column, where the actual attribute values are stored. Due to this approach every relation is internally represented as a collection of BATs. Physically BATs are represented as consecutive C arrays. They can be up to hundreds of megabytes. The operating system swaps them into memory and compresses on the disk upon need [3]. The example relation r consists of 6 attributes, for these attributes 6 BATs exist. Each BAT stores the respective attribute as (OID, value) pairs. The internal representation of r is illustrated in Figure 4.1. OIDs (o 1, o 2,...,o 6 ) are generated from the system and identify the tuple the value belongs to. The values from the same tuple are assigned the same OID (e.g. in 6 BATs that represent relation r values 7, 8, 3, 0, 2, 5 at the very top of each BAT have the same OID o 1. That means, that all of them belong to the same tuple t 1 ). The head OID column is not materialised, but rather implicitly given by the position of the value, so all values of one tuple will be on the same position in their respective column representation. The position, in turn, is determined by the inserting order of tuples [5]. Relational algebra plans are translated into simple BAT algebra operations and com- 12

23 4.1. MONETDB 13 piled to MAL programs. BAT algebra operators perform simple operations on an entire column of values ( bulk processing ). MAL MonetDB Assembly Language (MAL) is the language used for the abstract machine of the MonetDB kernel. It is the target language for front-end query compilers. Every SQL query generates an execution plan, consisting of simple MAL instructions. These instructions represent all actions, which are necessary in order to provide and deliver the results (e.g. actions to ensure binding data and transaction control, the instructions to produce the results, and the administration steps for preparing and delivering the resulting relation to the front-end). BAT algebra operators correspond to simple MAL instructions, thus building its core. MAL instructions have only BATs as input, on which corresponding simple BAT operations are performed. Complex operators can be broken down into a sequence of BAT algebra operators, which in their turn are mapped to simple operations on arrays. This approach allows avoiding additional expression interpretations. An example of such implementation of a select operator in MonetDB is shown in Figure 4.2 [5]. The select value (V) is compared to the values, stored in the tail part of the input BAT (B), which in its turn is stored as a C-Array. If the value is found, its OID is stored in the tail part of the resulting BAT (R). for ( i=j=0 ; i<n ; i++ ) select (B:bat:[:oid:int], V:int) if ( B.tail[i] == V) R.tail[j++] = i; Figure 4.2: Implementation of select operation The for-loops containing no external function calls provide high instruction locality. That enables better compiler optimization and leads to reduction in the number of instruction cache misses. MonetDB provides an internal function to show the MAL-Plan of a passed query. The operations in it are executed one at a time. That means, that subsequent operations are invoked only after the operation completed the calculation over its entire input data. The idea of BAT operations is always to have a materialized result after its execution. Query Processing Symbol tree: A query passed into the client is parsed in order to detect the specified keywords (tokens). The tokens help the Yacc parser generator to generate a shift-reduce parser. The input to Yacc is a grammar (sql parser.y) with snippets of C++ code ( actions ) attached to rules of the grammar. As soon as the rule is recognised, the parser 13

24 14 CHAPTER 4. APPROACH calls the code snippets associated with this rule. During the execution of these snippets nodes of different types are built and connected together in a symbol tree. Symbol tree represents only the structure of parsed query. Relation tree: The resulting symbol tree is used to build a relation tree. The relation tree consists of nodes, representing a single operation on input tables, containing necessary information for this operation. The nodes of the symbol tree are parsed, analysed, grouped together and converted into corresponding relation tree node, depending on tokens in them. Each node in the relation tree may have up to two children. Afterwards the strategic optimization is carried out. Statement tree: A statement tree is a special feature of MonetDB. It is an intermediate step between a relation tree and a low-level execution plan (MAL-Plan). Comparing to the resulting relation tree, which nodes represent operations on or between relations, each statement in the statement tree represents the sequence of operations on attributes and returns one BAT as a result. Each operation is represented by a MAL-expression that in its turn corresponds to simple BAT algebra operations. Putting it all together, MonetDB provides a good extensibility framework. That allows to extend its functionality, SQL syntax and to make changes at all levels of architecture. The BAT structure using the OID guarantees, that the values from the same tuple belong together, despite the order of the tuples in BATs. This mechanism allows algebra and statistical operations, for which the order and the relation between the values are essential. Breaking down the operations to the simple operations on arrays and the set of existing BAT algebra operations provide a good basis for implementation of sophisticated and complex calculations. 14

25 4.2. IMPLEMENTATION Implementation In this subchapter I describe the changes in MonetDB, which has to be done in order to implement the task query. I also show the examples of trees built by MonetDB during the processing of Q QR function, using the example SQL query (3.2) Symbol tree To parse the task query I expand the existing parser grammar by a new token Q QR. For processing the new token I define additional rules and snippets of code for these rules. The extended part of the grammar can be found in Appendix A.1. Figure 4.3 illustrates the symbol tree, which is produced after parsing and executing the corresponding code. SELECT * FROM Q QR (r ON x, y, z ORDER BY a); SelectNode F ROM Q QR r Application part ORDER BY x y z a Figure 4.3: Symbol tree for the query (3.2) The symbols are build according to the tokens recognized in the passed query. SelectN ode is a route symbol-node for all SELECT-queries. It is connected to all symbols containing the information, which is necessary to produce the results from the input relation and to deliver the needed result relation (e.g. symbols like W HERE, GROUP BY, ORDER BY, HAV ING etc.). The query (3.2) produces the SelectN ode connected only to one symbol namely one with the information about the source relation (F ROM). The new Q QR token also produces a symbol, connected to the symbols with input parameters: relation r, Application part and ORDER BY. The last two parameters are lists, containing the COLUMN-symbols. The Application part contains the columns (x, y, z) for the matrix used during the QRD and ORDER BY contains the column (a) for determining the order of values in the resulting relation. The information in COLU M N-symbols must correspond with the attributes in the input relation r. The result of the Q QR-symbol represents the only source relation for the F ROM-symbol. 15

26 16 CHAPTER 4. APPROACH Relation tree To implement the task I add a new function that deals with the new Q QR-symbol in the symbol tree. This function uses the Q QR-symbol as well as the related symbols with corresponding information to build a proper relation tree node for the Q QR function. The resulting relation tree has tree nodes, as shown in Figure 4.4. SELECT * FROM Q QR (r ON x, y, z ORDER BY a); π a,b,c,x,y,z q qr a x,y,z r Figure 4.4: Relation tree for the query (3.2) After conversion of symbol tree nodes into relation tree nodes the information about input relation r is stored in a separate node. For storing the lists of columns for QRD (x, y, z) and ordering (a) I define a new type of a relation tree node (q qr relation). The resulting relation tree corresponds to the relation plan, produced by the internal MonetDB function (see Appendix B) Statement tree To fulfil the translation of the new q qr relation into MAL-Plan and make the calculation of the QRD possible, I need to extend the existing translation function. It takes the information from the q qr relation and, following the steps of the vector-based Gram- Schmidt Algorithm, translates it in a sequence of statements. The pseudocode of the translation is shown in Algorithm 3. The extended part of the function can be found in Appendix A.2. The resulting sequence of statements can be also presented in a tree form. Figure 4.5 displays such a statement tree for the q qr relation of the query (3.2). It shows both the sequence and the origin of the data required for each particular statement. Each statement tree node is a single BAT, representing the results of the done statement calculations. At the beginning I use the information from the q qr relation in order to create two lists of BATs for the descriptive (a, b, c) and the application parts (x, y, z) of the input relation. Using the Order By part (a), I determine the order of tuples in the output relation. For this purpose I reuse the operations of the existing statements for ordering and reordering of data in BATs. These statements are translated without changes into 16

27 4.2. IMPLEMENTATION 17 Algorithm 3 Q QR 1: procedure Q QR(relation, on attrs, order by attrs) 2: attrs attributes of relation 3: descriptive part attrs on attrs 4: if order by attrs then 5: order by order of order by attrs 6: for descriptive attr in descriptive part do 7: sort descriptive attr by order by 8: add descriptive attr to list output 9: for on attr in on attrs do 10: sort on attr by order by 11: add on attr to on attrs sorted 12: rest attrs on attrs sorted 13: for on attr in on attrs sorted do 14: normalize on attr 15: add on attr to list output 16: rest attrs rest attrs on attr 17: for rest attr in rest attrs do 18: orthogonalize rest attr to on attr return list output low-level MAL-plan and represented through existing BAT algebra operations. The ordering statement takes the BAT, by which it must be sorted, as input and returns a BAT containing the determined order of tuples (O list order attributes in Figure 4.5 O a ). After that the attributes in the descriptive and the application parts are ordered by this BAT. As a result a new ordered BAT for each of the attributes is produced (O order by attrs (on attr) in Figure 4.5 O a (a), O a (b),..., O a (z)). The ordered results of the descriptive part are directly appended to the output list. The ordered application part is stored as an intermediate list that is used for the Q QR calculation afterwards. Q QR calculation is defined by a sequence of normalization and orthogonalization operations on attributes. The operations of the corresponding statements and their results are not implemented in MonetDB and have to be defined. Detailed description of how these two operations work and how the resulting BATs (N orm(attribute) and Orth attribute2 (attribute 1 ), e.g. in Figure 4.5 Norm(x) and Orth x (y) or Orth y (z) respectively) are produced is given in the Chapter I append the BAT, representing the normalized attribute, to the output list. After all operations have been done and all attributes have been appended, the output list is passed as an argument to the next node of the relation tree. The projection of the results to the user is processed by existing statements, and doesn t require changes at this step. The MAL-Plan constructed for the query (3.2) is shown in the Appendix C.1. 17

28 18 CHAPTER 4. APPROACH SELECT * FROM Q QR (r ON x, y, z ORDER BY a); a a b a c a x a y a z a Output Norm(z) Orth y(z) Norm(y) Q QR Orth x(y) Orth x(z) Norm(x) x a y a z a O a(a) O a(b) O a(c) O a(x) O a(y) O a(z) Ordering O a a b c x y z a b c x y z }{{} Order By part } {{ } Descriptive part } {{ } Application part Input Translation to MAL-Plan Figure 4.5: Statement tree for the query (3.2) Two additional statements, required for the implementation of Gram-Schmidt algorithm and used by me for the translation of the q qr relation into low-level MAL instructions, have to be implemented. These statements execute the operations of normalization and orthogonalization on attributes (see Algorithm 2). To get the results of these two statements, I use BAT operations from existing BAT modules, such as: algebra calculations (multiplication, exponentiation, division, subtraction) aggregate functions (sum) mathematical functions (square root). Both statements run over the ordered application part of the relation that is stored after the ordering operation in an intermediate list. 18

29 4.2. IMPLEMENTATION 19 Normalization: Normalization statement has one BAT as input, which has to be normalized. First the euclidean norm for the attribute is calculated. The calculation is performed by multiplying the corresponding BAT of the attribute by itself. The values in the resulting BAT are summed and the square root of the sum is taken. Each of the aggregate (SUM()) and mathematical (SQRT ()) functions returns a single number, stored as a separate temporal BAT. This BAT is used in subsequent calculations later. Division of each attribute value by its euclidean norm delivers the normalized form of the attribute. It is stored as a new BAT (Norm(x)) that is also appended to the output list of attributes. Figure 4.6 illustrates the processes of normalization statement, on the example of the attribute x from the query (3.2). List of ordered application part OID Oa(x) o 2 3 o 4 8 o 3 2 o 6 1 o 1 0 o 5 2 OID Oa(y) o 2 4 o 4 9 o 3 1 o 6 0 o 1 2 o 5 9 OID Oa(z) o 2 5 o 4 6 o 3 2 o 6 1 o 1 5 o 5 3 OID Oa(x) o 2 3 o 4 8 o 3 2 o 6 1 o 1 0 o 5 2 OID Oa(x) o 2 3 o 4 8 o 3 2 o 6 1 o 1 0 o 5 2 = OID temp1 o 2 9 o 4 64 o 3 4 o 6 1 o 1 0 o 5 4 Euclidean norm SUM( OID temp1 o 2 9 o 4 64 o 3 4 o 6 1 o 1 0 o 5 4 ) = OID temp2 o t2 82 SQRT ( OID temp2 o t2 82 ) = OID temp3 o t3 9,05539 Normalization operation OID Oa(x) o 2 3 o 4 8 o 3 2 o 6 1 o 1 0 o 5 2 / OID temp3 o t3 9,05539 = OID Norm(x) o 2 0,33129 o 4 0,88345 o 3 0,22086 o 6 0,11043 o 1 0 o 5 0,22086 Figure 4.6: Processes of normalization statement, on the example of the attribute x from the query (3.2) Orthogonalization: After each operation of normalization the rest of non-normalized attributes in the application part of the relation has to be orthogonolized to the last normalized vector. For this purpose the statement Orth attribute2 (attribute 1 ) takes two 19

30 20 CHAPTER 4. APPROACH attributes as an input: the attribute that has to be orthogonalized (attribute 1 ) and the attribute, which it has to be orthogonalized to (attribute 2 ). An example of orthogonalization procedure for the attribute y (orthogonalized to attribute x) is displayed in Figure 4.7. List of ordered application part OID Norm(x) o 2 0,33129 o 4 0,88345 o 3 0,22086 o 6 0,11043 o 1 0 o 5 0,22086 OID Oa(y) o 2 4 o 4 9 o 3 1 o 6 0 o 1 2 o 5 9 OID Oa(z) o 2 5 o 4 6 o 3 2 o 6 1 o 1 5 o 5 3 Dot product OID Norm(x) o 2 0,33129 o 4 0,88345 o 3 0,22086 o 6 0,11043 o 1 0 o 5 0,22086 OID Oa(y) o 2 4 o 4 9 o 3 1 o 6 0 o 1 2 o 5 9 = OID temp1 o 2 1,32518 o 4 7,95107 o 3 0,22086 o 6 0 o 1 0 o 5 1,98777 SUM( OID temp1 o 2 1,32518 o 4 7,95107 o 3 0,22086 o 6 0 o 1 0 o 5 1,98777 ) = OID temp2 o t2 11,48488 Orthogonalization process OID Norm(x) o 2 0,33129 o 4 0,88345 o 3 0,22086 o 6 0,11043 o 1 0 o 5 0,22086 OID o t2 temp2 11,48488 = OID Oa(y) o 2 4 o 4 9 o 3 1 o 6 0 o 1 2 o 5 9 OID temp3 o 2 3,80488 o 4 10,14634 o 3 2,53659 o 6 1,26829 o 1 0 o 5 2,53659 OID temp3 o 2 3,80488 o 4 10,14634 o 3 2,53659 o 6 1,26829 o 1 0 o 5 2,53659 = OID Orthx(y) o 2 0,19512 o 4-1,14634 o 3-1,53659 o 6-1,26829 o 1 2 o 5 6,46341 Figure 4.7: Processes of orthogonalization statement, on the example of the attribute y from the query (3.2) (orthogonalized to attribute x) After the orthogonalization of the non-normalized attributes is completed, as shown in Figure 4.8, the next attribute can be normalized and the orthogonalization procedure iterated until there is only one attribute left. After the normalization of the last attribute, the latter is appended to the output list, where it forms the output relation, together with the attributes from the descriptive part, as shown in Figure 3.4. One of the advantages of the application of Gram-Schmidt algorithm for the matrix 20

31 4.2. IMPLEMENTATION 21 OID Norm(x) o 2 0,33129 o 4 0,88345 o 3 0,22086 o 6 0,11043 o 1 0 o 5 0,22086 OID Orthx(y) o 2 0,19512 o 4-1,14634 o 3-1,53659 o 6-1,26829 o 1 2 o 5 6,46341 OID Orthx(z) o 2 2,29268 o 4-1,21951 o 3 0,19512 o 6 0,09756 o 1 5 o 5 1,19512 Figure 4.8: Stand of BATs after all attributes from the query (3.2) are orthogonalized to attribute x Q calculation is the possibility of parallelizing its execution. Since there are no dependencies between the attributes and no overwriting operations of them during the orthogonalization, the latter can be done concurrently. The described approach represents the processes and the order of their executions according to the used algorithm. However, during the low-level MAL-optimization of MonetDB, these processes are re-organized, so that the entire set of operations (the required number of orthogonalizations and the subsequent normalization) are performed consequently for each attribute. This is reflected in the produced MAL-Plan for Q QR function (see Appendix C). Such optimization helps to reduce the number of cache misses, since more information needed for all optimizations and the normalization is stored in the cache. That reduces the number of requests to the slow main memory and improves the runtime of the entire Q QR function. The code for the extended statements is shown in Appendix A.3. 21

33 5 Complexity Analysis In this chapter the complexity of the original Gram-Schmidt algorithm and the implementation are estimated and compared. 5.1 Complexity of the Gram-Schmidt algorithm The vector-based Gram-Schmidt algorithm, used for the implementation, comprises operations of normalization and orthogonalization. Using a matrix of size mxn, where m is the number of tuples and n is the number of attributes, the numbers of executions of the algorithm operations and their suboperations can be analyzed. Table 5.1 lists the estimated numbers of these operations. The normalization is executed only once for each vector, while orthogonalization is to be done for the rest (all non-normalized) vectors after every normalization. As a result, the execution number of orthogonalization equals to (n 1)n/2. Normalization Orthogonalization Operation Euclidian norm(e n ) Devision by e n Dot product Subtraction Sub -operation Number of executions n 2 n(m) + n(m-1) n(1) n / n(m) (n-1)n/2 ((n-1)n/2)(m) + ((n-1)n/2)(m-1) ((n-1)n/2) / ((n-1)n/2)(m) ((n-1)n/2)(m) Table 5.1: Operations in vector-based Gram-Schmidt algorithm and number of their executions

34 24 CHAPTER 5. COMPLEXITY ANALYSIS The total number of calculated QRD operations is summarized in Equation (n 1)n m + 2 (n 1)n (m 1) + 2 nm + n(m 1) + 1n + nm+ (n 1)n m + 2 (n 1)n m 2 (5.1) mn + 2mn n n (5.2) According to the simplified version of Equation (5.2), the algorithm has the total complexity of O(mn 2 ). 5.2 Complexity of the implementation According to the task, the output relation of the implemented Q QR function is ordered by the defined set of attributes. It means that the complexity of the Gram-Schmidt algorithm has to be extended by the complexity of additional operations of lists building, determination of order and ordering itself. In the first step I build the lists of attributes for the application and descriptive parts of the relation, which I need to define the matrix for the QRD. For this purpose I iterate through all attributes of the input relation, what has a complexity of O(n 2 ). The order of the attributes is to be determined according to the order by attributes passed in the query. If only one attribute is passed, the complexity is at least O(m log m). If there are more than one attribute in the order by attributes, the following attributes are only used in case, if the first order by attribute has at least two tuples with exactly same values. For ordering those tuples additional information is needed. That means, that the complexity of such additional ordering equals to O(k log k), where k is the number of tuples with same values and k m. The worst case of ordering is when all tuples of an order by attribute have the same value (k = m). The complexity of ordering in this case is at least O(sm log m), where s is the number of attributes with the same values. After the order is determined, all attributes in the relation have to be sorted according to it. This results in the ordering complexity of O(m log m n). The total complexity of the Q QR function is summarized in Equation(5.3). O(n 2 ) + O(m log m) + O(m log m n) + O(mn 2 ) (5.3) For the met assumption, that m n, the total complexity of the Q QR function equals to O(mn 2 ). If it is assumed, that m n, the effort for ordering overtakes the cost of the Q QR calculations, what leads to the resulting complexity of O(m log m n). 24

35 6 Experimental evaluation In this chapter I describe the experimental evaluation. It includes the explanation of the setup, test cases, data set, the results of the experiments and their discussion. The goal of the experimental evaluation is to analyse the complete runtime of the solution and to compare the performance of MonetDB with the existing R-solution empirically. Additionally to the complete runtime I analyse time spent on building the execution plan and for Q QR execution itself. 6.1 Setup The experiments are carried out on an Ubuntu (16.04 LTS) virtual machine having 5,5 GB RAM. The virtual machine (5.1.14) was running on MacOS Sierra ( ) having 2.70 GHz Intel Core i5-5257u CPU and 8 GB 1867 MHz DDR3 memory. To check the environmental dependency of MonetDB I use a server with 2 x Intel(R) Xeon(R), CPU E GH, 64GB main memory, hard disks 4x320GB, SATA, 15000rpm and OS: CentOS Test Case Selection and Metrics The implemented Q QR function has only one input relation and three additional parameters, which influence the execution performance and can be manipulated during the evaluation (see Figure 3.1): 1. number of tuples in the input relation (m) 2. number of attributes in the application part (n a ) 3. number of attributes in the descriptive part (n d ) 4. ratio of the Order By part in the descriptive part (ratio) The distribution of values is irrelevant. To get a different size of the Order By part of the relation I combine different Order By ratios with the size of the descriptive part.

36 26 CHAPTER 6. EXPERIMENTAL EVALUATION The input relation itself is an existing relation or a result of a passed sub-query. Since the processing of sub-query is not influenced during the implementation and it is not directly connected with the performance of the Q QR function, I refrain from manipulating this parameter. Experimental evaluation consists of two separate experiments: MonetDB vs. R: It is a comparison between the implementation in MonetDB and the existing R-solution. For this experiment, the embedded R package for MonetDB is used and the data is exported to R via UDF (see Appendix D). Internal: It provides the detailed information about the time consumption of the implementation function in MonetDB Detailed information on chosen values of input parameters for these experiments is presented in Table 6.1. Parameter Values m , , n a 20, 40, 60, 80, 100 n d 1 1, 10, 50 2 ratio 100%, 30%, 60%, 90% 3 Table 6.1: Values of the input parameters for experimental evaluation A combination of all four parameters represents a test case to be performed. For each of them a complete runtime is measured using the internal MonetDB timing function. For the Internal evaluation I record the time spent on creation of execution plan (preparation time), additionally to the complete runtime. The combination of complete runtime and preparation time provides the information about execution time. In order to evaluate time spent on ordering the relation, I run every defined test case in two modes (with and without ordering). For the without ordering mode I optimize the Q QR function so that the order is not determined and no ordering of attributes is performed. For each test case the difference between runtimes in two modes provides the information about time spent on ordering all attributes in the relation. 6.3 Data Set Every single test case has 3 runs that are performed one by one. The mean value is calculated based on the measurement results of each test case run. For each test case a new relation with m tuples and n attributes (n = n a + n d ) is generated. MonetDB and R solution use the same relation by each run. 1 Descriptive part with 1 attribute has always ratio of 100% 2 In the without ordering mode the descriptive part consists of only 1 attribute 3 In the without ordering mode the ratio is for all test cases 100% 26

37 6.3. DATA SET 27 By executing the same query more than one time, the execution plan is created only once and stored in the cache. To avoid its reusing for measuring the preparation time 3 runs are done on different relations. Test cases, where the Order By ratio parameter is manipulated, are performed on the same relation. Queries for creating and filling tables are generated using a python script program that takes random integers between 0 and

38 28 CHAPTER 6. EXPERIMENTAL EVALUATION 6.4 Results In this section I present the test results for both experiments. At first I present the results of the tests in with ordering mode, which are followed by the results of the without ordering mode. The measurements are evaluated and comparison and evaluation of the results is given. Detailed test results are listed in Appendix E W ith ordering mode In this part of the experimental evaluation I test the implemented solution in with ordering mode. The order of data in a relation can be essential for some statistical and linear algebra calculations, so the resulting relation is sorted by its Order By part. MonetDB vs. R The first set of test cases compares the performance of MonetDB with existing R- solution. That is done using UDF and embedded R package in MonetDB. The results of test cases with ordering by one attribute are illustrated in Figure 6.1. It is recognisable, that for R the number of tuples doesn t play a significant role, while for MonetDB it is a decisive parameter. Runtime(sec) MonetDB 100k tuples MonetDB 1M tuples R 100k tuples R 1M tuples Number of attributes in the application part Figure 6.1: Complete runtime in the MonetDB vs. R experiment (ordered by 1 attribute) The performance of R is over 4 times better than the performance of MonetDB for all combinations of the number of tuples and the size of the application part. And the more attributes in the application part are taken, the bigger is the absolute performance difference. By increasing the number of attributes in the application part and using the high number of tuples in the relation, the complete runtime in MonetDB changes 28

Optimization of Mixed Queries in MonetDB System

Department of Informatics, University of Zürich BSc Thesis Optimization of Mixed Queries in MonetDB System Alphonse Mariyagnanaseelan Matrikelnummer: 15-712-698 Email: alphonse.mariyagnanaseelan@uzh.ch