Database Systems: Lecture Notes
 5554348882

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

6.830/6.814 — Notes∗ for Lecture 1: Introduction to Database Systems Carlo A. Curino

September 10, 2010

2

Introduction

READING MATERIAL: Ramakrishnan and Gehrke Chapter 1 What is a database? A database is a collection of structured data. A database captures an abstract representation of the domain of an application. • Typically organized as “records” (traditionally, large numbers, on disk) • and relationships between records This class is about database management systems (DBMS): systems for cre­ ating, manipulating, accessing a database. A DBMS is a (usually complex) piece of software that sits in front of a collection of data, and mediates applications accesses to the data, guaranteeing many properties about the data and the accesses. Why should you care? There are lots of applications that we don’t offer classes on at MIT. Why are databases any different?

2

APP1

APP2

DBMS

"a system to create, manipulate, access databases (mediate access to the data)"

DB

"a collection of structure data"

Figure 1: What is a database management system? • Ubiquity (anywhere from your smartphone to Wikipedia) • Real world impact: software market (roughly same size as OS market roughly $20B/y). Web sites, big companies, scientific projects, all manage both day to day operations as well as business intelligence + data mining. • You need to know about databases if you want to be happy! The goal of a DBMS is to simplify the storing and accessing of data. To this purpose DBMSs provide facilities that serve the most common operations performed on data. The database community has devoted significant effort in formalizing few key concepts that most applications exploit to manipulate data. This provides a formal ground for us to discuss the application requirements on data storage and access, and compare ways for the DBMS to meet such requirements. This will provide you with powerful conceptual tools that go beyond the specific topics we tackle in this class, and are of general use for any application that needs to deal with data. Now we proceed in showing an example, and show how hard is doing things without a DB, later we will introduce formal DB concepts and show how much easier things are using a DB.

3

Mafia Example

Today we cover the user perspective, trying to detail the many reason we want to use a DBMS rather than organizing and accessing data directly, for example as files. Let us assume I am a Mafia Boss (Note: despite the accent this is not the case, but only hypothetical!) and I want to organize my group of “picciotti” (sicilian for the criminals/bad guys working for the boss, a.k.a the soldiers, see Figure 2) to achieve more efficiency in all our operations. I will also need a lot of book-keeping, security/privacy etc.. Note that my organization is very

3

Boss

Consigliere

Underboss Caporegime

Caporegime

Caporegime

Soldiers

Soldiers

Soldiers

Associates Image by MIT OpenCourseWare.

Figure 2: Mafia hierarchy. large, so there is quite a bit of things going on at any moment (i.e., many people accessing the database to record or read information). I need to store information about: • people that work for me (soldiers, caporegime, etc..) • organizations I do business with (police, ’Ndrangheta, politicians) • completed and open operations: – protection rackets – arms trafficking – drug trafficking – loan sharking – control of contracting/politics – I need to avoid that any of may man is involved in burglary, mugging, kidnapping (too much police attention) – cover-up operations/businesses – money laundry and funds tracking • assignment of soldiers to operations • etc... I will need to share some of this information with external organizations I work with, protecting some of the information. Therefore I need: • the boss, underboss and consigliere should be able to access all the data and do any kind of operations (assign soldiers to operations, create or shutdown operations, pay cops, check the total state of money movements, etc...) • the accountants (20 of them) access to perform money book-keeping (track money laundering operations, move money from bank to bank, report bribing expenses) 4

• the soldiers (5000) need to report daily misdeeds in a daily-log, and report money expenses and collections • the semi-public interface accessible by other bosses I collaborate with (search for cops on our books, check areas we already cover, etc..) name nickname phone

log

person

involve

log_id author title summary

operation

name desc $$ coverup-name

collaboration_with

accounts

organization

account-number false-identity balance

name boss rank

Figure 3: What data to store in my Mafia database.

3.1

An offer you cannot refuse

I make you an offer you cannot refuse: “you are hired to create my Mafia Information System, if you get it right you will have money, sexy cars, and a great life. If you get it wrong... well you don’t want to get it wrong”. As a first attempt, you think about just using a file system: 1. What to represent:, what are the key entities in the real world I need to represent? how many details? 2. How to store data: maybe we can use just files: people.txt, organiza­ tions.txt, operations.txt, money.txt, daily-log.txt. Each files contains a textual representation of the information with one item per line. 3. Control access credentials at low granularity: accountants should know about money movement, but not the names and addresses of our soldiers. Soldiers should know about operations, but not access money information 4. How to access data: we could write a separate procedural program opening one or more files, scanning through them and reading/writing information in them. 5. Access patterns and performance: how to find shop we didn’t col­ lected money from for the longest time (and at least 1 month)? scan the huge operation file, sort by time, pick the oldest, measure time? (need to be timely or they will stop paying, and this get the boss mad... you surely 5

don’t want that, and make sure no one is accessing it right now). “Tony Schifezza” is a mole, we need to find all the operations and people he was involved or knew about and shut them down... quick... like REAL quick!!! 6. Atomicity: when an accountant moves money from one place to another you need to guarantee that either money are removed from account A and added to account B, or nothing at all happens... (You do not want to have money vanishing, unless you plan to vanish too!). 7. Consistency: guarantee that the data are always in a valid state (e.g., there are no two operations with the same name) 8. Isolation: multiple soldiers need to add to daily-log.txt at the same time (risk is that they override each other work, and someone get “fired” be­ cause not productive!!) 9. Durability: in case of a computer crash we need to make sure we don’t lose any data, nor that data get scrambled (e.g., If the system says the payment of a cop went through, we must guarantee that after reboot the operation will be present in the system and completed. The risk is police taking down our operation!) Using the file system, you realize that most probably you will fail, and that can be very dangerous... Luckily you are enrolled in 6.830/6.814 and you just learned that: Databases address all of these issues!! you might have a chance! In fact, you might notice that the issues listed above are already related to the three concepts we mentioned before: 1-3 are problems related to Data Model, 4-5 are problems related to the Query language and 6-9 are problems related to Transactions. So let’s try to do the same with a “database” and get the boss what he needs.

3.2

More on fundamental concepts

Database are a microcosm of computer science, their study covers: languages, theory, operating systems, concurrent programming, user interfaces, optimiza­ tion, algorithms, artificial intelligence, system design, parallel and distributed systems, statistical techniques, dynamic programming. Some of the key con­ cepts we will investigate are: Representing Data We need a consistent structured way to represent data, this is important for consistency, sharing, efficiency of access. From database theory we have the right concepts. • Data Model: a set of constructs (or a paradigm) to describe the organiza­ tion of data. For example tables (or more precisely relations), but we could also choose graph, hierarchies, objects, triples , etc.. 6

• Conceptual/Logical Schema: is a description of a particular collection of data, using the a given data model (e.g., the schema of our Mafia database). • Physical Schema: is the physical organization of the data (e.g., data and index files on disk for our Mafia database). Declarative Querying and Query Processing a high-level (typically declar­ ative) language to describe operations on data (e.g., queries, updates). The goal is to guarantee Data independence (logical and physical), by separating “what” you want to do with data from “how” to achieve that (more later). • High level language for accessing data • “Data Independence” (logical and physical) • Optimization Techniques for efficiently accessing data Transactions • a way to group actions that must happen atomically (all or nothing) • guarantees to move the DB content from a consistent state to another • isolate from parallel execution of other actions/transactions • recoverable in case of failure (e.g., power goes out) This provide the application with guarantees about a group of actions even in presence of concurrency and failures. It is a unit of access and manipulation of data. And significantly simplify the work of application developers. This course covers these concepts, and goes deep into the investigation of how modern DBMS are designed to achieve all that. We will not cover the more artificial-inteligence / statistical / mining related areas that are also part of database research. Instead, we will explore some of the recent advanced topics in database research—see class schedule to get an idea of the topics.

3.3

Back to our Mafia database

What features of our organization shall we store? How do we want to capture them? Choose a level of abstraction and describe only the relevant details (e.g., I don’t care about favorite movies for my soldiers, but I need to store their phone numbers). Let’s focus on a subset: • each person has real name, nickname, phone number • each operation has a name, description, economical value, cover-up name • info about the persons involved in an operation and their role,

7

We could represent this data according to many different data models: • hierarchies • objects

• graph • triples • etc.. Let’s try using an XML hierarchical file:





Operations are duplicated in each person, this might make the update very tricky (inconsistencies) and the representation very verbose and redundant. Otherwise we can organize the other way around with people inside operations, well we would have people replicated. Another possibility is using a graph structure with people, names, nick­ names,phones, operation names etc.. as nodes, and edges to represent relation­ ships between them. Or we could have objects and methods on them, or triples like , etc.. Different data models are more suited for different problems. They different expressive power and different strengths depending on what data you want to represent and how you need to access them. Let’s choose the relational data model and represent this problem using “ta­ bles”. Again there are many ways to structure the representation, i.e., different “conceptual/logical schemas” that could capture the reality are modeling. For example we can have a single big table with all info together... again, is redun­ dant and might slow down all the access to data. The “database design” is the art of capturing a set of real world concepts and their relations in the best possible organization in a database. A good representation is shown in Figure 4. It is not redundant and contains all the information we care about.

8

involved pers_name oper_name

person

rols

carlo

snowflake

chief

operation

name

nickname

phone

tony

snowflake

sold

title

descr.

econ_val

coverup

carlo

baffo

123

mike

chocolate

chief

snowflake

..

$10M

laundromat

mike

lungo

456

chocolate

...

$5M

irish pub

tony

shifezza

789

caffe

...

$2M

irish pub

Figure 4: Simple Logical Schema for a portion of our Mafia database. What about the physical organization of the data? As a database user you can ignore the problem, thanks to the physical independence! As a student of this class you will devote a lot of effort in learning how to best organize data physically to provide great performance to access data.

3.4

Accessing the data (transactionally)

As we introduced before databases provide high-level declarative query lan­ guages. The key idea is that you describe “what” you want to access, rather than “how” to access it. Let’s consider the following operations you want to do on data, and how we can represent them using the standard relational query language SQL: • Which operations involve “Tony Schifezza”? SELECT oper_name FROM involved WHERE person = "tony"; • Given the “laundromat” operation, get the phone numbers of all the people involved in operations using it as a cover up. SELECT p.phone FROM person p, operation o, involve i WHERE p.name = i.person AND i.oper_name = o.name AND o.coverup_name = "laundromat"; • Reassign Tony’s operations to Sam and remove Tony from the database (he was the mole). BEGIN UPDATE involved i SET pers_name="sam" WHERE pers_name="tony"; DELETE FROM person WHERE name = "tony"; COMMIT

9

• Create a new operation with “Sam Astuto” in charge of it. BEGIN INSERT INTO operation VALUES (’newop1’,’’,0,’Sam’s bakery’); INSERT INTO involve VALUES (’newop1’,’sam’,’chief’); COMMIT Let us reconsider the procedural approach. You might organize data into files: one record of each table in a file, and maybe sort the data by one of the fields. Now every different access to the data, i.e., every “query” should become a different program opening the files, scanning them, reading or writing certain fields, saving the files.

4

Extras

The two following concepts have been broadly mentioned but not discussed in details in class. Optimization The goal of a DBMS is to provide a library of sophisticated techniques and strategy to store, access, update data that also guarantees per­ formance, atomicity, consistency, isolation, durability. DBMS automatically compile the user declarative queries into an execution plan (i.e., a strategy that applies various steps to achieve the compute the user queries), looks for equiv­ alent but more efficient ways to obtain the same result query optimization, and execute it, see example in Figure 5. BASIC PLAN

OPTIMIZED PLAN

project(p.phone)

project(p.phone)

filter(o.coverup="laundromat") filter(p.name=i.person) filter(i.oper_name=o.name) product

filter(p.name=i.person)

filter(i.oper_name=o.name)

product

product product

scan(person)

scan(involved)

scan(operations)

project(p.name,p.phone)

project(i.oper_name, i.person)

scan(person)

scan(involved)

project(o.name)

lookup(operations, coverup="laundromat")

Figure 5: Two equivalent execution plan, a basic and an optimized one.

10

External schema A set of views over the logical schema, that predicates how users see/access data. (e.g., a set of views for the accountants). It is often not physically materialized, but maintain as a view/query on top of the data. Let try to show only coverup names of operations worth less or equal to $5M and the nicknames of all people involved using a view (see Figure 6): CREATE VIEW nick-cover AS SELECT nickname, coverup_name FROM operation o, involved i, person p WHERE p.name = i.person AND i.oper_name = o.name AND

o.econ_val Select age From animals Where name = “Freddie” And cid = A.cid Find all pairs of animals cared for by the same keeper Select A.name, B.name From Animals A Where a.zid = Select B. zid

From Animals B Where B.name != A.name Requires refs from inside out and outside in No obvious query processing strategy  disallowed Okay for an inner to outer reference, but not the other way around. SQL solution (1976) -- multi-table blocks Select A.name, B.name From Aminals A, Animals B Where A.zid = B.zid and A.name != B.name Net result: horrible 2 ways to express most queries, e.g. Freddie’s keeper: (nested) select name from keepers where id in select kid

from Animals

where name = “Freddie”

(flat)

Select k.name

From Animals A, Keepers K

Where A.kid = K.id and A.name = ‘Freddie’

Which one to choose?

1980’s:

Hierarchical got you inside-out evaluation

Flat got you a complete optimization  obviously better!

Don’t use hierarchical representation!!!!

More recently, SQL engines rewrite hierarchical queries to flat ones, if they can. Big

pain for the implementation!!!

There are queries that cannot be flattened. e.g. ones with an “=” between the inner and out block. There are ones that cannot be expressed in nested fashion, e.g. “pairs of Animals query” above Two collections of queries: nested and flat: venn diagram almost identical but not quite. Awful language design. Lessons Never too late to toss everything and start over -- glueing warts is never a good idea Simple is always good Language design should be done by PL folks not DB folks – we are no good at it ************** OODB (1980s): persistent C++

Motivation:

Programming language world (C++)

struct animals { string name; string feed_time; string species; int age; }; struct keeper_data { string name; string address; animals charges[20]; }; DBobject;

data base world:

Keepers (name, address)

Animals (name, feed_time, species, age, k_id)

There is an obvious impedance mismatch.

Query returns a table – not a struct. You have to convert DB return to data structures of

client program.

OODBs were focused on removing this impedance mismatch.

Data model: C++ data structures. Declare them to be persistent: e.g. persistent keeper_data DB [500]; to query data base: temp = DB[1].keeper_data.name to do an update: DB [3].keeper_data.name = “Joseph” Query language: none – use C++ appropriate for CAD –style apps

Assumes programming language Esperanto (C++ only)

Later adopted a QL (OQL)

Transaction systems very weak – big downside

Never found a sweet spot.

***********

Semi-structured data problem

Hobbies of employees

Sam: bicycle (brand, derailer, maximum slope hill, miles/week)

Mike: hike (number of 4000 footers, boot brand, speed) Bike (builder, frame-size, kind-of-seat) “semi-structured data”

Not well suited to RM (or any other model we have talked about so far)

XML ( hierarchical, self-describing tagged data)

Sam

XXX . .

XML is anything you want. Document guys proposed this stuff as simplified SGML.

Adapted by DBMS guys. Always a bad idea to morph something to another purpose. If you want a structured collection, they proposed XML-schema

XML representation for structure of the XML document Most complex thing on the planet… Tables (RM) Hierarchies (IMS style) Refs (Codasyl style) Set valued attributes (color = {red, green brown}) Union types (value can be an X or a Y) XQuery is one of the query languages (with XPath)

For $X in Employee

Where $x/name = Sam

Return $X/hobbies/bike

Huge language

Relational elephants have added (well behaved subsets) to their engines

getting some traction.

******8

Data base design

***************

Animals (name, species, age, feeding_time, cid, kid)

Cages (id, size)

Keepers (id, name, address)

Or

Animals (name, species, age, feeding_time)

Cages (id, size)

Keepers (id, name, address)

Lives_in (aname, cid)

Cared_for_by (aname, kid)

Data base design problem – which one to choose

Two ways to data base design a) Normalization b) E-R models Normalization: start with an initial collection of tables, (for this example) Animals (name, species, age, cid, cage_size) Functional dependency: for any two collections of columns, second set is determined by the first set; i.e. it is a function. Written A -> B In our example: (name) is a key. Everything is FD on this key Plus cid -> size Problems: 1) redundancy: cage_size repeated for each animal in the case 2) cannot have an empty cage Issue:

Cage_size is functionally dependent on cid which in turn is functionally dependent on

name. So called transitive dependency.

Solution normaliziation;

Cage (id, cage_size)

Animals (name, species, age, c_id)

1NF (Codd) -- all tables “flat” 2NF (Codd) -- no transitive dependencies 3NF (Codd) -- fully functional on key BCNF 4NF

5NF P-J NF …. Theoreticians had a field day…. Totally worthless 1) mere mortals can’t understand FDs

2) have to have an initial set of tables – how to come up with these?

Users are clueless…

Plus, if you start with:

Animals (name, species, age, feeding_time)

Cages (id, size)

Keepers (id, name, address)

Lives_in (aname, cid)

Cared_for_by (aname, kid)

No way to get

Animals (name, species, age, feeding_time, cid, kid)

Cages (id, size)

Keepers (id, name, address)

******

Universal solution:

E-R model:

Entities (things with independent existence)

Keepers

Cages

Animals Entities have attributes Entities have a key Animals have name (key), species, age, … Entities participate in relationships

Lives_in Cared_for_by Relationships are 1:1, 1::N or M::N (use crows feet on the arcs to represent visually) Relationships can have attributes Draw an E-R diagram

Keepers Cages (id (key), name, address) (id (key), size) | | | | | | \/ \/ Animals

(name (key), species, age, feeding_time)

Automatic algorithm generates 3NF (Wong and Katz 1979)

Each entity is a table with the key

M::N relationships are a table with their attributes

1::N relationships – add the key on the N side to the one side with all of the relationship

attributes

Generates:

Animals (name, species, age, feeding_time, cid, kid)

Cages (id, size)

Keepers (id, name, address)

Over the years has been extended with

Weak entities (no key – inherits the key of some other entity) (learn to drive)

Inheritance hierarchies (generalization) (student is a specialization of person)

Aggregation (attribute of all of the participants in a relationship) (e.g count)

More than binary relationships (e.g. marriage ceremony)

Details in Ramakrishnan…..

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

6.830/6.814 — Notes∗ for Lecture 4: Database Internals Overview Carlo A. Curino September 22, 2010

1

Announcements • Problem Set 1 is due today! (For next time please submit some open non-esoteric format .txt .rtf .pdf) • Lab 1 is out today... start it right away, is due in 8 days! Do not copy... we perform automatic checking of plagiarism... it is not a good gamble! • Projects ideas and rules are posted online.

2

Readings

For this class the suggested readings are: • Joseph Hellerstein, Michael Stonebraker and James Hamilton. Architec­ ture of a Database System. Online at: http://db.cs.berkeley.edu/ papers/fntdb07-architecture.pdf It is a rather long paper (don’t be too scared by the 119 pages, the page format makes it look much longer than it is) that is in general worth reading, however we only require you too read sections: 1, 2 (skim through it), 3, 4 (up to subsection 4.5 included), 5. You can also skim through section 6 that we will discuss later on. Probably doesn’t all make sense right now – you should look at this paper again to this paper through the semester for context.

3

A bit of history

Complementing Mike’s historical overview...Projects ideas and rules are posted online. ∗ These notes are only meant to be a guide of the topics we touch in class. Future notes are likely to be more terse and schematic, and you are required to read/study the papers and book chapters we mention in class, do homeworks and Labs, etc.. etc..

1

1970’s : Several camps of proponents argue about merits of these competing systems while the theory of databases leads to mainstream research projects. Two main prototypes for relational systems were developed during 1974-77. • Ingres: Developed at UCB by (including guess who? Stonebraker and Wong). This ultimately led to Ingres Corp., Sybase, MS SQL Server, Britton-Lee, Wang’s PACE. This system used QUEL as query language. • System R: Developed at IBM San Jose (now Almaden) and led to IBM’s SQL/DS & DB2, Oracle, HP’s Allbase, Tandem’s Non-Stop SQL. This system used SEQUEL as query language (later SQL). Lots of Berkeley folks on the System R team, including Gray (1st CS PhD @ Berkeley), Bruce Lindsay, Irv Traiger, Paul McJones, Mike Blasgen, Mario Schkol­ nick, Bob Selinger , Bob Yost. Early 80’s : commercialization of relational systems • Ellison’s Oracle beats IBM to market by reading white papers. • IBM releases multiple RDBMSs, settles down to DB2. Gray (System R), Jerry Held (Ingres) and others join Tandem (Non-Stop SQL), Kapali Eswaran starts EsVal, which begets HP Allbase and Cullinet • Relational Technology Inc (Ingres Corp), Britton-Lee/Sybase, Wang PACE grow out of Ingres group • CA releases CA-Universe, a commercialization of Ingres • Informix started by Cal alum Roger Sippl (no pedigree to research). • Teradata started by some Cal Tech alums, based on proprietary network­ ing technology (no pedigree to software research) Mid 80’s : • SQL becomes ”intergalactic standard”. • DB2 becomes IBM’s flagship product. 1990’s: • Postgres project at UC Berkeley turned into successful open source project by a large community, mostly driven by a group in russia • Illustra (from Postgres) → Informix → IBM • MySQL

2

2000’s: • Postgres → Netezza, Vertica, Greenplum, EnterpriseDB... • MySQL → Infobright • Ingres → DATAllegro System R is generally considered the more influential of the two – you can see how many of the things they proposed are still in a database system today. However, Ingres probably had more ”impact” by virtue of training a bunch of grad students who went on to fund companies + build products (e.g., Berke­ leyDB, Postgres, etc.)

4

Introduction

Figure 1 shows the general architecture of a database.

Admission Control

Local Client Remote Client Protocols Protocols Client Communications Manager

Catalog Manager

Query Parsing and Authorization

Memory Manager

Query Rewrite Dispatch and Scheduling

DDL and Utility Processing

Query Optimizer Plan Executor

Relational Query Processor

Access Methods Process Manager

Administration, Monitoring & Utilities Replication and Loading Services

Buffer Manager Batch Utilities

Lock Manager

Log Manager

Transactional Storage Manager

Shared Components and Utilities

Image by MIT OpenCourseWare.

Figure 1: Architecture of a DBMS Today we will mainly look at the big picture, and go through the relational query rewriting and execution, the following lessons will focus on each of the pieces in more details. Show flow of a query

3

5

Process Models

Parallelism is a key to performance, in particular when I/O waits might stall computation. To maximize throughput you need to have enough stuff going on in parallel to avoid waiting/stalling. Process models: • Back in the days there was no good OS thread support, DB pioneered this ground (also due to the need of supporting many OSs) • Process per DBMS worker (need for shared memory [ASK: is it clear why we need to share across multiple workers?], context switch is expensive, easy to port, limited scalability) • Thread per DBMS worker (great if good OS thread support, or using DBMS separate implementation of threads... pro: portability, cons: du­ plicate functionalities) • Process/Thread pool, and scheduling/allocation of DBMS workers to pro­ cesses or threads.

6

Parallel Architecture • Shared Memory: typically inside one machine, for large installation high costs. All process models are applicable. Great for OLTP, many imple­ mentation form almost every vendor. • Shared Nothing: typically as a cluster of nodes. Require good partitioning, which is easier for OLAP workloads (Teradata, Greenplum, DB2 Parallel Edition,...). • Shared Disk: cluster of nodes with a SAN. Simple model, because ev­ ery node can access all the data, but requires cache-coherence protocols. Oracle RAC, DB2 SYSPLEX. • NUMA: not that common, we will not discuss. (Often DBMS treat it as either shared nothing or shared memory, depending how non-uniform it is). Different failure modes... Partial failure is good to have when possible. We will go back to parallel architectures later on, and dis

7

Query Processing

Query parsing (correctness check)

Query admission control / authorization

4

7.1

Query Rewrite:

View Rewrite Remember the other day schema: involved pers_name oper_name

person

rols

carlo

snowflake

chief

operation

name

nickname

phone

tony

snowflake

sold

title

descr.

econ_val

coverup

carlo

baffo

123

mike

chocolate

chief

snowflake

..

$10M

laundromat

...

$5M

irish pub

...

$2M

irish pub

mike

lungo

456

chocolate

tony

shifezza

789

caffe

Figure 2: Simple Schema for a portion of our Mafia database. What are views? A “named-query”, or a “virtual-table” (sometimes mate­ rialized). CREATE VIEW nick-cover AS SELECT nickname, coverup_name FROM operation o, involved i, person p WHERE p.name = i.person AND i.oper_name = o.name AND

o.econ_val 105 and a=b AND b < 108 becomes (after constant elimination, and logical predicate manipulations): WHERE a = 107 and a = b and b = 107 7.1.2

Subquery Flattening

As Mike mentioned the last class another key step is Subquery flattening (Not every optimizer will successfully do this, so you should always try to think of a non nested query if you can find one): SELECT nickname FROM operation o, involved i, person p WHERE p.name = i.person AND i.oper_name = o.name AND

o.econ_val 10k AND emp.dno = dept.dno AND e.eid = kids.eid; Πname,kidname (dept ��dno=dno (σsal>10k (emp)) ��eno=eno kids)

More graphically:

(NL join)

k

1000

(NL join) 100

d

1000

σsal>10k e

Image by MIT OpenCourseWare.

CPU operations: 7

• selection – 10,000 predicate ops • 1st Nested loops join – 100,000 predicate ops • 2nd nested loops join – 3,000,000 predicate ops Let’s look at number of disk I/Os assuming LRU and no indices if d is outer: • 1 scan of DEPT • 100 consecutive scans of EMP (100 x 100 pg. reads) – cache doesn’t benefit since e doesn’t fit – 1 scan of EMP: 1 seek + read in 1MB =10 ms + 1 MB / 100 MB/sec = 20 msec – 20 ms x 100 depts = 2 sec TOTAL: 10 msec seek to start of d and read into memory 2.1 secs if d is inner: • read page of e – 10 msec • read all of d into RAM – 10 msec • seek back to e – 10 msec • scan rest of e – 10 msec, joining with d in memory... Because d fits into memory TOTAL: total cost is just 40 msec No options if plan is pipelined, k must be inner: • 1000 scans of 300 pages 3 / 100 = 30 msec + 10 msec seek = 40 x 1000 = 40 sec So how do we know what will be cached? That’s the job of the buffer pool. What about indexes? The DBMS uses indexes when it is possible (i.e., when an index exists and it support the required operation, e.g., hash-indexes do not support range search)

8

Buffer Management and Storage Subsystem

Buffer Manager or Buffer pool caches memory accesses... Why is it better if the DBMS does this instead of relying on OS-level caching? DBMS knows more about the query workload, and can predict which pages will be accessed next. Moreover, it can avoid cases in which LRU fails. Also explicit management of pages is helpful to the locking / logging neces­ sary to guarantee ACID properties. 8

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

6.830/6.814 — Notes∗ for Lecture 6: Access Methods Carlo A. Curino September 29, 2010

1

Announcements • Problem Set 2 is out today... due in two weeks... be aware that is going to overlap to LAB 2. •

2

Projects ideas and rules are posted online.

Readings

For this class the suggested readings are: • In Database Management Systems, read: – Pages 273-289. If you are using another book, this is the introduc­ tion to the Section on Storage and Indexing which discusses different access methods and their relative performance. – Pages 344-358. This is an in-depth discussion of the B+Tree data structure and its implementation. Most database books, as well as any algorithms text (such as CLR or Knuth) will provide an equiva­ lent discussion of B+Trees. – (was not assigned, but is an important reading) Pages 370-378 on Static Hashing and Extensible Hashing • ”The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles.” Beckmann et al, in The Red Book. ∗ These notes are only meant to be a guide of the topics we touch in class. Future notes are likely to be more terse and schematic, and you are required to read/study the papers and book chapters we mention in class, do homeworks and Labs, etc.. etc..

1

3

Recap

We presented a broad overview of the DBMS internals.

We point out how important coming up with a good plan is.

I claimed disk I/O can be very important.

Remember I mentioned this last time: CPU cost (# of instructions) RAM access I/Ocost(#ofpagesread,#ofseeks) (RandomI/O=pageread+seek)

1Ghz == 1 billions instrs/sec 50ns 100MB/sec 10msec/seek

1 nsec / instr 10nsec/byte 100seeks/sec

1 seek = 10M instructions!!! Moreover there is a big difference from sequential and random IO. Example: • Read 10KB from a random location on disk: 10ms + 100µs = 10.1ms; • Read 10KB+10KB from a random location on disk (two blocks are next to each other): 10ms + 200µs = 10.2ms; • Read 10KB+10KB from two random locations on disk: 10ms + 100µs + 10ms +100µs= 20.2ms; WOW! So saving disk I/O, and in particular random ones is VERY impor­ tant!! DB are usually well designed to: 1) avoid excessive disk I/O and try to be sequential, and 2) have a LOT of drives available (TPC-H competing machines: 144 cores AND 1296 disks!!) Today we study how to do a good job at reducing Disk I/O by organize stuff intelligently on disk, next lecture we study what we should keep in RAM.

4

Access Methods

Today we get into more details on Access Methods, i.e., on the portion of the DBMS in charge of managing the data on disk. We will show a bunch of organization of data, and their performance in supporting typical accesses we need to support queries. Next lecture we will study the Buffer Manager, which tries to reduce the access to disk. What are the functionalities we need to support: • scan • search (equality)

2

• search (range)

• insert

• delete Various access methods: • heap file: unordered, typically implemented as a linked list of pages • sorted file: ordered records, expensive to maintain • index file: data + extra structures around to quickly access data – might contain data (primary index) – or point at the data, often stored in a heapfile or other index (sec­ ondary index) – if the data are sorted in the same order of the field is index, we say is a clustered index (we will see this is good for scans since disk accesses are sequential) Type of indexes: • hash • B+trees • R*trees

4.1

Data organization within file

file organization: • pages • records (record ids: page id, slot id)

page layout:

• fixed length records page of slots, • free bit map ”slotted page” structure for var length records or slot direc­ tory (slot offset, len) What about big records? Hard to place, and might overflow on another page. tuple layout (similar story): • fixed length (structure know by the system catalog, can predict where fields start)

3

• variable length, field slots, two options: delimiters or directory with point­ ers/offsets What happen when the field size changes? Need to move stuff... so if we have a mix of fixed/variable, the fixed fields are best to be stored first. Null values? Good trick is using the pointers/offset.. if two have same value, it means the field in between is null... this makes storing nulls 0 space overhead.

5

Cost model (enhanced one, from the book) • Heap Files: Equality selection on key; exactly one match. • Sorted Files: Files compacted after deletions.

• Indexes:

– Alt (2), (3): data entry size = 10% size of record – Hash: No overflow buckets. 80%pageoccupancy→Filesize=1.25datasize – Tree: 67% occupancy (this is typical). → File size = 1.5 data size We use: • B: number of data pages • R: number of record per page • D: average time to read/write from disk • C: average time to process a record (e.g., equality check)

Heap Sorted

Scan

Equality

Range

Insert

Delete

BD

0.5BD

BD

2D

Search + D

Dlog2B

Dlog2B + # matches

Search + BD

Search + BD

Search + D

Search + D

D (3 + logF0.15B)

Search + 2D

4D

Serach + 2D

BD

Clustered

1.5BD

DlogF1.5B

Unclustered tree index

BD (R+0.15)

D (1 + logF0.15B)

DlogF1.5B + # matches DlogF0.15B + # matches

Unclustered hash index

BD (R+0.125)

2D

BD

Image by MIT OpenCourseWare.

6

Extensible hashing

Good read on the book: pages... 370-378 4

Local Depth Global Depth

2 00 01 10 11

2

Local Depth

32* 16* Bucket A 2

3 000 001 010 011 100 101 110 111

1* 5* 21* 13* Bucket B 2 10* Directory

Bucket C

2 15* 7* 19* 2 4* 12* 20*

3 32* 16* Bucket A

Global Depth

Bucket D

2 10*

Bucket C

2 15* 7* 19* Directory

Bucket A2 ('split image' of Bucket A)

2 1* 5* 21* 13* Bucket B

Bucket D

3 4* 12* 20*

Bucket A2 ('split image' of Bucket A)

Image by MIT OpenCourseWare.

7

B+trees

Hierarchical indices are the most common type used—e.g., B+-Trees indices typically point from key values to records in the heap file. Shall we always have indexes? What are the pros and cons? (keep the index up-to-date, extra space required)

Figure 1: B+Tree graphical representation. Courtesy of Grundprinzip on Wikipedia.

Special case, is a “clustered” index, i.e., when the order of the tuples on disk correspond to the order in which they are stored in the index. What is it good for? (range selections, scans).

5

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

6.830/6.814 — Notes∗ for Lecture 7: Buffer Management Carlo A. Curino September 30, 2010

1

Announcements • Lab 2 is going to go out today... Same as before... do not copy! (I mean it!) • Project teams were due last time. Anyone still without a team?

2

Readings

For this class the suggested readings are: • Hong-Tai Chou and David DeWitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. VLDB, 1985. • If you are interested in simple current implementation of bufferpool man­ agement: http://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool. html and

3

Recap

We are given many access methods. None of this is uniformly better than others across the map, but each has some particular case in which is best. B+Trees (both clustered and unclustered) and Heapfile are the most commonly used. We made a big case about the fact that Disk accesses are very expensive and that the DBMS has 2 ways to reduce their impact on performance: i) reduce the number of disk accesses by having smart access methods, ii) do a good job at caching in RAM data from disk. ∗ These notes are only meant to be a guide of the topics we touch in class. Future notes are likely to be more terse and schematic, and you are required to read/study the papers and book chapters we mention in class, do homeworks and Labs, etc.. etc..

1

Last lecture we assumed every access was off of disk, and we tried our best to minimize the number of pages accessed, and to maximize the sequentiality of the accesses (due to the large cost of seeks) by designing smart access methods. Today we get into the investigation of “what” to try to keep in RAM to further avoid Disk I/O. The assumption is that we can’t keep everything, since the DB is in general bigger than the available RAM.

4

Today’s topic

Why don’t we just trust the OS? (DBMS knows more about the data accesses, and thus can make a better job about what to keep in RAM and what to pre-fetch, given an execution plan is rather clear what we are going to need next). DBMS manages its own memory: Buffer management / Buffer pool. Buffer pool: • cache of recently used pages (and more importantly plans ahead of which one are likely to be accessed again, and what could be prefetched) • convenient ”bottleneck” through which references to underlying pages go useful when checking to see if locks can be acquired or not • shared between all queries running on the system (important! the goal is to globally optimize the query workload) Final goal is to achieve better overall system performance... often correlated to minimize physical disk accesses (e.g., executing 1 query at a time guarantees minimum number of accesses, but lead to poor throughput). Good place to keep locks:

Cache – so what is the best eviction policy? USE SAM NOTES FROM HERE ON...

2

5

What about MySQL?

(From MySQL documentation online: http://dev.mysql.com/doc/refman/5. 5/en/innodb-buffer-pool.html) A variation of the LRU algorithm operates as follows by default: • 3/8 of the buffer pool is devoted to the old sublist. • The midpoint of the list is the boundary where the tail of the new sublist meets the head of the old sublist. • When InnoDB reads a block into the buffer pool, it initially inserts it at the midpoint (the head of the old sublist). A block can be read in because it is required for a user-specified operation such as a SQL query, or as part of a read-ahead operation performed automatically by InnoDB. • Accessing to a block in the old sublist makes it ?young?, moving it to the head of the buffer pool (the head of the new sublist). If the block was read in because it was required, the first access occurs immediately and the block is made young. If the block was read in due to read-ahead, the first access does not occur immediately (and might not occur at all before the block is evicted). • As the database operates, blocks in the buffer pool that are not accessed ?age? by moving toward the tail of the list. Blocks in both the new and old sublists age as other blocks are made new. Blocks in the old sublist also age as blocks are inserted at the midpoint. Eventually, a block that remains unused for long enough reaches the tail of the old sublist and is evicted. You can control: • innodb old blocks pct for the portion of new-old • innodb old blocks time Specifies how long in milliseconds (ms) a block inserted into the old sublist must stay there after its first access before it can be moved to the new sublist.

6

LRU Cache misses in typical scenarios

From the paper: I/O Reference Behavior of Production Database Workloads and the TPC Benchmarks An Analysis at the Logical Level by Windsor W. Hsu, Alan Jay Smith, and Honesty C. Young. We report the LRU miss ratios for increasingly large bufferpool sizes, and for typical production databases and popular benchmarks. This should give you an idea of the fact that some portion of the DB are very “hot” while other are rather “cold”, thus throwing more and more RAM at the problem will provide less and less returns. On the other side choosing the “right” things to keep in RAM is clearly vital! 3

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

LRU? What if a query just does one sequential scan of a file -- then putting it in the cache at all would be pointless. So you should only do LRU if you are going to access a page again, e.g., if it is in the inner loop of a NL join. For the inner loop of a nested loops join, is LRU always the best policy? No, if the inner doesn't fit into memory, then LRU is going to evict the record over and over. E.g., 3 pages of memory, scanning a 4 page file: pages

A 1 1 1 14 14

B

C

2 2 2 21

3 3 3

read 1 2 3 4 1 2

hit/miss m m m m m m

Always misses?. What would have been a better eviction policy? MRU! pages

A 1 1 1 1 1 1 1 1 1 12

B

C

2 2 2 2 2 23 3 3 3

3 34 4 4 4 4 4 4

read 1 2 3 4 1 2 3 4 1 2

hit/miss? m m m m h h m h h m

Here, MRU hits 2/3 times. DBMIN tries to do a better job of managing buffer pool by 1) allocating buffer pools on a per-file-instance basis, rather than a single pool for all files 2) using different eviction policies per file What is a "file instance"? (Open instance of a file by some access method.) Each time a file is opened, assign it one of several access patterns, and use that pattern to derive a buffer management policy. (What does a policy consist of?) Policy for a file consists of a number of pages to allocate as well as a page replacement policy. (What are the different types of policies?) Policies vary according to access patterns for pages. pages in a database system? SS - Straight Sequential (sequential scan) CS - Clustered Sequential (merge join) (skip) LS - Looping sequential (nested loops)

What are the different access patterns for

SR - Straight Random (index scan through secondary index)

CR - Clustered Random (index NL join with with secondary index on inner, with repeat foreign keys on outer)

(skiP

SH - Straight Hierarchical (index lookup)

LH - Looping Hierarchical (repeated btree lookups)

So what's the right policy:

SH - 1 page, any access method

CS - size of cluster pages, LRU

LS - size of file pages, any policy, or MRU plus however many pages you can spare

SR - 1 page, any access method

CR - size of cluster pages, LRU

SH - 1 page, any access method

LH - top few pages, priority levels, any access method for bottom level

How do you know which policy to use?

(Not said, presumably the query parser/optimizer has a table and can figure this out.)

Multipage interactions. Diagram:

Buffer pool per file instance, with locality set for that instance, plus "global table" that contains all pages.

Each page is "owned" by a at most one query. Each query has a "locality set" of pages for each file instance it

is accessing as a part of its operation, and each locality set is managed according to one of the above

policies.

Also store current number of pages associated with a file instance (r) and the maximum number of pages

associated with it (l).

How do you determine the maximum number of pages?

Using numbers above.

What happens when the same page is accessed by multiple different queries?

1) Already in buffer pool and owned locally 2) Already in buffer pool, but not owned

a) If someone else owns, nothing to be done b) If no owner, requester becomes owner 3) Not in buffer pool - requester becomes owner, evict something from requester's memory

How do you avoid running out memory? Don't admit queries into the system that will make the total sum of all of the l_ij variables > total system memory. Metacomments about performance study. (It's good.) Interesting approach. What did they do?

Collect real access patterns and costs, use them to drive a simulation of the buffer pool.

(Why?) Real system would take a very long time to run, would be hard to control.

How much difference did they conclude this makes?

As much as a factor of 3 for workload with lots of concurrent queries and not much sharing. Seems to be

mostly due to admission control. With admission control, simple fifo is about 60% as good as DBMIN.

DBMIN is not used in practice. What is? (Love hate hints). What's that? (When an operator finishes with a page, it declares its love or hate for it. Buffer pool preferen tially evicts hated pages.) Not clear why (this would make a nice class project.) Perhaps love hate hints perform almost as well as DBMIN and are a lot simpler. They don't capture the need for different buffer management policies for different types of files.

(What else might you want the buffer manager to do?)

Prefetch.

(Why does that matter.)

Sequential I/O is a lot faster. If you are doing a scan, you should keep scanning for awhile before servicing some other

request, even if the database hasn't yet requested the next page.

Depending on the access method, you may want to selectively enable prefetching.

Interaction with the operating system (What's the relationship between the database buffer manager and the operating system?)

Long history of tension between database designers and OS writers. These days databases are an important enough

application that some OSes have support for them.

(What can go wrong?)

- Double buffering -- both OS and database may have a page in memory, wasting RAM. - Failure to write back -- the OS may not write back a page the database has evicted, which can cause problems if, for example, the database tries to write a log page and then crashes. - Performance issues -- the OS may perform prefetching, for example, when the database knows it may not need it.

Disk controllers have similar issues (cache, performance tricks.)

(What are some possible solutions?)

- Add hooks to the OS to allow the database to tell it what to do.

- Modify the database to try to avoid caching things the OS is going to cache anyway.

In general, a tension in layered systems that can lead to performance anomalies.

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Lecture 9 10/8/09 Query Optimization

Lab 2 due next Thursday.

M pages memory

S and R, with |S| |R| pages respectively; |S| > |R|

M > sqrt(|S|)

External Sort Merge

split |S| and |R| into memory sized runs

sort each

merge all runs simultaneously

total I/O 3 |R| + |S| (read, write, read) "Simple" hash given hash function h(x), split h(x) values in N ranges N = ceiling(|R|/M) for (i = 1…N) for r in R if h® in range i, put in hash table Hr o.w. write out for s in S if h(s) in range i, lookup in Hr o.w. write out total I/O N (|R| + |S|) Grace hash: for each of N partitions, allocate one page per partition hash r into partitions, flushing pages as they fill hash s into partitions, flushing pages as they fill for each partition p build a hash table Hr on r tuples in p hash s, lookup on Hr example: R = 1, 4, 3, 6, 9, 14, 1, 7, 11 S = 2, 3, 7, 12, 9, 8, 4, 15, 6 h(x) = x mod 3 R1 = 3 6 9 R2 = 1 4 1 7 R3 = 14 11 S1 = 3 12 9 15 6 S2 = 7 4 S3 = 2 8

Now, join R1 with S1, R2 with S2, R3 with S3

Note -- need 1 page of memory per partition. Do we have enough memory?

We have |R| / M partitions

M ≥ sqrt(|R|)

worst case

|R| / sqrt(|R|) = sqrt(|R|) partitions Need sqrt(|R|) pages of memory b/c we need at least one page per partition as we write out (note that simple

hash doesn't have this requirement)

I/O:

read R+S (seq)

write R+S (semi-random)

read R+S (seq)

also 3(|R|+|S|) I/OS

What's hard about this?

When does grace outperform simple?

(When there are many partitions, since we avoid the cost of re-reading tuples from disk in building partitions )

When does simple outperform grace?

(When there are few partitions, since grace re-reads hash tables from disk ) So what does Hybrid do? M = sqrt(|R|) + E Make first partition of size E, do it on the fly (as in simple) Do remaining partitions as in grace. 70 I/O (relative to simple with |R| = M)

63

Grace Simple Hybrid

56 49 42 35 28 21 14 7 0

1

2

3

4

5

6

7

8

9

|R|/M

Why does grace/hybrid outperform sort-merge?

CPU Costs!

I/O costs are comparable

690 / 1000 seconds in sort merge are due to the costs of sorting

17.4 in the case of CPU for grace/hybrid! Will this still be true today? (Yes) Selinger Famous paper. Pat Selinger was one of the early System R researchers; still active today. Lays the foundation for modern query optimization. Some things are weak but have since been improved upon. Idea behind query optimization: (Find query plan of minimum cost ) How to do this? (Need a way to measure cost of a plan (a cost model) ) single table operations how do i compute the cost of a particular predicate? compute it's "selectivity" - fraction F of tuples it passes how does selinger define these? -- based on type of predicate and available statistics what statistics does system R keep? - relation cardinalities NCARD - # pages relation occupies TCARD - keys in index ICARD - pages occupied by index NINDX Estimating selectivity F: col = val F = 1/ICARD() F = 1/10 (where does this come from?) col > val high key - value / high key - low key 1/3 o.w. col1 = col2 (key-foreign key) 1/MAX(ICARD(col1, col2)) 1/10 o.w. ex: suppose emp has 10000 records, dept as 1000 records total records is 10000 * 1000, selectivity is 1/10000, so 1000 tuples expected to pass join note that selectivity is defined relative to size of cross product for joins! p1 and p2 F1 * F2

p1 or p2 1 - (1-F1) * (1-F2) then, compute access cost for scanning the relation. how is this defined? (in terms of number of pages read) equal predicate with unique index: 1 [btree lookup] + 1 [heapfile lookup] + W (W is CPU cost per predicate eval in terms of fraction of a time to read a page ) range scan: clustered index, boolean factors: F(preds) * (NINDX + TCARD) + W*(tuples read) unclustered index, boolean factors: F(preds) * (NINDX + NCARD) + W*(tuples read) unless all pages fit in buffer -- why? ... seq (segment) scan: TCARD + W*(NCARD) Is an index always better than a segment scan? (no) multi-table operations how do i compute the cost of a particular join? algorithms: NL(A,B,pred) C-outer(A) + NCARD(outer) * C-inner(B) Note that inner is always a relation; cost to access depends on access methods for B; e.g.,

w/ index -- 1 + 1 + W

w/out index -- TCARD(B) + W*NCARD(B)

C-outer is cost of subtree under outer

How to estimate # NCARD(outer)? product of F factors of children, cardinalities of children example:

F2 F1

σ

F1F2 NCARDA x NCARDB B C2

A C1

Image by MIT OpenCourseWare.

Merge_Join_x(P,A,B), equality pred C-outer + C-inner + sort cost (Saw cost models for these last time) At time of paper, didn't believe hashing was a good idea Overall plan cost is just sum of costs of all access methods and join operators Then, need a way to enumerate plans

Iterate over plans, pick one of minimum cost Problem: Huge number of plans. Example: suppose I am joining three relations, A, B, C Can order them as: (AB)C A(BC) (AC)B A(CB) (BA)C B(AC) (BC)A B(AC) (CA)B C(AB) (CB)A C(BA) Is C(AB) different from (CA)B? Is (AB)C different from C(AB)? yes, inner vs. outer n! strings * # of parenthetizations how many parenthetizations are there? ABCD --> (AB)CD A(BC)D AB(CD) 3 XCD AXD ABX *2 === 6 --> (n-1)! ==> n! * (n-1)!

6 * 2 == 12 for 3 relations

Ok, so what does Selinger do?

Push down selections and projections to leaves

Now left with a bunch of joins to order.

Selinger simplifies using 2 heuristics? What are they?

- only left deep; e.g., ABCD => (((AB)C)D) show - ignore cross products e.g., if A and B don't have a join predicate, doing consider joining them still n! orderings. can we just enumerate all of them? 10! -- 3million 20! -- 2.4 * 10 ^ 18

so how do we get around this?

Estimate cost by dynamic programming: idea: if I compute join (ABC)DE -- I can find the best way to combine ABC and then consider all the ways to combine that with DE. i can remember the best way to compute (ABC), and then I don't have to re-evaluate it. best way to do ABC may be ACB, BCA, etc -- doesn't matter for purposes of this decision. algorithm: compute optimal way to generate every sub-join of size 1, size 2, ... n (in that order). R 25k .4

0

.1

10k

.4

20k

.1

30k

40k

.2 + .1 = .3 Image by MIT OpenCourseWare.

example: 2d hist

.05

.05

.1

.2

.1

.1

.1

.1

.1

Age

60 30

40k

Salary

Salary > 1000*age area below line

80k

Image by MIT OpenCourseWare.

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Selinger  Op+mizer   6.814/6.830  Lecture  10   October  07,  2010   (Slides  gently  offered  by  Sam  Madden)

The  Problem •  How  to  order  a  series  of  N  joins,  e.g.,     A.a  =  B.b  AND  A.c  =  D.d  AND  B.e  =  C.f   N!  ways  to  order  joins  (e.g.,  ABCD,  ACBD,  ….)   (N-­‐1)!  plans  per  ordering  (e.g.,  (((AB)C)D),  ((AB)(CD),  …)   Mul+ple  implementa+ons  (e.g.,  hash,  nested  loops,  etc)  

•  Naïve  approach  doesn’t  scale,  e.g.,  for  20-­‐way  join   –  10!  x  9!  =  1.3  x  10  ^  12   –  20!  x  19!  =  2.9  x  10  ^  35  

Selinger  Op+miza+ons •  Le^-­‐deep  only  (((AB)C)D)  (eliminate  (N-­‐1)!)   •  Push-­‐down  selec+ons   •  Don’t  consider  cross  products   •  Dynamic  programming  algorithm  

Dynamic  Programming R    set  of  rela+ons  to  join  (e.g.,  ABCD)   For  ∂  in  {1...|R|}:   for  S  in  {all  length  ∂  subsets  of  R}:     optjoin(S)  =  a  join  (S-­‐a),         where  a  is  the  single  rela+on  that  minimizes:         cost(optjoin(S-­‐a))  +           min.  cost  to  join  (S-­‐a)  to  a  +           min.  access  cost  for  a   optjoin(S-­‐a)  is  cached  from  previous  itera+on

Cache

Example optjoin(ABCD)    –  assume  all  joins  are  NL  

Subplan Best   choice

Cost

A

index

100

B

seq  scan 50



∂=1   A  =  best  way  to  access  A          

(e.g.,  sequen+al  scan,  or  predicate  pushdown  into  index...)  

B  =  best  way  to  access  B   C  =  best  way  to  access  C     D  =  best  way  to  access  D   Total  cost  computa+ons:  choose(N,1),  where  N   is  number  of  rela+ons  

Cache

Example optjoin(ABCD)  

Subplan Best   choice

Cost

A

index

100

B

seq  scan 50

… {A,B}

BA

156  

{B,C}

BC

98  



∂=2   {A,B}  =  AB  or  BA       (using  previously  computed  best  way  to  access  A  and  B)   {B,C}  =  BC  or  CB   {C,D}  =  CD  or  DC   {A,C}  =  AC  or  CA   Total  cost  computa+ons:  choose(N,2)  x  2   {A,D}  =  AD  or  DA   {B,D}  =  BD  or  DB

Cache

Example optjoin(ABCD)   Already  computed  –   lookup  in  cache

∂=3   {A,B,C}  =  remove  A,  compare  A({B,C})  to  ({B,C})A   remove  B,  compare  B({A,C})  to  ({A,C})B   remove  C,  compare  C({A,B})  to  ({A,B})C   {A,B,D}  =  remove  A,  compare  A({B,D})  to  ({B,D})A            ….   {A,C,D}  =  …   {B,C,D}  =  …  

Subplan Best   choice

Cost

A

index

100

B

seq  scan 50

{A,B}

BA

156  

{B,C}

BC

98  

… {A,B,C}

BCA

125  

BCD

115  

… {B,C,D}

Total  cost  computa+ons:  choose(N,3)  x  3  x  2  

Cache

Example optjoin(ABCD)   Already  computed  –   lookup  in  cache

∂=4   {A,B,C,D}  =  remove  A,  compare  A({B,C,D})  to  ({B,C,D})A   remove  B,  compare  B({A,C,D})  to  ({A,C,D})B   remove  C,  compare  C({A,B,D})  to  ({A,B,D})C   remove  D,  compare  D({A,B,C})  to  ({A,B,C})D  

Final  answer  is  plan  with  minimum  cost  of   these  four   Total  cost  computa+ons:  choose(N,4)  x  4  x  2  

Subplan Best   choice

Cost

A

index

100

B

seq  scan

50

{A,B} {A,B}

BA BA

156   156  

{B,C} {B,C}

BC BC

98   98  

{A,B,C} {A,B,C}

BCA BCA

125   125  

… {B,C,D}

BCD

115  

{B,C,D} {A,B,C,D} BCD ABCD

115   215  

Complexity choose(n,1)  +  choose(n,2)  +  …  +  choose(n,n)  total   subsets  considered   All  subsets  of  a  size  n  set  =  power  set  of  n  =  2^n   Equiv.  to  compu+ng  all  binary  strings  of  size  n     000,001,010,100,011,101,110,111   Each  bit  represents  whether  an  item  is  in  or  out  of  set  

Complexity  (con+nued) For  each  subset,   k  ways  to  remove  1  join   k  <  n   m  ways  to  join  1  rela+on  with  remainder   Total  cost:    O(nm2^n)  plan  evalua+ons   n  =  20,  m  =  2   4.1  x  10^7  

Interes+ng  Orders •  Some  queries  need  data  in  sorted  order   –  Some  plans  produce  sorted  data  (e.g.,  using  an  index  scan  or  merge  join  

•  May  be  non-­‐op+mal  way  to  join  data,  but  overall  op+mal  plan   –  Avoids  final  sort  

•  In  cache,  maintain  best  overall  plan,  plus  best  plan  for  each   interes+ng  order   •  At  end,  compare  cost  of     best  plan  +  sort  into  order       to     best  in  order  plan  

•  Increases  complexity  by  factor  of  k+1,  where  k  is  number  of   interes+ng  orders  

Example SELECT  A.f3,  B.f2  FROM  A,B  where  A.f3  =  B.f4     ORDER  BY  A.f3   Subplan

Best  choice Cost

Best  in  A.f3  order Cost  

A

index

100

index

100

B

seq  scan

50

seqscan

50

{A,B}

BA  hash

156  

AB  merge  

180  

compare:       cost(sort(output))  +  156   to     180  

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Transactions Model: Begin xact Sql-1 Sql-2 . . . Sql-n commit or abort Concurrency control (Isolation) Crash recovery (Atomic, Durable) Example: move $100 from acct-A to acct-B Atomic: all or nothing Durable: once done, it stays done Isolation: produces the “right” answer in a concurrent world Consistent: acct cannot go below zero Consistent – deals with integrity constraints, which we are not going to talk about. Concurrency control first: Consider: Bill Sam George Hugh Fred T1: T2:

shoe shoe toy toy shoe

10K 15K 12K 8K 14K

Give a 10% raise to everybody in the shoe dept. Move George and Hugh into the shoe dept

Easy to create a parallel schedule where one receives the raise and the other does not. Definition: Want an outcome which is the same as doing T1 first and then T2 or vica versa. Serializiability. Gold standard is to ensure serializiability for any commands, and any amount of parallelism.

Gold standard mechanism:

Divide the data base into granules (bits, bytes, records, …)

Lock everything you touch.

Ensure that locking is 2-phase, i.e. a growing phase and then a shrinking phase.

(In practice grow phase is whole xact, shrink phase is at commit.)

Deep theorem: 2 phase locking  serializiability

Easy generalization to Share (read) locks and exclusive (write) locks

Therefore, lock everything you touch, hold all locks to EOT.

Generally called two phase locking, or dynamic locking. Used by ALL major DBMSs.

Devil is in the details:

How big a granule to lock? Records (page level locking gets hammered)

However, what to do with

select avg (salary)

from emp

don’t want to set 10**7 locks.

Answer lock escalation to table level. If you set too many record locks in a table, then

trade it in for a table lock.

What to do if you can’t get a requested lock? Wait

What about deadlock? Can happen. Periodically look for a cycle in the “waits for”

graph. Pick a victim (one who has done less work) and kill him. I.e. run crash recovery

on his transaction (to be discussed next time).

Alternative: time-out and kill yourself.

Possible to starve. Repeated execution, retry cycle….

Doesn’t happen in practice. In a well designed OLTP system, the average Xact does not

wait. If not true, then redesign app to make it true. Rule of thumb: probability of

waiting is .01 or less.

To avoid deadlock possibility:

All at once lock request (silly in practice)

Order all locks and request in order (silly)

Pre-emption (murder – not really done in practice)

What about auxiliary structures? Lock table: must use latches, semaphores, etc. to serialize Buffer pool: ditto System catalogs (table table – drop table does a delete in the table table – other Xacts read a cached version -- generally finessed. Often the #blocks is in the table table. If updated, then do it with latches, not locks, …) B-trees: too expensive to latch the whole tree Hence, latch single blocks on access. Go to descendent block, latch it, release one up (called latch crabbing). Get to page to update, and latch it for the update. Lehman& Yao have a scheme to avoid the latches (red book paper) – at the expense of noticeable complexity. Halloween problem (urban myth that System R guys discovered this on Halloween). Aka phantom problem T1: begin xact update emp (set salary = 1.1 * salary) Where dept = ‘shoe’ End xact T2: begin xact Insert into emp values (‘George’, 20,000, ‘shoe’) Insert into emp values (‘Hugh’, 30,000, ‘shoe’) End xact At beginning of xacts, there are 3 shoe dept employees: Bill, Sam, and Fred. Suppose the emp table is sorted in storage on name, and holes are left for inserts. Suppose query plan for T1 does a scan of emp. Consider the following sequence of operations: T1: update Bill T1: update Sam T2: insert George T2: insert Hugh T2 commits and releases all locks

T1: update Hugh(!!!!!) T1: update Fred T1 commits Both xacts obey locking protocol – but result is not serializiable. Issue is lock things you touch – but must guarantee non-existence of any new ones!!! How: predicate locking (tried – doesn’t work well – need a theorem prover – also, conflicts are data dependent) How range locks in a B-tree index (assumes an index on dept). Otherwise, table lock on the data. Escrow transactions Begin Xact Update flights (set seats = seats -1) where flight = 234 . . . End xact Locks 234 for the duration of the transaction. Nobody else can get a seat. Two transactions can go on in parallel as long as they perform only increment and decrement operations. Xacts commute, so everything is ok However, if a Xact aborts, have to be careful with recovery logic. Forward pointer to Aries discussion.

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Consider

Begin Xact

Select avg (sal) from emp

End xact

Begin xact

Update emp set …

End xact

Lock escalation will occur;

Either reader will starve (without scheduling help)

Or

Writers will starve (while reader is doing a long query)

Both undesirable…… So what to do:

1) nearly always done – run analytics on a companion data warehouse

2) take the stall (rarely done)

3) run less than serializability

Hence: Degree 0: (read uncommitted) no locks (guarantees nothing). Can read uncommitted data Degree 1: (read committed) X locks held till EOT – R locks obtained only temporarily. (can only read committed data – however subsequent reads of the same thing can produce different answers Degree 2: (repeatable read) X, S locks held till XOT – serializable unless you squint. Degree 3: (serializable) S and X locks held till EOT plus solve the phantom problem. Really do the right thing Choose what you want to pay for. Degree 0 solves big read problem –at the expense of getting the wrong answer. 4) use multi-version system read is given a time stamp (TS). A write installs new timestamp and keeps old data for a while. Read is given the biggest TS less than his. I.e. read is “as of a time”.

Reads set no locks. However, get a historical (consistent) answer. After a while can garbage collect old values – when they are no longer needed – i.e. there is no running xact older than the next guy in line. Can be turned into a full cc system MVCC. Give every Xact a timestamp

Have a read TS and a write TS for every “granule”

Read-only xact: get a TS.. Read whatever you want. If multiple versions, then read the

one written just earlier than your time stamp. If reading the current version, install your

TS if greater than the one that is there.

Update xact: given a TS. Read whatever you want, installing a read TS as above.

Write a new value with your timestamp, keeping the old value, as above. Do this only if

your timestamp greater than both ones there. Otherwise, commit suicide.

******

Locking is pessimistic – i.e. ensure no conflict by assuming the worst case. Other approaches to concurrency control are more aggressive: Optimistic concurrency control (Kung and Robinson – late ‘70s) Run transaction to completion. Check at end if there was a problem. 3 phases: read/write, validate, commit Read/write: do logic normally. Any write goes to a private copy (think of it as an update list). Keep track of read-set and write-set. R(Ti) W (Ti) At end, enter the validate phase:

For all xacts, Tj, which commited after I started:

W(Tj) intersect R (Ti) empty, if fail then abort (and restart) if succeed, then enter

commit phase. Commit: install updates Issues: one validator at a time! One commiter at a time! Bottleneck

Issues: lose all work on a abort Issues: starvation (cyclic restart) Issues: a bit pessimistic – possible to restart when there is not a conflict. *********************************************************** So which one wins?

Several simulation studies in the 80’s. Most have Mike Carey as an author or co-author.

Variables:

Prob (contention)

# concurrent xacts

Resources available (disks, CPU)

Locking wins, except in corner cases. If no conflict, then nobody waits and it is a wash If lots of conflicts, then locking wastes less work

*****

Modern day OLTP:

Main memory problem

No-disk stalls in a Xact

Do not allow user-stalls in a Xact (Aunt Millie will go out for lunch)

--- hence no stalls.

A heavy Xact is 200 record touches – less than 1 Msec.

Why not run Xact to completion – single threaded! No latches, no issues, no nothing.

Basically TS order !!!

Problem: multiprocessor support – we will come back to this.

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Modern day OLTP: Main memory problem

No-disk stalls in a Xact

Do not allow user-stalls in a Xact (Aunt Millie will go out for lunch)

--- hence no stalls.

A heavy Xact is 200 record touches – less than 1 Msec.

Why not run Xact to completion – single threaded! No latches, no issues, no nothing.

Basically TS order !!!

Problem: multiprocessor support – we will come back to this.

Ok to ask for Xact classes in advance (no ad-hoc updates in an OLTP system).

Look at the xacts classes…..

They might commute: if so run with no locking

They might never conflict – if so run with no locking.

Might be only two classes that conflict (Ti and Tj). Run everybody else with no controls.

Serialize Ti and Tj (with timestamp techniques or something else)

If a transaction is alive for nanoseconds (processor transactional memory) or

microseconds (modern OLTP), then interesting to rerun Carey simulations (which

assumed disk not main memory).

Contracts/Saga Vacation in San Diego T1: get a plane ticket T2: get a hotel T3: get a rental car T4: tickets to San Diego zoo Oops – get sick – can’t go. Want to “unwind” whole “workflow”. Want something bigger than a Xact which can be reversed. Notion of Sagas and Contracts. Need compensation actions, which will reverse a xact. Can’t abort after a commit.

Crash recovery

Never lose my data ever. Surest recipe to get fired on the spot.

Scenarios:

1) transaction aborts (back him out)

2) transaction deadlocks, and is picked as a victim (ditto)

3) transaction violates an integrity constraint (ditto)

OS fails (rule of thumb – MVS crashes once a year, Linux once a month, Windows once a week or more) -- reload OS, reload DBMS, undo losers, redo winners DBMS fails bohrbugs (repeatable). These are knocked out quickly by a good QA process. If you are buying a DBMS, get clear on how serious the vendor is about QA (typically not very) -- don’t run a DBMS in production until it is “mature” -- like 1-2 years after release heisen bugs (not repeatable) timing problems, race conditions, … Unbelievably hard to find. Usually put engineers on airplanes.

Disk crash: modern disks fail every 5 years or so. Usually start to see disk errors (redo reads or writes in advance). In any case, take a dump periodically, weekly full with daily partials. Must roll forward from the last partial, redoing history using a log. Bad, bad, bad, bad: unrecoverable failures (corrupted the log) -- “up the creek” App fails: Not an issue in this class Comm. Failures: We will come back to these when we deal with multi-processor issues Disaster -- Machine room fails (fire, flood, earthquake, 9/11, …) 1970’s solution to disasters: Write a log (journal) of changes

Spool the log to tape

Hire iron mountain to put the tapes under the mountain Buy IBM hardware – they were heroic in getting you back up in small numbers of days 1980’s solution Hire Comdisco to put the tapes at their machine room in Atlanta Send your system programmers to Atlanta, restore the log tapes, divert comm to Atlanta Back up in small numbers of hours (average CIO conducts a disaster drill more than once a year) 2000’s solution (for some) Run a dedicated “hot standby” in Atlanta Fail over in seconds to a few minutes Driven by plummeting cost of hardware and the increasing cost of downtime (thousands of dollars per minute) In most cases, you tend to lose a few transactions. Too costly to lower the probability to zero. Write disgruntled users a check!!! Disk-intact recovery 1) undo uncommitted transactions 2) redo committed transactions write this stuff in a log depends on buffer pool tactics. Buffer pool implements “steal” – can write to disk a dirty page of an uncommitted xact – when it needs the slot for something else. All systems do this. Requires the before image in a log to perform undo Buffer pool implements “no force” – do not require dirty blocks to be forced at commit time – takes too long. Everybody does this. Requires the after image to perform redo Hence write (before image, after image) in a log. Must write the log before writing the data. Otherwise screwed. Hence WAL. Options for the log: Can use a physical log. Whenever the bits change, write a log record.

Insert could cause logging of 4K worth of data

Can use a logical log – record the SQL command (nobody does this – too slow -- and

won’t work for undo)

Can use something in between – e.g. insert (record) on page-X

Can log B-tree updates

Physical: means 8K bytes for a page splitter

Logical: do nothing – side effect of SQL

In between: insert (key, block – X)

Most of these have been used at one time or another.

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

WAL recovery Options for the log: Can use a physical log. Whenever the bits change, write a log record. Insert could cause logging of 4K worth of data Can use a logical log – record the SQL command (nobody does this – too slow -- and

won’t work for undo)

Can use something in between – e.g. insert (record) on page-X. We will assume for now

(insert, page#, slot#, bytes)

(delete, P#, slot#, bytes)

update, p#, slot#, old bytes, new bytes)

Can log B-tree updates

Physical: means 8K bytes for a page splitter

Logical: do nothing – side effect of SQL

In between: insert (key, block#)

We will assume for now – no logging of B-trees

One simple scheme.

Periodically take a checkpoint. Force all dirty blocks to disk Write a log record containing a list of active (uncommitted xacts) Do not log any information on B-trees

Logical (in between) data log -- per above

On a crash.

Start at the end of the log.

Look for commit and abort records; keep a list of finished xacts. Any log record that

corresponds to an uncommitted or aborted xact, perform undo, by logically undoing the

operation, but only if the after image matches the bytes on the page. modifying any

affected B-trees, by searching the B-tree for the correct index key and fixing it, if

necessary.

When you reach a checkpoint, compare checkpoint list of active xacts, with commit/abort

of finishers list. If checkpoint list in commit/abort list, then “far enough”. Otherwise,

keep going until you find such a checkpoint.

Turn around and go forward, redoing the effects of all committed xacts.

If you crash during recovery, do it all again.

Example done in class

Problems:

have to force buffer pool at a checkpoint -- expensive All operations logical. Recovery may be slow -- have to do B-tree inserts and deletes Won’t work for escrow xacts – do example

Aries: more sophisticated, faster and way more complex.

We will simplify it somewhat – It is very complex.

Aries in a nutshell:

Cheap checkpoints

Physical redo

Logical undo

Redo-done first

Deals with escrow xacts (but we won’t)

Now the details

Every log record has a LSN (sequence#)

Dirty page table – block#, current-LSN (dirtiers log record)

Xact table (1st-log record, last log record)

When you write a log record, you set the current in the xact table to it. You also store in

Each log record the previous current one – i.e. one way linked list of log records.

On a crash:

Find the dirty page table and the xact table.

Start at the min (LSN in dirty page table)

Physical redo to the end of the log -- data plus B-trees – all xacts – whether committed or

not – brings data base to the state at the time of the crash

Look at xact table. Find max LSN. Get that record and undo it. Replace current in xact

table by it. Go backwards to the end – doing logical undo.

Crash during recovery – do it again.

CLRs are used for escrow xacts – cannot be undone multiple times.

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Aries issues:

Crash during recovery – do it again.

CLRs are used for escrow xacts – cannot be undone multiple times. Do an example

Group commit – why required

Application errors: roll forward to a specific point in time, then undo backward – just not

to the present.

HA: standard wisdom; active-passive. Roll log forward at passive site. Failover, by

recovering. In flight transactions get aborted; not exactly HA. I.e. failover in seconds.

Active-active: 2 active sites, each does all xacts. No log. One is primary – other is

secondary. If primary crashes, then keep going from secondary. Best with a stored-

procedure interface.

To go fast: H-store data pie.

Buffer pool

Locking Threading WAL See times-10

See NoSQL.

Solution: main memory, one-xact at a time, single thread, no log – failover to a backup –

active-active.

Draw H-store picture.

Yabut; multicore

Yabut: multi-shard xacts – spec X

What about network partitions:

Primary can’t talk to secondary. Both up. Either:

Primary continues, secondary blocks (less availability)

Or

Both continue – no consisitency

Give up one or the other. Brewer has a CAP theorem – says you can’t have all 3.

Application errors, human errors, resource issues, [run out of mem, run out of disk, run

out of …] install new software, reprovision – these dwarf network partitions.

Byzantine failures.

Have to have 3 replicas and voting. Nobody worries about this – except theoreticians

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

6.830 2010 Lecture 15: C-Store (Sam Madden) Why are we reading this paper? C-store has standard interface, but very different design Help us understand what choices standard DBs made Think about different set of apps than OLTP Paper status Most individual techniques already existed C-Store pulls them together Not just a limited special-purpose DB transparent -- sql interface

read/write, not just r/o

consistency

transactional update

Paper doesn't describe complete system

design + partial implementation

Commercialized as Vertica

What's a data warehouse? big historical collection of data companies analyze to spot trends &c what products are in? what's out? where is cherry coke popular? spot early, order more

mostly r/o but must be updated, maybe continuously

Example: big chain of stores, e.g. Walmart each store records every sale upload to central DB at headquarters keep the last few years Typical schema (logical): time(tid,year,mon,day,hour)

product(pid,type,color,supplier)

xact(tid,pid,sid,cid,price,discount,tax,coupon,&c) store(sid,state,mgr,size)

customer(cid,city,zip)

called a "star schema" "fact table" in the middle

gigantic # of rows

might have 100s of columns

"dimension tables" typically much smaller How big is the data? 50K products (5 MB) 3K stores (1 MB) 5M customers (5 GB) 150K times (5 MB) (10 minute granularity) 350B xact rows (35 TB) (100 bytes/sale) 3000 stores * 10 registers * 20 items/min * 2 years example 1: total sales by store in Texas on Mondays join xact to time and store filter by day and state group by sid, aggregate example 2: average daily sales for Nikon cameras join xact to product, time filter by supplier group by day, aggregate

How long would queries take on traditional DB? probably have to look at every page of fact table

even if only 1% of records pass filter

means every block might have one relevant record

so index into fact table may not be very useful

joins to dimension tables pretty cheap

they fit in memory, fast hash lookups

how long to read the whole fact table?

35 TB, say 100 disks, 50 MB/sec/disk => 2 hours

outch!

You can imagine building special setups e.g. maintain aggregates in real time -- pre-compute

know all the queries in advance

update aggregate answers as new data arrives

table of daily sales of Nikon cameras, &c

but then hard to run "ad-hoc" queries

C-Store Why columns? Why store each column separately? avoid reading bytes from fact table you don't need Why "projections" of columns? you usually want more than one column e.g. sid and price for example 1 Why is a projection sorted on one of the columns? to help aggregation: bring all data for a given store together or to help filtering by bringing all data w/ given col value together so you only have to read an interval of the column What projection would help example 1? columns: sid, price, store.state, time.day note we are including columns from multiple logical tables note we are duplicating a lot of data e.g. store.state note projection must have every column you need -- can't consult "original" row thus you don't need a notion of tupleID

note i'th row in each column comes from same xact row

order?

sid

state, sid

Why multiple overlapping projections? why store the same column multiple times? What projection would help example 2? columns: price, time.year, time.mon, time.day, product.supplier note we are not including join columns! e.g. pid order?

supplier, year, mon, day

year, mon, day, supplier

What if there isn't an appropriate projection for your query? You lose -> wait 2 hours Ask DB administrator to add a projection Could we get the same effect in conventional DB? Keep heap files sorted ("clustered")? can only do it one way B+Trees for order and filtering?

have to avoid seeks into main heap file, so multi-key B+trees

copy data into many tables, one per projection

So yes, we could

But very manual

choose right table for each query

updating?

"materialized views" partially automates this for conventional DB

and Eval in Section 9 shows they make row store perform 10x better

but c-store still faster

Won't all this burn up huge quantities of disk space? How do they compress? Why does self-order vs foreign-order matter in Section 3.1? How to compress for our example projections? sid ordered by sid? price ordered by sid? store.state ordered by sid? time.day ordered by sid? Won't it be slow to update if there are lots of copies? How does C-Store update efficiently? How does C-Store run consistent r/o queries despite updates? Why segment across a cluster of servers? Parallel speedup many disks, more memory, many CPUs How do they ensure good parallel speedup on a cluster? What is a "horizontal partition"? Why will that lead to good parallel speedup? Sorting allows filtering and aggregating to proceed in parallel will talk about parallel DBs more later Evaluation? Section 9 what are the main claims that need to be substantiated?

faster on data warehouse queries than a traditional row store

uses a reasonable amount of space

Experimental setup standard data-warehouse benchmark "TPC-H" single machine one disk 2 GB RAM this is a little odd -- original data also 2 GB small reduction in memory requirement could give a huge boost in this setup but make no difference for larger data sets TPC-H scale_10 standard data warehouse benchmark comes in different sizes ("scale") defines how many rows in each table

customer: 1.5 M rows, abt 15 MB

orders: 15 M rows, abt 150 MB

lineitem: 60 M rows, abt 2.4 GB

results are spectacular! mostly > 100x faster than row store

Q4 is 400x faster on c-store -- why? print o_orderdate, l_shipdate group by o_orderdate filter on l_orderkey, o_orderkey, o_orderdate must be using D2: o_orderdate, l_shipdate, l_suppkey | o_orderdate, l_suppkey D2 is missing o_orderkey and l_orderkey -- do we need them?

D2 already in good order to aggregate by o_orderdate

how much data is c-store scanning?

two columns with 60 M rows

o_orderdate probably compressed down to a bit or byte

l_shipdate might be 4 bytes

so 300 MB?

read from disk in 6 seconds read from RAM in 0.3 seconds actual performance is in between: 2 seconds maybe skipping due to o_orderdate > D? maybe some in mem, some in disk? what is row DB probably doing? for 723 seconds

would have to scan 2 GB LINEITEM table

if doesn't fit in RAM, 40 seconds at 50 MB/sec from disk

must join to ORDERS table, fits in memory, should be fast hash

then sort (or something) by o_orderdate

hard to guess why row DB takes 723 rather than 40+ seconds

Q2 is only 3x faster w/ c-store needs l_suppkey, l_shipdate filter by l_shipdate group by l_suppkey probably uses D1: l* | l_shipdate, l_suppkey D1 lets c-store only look at l_shipdate = D, needn't scan most of LINEITEM D1 sorted well for aggregation what would row DB do? maybe has a b+tree also keyed by l_shipdate, l_suppkey?

does not need to scan or seek into LINEITEM

They win by keeping multiple copies, tailored to different queries How much storage penalty for queries in Eval? Actually LESS storage! 2 GB vs 4.5 GB Uncompressed data was also about 2 GB Would be more for more queries

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

6.830 2010 Lecture 16: Two-Phase Commit last time we were talking about parallel DBs partitioned data across multiple servers we mostly discussed read-only queries what about read/write queries? high-level model a bunch of servers rows are partitioned over servers each server runs a complete DB on its partition SQL, locks, logging external client connects to one server: Transaction Coordinator (TC) sends it commands for whole system TC farms them out to correct "subordinate" server TC collects results, returns them to client TC and servers exchange messages over a LAN example transaction: begin SELECT A ... SELECT B ... UPDATE A ... UPDATE B ... commit diagram A on S1, B on S2 client connects to S3 S3 sends SELECT A to S1, gets result S3 sends SELECT B to S2, gets result 93 sends UPDATE A to S1 S3 sends UPDATE B to S2 S3 sends "transaction completed" reply to client but wait, this is not enough! what about locking? each r/w acquires lock on server w/ the data so: S1/A/S, S2/B/S, S1/A/S, S1/A/X when should the system release the locks? remember we want strict two-phase locking for serializability and no cascading aborts

so can't release until after commit

so there must be at least one more message:

TC tells S1,S2 that the transaction is over can we get deadlock? yes, for example if we run two of this transaction in general a subordinate could block+deadlock at any read or write let's assume a global deadlock detector which notifies a transaction that it must abort

so a subordinate might be aborted at any step

more generally, a subordinate can fail for a number of reasons deadlock integrity check (e.g. insert but primary key not unique) crash network failure what if one subordinate fails before completing its update? and the other subordinate didn't fail? we want atomic transactions!

so TC must detect this situation, tell the other subordinate to abort+UNDO we need an "atomic commitment" protocol all subordinates complete their tasks or none Two-Phase Commit is the standard atomic commitment protocol 2PC message flow for ordinary operation Client TC Subordinate ---------> -- SQL cmds -->

acquire locks

if update,

append to log update blocks check deadlock, integrity, &c -- PREPARE --> [log prepare or abort]

names of GFS files that store the tablet (sec 5.3)

what tablet server serves it (guessing, paper doesn't say)

what properties of Chubby are important? why a master AND chubby? most systems integrate them; separation means chubby can be reused chubby is a generic fault-tolerant file and lock server chubby does three things for BigTable stores root of METADATA table in a file

maintains master lock, so there's at most one master

tracks which tablet servers are alive (via locks)

key properties:

Chubby replicates METADATA and locks

Chubby keeps going even if one (two?) Chubby servers down

Chubby won't disagree with itself

example: network partition

you update Chubby replica in one partition

Chubby replica in other partition will *not* show stale data

what is the point of the master? after all, the METADATA is all in Chubby and GFS answer: there had better be only one entity assigning tablets to servers only the master writes METADATA

chubby locking ensures there's at most one master

even during network partitions

why isn't Chubby a bottleneck? clients cache METADATA METADATA doesn't change often tablet server will tell client if it is talking to wrong server read/write processing inside a tablet server similar to c-store log for fast writes, SSTables for fast lookups [diagram: log in GFS, memtable, SSTables in GFS] SStables in GFS compact ordered row/family/col/time data

compressed

index at the end

immutable -- why not mutable b+tree?

fast search, compact, compression, GFS not good at rand write

log in GFS

compaction

recovery from tablet server crashes key problem:

what if it was in the middle of some update when it crashed?

do we need to wait for it to reboot and recover from its log?

chubby notices server is dead (stops refreshing its lock)

and/or master notices it is dead?

even if tablet server is live but partitioned,

it won't be able to refresh its lock if Chubby thinks it is dead

so table server will know to stop serving

if master sees tablet server no longer has its lock:

picks another tablet server (preferably lightly loaded one)

tells is "load that tablet from GFS"

new tablet server reads the crashed server's log from GFS! recovery from BigTable master crashes Chubby takes away its lock some other machine(s) decide to be master only one gets the Chubby lock recreate old master's state:

read set of tablets from METADATA

ask Chubby for list of live tablet servers

ask tablet servers what they serve

Evaluation setup 1700 GFS servers, N tablet servers, N clients all using same set of machines two-level LAN with gig-e each row has 1000 bytes

single-tablet-server random read first row, first column of Figure 6 single client reads random rows how can one server do 1212 random reads/second? you can't seek 1212 times per second!

answer: only 1 GB of data, split up over maybe 16 GFS servers

so all the data is in the GFS Linux kernel file cache

so why only 1212, if in memory?

that's only 1 megabyte/second!

each row read reads 64KB from GFS

78 MB / second, about all gig-e or TCP can do

single-tablet-server random write single client reads random rows traditionally a hard workload how could it write 8850 per second? each write must go to disk (the log, on GFS) for durability

log is probably in one GFS chunk (one triple of servers)

you cannot seek or rotate 8850 times per second!

presumably batching many log file writes, group commit

does that means BigTable says "yes" to client before data is durable? what about scaling read across a row in Figure 6 the per-server numbers go down so performance goes up w/ # tablet servers, but not linearly why not linear? paper says load imbalance:

some BigTable servers have other stuff running on them

master doesn't hand out tablets 100% balanced

also network bottleneck, at least for random read

remember 64K xfer over LAN per 1000-byte row read

root of net only has about 100 gbit/second total

enough to keep only about 100 tablet servers busy

i like this paper's evaluation section shows good and bad aspects explains reasons for results connects performance back to design

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

ORM: Problem: Impedance Mismatch (i.e., different languages for data and programming, need casting between types, makes analysis difficult) Solution: Object-Relation-Mapping middleware (provide an persistence abstraction for objects, and takes care of transformation from/to the DB world) "Everyone who is somebody has one! Either standard (e.g., hibernate) or ad-hoc." The idea is to provide: - pre-canned mapping between OO classes/fields and table/columns - manually defined mappings - provides object persistency without looking at the DB Good: - abstraction - ease of debug Bad: - performance

Example: Hibernate Application dialogs - Swing - SWT - Web application

Application logic Class Order Class Customer Instance of Customer class Hibernate Row of Customer table Database Table Order -id -number -date -customer_id

Table Customer -id -firstname -lastname

Image by MIT OpenCourseWare.

Example of Hibernate Mapping













Example of Hibernate Usage (many details are hidden) Honey honey = new Honey();

honey.setName("forest honey");

honey.setTaste("very sweet");

… tx = session.beginTransaction();

session.save(honey);

tx.commit();

… tx = session.beginTransaction();

session.update(honey);

tx.commit();

… tx = session.beginTransaction();

List honeys = session.createQuery("select h from Honey as h").list();

tx.commit();

-----------------------------------------------------------------------------------------

Next we talk about DriadLINQ… it provides similar features but adds much more in particular: - LINQ language integration - Batch-oriented - Cluster-oriented - More than SQL

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

DryadLINQ

11/24/09 -- Lecture 20

What is DryadLINQ? (Programming language and distributed execution framework for manipulating very large datasets spread across many computers.) LINQ -- programming language Dryad -- execution framework What are the goals of Dryad?

Provide parallel database-style scalability in a more general purpose language.

Users write a single program, don't worry about how it is partitioned across many

machines.

Why not SQL?

Want to run arbitrary programs over data. Databases do allow users to write user

defined functions and create data types, but this can be awkward at times.

Want better integration with development framework and main (host) language.

Why not MapReduce? It has a very restricted communication pattern that often requires multiple map/reduce

phases to write complex tasks (e.g., joins.) No cross-phase optimization.

What types of programs are they trying to run?

They are not trying to provide transactions, and are focused more on chewing on a

bunch of big data files than applying a few small updates, etc.

What is the LINQ programming model?

SQL-like, except that it is embedded in C#, and can make use of C# functions. It is

compiled and type checked using C#.

Example:

Ex 1: using System;

using System.Linq;

using System.Collections.Generic;

class app { static void Main() { string[] names = { "Burke", "Connor", "Frank", "Everett", "Albert", "George", "Harris", "David" }; IEnumerable query = from s in names

where s.Length == 5

orderby s

select s.ToUpper();

foreach (string item in query) Console.WriteLine(item); } }

Ex 2: string[] names = { "Albert", "Burke", "Connor", "David", "Everett", "Frank", "George", "Harris"}; // group by length

var groups = names.GroupBy(s => s.Length, s => s[0]);

foreach (IGrouping group in groups) {

Console.WriteLine("Strings of length {0}", group.Key); foreach (char value in group) Console.WriteLine(" {0}", value); } This variation prints the following: Strings of length 6 A C G

H

Strings of length 5 B D F Strings of length 7 E

Ex 3: (In general, you can combine a sequence of operations together) var adjustedScoreTriples = from d in scoreTriples join r in staticRank on d.docID equals r.key select new QueryScoreDocIDTriple(d, r); var rankedQueries =

from s in adjustedScoreTriples

group s by s.query into g

select TakeTopQueryResults(g);

Can think of this as a graph of operators, just like in SQL (except that SQL plans are mostly trees, whereas here they are DAGs)

rankedQueries

adjustedScoreTriples

group by

takeTop...

...

...

queryScore

join

scoreTriples

staticRank

Operators: select project order by join group by

Why is that good? What are the advantages? Avoids "Impedence Mismatch" where you run a SQL query, retrieve and typecheck the results, then pack the results back into a string, which you send to the SQL engine, which typechecks the inputs, etc....

Any disadvantages? Binds you to a specific language; complicates host language; may not support all of SQL. How does this get distributed in DryadLINQ?

(Show architecture diagram)

Label pieces -- note similarity to BigTable/MapReduce

Note that the output of each operator goes to disk

Client MGR

NS

W1

W2

W3

Storage System Image by MIT OpenCourseWare.

Simple example: Sort a

a,s,q,m W1

sample

W3

repartition

hist

1:a-c 2:d-z

interleave

b,d,b,a

sample

sort

b,b,a s,q,m

W2

a,a,b,b

a,b,b,a

interleave

repartition

d

Input data partitioned across nodes, just like in RDBMS

d,m,q,s

s,q,m,d sort

Example 2 -- Aggregate, e.g. Q2 var names = GetTable("file://names.txt") // group by length var groups = names.GroupBy(s => s.Length, s => s[0]); var output = ToDryadTable(groups, "file://out.txt")

Very similar to how a DDBMS would do this. How many partitions to create?

Just say it "depends on the size of the input data"

Are there restrictions on the programs that can be run?

Says that they must be "side effect free"

What does that mean?

No modifications to any shared state.

How restrictive is that? Can make it tough to do some operations -- for example, you can't write a group by in

the language w/out special group by support.

Ex:

curgroup = 0

saved = NULL

void aggGbySorted(t, groupf, aggf)

if (groupf(t) != curgroup && saved != NULL)

emit aggf(saved)

saved = new list().append(t)

curgroup = groupf(t)

else

saved.append(t)

(can only auto-parallelie user defined funcs that are stateless) Can I read shared state? Yes How is that handled? Shipped to each processing node What if I can't figure out how to write my program in the above constraints or constructs? Apply (f, list) --> Can perform arbitrary computation on the list, but will be run on a single node What compiler optimizations are supported? Static (e.g., at compile time) - Pipeline operators on the same node

Ex: (somewhat unclear how they choose what to combine)

sort

partial agg.

sort

partial agg.

- Eager aggregation -- see aggregate above; perform partial aggregation before repartitioning - I/O reduction don't write intermediate data to files -instead, store in a buffer or send over a TCP pipe when possible Dynamic (e.g., at run time): - Dynamic aggregation - perform eager aggregation at edges between data centers - Dynamic partitioning - choose number of partitions to create based on the amount of data unclear how they do this

- Sort optimizations - specific plan for order by -- sample data to determine ranges, compute histogram, split into ranges with equal #s of tuples, redistribute and sort on each node. Optimizations are very specialized to sepcific programs -- e.g., they seem to have a specific optimization to make orderby, groupby, etc run fast, rather than true general purpose opts. Lots of talk about how they can 'automatically apply' many different optimizations and infer properties like commutativity, etc., but a bit hard to figure out how these actually work. If you go look at their other papers they have more detail. In general, this means that programs with lots of custom ops may do many repartitionings -- e.g., after any selection that modifies tuples, repartitioning is needed. Fault tolerance Don't really mention it -- though they claim the ability to restart slow nodes as a contribution (kind of an oversight.) Presumably they do something like what MapReduce does -- if they notice that a particular part of a job fails, they can simply restart/rerun it. Less clear that this works as well as in map reduce where there is a well defined notion of a "job" assigned to one or more nodes. Debugging, etc. LINQ programs can compile to other execution frameworks, including a single node thing that makes debugging easy Performance -- Terasort (3.87 GB / node) (Sec 5.1/Table 1) 1 node takes 119 secs 4 striped disks should read/write sequentially at about ~200 MB/sec 3.87 / .2 = ~20 sec ~40 sec to read and write data; so sorting 3.87 GB is about 80 secs? > 1 node -- Each node ships n-1/n of data Data compresses by a factor of 6 Compressed 645 MB for n=2, sends 320 MB

1 gbit/sec switch ==> 2.5 sec to send

for n = 10, sends 580 MB ==> 4.6 sec to send

(so slight increase from 2-->10 seems to make sense)

So why is sorting so much slower when going from 1-->2?

Sampling? Dunno. Frustrating performance analysis.

MIT OpenCourseWare http://ocw.mit.edu

6.830 / 6.814 Database Systems Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.