Notes on Relational Algebra

Relational algebra and relational calculus are formal languages associated with the relational model. Informally, relational algebra is a (high-level) procedural language and relational calculus a nonprocedural language. However, formally both are equivalent to one another.
Relational algebra operations work on one or more relations to define another relation without changing the original relations. Any operation on relations produces a relation. This is why we call this an algebra
A basic expression in the relational algebra consists of either a relation in the database or a constant relation
Both operands and results are relations, so output from one operation can become input to another operation.
Composition of relational algebra operations: Allows expressions to be nested, just as in arithmetic. This property is called closure.
Relational model is set-based (no duplicate tuples)
RA is not universal. It is impossible in relational algebra (or standard SQL) to compute the relation Answer(Ancestor, Descendant), e.g., traverse through a graph
Many relational databases use relational algebra operations for representing execution plans. Relatively easy to manipulate for query optimization

Each operation takes one or two relations as input. Produces another relation as output
Selection: a subset of rows. Sigma(predicate, R). Unary
Projection: a subset of columns. Pi(col1, … colN, R). Unary
- Result of PROJECT operation is a set of distinct tuples, i.e., maps to a DISTINCT in SQL
Cartesian product (X)
Union
Set Difference: R - S => items in R but not S. R and S must be union-compatible.
Intersection: R - (R - S)
Division: Defines a relation over the attributes C that consists of set of tuples from R that match combination of every tuple in S. Binary, similar to join
Unary RENAME operator: Sometimes we want to be able to give names to intermediate results of relational algebra operations

Equivalent to performing a Selection, using join predicate as selection formula, over Cartesian product of the two operand relations.
One of the most difficult operations to implement efficiently in an RDBMS and one reason why RDBMSs have intrinsic performance problems.
Inner join/Theta join: Defines a relation that contains tuples satisfying the predicate F from the Cartesian product of R and S
- Can rewrite Theta join using basic Selection and Cartesian product operations.
- If predicate F contains only equality (=), the term Equijoin is used.
Natural join: An Equijoin of the two relations R and S over all common attributes x. One occurrence of each common attribute is eliminated from the result.
- No join condition
- Equijoin on attributes having identical names followed by projection to remove duplicate (superfluous) attributes
- Often attribute(s) in foreign keys have identical name(s) to the corresponding primary keys
Outer join
Semi join: Defines a relation that contains the tuples of R that participate in the join of R with S.
- Can rewrite Semijoin using Projection and Join
To push a projection operation inside a join requires that the result of the projection contain the attributes used in the join.

F(grouping attributes, funciton list, R)
- grouping attributes are attributes of R
- function list is a list of (function, attribute) pairs
If no renaming occurs, the attributes of the resulting relation are named by concatenating the name of the function and the attribute.
Can append -distinct to any aggregate function to specify elimination of duplicates, e.g., count-distinct

Refer to a specific relation: A specific set of tuples, with a particular schema
A named relation is technically a relation variable

Results of relational operations are unnamed. Result has a schema, but the relation itself is unnamed
Used to resolve ambiguities within a specific relational algebra expression
- Allow a derived relation and its attributes to be referred to by enclosing relational-algebra operations
- Allow a base relation to be used multiple ways in one query, e.g., self-join
rho(x, E): x - new name, E - named relation, relation-variable, or an expression that produces a relation
does not create a new relation-variable
The new name is only visible to enclosing relational-algebra expression

Leaf nodes are base relations/operands — either variables standing for relations or particular, constant relations
Internal nodes are operators

SQL data model is a bag/multiset not a set, i.e. duplicates allowed.
Some operations, like projection, are more efficient on bags than sets.
SQL is much more on the “declarative” end of the spectrum (What you want). RA is procedural (how to do it)

Null signifies an unknown value or that a value does not exist.
The result of any arithmetic expression involving null is null.
Aggregate functions simply ignore null values (as in SQL)
For duplicate elimination and grouping, null is treated like any other value, and two nulls are assumed to be the same (as in SQL)
(unknown or true) = true, (unknown or false) = unknown, (unknown or unknown) = unknown