Linear Algebra and Learning from Data: Deep Learning and Neural Nets

2024-12-20 00:00:00 +0000

This sentence uses $ delimiters to show math inline: $\sqrt{3x-1}+(1+x)^2$

Slowly Changing Dimensions

2024-11-27 00:00:00 +0000

Motivation

  • A dimension does not change often. For example, suppliers do not move to a different state often. Customers rarely update demographical info. Products dont change their description often
  • Track both current and historical data over time. An aggregate table summarizing facts should continue to reflect the historical state

Type 1: Overwrite

  • No historical data

Type 2: add a new row

  • Multiple rows for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Each row represents the full details at that version.
  • Ways to track versions by
    • a version number. Can use a new surrogate key for the new version
    • Start date + end date. The latest version use a null end date or a sentinel value
    • Single effective date + current flag
  • If there are retroactive changes made to the contents of the dimension, or if new attributes are added to the dimension which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to frequent change.

Type 3: previous value column

  • Only the previous history is stored
  • No new row, and thus no change to the surrogate key

Type 4: historical table

  • The dimension table does not change, i.e., it holds only the latest version
  • The history table keeps track of create date

Skipping type 5 - 7 since they seem less popular

Support UNION in Squirrel

2024-11-03 00:00:00 +0000

Option: add UNION to selectData

  • PR 1
  • PR 2
  • PR 3. This PR is not feasible because it assumes
    • single union only
    • No group by/join when union is added to the builder

type selectData struct {
	PlaceholderFormat PlaceholderFormat
	RunWith           BaseRunner
	Prefixes          exprs
	Options           []string
	Columns           []Sqlizer
	From              Sqlizer
	Joins             []Sqlizer
	WhereParts        []Sqlizer
	GroupBys          []string
	HavingParts       []Sqlizer
	OrderBys          []string
	Limit             string
	Offset            string
	Union             []Sqlizer -- data type is same as Joins and From
	UnionAll          []Sqlizer
	Suffixes          exprs --previous used to simulate union
}

Inside ToSql, attach UNION and UNION ALL after suffixes

	if len(d.Suffixes) > 0 {
		sql.WriteString(" ")
		args, _ = d.Suffixes.AppendToSql(sql, " ", args)
	}

	if len(d.Union) > 0 {
		sql.WriteString(" UNION ")
		args, err = appendToSql(d.Union, sql, " UNION ", args)
		if err != nil {
			return
		}
	}
	if len(d.UnionAll) > 0 {
		sql.WriteString(" UNION ALL ")
		args, err = appendToSql(d.UnionAll, sql, " UNION ALL ", args)
		if err != nil {
			return
		}
	}

	sqlStr = sql.String()
	return
  • Union and union all should probably be before the suffix
  • Union still needs to support limit and order by
  • This code assumes no mix of UNION and UNION ALL in the same SQL

Option: UnionBuilder


type unionSelect struct {
	op       string // e.g. "UNION"
	selector squirrel.SelectBuilder
}

type unionData struct {
	Selects []*unionSelect
	Limit   string
	OrderBy []string
        PlaceholderFormat sq.PlaceholderFormat
}

type UnionBuilder builder.Builder

Notes on pg_stats

2024-08-16 00:00:00 +0000

  • To collect statistics, the analyzer randomly selects 300 × default_statistics_target rows (the default value is 100, so 30,000 rows in total).
  • Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly updated. After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the pg_statistic table

Statistics Used by the Planner

SELECT * FROM tenk1

  • The planner determines the cardinality of tenk1
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
  • Basic relation-level statistics are stored in the table pg_class in the system catalog.
  • reltuples: Relation’s row count
  • relpages: Relation’s size in pages

WHERE unique1 < 1000

  • The planner examines the WHERE clause condition and looks up the selectivity function for the operator < in pg_operator. This is held in the column oprrest, and the entry in this case is scalarltsel. The scalarltsel function retrieves the histogram for unique1 from pg_statistic.
  • For equality estimation the histogram is not useful; instead the list of most common values (MCVs) is used to determine the selectivity.

WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2

  • The restriction on tenk1, unique1 < 50, is evaluated before the nested-loop join
  • The restriction for the join is t2.unique2 = t1.unique2. The operator is just our familiar =, however the selectivity function is obtained from the oprjoin column of pg_operator, and is eqjoinsel. eqjoinsel looks up the statistical information for both tenk2 and tenk1, e.g., null_frac,n_distinct, most_common_vals
 selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
rows = (outer_cardinality * inner_cardinality) * selectivity

pg_stats

  • pg_stats is designed to be more easily readable and is readable by all, whereas pg_statistic is only readable by a superuser.
  • For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run an ANALYZE on a replica or a reader node (both can read from the pg_statistics relation, but can’t update it).

ALTER TABLE SET STATISTICS

  • sets the per-column statistics-gathering target for subsequent ANALYZE operations. The target can be set in the range 0 to 10000; alternatively, set it to -1 to revert to using the system default statistics target (default_statistics_target)

correlation

  • When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0

most_common_vals(MCV) and most_common_freqs

  • MCV lists are also used for selectivity estimations of inequalities: to find the selectivity of “column < value”, the planner searches most_common_vals for all the values lower than the given value, and then adds together their frequencies from most_common_freqs.
  • Common value statistics work best when the number of distinct values is low. The maximum size of the MCV arrays is defined by default_statistics_target, the same parameter that governs row sample size during analysis.

Histogram

  • When the number of distinct values grows too large to store them all in an array, the system starts using the histogram representation. A histogram employs several buckets to store values in. The number of buckets is limited by the same default_statistics_target parameter.
  • histograms are used to estimate selectivity of “greater than” and “less than” operations together with MCV lists.

n_distinct

ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...)

the only defined per-attribute options are n_distinct and n_distinct_inherited, which override the number-of-distinct-values estimates made by subsequent ANALYZE operations, i.e., the n_distinct change will not be in effect until you run ANALYZE again

How is pg_stats build

    FROM (((pg_statistic s                                                                                                                                     
      JOIN pg_class c ON ((c.oid = s.starelid)))                                                                                                               
      JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum))))                                                                            
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))  
  • tablename: pg_class.relname
  • attname: from pg_attribute, identified by oid + attnum
  • null_frac: stanullfrac
  • avg_width: stawidth
  • n_distinct: stadistinct

CREATE STATISTICS

  • Extended statistics metadata is stored in the pg_statistic_ext table in the system catalog, while the statistics data itself is stored in a separate table pg_statistic_ext_data
  • CREATE STATISTICS will create a new extended statistics object tracking data about the specified table, foreign table or materialized view. The statistics object will be created in the current database and will be owned by the user issuing the command.
  • The CREATE STATISTICS command has two basic forms. The first form allows univariate statistics for a single expression to be collected, providing benefits similar to an expression index without the overhead of index maintenance. This form does not allow the statistics kind to be specified, since the various statistics kinds refer only to multivariate statistics. The second form of the command allows multivariate statistics on multiple columns and/or expressions to be collected, optionally specifying which statistics kinds to include. This form will also automatically cause univariate statistics to be collected on any expressions included in the list.
  • If a schema name is given (for example, CREATE STATISTICS myschema.mystat …) then the statistics object is created in the specified schema. Otherwise it is created in the current schema. The name of the statistics object must be distinct from the name of any other statistics object in the same schema.
  • Extended statistics are not currently used by the planner for selectivity estimations made for table joins. This limitation will likely be removed in a future version of PostgreSQL.
  • Creation of such an object merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed by ANALYZE (either a manual command, or background auto-analyze). The collected values can be examined in the pg_statistic_ext_data catalog.
  • ANALYZE computes extended statistics based on the same sample of table rows that it takes for computing regular single-column statistics.

Dependencies

  • By default, the statistics from ANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.

References

Notes on query plan optimization in RDBMS

2024-06-01 00:00:00 +0000

  • Steps in Query Processing: query -> parsing -> RA expressions -> optimizer with stats -> execution plan -> eval engine with data -> output
  • Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression, i.e., heuristic optimization and cost-based optimization
  • Frequently used approach: heuristic rewriting of nested block structure and aggregation. Followed by cost-based join-order optimization for each block
  • Optimizers often use simple heuristics for very cheap queries, and perform exhaustive enumeration for more expensive queries
  • Push down projection also reduces number of pages since it eliminates columns early
  • Heuristics
  • Consider only left-deep plans
  • Avoid Cartesian products
  • Don’t optimize the entire query at once. Break query into query blocks and optimize one block at a time
    • A query block contains a single SELECT-FROM-WHERE expression as well as GROUP BY, HAVING clauses

References

Notes on Relational Algebra

2024-05-24 00:00:00 +0000

Overview

  • Relational algebra and relational calculus are formal languages associated with the relational model. Informally, relational algebra is a (high-level) procedural language and relational calculus a nonprocedural language. However, formally both are equivalent to one another.
  • Relational algebra operations work on one or more relations to define another relation without changing the original relations. Any operation on relations produces a relation. This is why we call this an algebra
  • A basic expression in the relational algebra consists of either a relation in the database or a constant relation
  • Both operands and results are relations, so output from one operation can become input to another operation.
  • Composition of relational algebra operations: Allows expressions to be nested, just as in arithmetic. This property is called closure.
  • Relational model is set-based (no duplicate tuples)
  • RA is not universal. It is impossible in relational algebra (or standard SQL) to compute the relation Answer(Ancestor, Descendant), e.g., traverse through a graph
  • Many relational databases use relational algebra operations for representing execution plans. Relatively easy to manipulate for query optimization

Operations

Basic Operations

  • Each operation takes one or two relations as input. Produces another relation as output
  • Selection: a subset of rows. Sigma(predicate, R). Unary
  • Projection: a subset of columns. Pi(col1, … colN, R). Unary
    • Result of PROJECT operation is a set of distinct tuples, i.e., maps to a DISTINCT in SQL
  • Cartesian product (X)
  • Union
  • Set Difference: R - S => items in R but not S. R and S must be union-compatible.

  • Intersection: R - (R - S)
  • Division: Defines a relation over the attributes C that consists of set of tuples from R that match combination of every tuple in S. Binary, similar to join
  • Unary RENAME operator: Sometimes we want to be able to give names to intermediate results of relational algebra operations

Join

  • Equivalent to performing a Selection, using join predicate as selection formula, over Cartesian product of the two operand relations.
  • One of the most difficult operations to implement efficiently in an RDBMS and one reason why RDBMSs have intrinsic performance problems.
  • Inner join/Theta join: Defines a relation that contains tuples satisfying the predicate F from the Cartesian product of R and S
    • Can rewrite Theta join using basic Selection and Cartesian product operations.
    • If predicate F contains only equality (=), the term Equijoin is used.
  • Natural join: An Equijoin of the two relations R and S over all common attributes x. One occurrence of each common attribute is eliminated from the result.
    • No join condition
    • Equijoin on attributes having identical names followed by projection to remove duplicate (superfluous) attributes
    • Often attribute(s) in foreign keys have identical name(s) to the corresponding primary keys
  • Outer join
  • Semi join: Defines a relation that contains the tuples of R that participate in the join of R with S.
    • Can rewrite Semijoin using Projection and Join
  • To push a projection operation inside a join requires that the result of the projection contain the attributes used in the join.

Aggregate Functions & Grouping

  • F(grouping attributes, funciton list, R)
    • grouping attributes are attributes of R
    • function list is a list of (function, attribute) pairs
  • If no renaming occurs, the attributes of the resulting relation are named by concatenating the name of the function and the attribute.
  • Can append -distinct to any aggregate function to specify elimination of duplicates, e.g., count-distinct

Relation variables

  • Refer to a specific relation: A specific set of tuples, with a particular schema
  • A named relation is technically a relation variable

Assignment

  • relvar <- E. E is an expression that evaluates to a relation
  • assign a relation-value to a relation-variable
  • the name relvar persists in the database
  • Query evaluation becomes a sequence of steps, e.g., % operator

Rename

  • Results of relational operations are unnamed. Result has a schema, but the relation itself is unnamed
  • Used to resolve ambiguities within a specific relational algebra expression
    • Allow a derived relation and its attributes to be referred to by enclosing relational-algebra operations
    • Allow a base relation to be used multiple ways in one query, e.g., self-join
  • rho(x, E): x - new name, E - named relation, relation-variable, or an expression that produces a relation
  • does not create a new relation-variable
  • The new name is only visible to enclosing relational-algebra expression

Query tree

  • Leaf nodes are base relations/operands — either variables standing for relations or particular, constant relations
  • Internal nodes are operators

Relational algebra vs SQL

  • SQL data model is a bag/multiset not a set, i.e. duplicates allowed.
  • Some operations, like projection, are more efficient on bags than sets.
  • SQL is much more on the “declarative” end of the spectrum (What you want). RA is procedural (how to do it)

Null

  • Null signifies an unknown value or that a value does not exist.
  • The result of any arithmetic expression involving null is null.
  • Aggregate functions simply ignore null values (as in SQL)
  • For duplicate elimination and grouping, null is treated like any other value, and two nulls are assumed to be the same (as in SQL)
  • (unknown or true) = true, (unknown or false) = unknown, (unknown or unknown) = unknown

References

How does CockroachDB translate an AST into execution plans

2024-05-21 00:00:00 +0000

Phases

  • Parsing → Logical Planning & Optimization → Physical Planning → Execution
  • The SQL front-end is responsible for parsing, desugaring, free simplifications and semantic analysis. The SQL middle-end is responsible for logical planning and optimization. For efficiency and engineering reasons, the the front-end and middle-end are grouped together in the code
  • If the architecture calls out a thing called e.g. “query runner”, which takes as input a logical query plan (a data structure) and outputs result rows (another data structure), you’d usually expect a thing in the source code called “query runner” that looks like a class whose instances would carry the execution’s internal state providing some methods that take a logical plan as input, and returning result rows as results.
  • It is common for SQL engines to separate processing of a query into two phases: preparation and execution. This is especially valuable because the work of preparation can be performed just once for multiple executions.

Parsing

  • CockroachDB uses an LALR parser that is generated by goyacc from a Yacc-like grammar file. This file specifies CockroachDB’s SQL dialect, and was originally copied from PostgreSQL and then modified.

// Statement is the result of parsing a single statement. It contains the AST
// node along with other information.
type Statement[T any] struct {
	// AST is the root of the AST tree for the parsed statement.
	// Note that it is NOT SAFE to access this currently with statement execution,
	// as unfortunately the AST is not immutable.
	// See issue https://github.com/cockroachdb/cockroach/issues/22847 for more
	// details on this problem.
	AST T

	// Comments is the list of parsed SQL comments.
	Comments []string

	// SQL is the original SQL from which the statement was parsed. Note that this
	// is not appropriate for use in logging, as it may contain passwords and
	// other sensitive data.
	SQL string

	// NumPlaceholders indicates the number of arguments to the statement (which
	// are referenced through placeholders). This corresponds to the highest
	// argument position (i.e. the x in "$x") that appears in the query.
	//
	// Note: where there are "gaps" in the placeholder positions, this number is
	// based on the highest position encountered. For example, for `SELECT $3`,
	// NumPlaceholders is 3. These cases are malformed and will result in a
	// type-check error.
	NumPlaceholders int

	// NumAnnotations indicates the number of annotations in the tree. It is equal
	// to the maximum annotation index.
	NumAnnotations tree.AnnotationIdx
}
  • Note that the AST only represents the syntax of the query, and says nothing about how, or even if, it can be executed. For example, it has no idea whether a table exists or what datatype a column has. Figuring that out is the job of the planner.
  • mainly Parser.Parse() in sql/parser/parse.go.

Logical Planning and Optimization

  • Query planning is the process of converting the SQL query AST into a tree of relational algebra operators that will produce the desired result.
EXPLAIN
    SELECT companies.name AS company, countries.name AS country, employees
    FROM companies JOIN countries ON companies.country = countries.id
    WHERE employees > 1e6
    ORDER BY employees DESC;
  • Each node in the plan is a relational operator, and results flow upwards from one node to the next. We can see that the optimizer has decided to do a full table scan of the companies table, filter the rows on employees, perform a lookup join with the countries table, and finally sort the results by employees in descending order.
  • Scalar expressions (algebraic expressions such as tax / total * 100) are originally represented by nested tree.Expr AST nodes and implement a visitor pattern for analysis and transformation.

Planner


// planner is the centerpiece of SQL statement execution combining session
// state and database state with the logic for SQL execution. It is logically
// scoped to the execution of a single statement, and should not be used to
// execute multiple statements. It is not safe to use the same planner from
// multiple goroutines concurrently.
//
// planners are usually created by using the newPlanner method on a Session.
// If one needs to be created outside of a Session, use makeInternalPlanner().
type planner struct {
	txn *kv.Txn

	// isInternalPlanner is set to true when this planner is not bound to
	// a SQL session.
	isInternalPlanner bool

	// Corresponding Statement for this query.
	stmt Statement

	instrumentation instrumentationHelper

	// Contexts for different stages of planning and execution.
	semaCtx         tree.SemaContext
	extendedEvalCtx extendedEvalContext

	// sessionDataMutator is used to mutate the session variables. Read
	// access to them is provided through evalCtx.
	sessionDataMutator *sessionDataMutator

	// execCfg is used to access the server configuration for the Executor.
	execCfg *ExecutorConfig

	preparedStatements preparedStatementsAccessor

	// avoidCachedDescriptors, when true, instructs all code that
	// accesses table/view descriptors to force reading the descriptors
	// within the transaction. This is necessary to read descriptors
	// from the store for:
	// 1. Descriptors that are part of a schema change but are not
	// modified by the schema change. (reading a table in CREATE VIEW)
	// 2. Disable the use of the table cache in tests.
	avoidCachedDescriptors bool

	// If set, the planner should skip checking for the SELECT privilege when
	// initializing plans to read from a table. This should be used with care.
	skipSelectPrivilegeChecks bool

	// autoCommit indicates whether we're planning for an implicit transaction.
	// If autoCommit is true, the plan is allowed (but not required) to commit the
	// transaction along with other KV operations. Committing the txn might be
	// beneficial because it may enable the 1PC optimization.
	//
	// NOTE: plan node must be configured appropriately to actually perform an
	// auto-commit. This is dependent on information from the optimizer.
	autoCommit bool

	// cancelChecker is used by planNodes to check for cancellation of the associated
	// query.
	cancelChecker *cancelchecker.CancelChecker

	// isPreparing is true if this planner is currently preparing.
	isPreparing bool

	// curPlan collects the properties of the current plan being prepared. This state
	// is undefined at the beginning of the planning of each new statement, and cannot
	// be reused for an old prepared statement after a new statement has been prepared.
	curPlan planTop

	// Avoid allocations by embedding commonly used objects and visitors.
	txCtx                 transform.ExprTransformContext
	nameResolutionVisitor schemaexpr.NameResolutionVisitor
	tableName             tree.TableName

	// Use a common datum allocator across all the plan nodes. This separates the
	// plan lifetime from the lifetime of returned results allowing plan nodes to
	// be pool allocated.
	alloc *rowenc.DatumAlloc

	// optPlanningCtx stores the optimizer planning context, which contains
	// data structures that can be reused between queries (for efficiency).
	optPlanningCtx optPlanningCtx

	// noticeSender allows the sending of notices.
	// Do not use this object directly; use the BufferClientNotice() method
	// instead.
	noticeSender noticeSender

	queryCacheSession querycache.Session

	// contextDatabaseID is the ID of a database. It is set during some name
	// resolution processes to disallow cross database references. In particular,
	// the type resolution steps will disallow resolution of types that have a
	// parentID != contextDatabaseID when it is set.
	contextDatabaseID descpb.ID
}
  • These normalization rules are compiled into Go code along with a norm.Factory that builds normalized memo expressions.
  • Now that we’ve built a normalized memo, buildExecMemo() goes on to optimize the plan using xform.Optimizer. This explores possible query plans by applying Optgen transformation rules to generate equivalent memo groups, and tries to pick the best combination through cost-based optimization based on table statistics.
  • opt/optbuilder
    • in-order depth-first recursive traversal of the AST;
    • invokes semantics checks on the way;
    • constructs a tree of “relational expression nodes”. This tree is also called “the memo” because of the data structure it uses internally.
    • the resulting tree is the logical plan.

References

How does TiDB translate JOIN and GROUP BY into logical plans

2024-05-13 00:00:00 +0000

Join


// Join represents table join.
type Join struct {
	node
	resultSetNode

	// Left table can be TableSource or JoinNode.
	Left ResultSetNode
	// Right table can be TableSource or JoinNode or nil.
	Right ResultSetNode
	// Tp represents join type.
	Tp JoinType
	// On represents join on condition.
	On *OnCondition
	// Using represents join using clause.
	Using []*ColumnName
	// NaturalJoin represents join is natural join
	NaturalJoin bool
}


func (b *planBuilder) buildJoin(join *ast.Join) LogicalPlan {
	b.optFlag = b.optFlag | flagPredicatePushDown
	leftPlan := b.buildResultSetNode(join.Left)
	rightPlan := b.buildResultSetNode(join.Right)
	newSchema := expression.MergeSchema(leftPlan.Schema(), rightPlan.Schema())
	joinPlan := LogicalJoin{}.init(b.ctx)
	joinPlan.SetChildren(leftPlan, rightPlan)
	joinPlan.SetSchema(newSchema)

	// Merge sub join's redundantSchema into this join plan. When handle query like
	// select t2.a from (t1 join t2 using (a)) join t3 using (a);
	// we can simply search in the top level join plan to find redundant column.
	var lRedundant, rRedundant *expression.Schema
	if left, ok := leftPlan.(*LogicalJoin); ok && left.redundantSchema != nil {
		lRedundant = left.redundantSchema
	}
	if right, ok := rightPlan.(*LogicalJoin); ok && right.redundantSchema != nil {
		rRedundant = right.redundantSchema
	}
	joinPlan.redundantSchema = expression.MergeSchema(lRedundant, rRedundant)

	b.curClause = onClause
	onExpr, newPlan, err := b.rewrite(join.On.Expr, joinPlan, nil, false)
	onCondition := expression.SplitCNFItems(onExpr)
	joinPlan.attachOnConds(onCondition)
	joinPlan.JoinType = InnerJoin
	return joinPlan
}

Group by


// LogicalAggregation represents an aggregate plan.
type LogicalAggregation struct {
	logicalSchemaProducer

// AggFuncDesc describes an aggregation function signature, only used in planner.
	AggFuncs     []*aggregation.AggFuncDesc
	GroupByItems []expression.Expression
	// groupByCols stores the columns that are group-by items.
	groupByCols []*expression.Column

	possibleProperties [][]*expression.Column
	inputCount         float64 // inputCount is the input count of this plan.
}

type AggFuncDesc struct {
	// Name represents the aggregation function name.
	Name string
	// Args represents the arguments of the aggregation function.
	Args []expression.Expression
	// RetTp represents the return type of the aggregation function.
	RetTp *types.FieldType
	// Mode represents the execution mode of the aggregation function.
	Mode AggFunctionMode
	// HasDistinct represents whether the aggregation function contains distinct attribute.
	HasDistinct bool
}

Rule-based optimization for SELECT in tidb

2024-05-12 00:00:00 +0000



func logicalOptimize(flag uint64, logic LogicalPlan) (LogicalPlan, error) {
    var err error
    for i, rule := range optRuleList {
        // The order of flags is same as the order of optRule in the list.
        // We use a bitmask to record which opt rules should be used. If the i-th bit is 1, it means we should
        // apply i-th optimizing rule.
        if flag&(1<<uint(i)) == 0 {
            continue
        }
        logic, err = rule.optimize(logic)
        if err != nil {
            return nil, errors.Trace(err)
        }
    }
    return logic, errors.Trace(err)
}


Translate SQL into logical plan


select b from t1, t2 where t1.c = t2.c and t1.a > 5

  • Projection: select b
    • Selection: where t1.a > 5
      • Join: from t1, t2 where t1.c = t2.c
        • DataSource t1
        • DataSource t2

Example 2

select min(id) from t
  • Aggregation
    • TableScan
select id from t order by id asc limit 1
  • Projection
    • TableScan
      • Sort
        • Limit

Column pruning

  • Projection
func (p *LogicalProjection) PruneColumns(parentUsedCols []*expression.Column) {
	child := p.children[0]
	used := getUsedList(parentUsedCols, p.schema)
	for i := len(used) - 1; i >= 0; i-- {
		if !used[i] && !exprHasSetVar(p.Exprs[i]) {
			p.schema.Columns = append(p.schema.Columns[:i], p.schema.Columns[i+1:]...)
			p.Exprs = append(p.Exprs[:i], p.Exprs[i+1:]...)
		}
	}
	selfUsedCols := make([]*expression.Column, 0, len(p.Exprs))
	selfUsedCols = expression.ExtractColumnsFromExpressions(selfUsedCols, p.Exprs, nil)
	child.PruneColumns(selfUsedCols)
}
  • DataSource
// PruneColumns implements LogicalPlan interface.
func (ds *DataSource) PruneColumns(parentUsedCols []*expression.Column) {
	used := getUsedList(parentUsedCols, ds.schema)
	for i := len(used) - 1; i >= 0; i-- {
		if !used[i] {
			ds.schema.Columns = append(ds.schema.Columns[:i], ds.schema.Columns[i+1:]...)
			ds.Columns = append(ds.Columns[:i], ds.Columns[i+1:]...)
		}
	}
	for k, cols := range ds.schema.TblID2Handle {
		if ds.schema.ColumnIndex(cols[0]) == -1 {
			delete(ds.schema.TblID2Handle, k)
		}
	}
}
  • Aggregation
    • columns in group by
    • columns in the aggregation function
  • Sort
func (ls *LogicalSort) PruneColumns(parentUsedCols []*expression.Column) {
	child := ls.children[0]
	for i := len(ls.ByItems) - 1; i >= 0; i-- {
		cols := expression.ExtractColumns(ls.ByItems[i].Expr)
		if len(cols) == 0 {
			ls.ByItems = append(ls.ByItems[:i], ls.ByItems[i+1:]...)
		} else {
			parentUsedCols = append(parentUsedCols, expression.ExtractColumns(ls.ByItems[i].Expr)...)
		}
	}
        // add the used columns from parents to its own required list
	child.PruneColumns(parentUsedCols)
}
  • Selection
// PruneColumns implements LogicalPlan interface.
func (p *LogicalSelection) PruneColumns(parentUsedCols []*expression.Column) {
	child := p.children[0]
	parentUsedCols = expression.ExtractColumnsFromExpressions(parentUsedCols, p.Conditions, nil)
	child.PruneColumns(parentUsedCols)
}
  • Join
    • columns in the join

Remove unnecessary projection

  • Projection(A) -> Projection(A, B, C)
  • Aggregation(A) -> Projection(A, B, C)
// eliminate eliminates the redundant projection in a logical plan.
func (pe *projectionEliminater) eliminate(p LogicalPlan, replace map[string]*expression.Column, canEliminate bool) LogicalPlan {
	proj, isProj := p.(*LogicalProjection)
	childFlag := canEliminate
	if _, isUnion := p.(*LogicalUnionAll); isUnion {
		childFlag = false
	} else if _, isAgg := p.(*LogicalAggregation); isAgg || isProj {
		childFlag = true
	}
	for i, child := range p.Children() {
		p.Children()[i] = pe.eliminate(child, replace, childFlag)
	}

	switch x := p.(type) {
	case *LogicalJoin:
		x.schema = buildLogicalJoinSchema(x.JoinType, x)
	case *LogicalApply:
		x.schema = buildLogicalJoinSchema(x.JoinType, x)
	default:
		for _, dst := range p.Schema().Columns {
			resolveColumnAndReplace(dst, replace)
		}
	}
	p.replaceExprColumns(replace)

	if !(isProj && canEliminate && canProjectionBeEliminatedLoose(proj)) {
		return p
	}
	exprs := proj.Exprs
	for i, col := range proj.Schema().Columns {
		replace[string(col.HashCode())] = exprs[i].(*expression.Column)
	}
	return p.Children()[0]
}

How does TiDB translate an AST into execution plans

2024-05-10 00:00:00 +0000


// Compile compiles an ast.StmtNode to a physical plan.
func (c *Compiler) Compile(goCtx goctx.Context, stmtNode ast.StmtNode) (*ExecStmt, error) {
	//validation, name binding
	plan.Preprocess(c.Ctx, stmtNode, infoSchema, false)

	infoSchema := GetInfoSchema(c.Ctx)
	// build and optimize query plan
	finalPlan = plan.Optimize(c.Ctx, stmtNode, infoSchema)

	return &ExecStmt{
		InfoSchema: infoSchema,
		Plan:       finalPlan,
		Expensive:  isExpensive,
		Cacheable:  plan.Cacheable(stmtNode),
		Text:       stmtNode.Text(),
		StmtNode:   stmtNode,
		Ctx:        c.Ctx,
	}, nil
}

// Optimize does optimization and creates a Plan.
// The node must be prepared first.
func Optimize(ctx context.Context, node ast.Node, is infoschema.InfoSchema) (Plan, error) {
	builder := &planBuilder{
		ctx:       ctx,
		is:        is,
		colMapper: make(map[*ast.ColumnNameExpr]int),
	}
	p := builder.build(node)

        //Assuming p is a LogicalPlan, which could be optimzied
	return doOptimize(builder.optFlag, p)
}


func (b *planBuilder) build(node ast.Node) Plan {
	b.optFlag = flagPrunColumns
	switch x := node.(type) {
	case *ast.InsertStmt:
		return b.buildInsert(x)
	case *ast.SelectStmt:
		return b.buildSelect(x)
	}
	b.err = ErrUnsupportedType.Gen("Unsupported type %T", node)
	return nil
}

/*

Hierarchy: ResultSet <- GroupBy <- Selection <- Aggregation <- Projection

ResultSetNode interface has a ResultFields property, represents a Node that returns result set. Implementations include SelectStmt, SubqueryExpr, TableSource, TableName and Join.

*/



func (b *planBuilder) buildSelect(sel *ast.SelectStmt) LogicalPlan {
	var (
		p        LogicalPlan
		aggFuncs []*ast.AggregateFuncExpr
		totalMap map[*ast.AggregateFuncExpr]int
		gbyCols  []expression.Expression
	)
        //base plan node
	p = b.buildResultSetNode(sel.From.TableRefs)
	originalFields := sel.Fields.Fields
        //add children
	p, gbyCols = b.resolveGbyExprs(p, sel.GroupBy, sel.Fields.Fields)

        //add children
	p = b.buildSelection(p, sel.Where, nil)

	aggFuncs, totalMap = b.extractAggFuncs(sel.Fields.Fields)
	var aggIndexMap map[int]int
        //add children
	p, aggIndexMap = b.buildAggregation(p, aggFuncs, gbyCols)
	for k, v := range totalMap {
		totalMap[k] = aggIndexMap[v]
	}
	var oldLen int

        //add children
	p, oldLen = b.buildProjection(p, sel.Fields.Fields, totalMap)
		p = b.buildDistinct(p, oldLen)
		p = b.buildSort(p, sel.OrderBy.Items, orderMap)
		p = b.buildLimit(p, sel.Limit)
	sel.Fields.Fields = originalFields
	if oldLen != p.Schema().Len() {
		proj := LogicalProjection{Exprs: expression.Column2Exprs(p.Schema().Columns[:oldLen])}.init(b.ctx)
		proj.SetChildren(p)
		schema := expression.NewSchema(p.Schema().Clone().Columns[:oldLen]...)
		for _, col := range schema.Columns {
			col.FromID = proj.ID()
		}
		proj.SetSchema(schema)
		return proj
	}

	return p
}

func (b *planBuilder) buildResultSetNode(node ast.ResultSetNode) LogicalPlan {
	switch x := node.(type) {
	case *ast.Join:
		return b.buildJoin(x)
	case *ast.TableSource:
		var p LogicalPlan
		switch v := x.Source.(type) {
		case *ast.SelectStmt:
			p = b.buildSelect(v)
		case *ast.TableName:
			p = b.buildDataSource(v)
		if v, ok := p.(*DataSource); ok {
			v.TableAsName = &x.AsName
		}
		for _, col := range p.Schema().Columns {
			col.OrigTblName = col.TblName
			if x.AsName.L != "" {
				col.TblName = x.AsName
				col.DBName = model.NewCIStr("")
			}
		}
		// Duplicate column name in one table is not allowed.
		// "select * from (select 1, 1) as a;" is duplicate
		dupNames := make(map[string]struct{}, len(p.Schema().Columns))
		for _, col := range p.Schema().Columns {
			name := col.ColName.O
			if _, ok := dupNames[name]; ok {
				b.err = ErrDupFieldName.GenByArgs(name)
				return nil
			}
			dupNames[name] = struct{}{}
		}
		return p
	case *ast.SelectStmt:
		return b.buildSelect(x)
	}
}



// buildProjection returns a Projection plan and non-aux columns length.
func (b *planBuilder) buildProjection(p LogicalPlan, fields []*ast.SelectField, mapper map[*ast.AggregateFuncExpr]int) (LogicalPlan, int) {
	b.optFlag |= flagEliminateProjection
	b.curClause = fieldList
	proj := LogicalProjection{Exprs: make([]expression.Expression, 0, len(fields))}.init(b.ctx)
	schema := expression.NewSchema(make([]*expression.Column, 0, len(fields))...)
	oldLen := 0
	for _, field := range fields {
		newExpr, np, err := b.rewrite(field.Expr, p, mapper, true)
		p = np
		proj.Exprs = append(proj.Exprs, newExpr)

		col := b.buildProjectionField(proj.id, schema.Len()+1, field, newExpr)
		schema.Append(col)

		if !field.Auxiliary {
			oldLen++
		}
	}
	proj.SetSchema(schema)
	proj.SetChildren(p)
	return proj, oldLen
}

func (b *planBuilder) buildSelection(p LogicalPlan, where ast.ExprNode, AggMapper map[*ast.AggregateFuncExpr]int) LogicalPlan {
	b.optFlag = b.optFlag | flagPredicatePushDown
	if b.curClause != havingClause {
		b.curClause = whereClause
	}
	conditions := splitWhere(where)
	expressions := make([]expression.Expression, 0, len(conditions))
	selection := LogicalSelection{}.init(b.ctx)
	for _, cond := range conditions {
		expr, np, err := b.rewrite(cond, p, AggMapper, false)
		p = np
		if expr == nil {
			continue
		}
		cnfItems := expression.SplitCNFItems(expr)
		for _, item := range cnfItems {
			expressions = append(expressions, item)
		}
	}
	selection.Conditions = expressions
	selection.SetChildren(p)
	return selection
}

    // RecordSet is an abstract result set interface to help get data from Plan.
    type RecordSet interface {
        // Fields gets result fields.
        Fields() []*ResultField
        // Next returns the next row, nil row means there is no more to return.
        Next(ctx context.Context) (row types.Row, err error)
        // NextChunk reads records into chunk.
        NextChunk(ctx context.Context, chk *chunk.Chunk) error
        // NewChunk creates a new chunk with initial capacity.
        NewChunk() *chunk.Chunk
        // SupportChunk check if the RecordSet supports Chunk structure.
        SupportChunk() bool
        // Close closes the underlying iterator, call Next after Close will
        // restart the iteration.
        Close() error
    }


// Exec builds an Executor from a plan. If the Executor doesn't return result,
// like the INSERT, UPDATE statements, it executes in this function, if the Executor returns
// result, execution is done after this function returns, in the returned ast.RecordSet Next method.
func (a *ExecStmt) Exec(goCtx goctx.Context) (ast.RecordSet, error) {
	ctx := a.Ctx
	e := a.buildExecutor(ctx)

	e.Open(goCtx)

	var pi processinfoSetter
	if raw, ok := ctx.(processinfoSetter); ok {
		pi = raw
		sql := a.OriginText()
		// Update processinfo, ShowProcess() will use it.
		pi.SetProcessInfo(sql)
	}

	return &recordSet{
		executor:    e,
		stmt:        a,
		processinfo: pi,
		txnStartTS:  ctx.Txn().StartTS(),
	}, nil
}

type SelectStmt struct {
        dmlNode
        resultSetNode
        // SelectStmtOpts wraps around select hints and switches.
        *SelectStmtOpts
        // Distinct represents whether the select has distinct option.
        Distinct bool
        // From is the from clause of the query.
        From *TableRefsClause
        // Where is the where clause in select statement.
        Where ExprNode
        // Fields is the select expression list.
        Fields *FieldList
        // GroupBy is the group by expression list.
        GroupBy *GroupByClause
        // Having is the having condition.
        Having *HavingClause
        // OrderBy is the ordering expression list.
        OrderBy *OrderByClause
        // Limit is the limit clause.
        Limit *Limit
        // LockTp is the lock type
        LockTp SelectLockType
        // TableHints represents the level Optimizer Hint
        TableHints [](#)*TableOptimizerHint
}

// LogicalSelection represents a where or having predicate.
type LogicalSelection struct {
	baseLogicalPlan

	// Originally the WHERE or ON condition is parsed into a single expression,
	// but after we converted to CNF(Conjunctive normal form), it can be
	// split into a list of AND conditions.
	Conditions []expression.Expression
}

Software Engineering at Google

2024-03-10 00:00:00 +0000

Notes from Software Engineering at Google

Differences between programming and software engineering

Time

  • On a software engineering project, engineers need to be more concerned with the passage of time and the eventual need for change.
Should I upgrade?
  • When you are fundamentally incapable of reacting to a change in underlying technology or product direction, you’re placing a high-risk bet on the hope that such a change never becomes critical.
  • For any project that didn’t plan for upgrades from the start, that transition is likely very painful.
  • You’re performing a task that hasn’t yet been done for this project; more hidden assumptions have been baked-in.
  • The engineers trying to do the upgrade are less likely to have experience in this sort of task.
  • The size of the upgrade is often larger than usual, doing several years’ worth of upgrades at once instead of a more incremental upgrade. And thus, after actually going through such an upgrade once (or giving up part way through), it’s pretty reasonable to overestimate the cost of doing a subsequent upgrade and decide “Never again.” Companies that come to this conclusion end up committing to just throwing things out and rewriting their code, or deciding to never upgrade again.
  • Expect the first upgrade for a codebase to be significantly more expensive than later upgrades, even controlling for other factors.

Scale

Hyrum’s Law

  • With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
  • We cannot assume perfect adherence to published contracts or best practices.
  • In practice, the complexity and difficulty of a given change also depends on how useful a user finds some observable behavior of your API. If users cannot depend on such things, your API will be easy to change. Given enough time and enough users, even the most innocuous change will break something;

How to upgrade

A new Widget has been developed. The decision is made that everyone should use the new one and stop using the old one.

Not scalable
  • To motivate this, project leads say “We’ll delete the old Widget on August 15th; make sure you’ve converted to the new Widget.”
  • Teams depend on an ever-increasing number of Widgets, and a single build break can affect a growing percentage of the company.
  • Forcing users to respond to churn means that every affected team does a worse job ramping up, solves their immediate problem, and then throws away that now-useless knowledge.
Scalable
  • Instead of pushing migration work to customers, teams can internalize it themselves, with all the economies of scale that provides.
  • Infrastructure teams must do the work to move their internal users to new versions themselves or do the update in place, in backward-compatible fashion. * Dependent projects are no longer spending progressively greater effort just to keep up.
  • Having a dedicated group of experts execute the change scales better than asking for more maintenance effort from every user: experts spend some time learning the whole problem in depth and then apply that expertise to every subproblem.
  • “If a product experiences outages or other problems as a result of infrastructure changes, but the issue wasn’t surfaced by tests in our Continuous Integration (CI) system, it is not the fault of the infrastructure change.”
    • “If you liked it, you should have put a CI test on it,”
    • Complicated, one-off bespoke tests that aren’t triggered by our common CI system do not count. Without this, an engineer on an infrastructure team could conceivably need to track down every team with any affected code and ask them how to run their tests.

Trade-offs

  • As software engineers, we are asked to make more complex decisions with higher-stakes outcomes, often based on imprecise estimates of time and growth.
  • We might sometimes defer maintenance changes, or even embrace policies that don’t scale well, with the knowledge that we’ll need to revisit those decisions. Those choices should be explicit and clear about the deferred costs.

Make decisions

  • It is important for there to be a decider for any topic and clear escalation paths when decisions seem to be wrong, but the goal is consensus, not unanimity. It’s fine and expected to see some instances of “I don’t agree with your metrics/valuation, but I see how you can come to that conclusion.”
  • Some of the quantities are subtle, or we don’t know how to measure them. Sometimes this manifests as “We don’t know how much engineer-time this will take.” Sometimes it is even more nebulous: how do you measure the engineering cost of a poorly designed API? Or the societal impact of a product choice?

How to Work Well on Teams

  • Ensuring that there is at least good documentation in addition to a primary and a secondary owner for each area of responsibility
  • Projects run into unpredictable design obstacles or political hazards, or we simply discover that things aren’t working as planned. Requirements morph unexpectedly. How do you get that feedback loop so that you know the instant your plans or designs need to change? Answer: by working in a team. Most engineers know the quote, “Many eyes make all bugs shallow,” but a better version might be, “Many eyes make sure your project stays relevant and on track.”
  • Group teams of four to eight people together in small rooms (or large offices) to make it easy (and non-embarrassing) for spontaneous conversation to happen.

Lose the ego

  • If you perform a root-cause analysis on almost any social conflict, you can ultimately trace it back to a lack of humility, respect, and/or trust.
  • Even if you know you’re the wisest person in the discussion, don’t wave it in people’s faces. For example, do you always feel like you need to have the first or last word on every subject? Do you feel the need to comment on every detail in a proposal or discussion?
  • “The appearance of conforming gets you a long way.” If you chose to assert your ego in any number of ways, “I am going to do it my way,” you pay a small steady price throughout the whole of your professional career. And this, over a whole lifetime, adds up to an enormous amount of needless trouble.

Learn to give and take criticism

  • For example, if you have an insecure collaborator, here’s what not to say: “Man, you totally got the control flow wrong on that method there. You should be using the standard xyzzy code pattern like everyone else.” This feedback is full of antipatterns: you’re telling someone they’re “wrong” (as if the world were black and white), demanding they change something, and accusing them of creating something that goes against what everyone else is doing (making them feel stupid). Your coworker will immediately be put on the offense, and their response is bound to be overly emotional.
  • A better way to say the same thing might be, “Hey, I’m confused by the control flow in this section here. I wonder if the xyzzy code pattern might make this clearer and easier to maintain?”

Knowledge sharing

  • Engineers have a tendency to reach for “this is bad!” far more quickly than is often warranted, especially for unfamiliar code, languages, or paradigms
  • The first time you learn something is the best time to see ways that the existing documentation and training materials can be improved

Challenges to Learning

A group of people that is split between people who know “everything” and novices, with little middle ground. This problem often reinforces itself if experts always do everything themselves and don’t take the time to develop new experts through mentoring or documentation. In this scenario, knowledge and responsibilities continue to accumulate on those who already have expertise, and new team members or novices are left to fend for themselves and ramp up more slowly.

Tribal and written knowledge complement each other. Even a perfectly expert team with perfect documentation needs to communicate with one another, coordinate with other teams, and adapt their strategies over time. No single knowledge-sharing approach is the correct solution for all types of learning,

Psychological Safety in Large Groups

  • No “well-actuallys”
    • Pedantic corrections that tend to be about grandstanding rather than precision.
  • No back-seat driving
    • Interrupting an existing discussion to offer opinions without committing to the conversation.

Leading Teams

2024-03-08 00:00:00 +0000

From How to Lead a Team

  • Manager and tech leads work together to ensure engs’ tasks match their skill sets and skill levels. In smaller teams both roles could be on one person
    • Most TL are ICs, and they should delegate tasks to team members, even if it means slower. Otherwise, the team is hard to grow in size and capacity
  • Act quickly on difficult situations, e.g., productivity, skill, and motivation, because rarely these problems will work themselves out

Influence without authority

  • Identify a strategic need for the company and show how it is linked to existing priorities.
  • Automate your process to reduce friction
  • Working to build team consensus
  • If your team is moving quickly, sometimes it will voluntarily concede authority and direction to team leads. Even though this might look like a dictatorship or oligarchy, when it’s done voluntarily, it’s a form of consensus.

Moving from IC to a leadership role

  • “Above all, resist the urge to manage.”
  • Social health of them is as important as the tech health, but harder to manage
    • At the end of each one-on-one meeting, “What do you need?”
  • Great managers worry about what things get done (and trust their team to figure out how).
  • Drive consensus and set directions, and let people put the product together to figure out how it should be done
  • You will not get everything right, nor will you have all the answers, and if you act like you do, you’ll quickly lose the respect of your team.
  • Your job is to inspire, but inspiration is a 24/7 job. Your visible attitude about absolutely everything, no matter how trivial, is unconsciously noticed and spreads infectiously to your team.

Leading at Scale

From Leading at Scale

  • Go “broad” than “Deep”.
    • Loss of details
    • Engineering expertise less relevant
    • Depends on technical intuition and ability to move engs in good directions
  • Your job became less proactive and more reactive. The higher up in leadership you go, the more escalations you receive. You are the “finally” clause in a long list of code blocks!
  • 95% observation and listening, and 5% making critical adjustments in just the right place. Listen to your leaders and skip-reports. Talk to your customers, who may not be end users out in the world, but your coworkers. Customers’ happiness requires just as much intense listening as your reports’ happiness.
  • Walk a fine line between “too rigid” and “too vague” team structures
  • A common mistake is to put a team in charge of a specific product rather than a general problem. A product is a solution to a problem. The life expectancy of solutions can be short, and products can be replaced by better solutions. However, a problem, if chosen well, can be evergreen.

Strategy

  • More on the high level strategy. Most of the decisions are about finding the correct set of trade-offs.
  • To defend against “analysis paralysis”, frame your process as continuous re-balancing of trade-offs
  • Guide people solve difficult, ambagious problems, i.e., no obvious solution or may not even be solvable.
    • See the forest, and find a path to the important trees, and let engs to chop them down
  • Your strategy needs to cover not just overall technical direction, but an organizational strategy as well. You’re building a blueprint for how the ambiguous problem is solved and how your organization can manage the problem over time.

Delegation

  • Leave a trail of self-sufficient success in your wake.
  • “If you want something done right, do it yourself.”
  • Delegation is absolutely the most effective way to train them. You give them an assignment, let them fail, and then try again and try again
  • Unless the task is truly time sensitive and on fire, bite the bullet and assign the work to someone else, presumably someone who you know can do it but will probably take much longer to finish. You need to create opportunities for your leaders to grow; they need to learn to “level up” and do this work themselves so that you’re no longer in the critical path.

Minerva: Airbnb's Business Metrics Platform

2024-03-01 00:00:00 +0000

Pain points

  • New tables were created manually on top of core_data tables every other day, but there was no way to tell if similar tables already existed.
  • Data lineage became impossible to track. When a data issue upstream was discovered and fixed, there was no guarantee that the fix would propagate to all downstream jobs.
  • Different teams reported different numbers for very simple business questions, and there was no easy way to know which number was correct.
  • core_data standardized table consumption and allowed users to quickly identify which tables to build upon. On the other hand, it burdened the centralized data engineering with the impossible task of gate-keeping and onboarding an endless stream of new datasets into new and existing core tables. Furthermore, pipelines built downstream of core_data created a proliferation of duplicate and diverging metrics. We learned from this experience that table standardization was not enough and that standardization at the metrics level is key to enabling trustworthy consumption. After all, users do not consume tables; they consume metrics, dimensions, and reports.

Use cases

  • Minerva’s product vision is to allow users to “define metrics once, use them everywhere”. That is, a metric created in Minerva should be easily accessed in dashboards, our A/B testing framework, or our anomaly detection algorithms to spot business anomalies
  • Users simply ask for metrics and dimension cuts, and receive the answers without having to worry about the “where” or the “how”.
  • Data Catalog
    • When a user interfaces with the Dataportal and searches for a metric, it ranks Minerva metrics at the top of the search results. The Dataportal also surfaces contextual information, such as certification status, ownership, and popularity so that users can gauge the relative importance of metrics. For most non-technical users, the Dataportal is their first entry point to metrics in Minerva.
  • Data Exploration
    • Upon selecting a metric, users are redirected to Metric Explorer, a component of the Dataportal that enables out-of-the-box data exploration. On a metric page, users can see trends of a metric with additional slicing and drill down options such as Group By and Filter
  • A/B Testing
    • All base events for A/B tests are defined and sourced from Minerva.
  • Executive Reporting
  • Data Analysis
    • Minerva data is exposed to Airbnb’s custom R and Python clients through Minerva’s API. This allows data scientists to query Minerva data in a notebook environment with ease

Metrics

  • Simple metrics are composed of single materialized events (e.g., bookings)
  • Filtered metrics are composed of simple metrics filtered on a dimension value (e.g., bookings in China);
  • Derived metrics are composed of one or more non-derived metrics (e.g. search-to-book rate).

Sample Request

  • Average daily price (ADR), cut by destination region, excluding private rooms for the past 4 weeks in the month of August 2021.
{
      metric: ‘price_per_night’,
      groupby_dimension: ‘destination_region’,
      global_filter: ‘dim_room_type!=”private-room”’,
      aggregation_granularity: ‘W-SAT’,
      start_date: 20210801,
      end_date: 20210901,
      truncate_incomplete_leading_data: true,
      truncate_incomplete_trailing_data: true,
}
  • Given that the price_per_night metric is a ratio metric (a special case of derived metric) that contains a numerator (gross_booking_value_stays) and a denominator (nights_booked), Minerva API breaks up this request into two sub-requests.

Minerva vs Shasta

Differences

  • Target audience: Minerva mainly for internal users, while Shasta is mainly for external users of Google ads manager.
  • Latency: Sub-second latency is P0 for Shasta, nice to have for Minerva
  • Storage: reading from different storage systems is a P0 for Shasta, and nice to have for Minerva
  • Consistent metric definition: P0 for Minerva, P1 for Shasta
  • Reverse ETL: Shasta focuses on reducing revert ETL by reading the OLTP storage directly, and this use case is not focused by Minerva

Similarities

  • Separation of dimensions and metrics.
  • Metrics often contain business logic
  • Declarative metric computation
  • Separation of dimension columns and metric columns

  • Minerva takes (normalized) fact and dimension tables as inputs, de-normalize them, and serves the aggregated data to downstream applications.
  • It uses Airflow for workflow orchestration, Apache Hive and Apache Spark as the compute engine, and Presto and Apache Druid for consumption.
  • Minerva defines key business metrics, dimensions, and other metadata in a centralized Github repository
  • Minerva de-normalize data efficiently by reuse data and intermediate joined results.
  • Minerva provides a unified data API to serve both aggregated and raw metrics on demand.
  • Users define the “what” and not the “how”. How the metrics are calculated, stored, or served are entirely abstracted away from end users.
  • Minerva focuses on metrics and dimensions as opposed to tables and columns.
  • Minerva requires metrics authors to provide self-describing metadata, such as, ownership, lineage, and metric description.

  • At the heart of Minerva’s configuration system are event sources and dimension sources, which correspond to fact tables and dimension tables in a Star Schema design, respectively.
  • Event sources define the atomic events from which metrics are constructed, and dimension sources contain attributes and cuts that can be used in conjunction with the metrics
  • With Minerva, users can simply define a dimension set, an analysis-friendly dataset that is joined from Minerva metrics and dimensions.

computational flow

  • Ingestion Stage: Partition sensors wait for upstream data and data is ingested into Minerva.
  • Data Check Stage: Data quality checks are run to ensure that upstream data is not malformed.
  • Join Stage: Data is joined programmatically based on join keys to generate dimension sets.
  • Post-processing and Serving Stage: Joined outputs are further aggregated and derived data is virtualized for downstream use cases.

In the ingestion stage, Minerva waits for upstream tables to land and ingests the data into the static Minerva tables. This ingested data becomes the source-of-truth and requires modification to the Minerva configs in order to be changed.

Here are some typical checks that Minerva runs:

  • sources should not be empty
  • timestamps should not be NULL and should meet ISO standards
  • primary keys should be unique
  • dimension values are consistent with what is expected

References

TAO: Facebook’s Distributed Data Store for the Social Graph

2024-02-12 00:00:00 +0000

Workload

  • 99.8% of the transactions are reads and only the 0.2% are writes
  • Creation time locality dominates the workload — a data item is likely to be accessed if it has been recently created.
  • the content on a page is highly customizable depending on users privacy settings and it is personalized for every user. This means that the data needs to be stored as-is and then filtered when it is being viewed/rendered.
  • A relatively large percentage of requests are for relations that do not exist — e.g., “Does this user like that story?” is false for most of the stories

Requirements

  • High probability that users always see their own updates. 99.99% of the results were the same as the ones that would have been served under a strict consistency model
  • It is tolerable to have some inconsistencies in the content that is presented to the user while it is not tolerable to have high latency or unavailability.

Sample queries to support

  • “Does this user like that story?”
  • “What are the 50 most recent comments on this piece of content?”
  • “Give me the most recent 10 comments about a check-in by Alice”
  • “How many likes did the comment from Cathy have?”

Pain points

  • Storing the list of edges as a single value
    • Every read loads all edges, even if the final result contained only a few edges or was empty.
    • Small incremental updates to the list invalidates the whole list in the cache
    • A native list type solves the edge-list lookup, but makes updating such a list tricky
  • MySql assumes data is accessed with spatial locality. This assumption is not true with Facebook’s workload
    • Creation time locality instead, i.e., loading data into the Mysql block cache, but most of data is not really needed and in effect just occupies the block cache with no hits
  • Hard to mitigate thundering herds with leases to Memcache clients
    • Read and write on the same popular objects causing misses in cache and then going to database
    • Hard to coordinate between clients since they don’t talk to each other.
  • Changing table schemas as products evolved required coordination between engineers and MySQL cluster operators.

Data model

“Alice checks in at the Golden Gate Bridge and tags Bob there, while Cathy comments on the check-in and David likes it”

Data items as nodes/objects

  • e.g., user, check-in, comment, and location are all nodes. Actions that can happen multiple times/be repeatable are modeled as nodes instead of edges
  • A typed object containing a dictionary of named fields, e.g., text of the comment will be field in the comment node.
  • Each node has an 64-bit id
  • New fields can be registered for an object type at any time and existing fields can be marked deprecated by editing that type’s schema. In most cases product engineers can change the schemas of their types without any operational work.
  • Each object_id contains a shard_id in it, reflecting the logical location of that object.
  • All the data belonging to objects is serialized and stored against the id.

Relationships between nodes as edges/associations

  • Models relationships happen at most once or state transition, e.g., “liked by”, “friend of”, and “accepts invitation”
  • grouped in association lists by their origin, ordered by time
  • Multiple associations may connect the same pair of objects as long as the types of all those associations are distinct.
  • Together objects and associations form a labeled directed multi-graph.
  • For every association type a so-called inverse type can be specified, e.g., “likes” and “liked by”. The inverse and the original edge will be maintained by TAO when an edge is created/deleted,
    • The inverse association type is often not same as the original type. For example, Golden Gate location object is connected to check-in object using check-in association type. While the check-in object connects to the golden gate location object using location association type. As a comparison, “friend” has a symmetric inverse type.
  • Every association to have a special time attribute that is commonly used to represent the creation time of association. The creation-time locality depends on this timestamp.
  • Association are stored similarly with id as the key and data being serialized and stored in one column. The table has index on (originating id(Object1), time based sorting, type of association)

API

  • Standard CRUD APIs for nodes and edges respectively
  • A point query on (node_id_1, edge_type, node_id_2)
  • Range queries find outgoing associations given an (id1, type) pair.
  • Count queries give the total number of outgoing associations for an (id1, type) pair. TAO optionally keeps track of counts as association lists grow and shrink, and can report them in constant time

Translate user query to TAO modes

  • Are two objects are connected by an association?
    • A point query on (node_id_1, edge_type, node_id_2). This point query is also used to fetch data for an an edge
  • “What are the 50 most recent comments on this piece of content?”
    • Range queries find outgoing associations given an (id1, type) pair.
  • “Give me the most recent 10 comments about a check-in by Alice”
    • This can be modeled as assoc_range(CHECK_IN_ID, COMMENT, 0, 10).
  • “How many likes did the comment from Cathy have?”
    • assoc_count(COMMENT_ID, LIKED_BY)

Out of scope

  • NO operations for complex traversals or pattern matching on the graph. Executing such queries while responding to a user request is almost always a suboptimal design decision.
  • TAO does not offer a server-side set intersection primitive. Instead we provide a client library function. The lack of clustering in the data set virtually guarantees that having the client orchestrate the intersection through a sequence of simple point and range queries on associations will require about the same amount of network bandwidth and processing power as doing such intersections entirely on the server side.

Cluster topology

  • Nodes and edges are stored in separate clusters, since they have different workload patterns.
  • All objects and associations in the same shard are stored persistently in the same MySQL database, and are cached on the same set of servers in each caching cluster. Individual objects and associations can optionally be assigned to specific shards at creation time.
  • A single primary region per shard and rely on MySQL replication to propagate updates from the region where the shard is primary to all other regions (secondary regions). A secondary region cannot update the shard in its regional persistent store.
  • There are far more shards in the system than the number of hosts that host the mysql servers. So many shards are mapped onto a single host.

Failover

  • A slave database failure can be addressed by going to the leader in the master region.
  • The primary region can be switched to another region at any time. This is an automated procedure.

Removing hot spots

  • the hot spots can be alleviated by consistent hashing which makes addition of tiers easier without re-balancing caches a lot.
  • Shards can be migrated or cloned among servers in the same cluster to equalize the load and to smooth out load spikes
  • A shard can be hosted by a slave region in Asia that has a replica DB, followers and leaders.

Sequence of actions

Assuming the client is in Asia

SoA: Read

  • Client talks to the follower TAO cluster in the Asia slave region
  • If the cache missed, follower fetches data from the leader cluster
    • Leader cluster fetches data from the MySql cluster in the region

SoA: Write

  • Clients talks to the slave leader cluster in the Asia slave region.
    • Slave leader forwards to request to the leader in the master region synchronously
      • Master leader forwards to the master DB synchronously
        • Master DB replicates to the slave DB
      • Master leader updates the master cache
    • Slave leader updates the slave cache
    • Slave leader sends cache maintenance/invalidation message to followers asynchronously. This message lets each follower know changed done by other followers.
  • TAO maintains the write through cache, i.e., new data in cache won’t be visible until the write to storage is done. The write-through design of cache simplifies maintaining read-after-write consistency for writes that are made in a secondary region for the affected shard.
  • TAO clients may override the default policy at the expense of higher processing cost and potential loss of availability.

References

How to Speak

2024-01-28 00:00:00 +0000

Notes from the video

How to start

  • Not start with a joke, because people are still zoning in
  • Start with empowerment promise. What will people learn in the next hour,
  • Cycling the ideas when you give a talk, because at any given moment someone will be zoning out
  • Distiguishing your ideas with someone else’s
  • Verbal punctuation, to get zoned out people back on track, e.g., review the outline and overall structure
  • ask a question and wait a bit for answers

On Removing Features

2024-01-15 00:00:00 +0000

Simplicity

  • 80% of the people use 20% of the features. So you convince yourself that you only need to implement 20% of the features, and you can still sell 80% as many copies. Unfortunately, it’s never the same 20%. Everybody uses a different set of features.
  • 20% products is an excellent bootstrapping strategy because you can create them with limited resources and build an audience.
  • What works for bootstrapping will not work as a good long term strategy, because there’s very little to prevent the next two-person startup from cloning your simple app, and because eventually you can’t fight human nature: “The people want the features,”
    • nothing we have ever done at Fog Creek has increased our revenue more than releasing a new version with more features.
  • When we tried Google ads, when we implemented various affiliate schemes, or when an article about FogBugz appears in the press, we could barely see the effect on the bottom line. When a new version comes out with new features, we see a sudden, undeniable, substantial, and permanent increase in revenue
  • If you think simplicity means “not very many features” or “does one thing and does it well,” then I applaud your integrity but you can’t go that far with a product that deliberately leaves features out.

Should I remove a feature used by only 0.1% of traffics?

  • In theory all users could be affected by the removed features. The 0.1% and 99.9% are not complements.
  • “Which percent contains the value of the product?”, because that might be the most important 0.1%.
  • Delete button may be rarely used. But if it is taken out, there will be a lot of support requests.
  • Sometimes, it’s not about the feature being used. It’s about the user’s confidence they won’t get stuck, or feel the product is supported adequately, or even, it’s a checkbox on a purchase requirements list.
  • People may be just playing most popular games, but they want to browse the comprehensive list.

The Oberon Compiler

2024-01-04 00:00:00 +0000

Compilation of a program text proceeds by analyzing the text and thereby decomposing it recursively into its constructs according to the syntax. When a construct is identified, code is generated according to the semantic rule associated with the construct. The components of the identified construct supply parameters for the generated code. It follows that we distinguish between two kinds of actions: analyzing steps and code generating steps. In a rough approximation we may say that the former are source language dependent and target computer independent, whereas the latter are source language independent and target computer dependent.

The main module of the compiler is ORP (for Oberon to RISC Parser) It is primarily dedicated to syntactic analysis, parsing. Upon recognition of a syntactic construct, an appropriate procedure is called the code generator module ORG (for Oberon to RISC Generator). Apart from parsing, ORP checks for type consistency of operands, and it computes the attributes of objects identified in declarations. Whereas ORP mirrors the source language and is independent of a target computer, ORG reflects the target computer, but is independent of the source language.

Each time the syntax analyzer (parser) proceeds to read the next symbol, it does this by calling procedure Get, which constitutes the so- called scanner residing in module ORS (for Oberon to RISC Scanner). It reads from the source text as many characters as are needed to recognize the next symbol.

The recognition of symbols within a character sequence is called lexical analysis.

Ideally the recognition of any syntactic construct, say A, consisting of subconstructs, say B1, B2, … , Bn, leads to the generation of code that depends only on (1) the semantic rules associated with A, and (2) on (attributes of) B1, B2, … , Bn. If this condition is satisfied, the construct is said to be context-free, and if all constructs of a language are context-free, then also the language is context-free.

exception is embodied by the notion of declarations. The declaration of an identifier, say x, attaches permanent properties to x, such as the fact that x denotes a variable and that its type is T. These properties are “invisible” when parsing a statement containing x, because the declaration of x is not also part of the statement. The “meaning” of identifiers is thus inherently context-dependent. Context-dependence due to declarations is the immediate reason for the use of a global data structure which represents the declared identifiers and their properties (attributes). Since this concept stems from early assemblers where identifiers (then called symbols) were registered in a linear table, the term symbol table tends to persist for this structure, although in this compiler it is considerably more complex than an array

A complication arises from the notion of exports and imports in Oberon. Its consequence is that the declaration of an identifier x may be in a module, say M, different from where x is referenced. If x is exported, the compiler includes x together with its attributes in the symbol file of the compiled module M. When compiling another module which imports M, that symbol file is read and its data are incorporated into the symbol table. Procedures for reading and writing symbol files are contained in module ORB, and no other module relies on information about the structure of symbol files.

uses the straight-forward scheme of sequential allocation of consecutively declared variables. An address is a pair consisting of a base address (in a register) and an offset. Global variables are allocated in the module’s data section and the respective base address register is SB (Static Base, see Chapter 6). Local variables are allocated in a procedure activation record on the stack; the respective base register is SP (Stack Pointer). Offsets are positive integers.

The parser is designed according to the proven method of top-down, recursive descent parsing with a look-ahead of a single symbol. The last symbol read is represented by the global variable sym. Syntactic entities are mirrored by procedures of the same name. Their goal is to recognize the specified construct in the source text. The start symbol and corresponding procedure is Module.

The rule of parsing strictly based on a single-symbol look-ahead and without reference to context is violated in three places. The prominent violation occurs in statements. If the first symbol of a statement is an identifier, the decision of whether an assignment or a procedure call is to be recognized is based on contextual information, namely the class of the identified object. The second violation occurs in qualident; if the identifier x preceding a period denotes a module, it is recognized together with the subsequent identifier as a qualified identifier. Otherwise x supposedly denotes a record variable. The third violation is made in procedure selector; if an identifier is followed by a left parenthesis, the decision of whether a procedure call or a type guard is to be recognized is again made on the basis of contextual information, namely the mode of the identified object.

Besides the parsing of text, the Parser also performs the checking for type consistency of objects. This is based on type information held in the global table, gained during the processing of declarations, which is also handled by the routines which parse. Thereby an unjustifiably large number of very short procedures is avoided. However, the strict target-computer independence of the parser is violated slightly: Information about variable allocation strategy including alignment, and about the sizes of basic types is used in the parser module. Whereas the former violation is harmless, because the allocation strategy is hardly controversial, the latter case constitutes a genuine target-dependence embodied in a number of explicitly declared constants. Mostly these constants are contained in the respective type definitions, represented by records of type Type initialized by ORB.

An inherently nasty subject is the treatment of forward references in a single-pass compiler. In Oberon, there are two such cases

The symbol table constitutes the context in which statements and expressions are parsed. Each procedure establishes a scope of visibility of local identifiers. The records registering identifiers belonging to a scope are linked as a linear list. They are of type Object. Each object has a type.

If a new identifier is to be added, procedure NewObj first searches the list, and if the identifier is already present, a double definition is diagnosed. Otherwise the new element is appended, thereby preserving the order given by the source text.

Procedures, and therefore also scopes, may be nested. Each scope is represented by the list of its declared identifiers, and the list of the currently visible scopes are again connected as a list. Procedure OpenScope appends an element and procedure CloseScope removes it. The list of scopes is anchored in the global variable topScope and linked by the field dsc. It is treated like a stack. It consists of elements of type Object, each one being the header (class = Head) of the list of declared entities.

A search of an identifier proceeds first through the scope list, and for each header its list of object records is scanned. This mirrors the scope rule of the language and guarantees that if several entities carry the same identifier, the most local one is selected. The linear list of objects represents the simplest implementation by far. A tree structure would in many cases be more efficient for searching, and would therefore seem more recommendable. Experiments have shown, however, that the gain in speed is marginal. The reason is that the lists are typically quite short. The superiority of a tree structure becomes manifest only when a large number of global objects is declared. We emphasize that when a tree structure is used for each scope, the linear lists must still be present, because the order of declarations is sometimes relevant in interpretation, e.g. in parameter lists.

Not only procedures, but also record types establish their own local scope. The list of record fields is anchored in the type record’s field dsc, and it is searched by procedure thisField. If a record type R1 is an extension of R0, then R1’s field list contains only the fields of the extension proper. The base type R0 is referenced by the BaseTyp field of R1. Hence, a search for a field may have to proceed through the field lists of an entire sequence of record base types.

A symbol file is a linearized form of an excerpt of the symbol table containing descriptions of all exported (marked) objects.

Sample scope of Semantic analysis

This scope demonstrates most typical use cases, and the actual scope is way bigger

  • Detect a type conflict in an expression, i.e., in X OP Y, the types of either X or Y are incompatible with OP
  • Detect an illegal assignment. i.e., A := Expr, where A is not a variable
  • Detect a type conflict in an assignment
  • Detect an illegal procedure call
    • different number of arguments
    • argument type is not assignable or equivalent to parameter’s type.
    • the corresponding argument is not addressable (i.e., the argument must be a variable, not a constant, nor a literal, nor an expression, nor a procedure call).
  • Detect an illegal constant declaration, where the value of Expr is not known at compile time.

References

Shasta: Interactive reporting at scale

2023-12-03 00:00:00 +0000

Workload

  • Power the the UI with both OLAP + OLTP functionalities in the same UI, e.g., the AdWords Web UI.
    • Users can select columns, filters, and segmentation in the UI, while still expecting sub-second latency.
  • Because users expect near real-time experience, queries have to go from the primary storage to user facing views.
    • Transitional data stores are optimized for writes.
    • Underlying OLAP data stores and OLTP data stores are diverse
  • Interfaces and APIs are often scoped to an individual user’s business data, so that typical Shasta queries process only a modest subset of data in underlying data stores.

Pain points

  • Reporting query is complex, often joining 50+ tables from different data stores. For instance, computing the main table in the AdWords UI “Campaigns” tab involves joining and aggregating approximately 20 F1 tables and 20 Mesa tables
    • Queries were way too large to be expressed gracefully in SQL. These complex SQLs are hard to reuse
  • Write-friendly schemas need significant transformation logic to be read-friendly. This “concept gap” tends to be particularly wide for OLAP functionality.
    • Classic ETL + pre-computation is not feasible due to freshness requirements
    • The necessary offline pipelines tend to make systems more stateful and operationally complex.
    • Transactionally updated materialized views on the other hand increase the cost of writes.

Architecture

  • Does not use pre-computation or materialized view of intermediate results.
  • Translate a client request to a single SQL and send the SQL to F1. F1 then uses its distributed query engine to join between different storage systems
  • RVL is sql like, with support for view params and UDFs
  • Added UDF server to F1 to support UDFs in RVL
  • Added in-memory, read-only TableCache between F1 storage and query engine. For other storages such as Mesa or BigTable, such cache had questionable RoI.

Sequence of actions

  • FE server sends a query, which has view name, columns to query, and view params such as user ID, feature ID. The query open contains a timestamp for snapshot versioning
  • The query is accepted by the view gateway, binds the view params, and generates a SQL.
    • RVL compiler uses RVL code + table metadata to generate SQL
  • The request handler then executes the SQL query on F1 and forwards the query results back to the Shasta client.
    • The SQL is executed by the F1 query engine, which also contacts the UDF server and TableCache. The query engine contacts Mesa and BigTable storage directly.

RVL

  • RVL uses information in the schema to automatically aggregate data when columns are projected.
  • The support for implicit aggregation has parallels to the MDX language for querying OLAP data cubes.

Query Language

  • The syntax and semantics are similar to SQL, with one fundamental difference: RVL automatically determines how to aggregate query results.
  • the metadata of each column may optionally specify an implicit aggregation function. If a column has an implicit aggregation function, we refer to that column as an aggregatable column. Otherwise, it is a grouping column.
  • In the special case where all columns are grouping columns, the behavior of RVL is consistent with relational algebra on sets
  • RVL also supports syntax to modify or remove implicit aggregation functions assigned to the columns of a relation
  • SELECT, FROM, WHERE, JOIN, and UNION: The behavior of these operations is similar to SQL, except that implicit aggregation is applied after each operation to yield a set of unique rows
Sample 1
  • Employee table: EmpId, DeptId, BldgId, Salary[SUM]
  • Building table: BldgId, CityId, Capacity[SUM]
Q0 = SELECT *
      FROM Employee LEFT JOIN Building USING (BldgId);
Q1 = SELECT DeptId, Salary FROM Q0;

generates

 SELECT DeptId, SUM(Salary) FROM Employee
 GROUP BY DeptId;
  • The join with the Building table can be pruned since none of the Building columns are required.
Sample 2
 Q2 = SELECT CityId, Salary, Capacity FROM Q0;

generates


SELECT CityId, SUM(Salary) AS Salary,
        SUM(Capacity) AS Capacity
 FROM
   (SELECT BldgId, SUM(Salary) AS Salary
    FROM Employee GROUP BY BldgId)
   LEFT JOIN Building USING (BldgId))
 GROUP BY CityId;

  • We aggregate requested columns before joining
  • Building to employee is 1 to many, so we have to make sure each building capacity is summed once, instead of once per employee.

View Templates

  • RVL embeds its query language in view templates.
    • A view template specifies a dynamic query using replaceable parameters.
    • Within each template, a query can be built as a sequence of subquery assignment statements. Each subquery in a template may refer to template parameters and instantiate other view templates.
Motivations
  • The view parameters may change the tables used in joins or the placement of filters in the query. RVL needs to be more dynamic in order to capture this wide range of possible query structures.
  • A typical Shasta view would require 100s of lines of code, and expressing that as a single large query can be difficult to read and maintain.
  • Allow large queries to be constructed dynamically from smaller pieces, which can be composed and reused by multiple Shasta views. Idea similar to Spark SQL’s DataFrame API
  • RVL view templates may use control structures written as if/else blocks to dynamically choose between two or more subqueries.
Parameters
  • A view template may be referenced in the FROM clause of RVL queries by passing values for its input parameters
  • RVL text: A string containing valid RVL syntax can be bound to a view template parameter, and that parameter can be referenced in places where it would be valid to inject the RVL syntax string. For example, a template parameter bound to the string “X,Y” could be referenced in a SELECT clause, and the template parameter reference will behave exactly as if “X,Y” were written directly in the query.
    • RVL text values do not allow for arbitrary code injection
    • In order to make view templates less error-prone, an RVL text value is only allowed to contain a few specific syntactic forms, such as scalar expressions, expression lists, and base table names
  • Nested dictionary: A template parameter can be bound to a dictionary of name-value pairs, where the values can either be another dictionary, or RVL text. Intuitively, a nested dictionary is a collection of RVL text parameters with hierarchical names.
  • Subquery: A template parameter can be bound to an RVL subquery, and referenced anywhere a table can be referenced. A subquery value differs from RVL text values, in the sense that subquery values are substituted in a purely logical manner which is independent of the syntax used to create the subquery. In contrast, an RVL text value is purely a text injection, which allows any variables in the RVL text to be interpreted based on the context where the parameter is referenced.
 view FilterUnion<input_table, params> {
   T1 = SELECT $params.column_name FROM $input_table;
   T2 = SELECT $params.column_name FROM Employee;
   T = T1 UNION T2;
   return SELECT * FROM T
          WHERE $params.column_name >= $params.min_value;
}
  • input_table can be bound to a table name or subquery
  • params must be bound to a nested dictionary.
main view template
  • A view template can be designated as an entry point in the RVL code
  • RVL provides an API to invoke a main view template, with a nested dictionary as a parameter.

main OutputValues<params> {
   b = SELECT * FROM Building;
   all_values = SELECT * from FilterUnion<@b, $params>;
   output all_values AS result;
}

  • The output statement specifies a table to produce as the final result when the main view template is invoked, as well as an alias for that table.
  • If there are multiple output statements, the aliases must be unique so that the Shasta view gateway can distinguish the results. Multiple output statements can reference the same RVL subquery by name which is useful when applications need to display multiple pivots of the same shared view computation. To achieve consistency between different data pivots within a view query, RVL guarantees that each named subquery is only executed once.

Compiler

  • the RVL compiler first resolves references to view templates and named subqueries, producing an algebraic representation of an RVL query plan that includes all outputs of the invoked main view template. The RVL compiler performs some transformations to optimize and simplify the query plan before translating it to SQL
  • The RVL compiler optimizes query plans using a rule-based engine. Each rule uses a different strategy to simplify the plan based on algebraic structure, without estimating cost. In practice, rule-based optimization is sufficient because the only goal is to simplify the generated SQL, rather than determine all details of query execution. We avoid using cost-based optimization because a cost model would tie the RVL compiler to a specific SQL engine and make it less generic.
Column Pruning
  • When computing aggregate values after joining, column pruning can reorder join and aggregation steps, so that aggregations happens before joins whenever possible. This reordering simplifies the job for F1’s query optimizer.
SELECT CityId, Salary, Capacity
FROM Employee LEFT JOIN Building USING (BldgId);

Before column pruning

 SELECT CityId, SUM(Salary) AS Salary,
        SUM(Capacity) AS Capacity
 FROM
   (SELECT BldgId, CityId,
           SUM(Salary) AS Salary, Capacity
    FROM Employee LEFT JOIN Building USING (BldgId)
    GROUP BY BldgId, CityId, Capacity)
 GROUP BY CityId;

After

SELECT CityId, SUM(Salary) AS Salary,
        SUM(Capacity) AS Capacity
 FROM
   (SELECT BldgId, SUM(Salary) AS Salary
    FROM Employee GROUP BY BldgId)
   LEFT JOIN Building USING (BldgId))
 GROUP BY CityId;

Filter Pushdown
  • Filter pushdown can improve the effectiveness of the column pruning optimization. For example, if there is a filter on a column which is not part of the final result, the filter will prevent column pruning from removing the column before the filter is applied, and pushed down filter means the column can be pruned early.
Left Join Pruning
  • A user can add many left joins to their view templates to fetch columns which might not be required, and they can be confident that the RVL compiler will know which joins can be skipped.
  • if a left join does not require any of the columns from its right input, the right input can be removed from the query plan

QUERY EXECUTION ON F1

  • Shasta represents a novel use of F1’s query engine as it places heavily distributed query execution at the heart of latency sensitive and business-critical applications. While Shasta queries are typically scoped to a modest subset of data in underlying databases (e.g., a single advertiser’s campaign configurations and performance metrics), the combination of re- mote input data, diverse data stores, data freshness requirements, and complexity of query logic make it challenging to achieve low latencies reliably using centralized execution,
  • DAG-structured query plans
  • External Data Sources: Using a plugin framework for federated querying, the F1 query engine supports several data sources. Central to the plugin framework is the abstraction of a ScanOperator. For every supported data source, a ScanOperator implementation translates between the data source API and the F1 SQL runtime
  • RVL code can invoke user-defined functions (UDFs) written in procedural languages such as C++ or Java. Motivations include sharing critical procedural code between RVL definitions and other systems processing the same data, re-use of subtle legacy code, and expressivity.
  • In order to be able to register with the F1 query engine, UDF server binaries must implement a standard RPC API defined by F1.

References

On Negotiation

2023-11-22 00:00:00 +0000

Time to negotiate

  • They are keen to sell it, and making you an easy offer, e.g., offer at the first turn
  • You have a complaint. Turn that into negotiation, since you are in a strong position.
    • Dirty hotel room -> move to a better room
    • dead mouse -> rejected a year worth of bread. What can we do to make you happy? Asked for something else.
  • As long as parties don’t walk out. Negotiation wouldn’t cost the deal.
    • Just a game to them. Doesn’t affect how they will like you. Does not make you like them more, if they slash the price.

Planning - most important part of negotiation

  • Set your walk away point. Walk away even if it is super close. When you set a limit, you mean it. Walking away empowers you, and let the other side know you are able to.
    • Your initial walkaway point may not be the actual one
    • Opening offer: external factors, e.g., market, going rates. Walkaway point: internal, personal, value to you.
    • Decide the walkaway point before checking the going rates
    • They may not afford to walk away too even when you are not able to either. See who will crumble first.
  • List tradable values before the talk starts. Use them to trade and get from the starting position to the final position
  • Opening offers - prefer face to face
    • Don’t open first. Ask for a budget. Pass it back if the other side don’t
    • Less talking, more questions.
      • What they have been doing.
      • What other options
    • Open below best possible when buying, and above when selling. Find reasons to justify the low opening offers
      • even a 10% chance is worth a try
      • but blame on yourself or they may be upset
    • Don’t open with round number. Start with 7 and 4. Sounds like you thought about that
      • Most likely rounding down
    • Ambitious, outrageous opening offer is fine later
    • You must do a flinch. Look into their eyes, do they flinch? If they don’t, don’t move from your offer. This is different from poker
  • Their issues: break it apart, which part is less expensive for me to address?
  • Your issues: come with examples and stories so they can understand
  • Shared issues: do both

Trading

  • Show trade instead of coming down unilaterally, which appears dishonest
  • Win/win instead of win/lose. Negotiate on price directly is zero sum
  • Either person can propose a win/win trade
  • Use “if you.., then I/we….” often. Much stronger than “how about giving you a cheaper price”.
  • Hide what is valuable. Don’t make a big deal of that or they will ask you to pay for that.
  • Things you are not bothered about, imply they are important and ready to trade it for something else.
  • Moving in small steps. People often not doing this enough
    • Implying you are near the limit
    • Most companies operate on a 10% profit margin
    • Brain prefers continuous installments rather than 1 big one
  • If we can meet these pressure points, what is the least you could drop the price to?
    • Silent close: repeat this low number and then shut up.
    • The first person to speak loses, at least revealing more info
    • and then this becomes the bracket of negotiation
  • Labelling concession
    • open about the cost of making concessions
    • identify the value to them
    • Legitimize why you held previous position
    • and then claim some value
  • Dealing with the inexperienced dictator,when their ego is hurt
    • give them small wins
    • “They didn’t go for your proposal as requested, but I was able to do some alternatives, enough for you to carry things forward”
  • Framing the issue to resolve as a discussion
    • I am concerned on X, let’s discuss - interest
    • I want to resolve X. - issue - then why is it important to you? - interest
  • Avoid reluctant yes or unhelpful no. Looking for a third option, e.g., quote a higher price, different time
  • Crumbling is always an opinion, just a game that may not work out sometimes

Closing

  • “Final offer” avoid to use it.
    • Gives up too much information
    • Neither side can move
    • If you present your offer in a take it or leave it fashion, you may be met with a reluctance to deal with your hostility. You can make the same outrageous requests, but if instead you allow your opponent to reject your proposition, you influence the decision toward agreement because they see that they can win with you. You will still get what you want and get the opponent to do what you want, depending on how you frame your offer.
  • Don’t split the difference. It is the the last feeble attempt before giving up.
    • Counter
      • I really can’t, it gotta be
      • fallback if they are walking away: can split the difference of difference in your favor
  • In the middle, are there any extras I need to know? Don’t ask this after agreeing on prices
    • Go for a counter nibble
    • Nibble is dishonest and should not let them get away with it
    • Similar counter for quivering pen
  • Salami: ask a lot of different things on many different fronts
    • Counter salami
      • Ask things back
      • If you want this, I want something else or more price
      • Don’t be afraid to start again as a last resort

Tactics

  • the vice: buyer says “you have to do better than that”
    • “still needs to get better than that”
    • “exactly how much better” and then flinch
  • Knocking the product: “not the color i liked, i would get it if it is cheaper”, dents
    • Counter: That is why the price is so reasonable/already priced-in
  • Reluctant buyer: “would prefer something else, but”
    • Reluctant seller, less common. “I suppose I could fit your request”
    • Counter: most likely a game
    • “Listen, if you decide to lower your price. I’d like to be the first one to know”
  • Hardball
    • analytical: rush them. Panic them
    • controller: make quick decisions on facts
      • Given them facts but don’t give enough time. Bore + rush
    • amiable: bully them
      • Push them. “Obviously you are gonna buy them”
      • “I will help you do it”
    • enthusiast: feelings. Bore them with details
      • Have to go back to the process, which will take this long
  • Discount should be on the invoice to set expectations baseline
  • Appeal to higher authority/bad cop
    • If the answer is maybe, just tell me no. if we were to create a proposal that met all requirements, any reason you couldn’t give me an answer today?
  • The other side is claiming current deal is giving him too much trouble. I will take it back and review, if we go with other options, should I go back to talk to you again?

References

  • Successful Negotiation: Master Your Negotiating Skills

Configuring Clickhouse's cleanup thread for MergeTree

2023-08-13 00:00:00 +0000

max_bytes_to_merge_at_max_space_in_pool

  • This parameter determines the maximum number of bytes that can be merged during compaction when the total size of the merged data exceeds the available space in a merge pool.

  • This setting helps control the size of each merge operation and prevents merging a large amount of data at once.

old_parts_lifetime

  • In ClickHouse, data is stored in parts within each level of the LSM tree structure. Over time, as new data is ingested and compaction occurs, older data becomes obsolete and no longer needed. The old_parts_lifetime parameter allows you to define how long data parts must persist before they are considered for cleanup and removal.
  • The cleanup process removes the disk files associated with these old data parts, freeing up storage space and maintaining optimal storage utilization.
  • You can set a shorter lifetime to trigger cleanup more frequently.

cleanup_delay_period

  • When a partition is marked for deletion in ClickHouse, rather than immediately triggering the cleanup process to remove the associated data files, the system introduces a delay based on the value set for cleanup_delay_period. This delay allows for a grace period during which the deleted data can still be accessed or recovered if needed.
  • By default, ClickHouse sets a certain duration for cleanup_delay_period, typically a few minutes, to account for any accidental deletions or potential data recovery scenarios. During this delay period, the partition is still accessible for queries and the associated data files are not immediately removed.

Replacing vs dropping a partition

Replace a partition

  • New Partition Creation: ClickHouse creates a new partition to replace the existing partition. This new partition contains the updated data or modifications.
  • Background Merge: ClickHouse runs a background merge job to gradually move data from the old partition to the new partition. It performs this merge operation incrementally, in small batches, and asynchronously in the background.
  • Continuous Query Processing: While the merge job is running, ClickHouse continues to process queries and serve data from both the old and new partitions. ClickHouse automatically combines the data from both partitions to provide query results, ensuring that queries are not affected by the ongoing partition replacement.
  • Merge Completion: Once the merge job is completed, ClickHouse removes the old partition, and the data is fully consolidated in the new partition.
  • During a partition replace operation, ClickHouse ensures that the replacement of the existing partition with the new partition containing updated data is performed as an atomic operation. This guarantees that the replacement is either completed successfully or not performed at all, without leaving any intermediate or inconsistent state.
  • During a partition replace operation, the original partition is effectively replaced with the new partition containing updated data. This does not involve explicitly marking the original partition for deletion.

Dropping a partition

  • Command Execution: When you execute the DROP PARTITION command, ClickHouse receives the request to drop the specified partition(s).
  • Scheduling: ClickHouse schedules the deletion of the partition(s) in the background. The actual deletion process occurs asynchronously and does not block or pause the execution of subsequent queries or operations.
  • Continuous Query Processing: ClickHouse continues to serve queries and perform operations while the partition deletion process is ongoing. Queries can still access data from the remaining partitions and unaffected parts of the table.
  • Deletion Completion: Eventually, in the background and depending on the size and complexity of the partition, ClickHouse completes the deletion of the specified partition(s). The timing of deletion completion varies based on various factors, such as the size of the partition, system load, and available resources.
  • While dropping a partition is asynchronous, it’s important to note that once the deletion process begins, the affected partition and its data will be marked for deletion and subsequent queries will no longer include the dropped partition’s data in the results. However, depending on the size of the partition and the system load, it may take some time for the deletion process to complete.

cleanup_thread_priority

Sets the priority of the cleanup thread. A higher priority may result in more frequent cleanup but can also impact the performance of other operations. Adjust this parameter based on system resources and workload requirements.

parts_to_delay_insert

When new data is inserted into a MergeTree table, ClickHouse may delay the merge operation to optimize performance. Instead of immediately merging each inserted part, ClickHouse waits for a certain number of parts to accumulate, as specified by parts_to_delay_insert, before triggering the merge operation. This delay reduces the frequency of merge operations and helps improve the overall performance of data insertion.

clear_old_parts_interval_seconds

In ClickHouse’s MergeTree engine, as newer data is ingested and merges occur, older parts that are no longer required become eligible for removal. The cleanup process periodically checks for these old parts based on the clear_old_parts_interval_seconds value. If an eligible part is found, it is removed during the cleanup.

Monitoring

SELECT
    task_type,
    current_state,
    progress
FROM
    system.background_tasks --including the cleanup tasks.
WHERE
    task_type = 'Merge' AND current_state = 'Executing';
  • Please note that the progress value in the system.background_tasks table may not correspond directly to the total number of parts or the amount of data cleaned up. It represents the internal progress of the current merge task, and the actual cleanup progress might involve multiple merge tasks over time.
  • system.mutations: This table provides information about mutations, including partition deletions that trigger the cleanup process.