Pain points

  • New tables were created manually on top of core_data tables every other day, but there was no way to tell if similar tables already existed.
  • Data lineage became impossible to track. When a data issue upstream was discovered and fixed, there was no guarantee that the fix would propagate to all downstream jobs.
  • Different teams reported different numbers for very simple business questions, and there was no easy way to know which number was correct.
  • core_data standardized table consumption and allowed users to quickly identify which tables to build upon. On the other hand, it burdened the centralized data engineering with the impossible task of gate-keeping and onboarding an endless stream of new datasets into new and existing core tables. Furthermore, pipelines built downstream of core_data created a proliferation of duplicate and diverging metrics. We learned from this experience that table standardization was not enough and that standardization at the metrics level is key to enabling trustworthy consumption. After all, users do not consume tables; they consume metrics, dimensions, and reports.

Use cases

  • Minerva’s product vision is to allow users to “define metrics once, use them everywhere”. That is, a metric created in Minerva should be easily accessed in dashboards, our A/B testing framework, or our anomaly detection algorithms to spot business anomalies
  • Users simply ask for metrics and dimension cuts, and receive the answers without having to worry about the “where” or the “how”.
  • Data Catalog
    • When a user interfaces with the Dataportal and searches for a metric, it ranks Minerva metrics at the top of the search results. The Dataportal also surfaces contextual information, such as certification status, ownership, and popularity so that users can gauge the relative importance of metrics. For most non-technical users, the Dataportal is their first entry point to metrics in Minerva.
  • Data Exploration
    • Upon selecting a metric, users are redirected to Metric Explorer, a component of the Dataportal that enables out-of-the-box data exploration. On a metric page, users can see trends of a metric with additional slicing and drill down options such as Group By and Filter
  • A/B Testing
    • All base events for A/B tests are defined and sourced from Minerva.
  • Executive Reporting
  • Data Analysis
    • Minerva data is exposed to Airbnb’s custom R and Python clients through Minerva’s API. This allows data scientists to query Minerva data in a notebook environment with ease

Metrics

  • Simple metrics are composed of single materialized events (e.g., bookings)
  • Filtered metrics are composed of simple metrics filtered on a dimension value (e.g., bookings in China);
  • Derived metrics are composed of one or more non-derived metrics (e.g. search-to-book rate).

Sample Request

  • Average daily price (ADR), cut by destination region, excluding private rooms for the past 4 weeks in the month of August 2021.
{
      metric: ‘price_per_night’,
      groupby_dimension: ‘destination_region’,
      global_filter: ‘dim_room_type!=”private-room”’,
      aggregation_granularity: ‘W-SAT’,
      start_date: 20210801,
      end_date: 20210901,
      truncate_incomplete_leading_data: true,
      truncate_incomplete_trailing_data: true,
}
  • Given that the price_per_night metric is a ratio metric (a special case of derived metric) that contains a numerator (gross_booking_value_stays) and a denominator (nights_booked), Minerva API breaks up this request into two sub-requests.

Minerva vs Shasta

Differences

  • Target audience: Minerva mainly for internal users, while Shasta is mainly for external users of Google ads manager.
  • Latency: Sub-second latency is P0 for Shasta, nice to have for Minerva
  • Storage: reading from different storage systems is a P0 for Shasta, and nice to have for Minerva
  • Consistent metric definition: P0 for Minerva, P1 for Shasta
  • Reverse ETL: Shasta focuses on reducing revert ETL by reading the OLTP storage directly, and this use case is not focused by Minerva

Similarities

  • Separation of dimensions and metrics.
  • Metrics often contain business logic
  • Declarative metric computation
  • Separation of dimension columns and metric columns

  • Minerva takes (normalized) fact and dimension tables as inputs, de-normalize them, and serves the aggregated data to downstream applications.
  • It uses Airflow for workflow orchestration, Apache Hive and Apache Spark as the compute engine, and Presto and Apache Druid for consumption.
  • Minerva defines key business metrics, dimensions, and other metadata in a centralized Github repository
  • Minerva de-normalize data efficiently by reuse data and intermediate joined results.
  • Minerva provides a unified data API to serve both aggregated and raw metrics on demand.
  • Users define the “what” and not the “how”. How the metrics are calculated, stored, or served are entirely abstracted away from end users.
  • Minerva focuses on metrics and dimensions as opposed to tables and columns.
  • Minerva requires metrics authors to provide self-describing metadata, such as, ownership, lineage, and metric description.

  • At the heart of Minerva’s configuration system are event sources and dimension sources, which correspond to fact tables and dimension tables in a Star Schema design, respectively.
  • Event sources define the atomic events from which metrics are constructed, and dimension sources contain attributes and cuts that can be used in conjunction with the metrics
  • With Minerva, users can simply define a dimension set, an analysis-friendly dataset that is joined from Minerva metrics and dimensions.

computational flow

  • Ingestion Stage: Partition sensors wait for upstream data and data is ingested into Minerva.
  • Data Check Stage: Data quality checks are run to ensure that upstream data is not malformed.
  • Join Stage: Data is joined programmatically based on join keys to generate dimension sets.
  • Post-processing and Serving Stage: Joined outputs are further aggregated and derived data is virtualized for downstream use cases.

In the ingestion stage, Minerva waits for upstream tables to land and ingests the data into the static Minerva tables. This ingested data becomes the source-of-truth and requires modification to the Minerva configs in order to be changed.

Here are some typical checks that Minerva runs:

  • sources should not be empty
  • timestamps should not be NULL and should meet ISO standards
  • primary keys should be unique
  • dimension values are consistent with what is expected

References