(DataExpert.io) Bootcamp - Day 3 - Lecture

Posted on Nov 19, 2024

Today’s lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption

Index

Additive VS non-additive dimensions
The power of Enums
When should you use flexible data types ?
Graph data modeling

Additive vs Non-additive dimensions

What makes a dimension additive ?

Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.

If you take all the sub-totals and sum them up you should have the total

Example : The sum of the count of people of each age gives the total population

An additive dimension is one that don’t “double count”

Example : The total number of driver doesn’t equals the sum of each car brand drivers because some people owns two cars

You should ask yourself if a user can be two of these at the same time on a given day.

The essential nature of additivity

A dimension is additive over a specific time window, if and only if, the grain of that window can only be one value at a time.

In a time window of one second, one driver can’t drive two cars
In a time window of one day, one driver can drive many cars

How does additivity helps ?

You don’t need to use COUNT(DISTINCT) on preaggregated dimensions.

Note : Non-additive dimensions are usually only non-additive with respect to COUNT aggregations but not SUM aggregations.

The power of Enums

When should you use enums ?

Enums are great for low-to-medium cardinality

Example : countries (200) might be to much for an enum.

If it’s less than 50 it might be a good enum

Why should you use enums ?

Built in data quality

The Enum guarantees the set of values a field can take.
This prevents invalid values at the database level
You can’t insert ‘pending ’ (with space) or ‘PENDING’ (wrong case)
More reliable than CHECK constraints on VARCHAR columns

Built in static fields

The enum values are stored as internal constants in the database
More storage efficient than VARCHAR
PostgreSQL stores enum values as integers internally but presents them as strings to users

-- More efficient than storing the full string each time
SELECT COUNT(*) FROM orders GROUP BY order_status;

Built in documentation

The enum definition serves as documentation in the database schema
Developers can see all valid values by checking the enum type:

# Postgres SQL command :
\dT+ status

Enumerations and subpartitions

Enumerations makes amazing subpartitions because :

You have an exausthive list
They chunk up the big data problem into manageable pieces

Thrift is a tool to manage schemas in your logging as well as you ETLs

https://en.wikipedia.org/wiki/Apache_Thrift

Zach created a framework named “Little book of pipelines”

https://github.com/EcZachly/little-book-of-pipelines

The enum pattern is useful whenever you have tons of sources mapping to a shared schema.

How do you model data from disparate sources into a shared schema ?

You don’t want to bring all the columns from 40 tables into one table with 500 columns

Most columns are going to be NULL all the time
It’s not really usable table

You want to use a “flexible schema”

Use a lot of map types
Overlaps a lot with graph database
- The enumerated list of things is similar to a “Vertex type”

Flexible schemas

If you need to add more things you just add it to the map

If new columns appears you can add them to the map

Limit : Maps in Spark can have up to 65,000 keys

Benefits

No ALTER TABLE commands
You can manage a lot more columns
You don’t have a ton of NULL columns
You can use an “other_properties”
- Carry rarely used but needed column
- Allows you to ignore them during modelisation

Drawbacks

Compression is terrible
- Map is one of the worst compression of all data types
- The only thing worse is JSON
- The column headers becomes map keys so they are part of the data instead of the schema and duplicated in each row

How is graph modelling different ?

Graph modelling is Relationship focused, not Entity focused.

You can do a very poor job at modeling entities

Zach super secret sauce for Graph data modelling :

He always met the same schema when doing graph databases

We don’t care about columns

Usually for an entity in a graph the model is :

Identifier: STRING
Type: STRING
Properties: MAP<STRING, STRING>

The relationships are modeled a little bit more in depth;

Usually the model looks like :

subject_identifier: STRING
-- Example: player_name
subject_type: VERTEX_TYPE 
-- Example: player
object_identifier: STRING
-- Example: team_name
object_type: VERTEX_TYPE 
-- Example: team
edge_type: EDGE_TYPE
-- Example: "PLAYS WITH" , "PLAYS AGAINST" (an action)
properties: MAP<STRING, STRING>