(DataExpert.io) Bootcamp - Day 3 - Lecture
Today’s lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption
Index
- Additive VS non-additive dimensions
- The power of Enums
- When should you use flexible data types ?
- Graph data modeling
Additive vs Non-additive dimensions
What makes a dimension additive ?
Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.
If you take all the sub-totals and sum them up you should have the total
Example : The sum of the count of people of each age gives the total population
An additive dimension is one that don’t “double count”
Example : The total number of driver doesn’t equals the sum of each car brand drivers because some people owns two cars
You should ask yourself if a user can be two of these at the same time on a given day.
The essential nature of additivity
A dimension is additive over a specific time window, if and only if, the grain of that window can only be one value at a time.
- In a time window of one second, one driver can’t drive two cars
- In a time window of one day, one driver can drive many cars
How does additivity helps ?
You don’t need to use COUNT(DISTINCT)
on preaggregated dimensions.
Note : Non-additive dimensions are usually only non-additive with respect to
COUNT
aggregations but notSUM
aggregations.
The power of Enums
When should you use enums ?
Enums are great for low-to-medium cardinality
Example : countries (200) might be to much for an enum.
If it’s less than 50 it might be a good enum
Why should you use enums ?
Built in data quality
- The Enum guarantees the set of values a field can take.
- This prevents invalid values at the database level
- You can’t insert ‘pending ’ (with space) or ‘PENDING’ (wrong case)
- More reliable than
CHECK
constraints onVARCHAR
columns
Built in static fields
- The enum values are stored as internal constants in the database
- More storage efficient than VARCHAR
- PostgreSQL stores enum values as integers internally but presents them as strings to users
-- More efficient than storing the full string each time
SELECT COUNT(*) FROM orders GROUP BY order_status;
Built in documentation
- The enum definition serves as documentation in the database schema
- Developers can see all valid values by checking the enum type:
# Postgres SQL command :
\dT+ status
Enumerations and subpartitions
Enumerations makes amazing subpartitions because :
- You have an exausthive list
- They chunk up the big data problem into manageable pieces
Thrift is a tool to manage schemas in your logging as well as you ETLs
Zach created a framework named “Little book of pipelines”
The enum pattern is useful whenever you have tons of sources mapping to a shared schema.
How do you model data from disparate sources into a shared schema ?
You don’t want to bring all the columns from 40 tables into one table with 500 columns
- Most columns are going to be NULL all the time
- It’s not really usable table
You want to use a “flexible schema”
- Use a lot of map types
- Overlaps a lot with graph database
- The enumerated list of things is similar to a “Vertex type”
Flexible schemas
If you need to add more things you just add it to the map
If new columns appears you can add them to the map
Limit : Maps in Spark can have up to 65,000 keys
Benefits
- No
ALTER TABLE
commands - You can manage a lot more columns
- You don’t have a ton of
NULL
columns - You can use an “other_properties”
- Carry rarely used but needed column
- Allows you to ignore them during modelisation
Drawbacks
- Compression is terrible
- Map is one of the worst compression of all data types
- The only thing worse is JSON
- The column headers becomes map keys so they are part of the data instead of the schema and duplicated in each row
How is graph modelling different ?
Graph modelling is Relationship focused, not Entity focused.
You can do a very poor job at modeling entities
Zach super secret sauce for Graph data modelling :
He always met the same schema when doing graph databases
We don’t care about columns
Usually for an entity in a graph the model is :
Identifier: STRING
Type: STRING
Properties: MAP<STRING, STRING>
The relationships are modeled a little bit more in depth;
Usually the model looks like :
subject_identifier: STRING
-- Example: player_name
subject_type: VERTEX_TYPE
-- Example: player
object_identifier: STRING
-- Example: team_name
object_type: VERTEX_TYPE
-- Example: team
edge_type: EDGE_TYPE
-- Example: "PLAYS WITH" , "PLAYS AGAINST" (an action)
properties: MAP<STRING, STRING>