Skip to content

anahoria/lego

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lego

Data storytelling how-to from lego data.

Data

We are going to create a relational database to store data from

rebrickable

wich is an open source database about all sets of lego sold in the world from the beggining. You can finde all data stored in *csv files and they share the data model with the relationships between tables.

alt text

For this use case we are going to create an identical datamodel in duckDB database.

The files corresponding to ecah table are in the url

rawDataRebrickable

and in this repository folder

legoDB/raw

Database

The chosen database for this proyect is

duckDB

and all the DDLs to crecreate the model are stored in this repository

legoDB/DDLs

Hands on the model, understanding the tables

There is 1 table with the global inventory which contains a single row for each inventory item. This table relates to other three with the detail about the parts, the sets and the minifig.

Tabla items por categoría

Following the list in order of volume, there is a tree that clasifies the inventory items in 8 categories.

Árbol items por categoría

Only sets and kits

First study will focus only in sets. In paralell another study of kits is made.

The main tables are

$process/models/sets/sets_detailed.sql$

and

$process/models/kits/kits_detailed.sql$.

sets_and_bigger_by_year

Focusing in the kpi table about sets, the next query extracts the main information about volume of sets and volume of pieces by set over years.

The size of the balls has been re-normalized to a 0-1000 scale using the function $y = 13 + 0.09x$

This is te query over the table to extract the csv file data/sets_and_bigger_by_year.csv

select 
	year,
	n_sets_year,
	(13+0.09*bigger_set) as sc_bigger_set,
	bigger_set,
	case when n_sets_year_parent > 100
	then parent_theme_name 
	else 'rest' end as parent_theme_name 
from main.sets_kpis where version = max_version and num_parts = bigger_set and bigger_set >= 100;

and the subsequent figure img/sets_and_bigger_by_year.svg

Number of sets by year and bigger set by year

Then I identified manually the permanente as the maxumum set ./data/sets_and_bigger_by_year_top_permanence

Permanence in the top

Only themes

First of all, I constructed the family tree of the themes. It came up with tthee levels of hierarchy, grandparents, parents, childs. The results are stored in the table themes_family.tree.sql

Hierarchy 2 levels Hierarchy 3 levels

Only colors

The table with all the information about the sets by year can be completed with the colors of all the pieces present in the sets not counting the spare ones. Better perform a cumsum over the years. It is also needed to have the tags of the colors to plot them in flourish.

When creating the table, it is important to fill every color every year with zeros to avoid inconvenient jumps in the table.

cumsum_colors_by_year.sql

flourish_colors.sql

There is also an interesting thing to shoe, regardin the colors, we can group then into 19 groups

colors_simplified.sql

colors in groups

but regarding the total number of pieces presents in sets along the years, the image is very diferent

colors in groups for sets over the years

There is another interesting representation wich is the diference between showing the larger amount of pieces by color and by group color

max color and max group

max color and max group aligned

About

Data storytelling how-to from lego data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published