Replies: 2 comments 1 reply
-
There is this, which looks sorta complex and I wouldn't want to maintain, but perhaps they would be open to turning it into a library: https://github.com/hms-dbmi/upset-altair-notebook |
Beta Was this translation helpful? Give feedback.
-
Interesting - I agree those charts look very useful. This is also pertinent to the work on this pr, where I'm trying to autodetect
One aspect of this work is I'm trying to find a cost function that is able to evaluate/compare 'how good' different sets of blocking rules are. See here. Such a cost function would have different weights depending on whether you're looking for training or prediction blocking rules. At the moment this cost function does not take account of the size of the intersection, but ideally it would. The biggest challenge here is that we have a new function name tbc with pr here that is very fast at finding the number of comparisons generated by a blocking rule. But the speed comes from avoiding generating the comparisons (and hence you also can't use it to compute intersections). At the moment, of the main things the cost function is looking for is how much freedom different columns get from the set of blocking rules:
Doesn't allow Re: the upset.app charts, I was actually working on a inferior but somewhat similar version of these charts the other day (that deal with the combinations, but not intersections) : ![]() Here's the script that makes that chart:
```python
import pandas as pd
from splink.duckdb.linker import DuckDBLinker
import altair as alt
import itertools
df = pd.read_parquet("/Users/robinlinacre/Documents/data_linking/splink/synthetic_50k_clean.parquet")df = pd.read_parquet( settings = {"link_type": "dedupe_only"} import itertools num_rows = len(df) if len(df) > 1e6: threshold_maximum = 20e6 results = []
results_df = pd.DataFrame(results) tick_values = [10**i for i in range(1, 15)] Creating the bar chartchart = ( Add the text on top of the bartext = chart.mark_text(align="left", baseline="middle", dx=3).encode( Render the chart(chart + text) Can we plot a grid on the rhs with ticks and crosses according to the blocking vars?df_melted = results_df.melt( Create the chartchart2 = ( Display the chartc = (chart2 | (chart + text))
|
Beta Was this translation helpful? Give feedback.
-
When I'm looking at blocking rules, I might be curious how many pairs are
In the implementation, the order of the blocking rules doesn't matter, since we do a DISTINCT on them. But when visualizing them, the order that you specify the rules matters. Thus, if you re-order your rules you can get vastly different impressions of which rules generate a lot of records, and which ones generate a small amount.
Really what we are interested in is visualizing the amount of intersect between a lot of different sets. It looks like https://upset.app/ is made for this. Not immediately needed, but this would be a cool goal to work towards I think. Not sure what the python/altair story is here.
Beta Was this translation helpful? Give feedback.
All reactions