Skip to content

Commit 27c3025

Browse files
authored
Feature/support-model-list-json (#10)
1 parent a8a5434 commit 27c3025

20 files changed

+15101
-158
lines changed

.github/workflows/check_version.yml

+47
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: Check Version Update
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- main
7+
paths:
8+
- 'py_package/setup.py'
9+
10+
jobs:
11+
check-version:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- name: Checkout PR branch
15+
uses: actions/checkout@v4
16+
with:
17+
fetch-depth: 0
18+
19+
- name: Set up Python
20+
uses: actions/setup-python@v4
21+
with:
22+
python-version: '3.x'
23+
24+
- name: Extract PR version
25+
id: pr_version
26+
run: |
27+
PR_VERSION=$(grep -oP "version='\K[^']+" py_package/setup.py)
28+
echo "pr_version=$PR_VERSION" >> $GITHUB_OUTPUT
29+
echo "PR Version: $PR_VERSION"
30+
31+
- name: Extract main version
32+
id: main_version
33+
run: |
34+
git fetch origin main
35+
MAIN_VERSION=$(git show origin/main:py_package/setup.py | grep -oP "version='\K[^']+")
36+
echo "main_version=$MAIN_VERSION" >> $GITHUB_OUTPUT
37+
echo "Main Version: $MAIN_VERSION"
38+
39+
- name: Compare versions
40+
id: compare_versions
41+
run: |
42+
if [ "${{ steps.pr_version.outputs.pr_version }}" == "${{ steps.main_version.outputs.main_version }}" ]; then
43+
echo "::error::Version in setup.py must be updated from '${{ steps.main_version.outputs.main_version }}' when making a PR to main branch"
44+
exit 1
45+
else
46+
echo "Version was updated from '${{ steps.main_version.outputs.main_version }}' to '${{ steps.pr_version.outputs.pr_version }}'"
47+
fi

.github/workflows/publish.yml

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
name: Publish Python Package
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
paths:
8+
- 'py_package/**'
9+
10+
jobs:
11+
deploy:
12+
runs-on: ubuntu-latest
13+
defaults:
14+
run:
15+
working-directory: ./py_package
16+
17+
steps:
18+
- uses: actions/checkout@v3
19+
20+
- name: Set up Python
21+
uses: actions/setup-python@v4
22+
with:
23+
python-version: '3.x'
24+
25+
- name: Install dependencies
26+
run: |
27+
python -m pip install --upgrade pip
28+
pip install setuptools wheel twine
29+
30+
- name: Build package
31+
run: python setup.py sdist bdist_wheel
32+
33+
- name: Publish to PyPI
34+
env:
35+
TWINE_USERNAME: __token__
36+
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
37+
run: twine upload dist/*

.github/workflows/test.yml

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
name: Python Tests
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- main
7+
8+
jobs:
9+
test:
10+
runs-on: ubuntu-latest
11+
strategy:
12+
matrix:
13+
python-version: ["3.8"]
14+
15+
steps:
16+
- uses: actions/checkout@v4
17+
18+
- name: Set up Python ${{ matrix.python-version }}
19+
uses: actions/setup-python@v4
20+
with:
21+
python-version: ${{ matrix.python-version }}
22+
23+
- name: Install build dependencies
24+
run: |
25+
python -m pip install --upgrade pip
26+
python -m pip install build setuptools wheel
27+
28+
- name: Install package dependencies
29+
run: |
30+
pip install pytest
31+
pip install -r py_package/requirements.txt
32+
33+
- name: Install package in editable mode
34+
run: |
35+
cd py_package
36+
pip install -e .
37+
cd ..
38+
39+
- name: Setup test data
40+
run: |
41+
mkdir -p tests/test_data/inputs
42+
cp -r examples/inputs/* tests/test_data/inputs/
43+
44+
- name: Verify test data
45+
run: |
46+
ls -la tests/test_data/inputs/
47+
echo "Test data files:"
48+
find tests/test_data/inputs/ -type f -exec ls -lh {} \;
49+
50+
- name: Run tests with pytest and coverage
51+
run: |
52+
python -m pytest -v tests/ --capture=no

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -155,3 +155,5 @@ cython_debug/
155155
.DS_Store
156156
lineage_to_direct_children.json
157157
lineage_to_direct_parents.json
158+
/MagicMock
159+
/outputs

docs/model_selection_syntax.md

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Model Selection Syntax
2+
3+
This document explains the dbt-style model selection syntax supported by the DBT Column Lineage Extractor tool.
4+
5+
## Node Types
6+
- Regular models: `model_name` or `model.package.model_name`
7+
- Source nodes: `source.schema.table`
8+
9+
## Graph Operators
10+
- `+model_name` - Include the model and all its ancestors (upstream models)
11+
- `model_name+` - Include the model and all its descendants (downstream models)
12+
- `+model_name+` - Include the model, all its ancestors, and all its descendants (entire lineage)
13+
14+
## Set Operators
15+
- `model1 model2` - Select models that match either selector (union)
16+
- `model1,model2` - Select models that match both selectors (intersection)
17+
18+
## Resource Selectors
19+
- `tag:daily` - Select models with the tag "daily"
20+
- `path:models/finance` - Select models in the specified path
21+
- `package:marketing` - Select models from the specified package
22+
23+
## Examples
24+
25+
```bash
26+
# Select all finance models and their dependencies
27+
dbt_column_lineage_direct --model "+tag:finance"
28+
29+
# Select order models and their downstream dependencies
30+
dbt_column_lineage_direct --model "orders+"
31+
32+
# Select models that are both daily and in the finance package
33+
dbt_column_lineage_direct --model "tag:daily,package:finance"
34+
35+
# Select models in the core package or downstream from customers
36+
dbt_column_lineage_direct --model "package:core customers+"
37+
38+
# Select a specific source
39+
dbt_column_lineage_direct --model "source.raw.customers"
40+
41+
# Select a source and its downstream dependencies
42+
dbt_column_lineage_direct --model "source.raw.customers+"
43+
44+
# Select a source and its entire lineage (both upstream and downstream)
45+
dbt_column_lineage_direct --model "+source.raw.customers+"
46+
```
47+
48+
You can also specify models from a JSON file using the `--model-list-json` parameter:
49+
```bash
50+
dbt_column_lineage_direct --manifest ./inputs/manifest.json --catalog ./inputs/catalog.json --model-list-json ./models.json
51+
```
52+
Where `models.json` is a JSON file containing a list of model names:
53+
```json
54+
[
55+
"model.jaffle_shop.customers",
56+
"model.jaffle_shop.orders",
57+
"source.raw.customers"
58+
]
59+
```

examples/main_step_1_direct.py

+5
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@
1313
# "model.dbt_project_1.model_2"
1414
]
1515

16+
# Alternative: You can also load models from a JSON file
17+
# import json
18+
# with open('models.json', 'r') as f:
19+
# li_selected_model = json.load(f)
20+
1621
extractor = DbtColumnLineageExtractor(
1722
manifest_path=manifest_path,
1823
catalog_path=catalog_path,

examples/readme.md

+38-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,10 @@
33
1. Place your dbt `manifest.json` and `catalog.json` files in the `inputs` directory.
44
2. **Customization**:
55
- Set your dialect (only tested with `snowflake` so far) in the `main_step_1_direct.py` script.
6-
- You can specify the scope of the models you want to extract column lineage for by adding them to the `li_selected_model` list, or leave it empty to process all models (recommended).
6+
- You can specify the scope of the models you want to extract column lineage for by adding them to the `li_selected_model` list, or leave it empty to process all models.
7+
- When specifying models, you can use dbt-style selectors like `+model_name` (ancestors), `model_name+` (descendants), `+model_name+` (entire lineage), `tag:my_tag` (tag filtering), etc.
8+
- Both models and sources are supported in selectors (e.g., `source.schema.table`).
9+
- Alternatively, you can create a JSON file with a list of models and use the `--model-list-json` parameter when running the CLI.
710

811
3. Run the `main_step_1_direct.py` script to extract direct column lineage:
912
```bash
@@ -14,6 +17,40 @@
1417
- `lineage_to_direct_parents.json`
1518
- `lineage_to_direct_children.json`
1619

20+
#### Model Selection with dbt-style Syntax
21+
22+
When specifying models using Python code, you can use dbt-style selectors just like in the CLI:
23+
24+
```python
25+
# Example model selectors
26+
li_selected_model = [
27+
# Include orders and all its ancestors
28+
"+orders",
29+
30+
# Include all models with "finance" tag
31+
"tag:finance",
32+
33+
# Include models that are both daily-tagged AND in the core package
34+
"tag:daily,package:core",
35+
36+
# Include a specific source
37+
"source.raw.customers",
38+
39+
# Include a source and all its downstream dependencies
40+
"source.raw.orders+",
41+
42+
# Get the entire lineage (upstream and downstream) of a source
43+
"+source.raw.payments+"
44+
]
45+
46+
extractor = DbtColumnLineageExtractor(
47+
manifest_path="./inputs/manifest.json",
48+
catalog_path="./inputs/catalog.json",
49+
selected_models=li_selected_model,
50+
dialect="snowflake"
51+
)
52+
```
53+
1754
#### Analyze Recursive Column Lineage
1855

1956
1. With the output from the direct column lineage step, run the `main_step_2_recursive.py` script to analyze recursive column lineage:

py_package/dbt_column_lineage_extractor/cli_direct.py

+76-28
Original file line numberDiff line numberDiff line change
@@ -7,44 +7,92 @@ def main():
77
parser.add_argument('--manifest', default='./inputs/manifest.json', help='Path to the manifest.json file, default to ./inputs/manifest.json')
88
parser.add_argument('--catalog', default='./inputs/catalog.json', help='Path to the catalog.json file, default to ./inputs/catalog.json')
99
parser.add_argument('--dialect', default='snowflake', help='SQL dialect to use, default is snowflake, more dialects at https://github.com/tobymao/sqlglot/tree/v25.24.5/sqlglot/dialects')
10-
parser.add_argument('--model', nargs='*', default=[], help='List of models to extract lineage for, default to all models')
10+
parser.add_argument(
11+
'--model',
12+
nargs='*',
13+
default=[],
14+
help='''List of models to extract lineage for using dbt-style selectors:
15+
- Simple model names: model_name
16+
- Include ancestors: +model_name (include upstream/parent models)
17+
- Include descendants: model_name+ (include downstream/child models)
18+
- Union (either): "model1 model2" (models matching either selector)
19+
- Intersection (both): "model1,model2" (models matching both selectors)
20+
- Tag filtering: tag:my_tag (models with specific tag)
21+
- Path filtering: path:models/finance (models in specific path)
22+
- Package filtering: package:my_package (models in specific package)
23+
Default behavior extracts lineage for all models.'''
24+
)
25+
parser.add_argument('--model-list-json', help='Path to a JSON file containing a list of models to extract lineage for. If specified, this takes precedence over --model')
1126
parser.add_argument('--output-dir', default='./outputs', help='Directory to write output json files, default to ./outputs')
1227
parser.add_argument('--show-ui', action='store_true', help='Flag to show lineage outputs in the console')
28+
parser.add_argument('--continue-on-error', action='store_true', help='Continue processing even if some models fail')
1329

1430
args = parser.parse_args()
1531

16-
# utils.clear_screen()
17-
18-
extractor = DbtColumnLineageExtractor(
19-
manifest_path=args.manifest,
20-
catalog_path=args.catalog,
21-
selected_models=args.model,
22-
dialect=args.dialect,
23-
)
32+
try:
33+
selected_models = args.model
34+
if args.model_list_json:
35+
try:
36+
selected_models = utils.read_json(args.model_list_json)
37+
if not isinstance(selected_models, list):
38+
raise ValueError("The JSON file must contain a list of model names")
39+
except Exception as e:
40+
print(f"Error reading model list from JSON file: {e}")
41+
return 1
2442

25-
lineage_map = extractor.build_lineage_map()
26-
lineage_to_direct_parents = extractor.get_columns_lineage_from_sqlglot_lineage_map(lineage_map)
27-
lineage_to_direct_children = (
28-
extractor.get_lineage_to_direct_children_from_lineage_to_direct_parents(
29-
lineage_to_direct_parents
43+
extractor = DbtColumnLineageExtractor(
44+
manifest_path=args.manifest,
45+
catalog_path=args.catalog,
46+
selected_models=selected_models,
47+
dialect=args.dialect,
3048
)
31-
)
3249

33-
utils.write_dict_to_file(
34-
lineage_to_direct_parents, f"{args.output_dir}/lineage_to_direct_parents.json"
35-
)
50+
print(f"Processing {len(extractor.selected_models)} models after selector expansion")
51+
52+
try:
53+
lineage_map = extractor.build_lineage_map()
54+
55+
if not lineage_map:
56+
print("Warning: No valid lineage was generated. Check for errors above.")
57+
if not args.continue_on_error:
58+
return 1
59+
60+
lineage_to_direct_parents = extractor.get_columns_lineage_from_sqlglot_lineage_map(lineage_map)
61+
lineage_to_direct_children = (
62+
extractor.get_lineage_to_direct_children_from_lineage_to_direct_parents(
63+
lineage_to_direct_parents
64+
)
65+
)
3666

37-
utils.write_dict_to_file(
38-
lineage_to_direct_children, f"{args.output_dir}/lineage_to_direct_children.json"
39-
)
67+
utils.write_dict_to_file(
68+
lineage_to_direct_parents, f"{args.output_dir}/lineage_to_direct_parents.json"
69+
)
70+
71+
utils.write_dict_to_file(
72+
lineage_to_direct_children, f"{args.output_dir}/lineage_to_direct_children.json"
73+
)
4074

41-
if args.show_ui:
42-
print("===== Lineage to Direct Parents =====")
43-
utils.pretty_print_dict(lineage_to_direct_parents)
44-
print("===== Lineage to Direct Children =====")
45-
utils.pretty_print_dict(lineage_to_direct_children)
75+
if args.show_ui:
76+
print("===== Lineage to Direct Parents =====")
77+
utils.pretty_print_dict(lineage_to_direct_parents)
78+
print("===== Lineage to Direct Children =====")
79+
utils.pretty_print_dict(lineage_to_direct_children)
4680

47-
print("Lineage extraction complete. Output files written to output directory.")
81+
print("Lineage extraction complete. Output files written to output directory.")
82+
return 0
83+
84+
except Exception as e:
85+
print(f"Error during lineage extraction: {str(e)}")
86+
if not args.continue_on_error:
87+
raise
88+
return 1
89+
90+
except Exception as e:
91+
print(f"Error: {str(e)}")
92+
import traceback
93+
traceback.print_exc()
94+
return 1
4895

4996
if __name__ == '__main__':
50-
main()
97+
import sys
98+
sys.exit(main())

0 commit comments

Comments
 (0)