-
-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Overview
While working on #4134 and adding new SEC 10-K tables to PUDL, some strange behavior cropped up. Some of it we'd seen before.
- New fields with ENUM constraints were unable to load data with NA values.
- Schema changes affecting only PK constraints weren't getting updated by Alembic.
- Materializing the new assets via the Dagster UI worked fine, but failed with SQLite errors in the integration tests.
To try and understand the SQLite insert error and why it was behaving differently in different contexts, I opened up pudl.sqlite
from pytest, and from dagster to see if they were different, and they were very different:
Dagster Schema (also nightly builds)
CREATE TABLE IF NOT EXISTS "core_sec10k__quarterly_company_information" (
filename_sec10k TEXT NOT NULL,
central_index_key TEXT,
phone_number TEXT,
company_name TEXT,
film_number TEXT,
fiscal_year_end TEXT,
sec_act TEXT,
filing_number_sec TEXT,
industry_id_sic TEXT,
industry_name_sic TEXT,
taxpayer_id_irs TEXT,
filer_count INTEGER NOT NULL,
incorporation_state TEXT,
sec10k_type TEXT,
business_street_address TEXT,
business_street_address_2 TEXT,
business_city TEXT,
business_state TEXT,
business_zip_code TEXT,
business_zip_code_4 TEXT,
business_postal_code TEXT,
mail_street_address TEXT,
mail_street_address_2 TEXT,
mail_city TEXT,
mail_state TEXT,
mail_zip_code TEXT,
mail_zip_code_4 TEXT,
mail_postal_code TEXT,
CONSTRAINT pk_core_sec10k__quarterly_company_information PRIMARY KEY (filename_sec10k, filer_count)
);
pytest
Schema
CREATE TABLE core_sec10k__quarterly_company_information (
filename_sec10k TEXT NOT NULL CONSTRAINT "ck_core_sec10k__quarterly_company_information_`869797815423838214`" CHECK (TYPEOF("filename_sec10k") = 'text'),
filer_count INTEGER NOT NULL CONSTRAINT "ck_core_sec10k__quarterly_company_information_`645929632982996875`" CHECK (TYPEOF("filer_count") = 'integer'),
central_index_key TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-8131591961600470204`" CHECK ("central_index_key" IS NULL OR TYPEOF("central_index_key") = 'text'),
company_name TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-6055054790339972579`" CHECK ("company_name" IS NULL OR TYPEOF("company_name") = 'text'),
fiscal_year_end TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`6883342265760783397`" CHECK ("fiscal_year_end" REGEXP '^(?:(?NULL[1-9]|1[0-2])(?NULL[1-9]|1\d|2\d|3[01])|(?NULL[13-9]|1[0-2])(?NULL|30)|(?NULL[13578]|1[02])31)$') CONSTRAINT "ck_core_sec10k__quarterly_company_information_`5865496341343288976`" CHECK ("fiscal_year_end" IS NULL OR TYPEOF("fiscal_year_end") = 'text'),
taxpayer_id_irs TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`543403320458096829`" CHECK ("taxpayer_id_irs" REGEXP '^\d{2}-\d{7}$') CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-4102673625268701631`" CHECK ("taxpayer_id_irs" IS NULL OR TYPEOF("taxpayer_id_irs") = 'text'),
incorporation_state TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`4859554518323869125`" CHECK ("incorporation_state" IS NULL OR TYPEOF("incorporation_state") = 'text'),
industry_name_sic TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`3959947840676262577`" CHECK ("industry_name_sic" IS NULL OR TYPEOF("industry_name_sic") = 'text'),
industry_id_sic TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-1049020856102920516`" CHECK ("industry_id_sic" IS NULL OR TYPEOF("industry_id_sic") = 'text') CONSTRAINT "ck_core_sec10k__quarterly_company_information_`393160964873686241`" CHECK ("industry_id_sic" REGEXP '^\d{4}$'),
film_number TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-2112028239114759727`" CHECK ("film_number" IS NULL OR TYPEOF("film_number") = 'text'),
sec10k_type TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-2407677542113752064`" CHECK ("sec10k_type" IS NULL OR TYPEOF("sec10k_type") = 'text'),
sec_act TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-7600126519029964726`" CHECK ("sec_act" IS NULL OR TYPEOF("sec_act") = 'text'),
filing_number_sec TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-5371354177211245036`" CHECK ("filing_number_sec" IS NULL OR TYPEOF("filing_number_sec") = 'text'),
phone_number TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`7246694621579180064`" CHECK ("phone_number" IS NULL OR TYPEOF("phone_number") = 'text'),
business_street_address TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`5484274367470430808`" CHECK ("business_street_address" IS NULL OR TYPEOF("business_street_address") = 'text'),
business_street_address_2 TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-4784933923516920717`" CHECK ("business_street_address_2" IS NULL OR TYPEOF("business_street_address_2") = 'text'),
business_city TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`9188667654242910404`" CHECK ("business_city" IS NULL OR TYPEOF("business_city") = 'text'),
business_state TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`5039280021878742990`" CHECK ("business_state" IS NULL OR TYPEOF("business_state") = 'text'),
business_zip_code TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`5925569707734702120`" CHECK ("business_zip_code" IS NULL OR TYPEOF("business_zip_code") = 'text') CONSTRAINT "ck_core_sec10k__quarterly_company_information_`9167138883428811174`" CHECK ("business_zip_code" REGEXP '^\d{5}$'),
business_zip_code_4 TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`5787026919065795578`" CHECK ("business_zip_code_4" IS NULL OR TYPEOF("business_zip_code_4") = 'text') CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-5663903969516670354`" CHECK ("business_zip_code_4" REGEXP '^\d{4}$'),
business_postal_code TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-6043389851641362179`" CHECK ("business_postal_code" IS NULL OR TYPEOF("business_postal_code") = 'text'),
mail_street_address TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-426568826900999257`" CHECK ("mail_street_address" IS NULL OR TYPEOF("mail_street_address") = 'text'),
mail_street_address_2 TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-1474752677648676221`" CHECK ("mail_street_address_2" IS NULL OR TYPEOF("mail_street_address_2") = 'text'),
mail_city TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-3464743352250393159`" CHECK ("mail_city" IS NULL OR TYPEOF("mail_city") = 'text'),
mail_state TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-5711654700478405267`" CHECK ("mail_state" IS NULL OR TYPEOF("mail_state") = 'text'),
mail_zip_code TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-7793121904295891811`" CHECK ("mail_zip_code" IS NULL OR TYPEOF("mail_zip_code") = 'text') CONSTRAINT "ck_core_sec10k__quarterly_company_information_`7446180905356083089`" CHECK ("mail_zip_code" REGEXP '^\d{5}$'),
mail_zip_code_4 TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-1794365427123600825`" CHECK ("mail_zip_code_4" IS NULL OR TYPEOF("mail_zip_code_4") = 'text') CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-4660163769868745038`" CHECK ("mail_zip_code_4" REGEXP '^\d{4}$'),
mail_postal_code TEXT CONSTRAINT "ck_core_sec10k__quarterly_company_information_`-2654227477147437306`" CHECK ("mail_postal_code" IS NULL OR TYPEOF("mail_postal_code") = 'text'),
CONSTRAINT pk_core_sec10k__quarterly_company_information PRIMARY KEY (filename_sec10k, filer_count)
);
Notes
- The pytest
CREATE
statement is more what I expect to see -- we programmatically define a bunch of SQL constraints based on our field definitions, and those look more or less like the ones we're defining. - The lack of
IS NULL or TYPEOF()
constraints in the Dagster outputs could explain why my attempts to add ENUM constraints on columns that contain NA values was failing. - I have no idea what is up with the back-tick quoted random numbers in the constraint names. It seems like those are probably supposed to ensure that all the constraints are named and unique, but the fact that they're in backticks and some of the numbers are negative seems suspicious.
- So, what's different about how these two systems create the database, and how can we make them both look more like what's happening in pytest currently?
- Given all of the Alembic migrations that have been in flux on the branch I'm working on, I'm pretty sure I just removed the PUDL DB and re-created it with
alembic upgrade head
before populating it using Dagster. So maybe Alembic isn't actually getting all of the constraints that we've specified? Or maybe it is unwilling to work with these weird backtick negative numbers?
Questions
- How is the database initially created in pytest?
- If we start without any database, how is it created by Dagster?
- How is the database created in the nightly builds?
- What constraints are making it into the Pandera asset checks?
Named constraints
It looks like we're the ones adding the backticks when we specify the naming conventions for the SQLite metadata outputs, but it seems like SQLAlchemy is generating the constraint_name
value, including negative numbers. Here in the pudl.metadata.classes.Package.to_sql()
method. This bit was added by @jdangerx 20 months ago so that we could have Alembic refer to specific constraints.
I thought we had fixed the weird PK issue that I kept running into in this PR by adding these names. But if the names aren't making it into the database / the information used by Alembic that would explain why they've been problematic again.
def to_sql(
self,
check_types: bool = True,
check_values: bool = True,
) -> sa.MetaData:
"""Return equivalent SQL MetaData."""
metadata = sa.MetaData(
naming_convention={
"ix": "ix_%(column_0_label)s",
"uq": "uq_%(table_name)s_%(column_0_name)s",
"ck": "ck_%(table_name)s_`%(constraint_name)s`",
"fk": "fk_%(table_name)s_%(column_0_name)s_%(referred_table_name)s",
"pk": "pk_%(table_name)s",
}
)
for resource in self.resources:
if resource.create_database_schema:
_ = resource.to_sql(
metadata,
check_types=check_types,
check_values=check_values,
)
return metadata
Pandera asset checks
We definitely seem to intend to add all these constraints into the Pandera schema checks.
def to_pandera_checks(self) -> list[pr.Check]:
"""Convert these constraints to pandera Column checks."""
checks = []
if self.min_length is not None:
checks.append(pr.Check.str_length(min_value=self.min_length))
if self.max_length is not None:
checks.append(pr.Check.str_length(max_value=self.max_length))
if self.minimum is not None:
checks.append(pr.Check.ge(self.minimum))
if self.maximum is not None:
checks.append(pr.Check.le(self.maximum))
if self.pattern is not None:
checks.append(pr.Check.str_matches(self.pattern))
if self.enum:
checks.append(pr.Check.isin(self.enum))
pytest PUDL DB initialization
In contest.py
we explicitly create the PUDL DB from scratch using SQLAlchemy:
logger.info("setting up the pudl_engine fixture")
if not live_dbs:
# Create the database and schemas
engine = sa.create_engine(PudlPaths().pudl_db)
md = PUDL_PACKAGE.to_sql()
md.create_all(engine)
# Run the ETL and generate a new PUDL SQLite DB for testing:
execute_result = pudl_etl_job_factory(base_job="etl_fast")().execute_in_process(
run_config={
"resources": {
"dataset_settings": {
"config": dataset_settings_config,
},
"datastore": {
"config": pudl_datastore_config,
},
},
},
)
Alembic PUDL DB initialization
In migrations/env.py
we're also generating an SQLAlchemy MetaData
object by using PUDL_PACKAGE.to_sql()
so it seems like it should be getting all the same metadata. I don't see the equivalent of md.create_all(engin)
in there though so I don't know how the MetaData
is actually being used to create the DB. Maybe it's invoked differently?
By default Package.to_sql()
includes the type and value checks, though they can be disabled if you want to.
Looking at aee9c15c7394_wipe_and_reset.py
-- the oldest migration we have, which creates the bulk of the DB -- I see that it doesn't add any of the field content constraints.
An, and indeed it turns out that alembic autogenerate can't handle check constraints of which we have many.
But the Pandera based asset schema checks should still be aware of these constraints and checking for them, but they aren't failing in CI or in the Dagster ETL.
Resources
- SQLAlchemy: Configuring Constraint Naming Conventions
- Alembic: The Importance of Naming Constraints
SQLite Insert Error
def do_executemany(self, cursor, statement, parameters, context=None):
> cursor.executemany(statement, parameters)
E sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) user-defined function raised exception
E [SQL: INSERT INTO core_sec10k__quarterly_company_information (filename_sec10k, filer_count, central_index_key, company_name, fiscal_year_end, taxpayer_id_irs, incorporation_state, industry_name_sic, industry_id_sic, film_number, sec10k_type, sec_act, filing_number_sec, phone_number, business_street_address, business_street_address_2, business_city, business_state, business_zip_code, business_zip_code_4, business_postal_code, mail_street_address, mail_street_address_2, mail_city, mail_state, mail_zip_code, mail_zip_code_4, mail_postal_code) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]
E [parameters: [('1000015/0000912057-00-014793', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '585471', '10-k', None, '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None), ('1000015/0000912057-01-506445', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '1591442', '10-k', None, '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None), ('1000015/0000912057-02-012956', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '02596928', '10-k405', '1934 act', '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None), ('1000015/0001000015-97-000011', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '97576343', '10-k', '1934 act', '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None), ('1000015/0001000015-98-000009', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '98580215', '10-k', None, '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None), ('1000015/0001000015-99-000011', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '99580768', '10-k', None, '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None), ('1000015/0001047469-03-010821', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '03624296', '10-k', '1934 act', '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None), ('1000015/0001047469-03-035439', 0, '0001000015', 'meta group inc', '1231', '06-0971675', 'DE', 'services-engineering, accounting, research, management', '8700', '03969490', '10-k/a', '1934 act', '000-27280', '2039736700', '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None, '208 harbor dr', None, 'stamford', 'CT', '06912', '0061', None) ... displaying 10 of 100000 total bound parameter sets ... ('1133818/0001016193-10-000006', 0, '0001133818', 'bio-path holdings inc', '1231', '87-0652870', 'UT', 'retail-miscellaneous shopping goods stores', '5940', '10715118', '10-k', '1934 act', '000-53404', '8013995500', '3293 harrison boulevard ste 230', None, 'ogden', 'UT', '84403', None, None, '3293 harrison boulevard ste 230', None, 'ogden', 'UT', '84403', None, None), ('1133818/0001104659-20-029711', 0, '0001133818', 'bio-path holdings inc', '1231', '87-0652870', 'DE', 'pharmaceutical preparations', '2834', '20690929', '10-k', '1934 act', '001-36333', '(832) 742-1357', '4710 bellaire boulevard', 'suite 210', 'bellaire', 'TX', '77401', None, None, '4710 bellaire boulevard', 'suite 210', 'bellaire', 'TX', '77401', None, None)]]
E (Background on this error at: https://sqlalche.me/e/20/e3q8)
To Do
- It seems like the field constraints in
pudl.sqlite
should be the same regardless of whether the DB is initially created by alembic, dagster, or pytest. But it sounds like Alembic can't deal with field check constraints. However, in CI and in the nightly builds we're always starting from scratch, so we should be able to use SQLAlchemy directly to create the empty database (as happens now in CI) to ensure that all of the checks are there. Can we do the same thing when we're wiping out the DB and starting from scratch in our local setups? - Verify that Pandera is doing field constraint checks that look at the values of the data. If this is working, then we should at least be alerted when data isn't satisfying the constraints.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status