Skip to content

Commit 21a96fb

Browse files
authoredDec 7, 2018
Provide hooks for logging prediction (#121)
* Helper for creating prediction dataframe * Helper for logging predictions * Store latest predictions on every predict * Util function for converting df columns to json * Create Mock model for unit test * Create test for prediction logging * Integrate relevant changes from montana/model_store * Add metadata DB * Add class method to get_or_create instance * Change schema for metadata * Instrument model base class for metadata logging * Update fitting schema to include model uploads * Ignore commit data for now * Add memoized property to utils * Add basic unit test for fit metadata * Change metadata schema 1. Remove fitting and snapshot status 2. Change fitting_name to fitting_num * Add additional imports * Modify model fit and save for metadata logging * Save best estimator as fitting with hyper_parameter_search * Fix paths so model upload works * Refactor uploading/downloading code * Modify last_fitting to get correct fitting name * Modify predictions metadata * Log predictions * Modify unit tests for logging * Add additional columns to metadata * Change prediction logging to use custom_data column * Save model URL on upload * Raise error if no fittings found for model in metadata * Add test for prediction logging * Fix last_fitting function * Fix bug where fitting name was not being set properly on downlaod * Define model_name as property * Temp commit * hunt down sql(alchemy+ite) bug (#134) * bulk insert support for snowflake (#122) * bulk insert support for snowflake * always use slices * cleanup shadowing slice * Fix issue where copying between different file systems would break data retrieval (#125) `os.rename` only works if the source and destination path are on the same file system Copying using `shutil.copy`, and subsequentially manually removing the source file fixes the issue. Traceback: ``` 11:21:36.644 ERROR root:293 => Exception: Traceback (most recent call last): File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/bin/lore", line 11, in <module> sys.exit(main()) File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/__main__.py", line 331, in main known.func(known, unknown) File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/__main__.py", line 483, in fit model.fit(score=parsed.score, test=parsed.test, **fit_args) File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/models/base.py", line 49, in fit x=self.pipeline.encoded_training_data.x, File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/pipelines/holdout.py", line 132, in encoded_training_data self._encoded_training_data = self.observations(self.training_data) File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/pipelines/holdout.py", line 110, in training_data self._split_data() File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/util.py", line 210, in wrapper return func(*args, **kwargs) File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/pipelines/holdout.py", line 234, in _split_data self._data = self.get_data() File "/home/thomas/code/my_app/my_app/pipelines/product_popularity.py", line 20, in get_data lore.io.download(url, cache=True, extract=True) File "/home/thomas/.pyenv/versions/3.6.4/envs/my_app/lib/python3.6/site-packages/lore/io/__init__.py", line 124, in download os.rename(temp_path, local_path) OSError: [Errno 18] Invalid cross-device link: '/tmp/tmpwl6lvhon' -> '/home/thomas/code/my_app/data/instacart_online_grocery_shopping_2017_05_01.tar.gz' ``` * documention checkpoint (#129) * Create Naive Estimator (#127) * Create Naive estimator A naive estimator will just predict the mean of the response variable. It is useful for benchmarking models * Create simple base class for Naive model * Add predict_proba method for xgboost * Add predict_proba method to base class * Add unit tests for naive model * Test for XGBoost predict_proba * Return probabilities for both classes ala sklearn/xgboost * Add documentation for naive estimator * Generalize documentation for multi-class classification * Use numpy.full instead of numpy.ones * Add basic documentation and `predict_proba` to SKLearn Estimator (#130) * Add basic documentation for sklearn estimators * Expose predict_proba method for sklearn BinaryClassifier * Add documentation for predict_proba. This should probably be done in a DRY fashion. But doing it this way for now * Improve OneHot encoder (#131) * Fix names for OneHot encoded columns * Add option to drop first level This is useful for algorithms like linear regression which do not like singular matrices * Test for drop_first * Add percent_occurrences to OneHot * Add documentation for OneHot * Version bump * [Lore] Add exception handling for unauthenticated snowflake connections (#132) * [Lore] Add exception handling for unauthenticated snowflake connections * [Lore] Added more strict error handling for expired snowflake connection renewal * [Lore] Added test cased for unauthenticated snowflake connection error * [Lore] Disable failing tests * Fix tasks invocation (#133) * python2 compatibility for tests * Helper for creating prediction dataframe * Helper for logging predictions * Store latest predictions on every predict * Util function for converting df columns to json * Create Mock model for unit test * Create test for prediction logging * Integrate relevant changes from montana/model_store * Add metadata DB * Add class method to get_or_create instance * Change schema for metadata * Instrument model base class for metadata logging * Update fitting schema to include model uploads * Ignore commit data for now * Add memoized property to utils * Add basic unit test for fit metadata * Change metadata schema 1. Remove fitting and snapshot status 2. Change fitting_name to fitting_num * Add additional imports * Modify model fit and save for metadata logging * Save best estimator as fitting with hyper_parameter_search * Fix paths so model upload works * Refactor uploading/downloading code * Modify last_fitting to get correct fitting name * Modify predictions metadata * Log predictions * Modify unit tests for logging * Add additional columns to metadata * Change prediction logging to use custom_data column * Save model URL on upload * Raise error if no fittings found for model in metadata * Add test for prediction logging * Fix last_fitting function * Fix bug where fitting name was not being set properly on downlaod * Define model_name as property * Temp commit * hunt down sql(alchemy+ite) bug * - use env aware default metadata database - use workaround to re-enable watermarking w/ sqlite - cleanup test outoput * use in memory database * test batch mode in CI * go go postgres in CI * prevent database schema caching * prediction log testing is in metadata tests. * Add get() to return classes by key. * Support loading legacy models in lore. * bump version * add ganesh * Lets make this a Release Candidate before launching broadly. * merge master
1 parent 6d64c07 commit 21a96fb

17 files changed

+629
-89
lines changed
 

‎.circleci/config.yml

+1
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ jobs:
6969
- /opt/circleci/.pyenv
7070
- /opt/circleci/python
7171

72+
- run: dropdb lore_test --if-exists
7273
- run: createdb lore_test
7374
- run: LORE_PYTHON_VERSION=2.7.15 lore test
7475
- run: LORE_PYTHON_VERSION=3.6.5 lore test

‎config/database.cfg

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[MAIN]
2+
url: postgres://localhost/lore_test
3+
use_batch_mode: True
4+
5+
[MAIN_TWO]
6+
url: postgres://localhost/lore_test
7+
use_batch_mode: True
8+
9+
[METADATA]
10+
url: sqlite:///data/metadata.sqlite

‎config/test/database.cfg

+4
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,7 @@ use_batch_mode: True
55
[MAIN_TWO]
66
url: postgres://localhost/lore_test
77
use_batch_mode: True
8+
9+
[METADATA]
10+
url: postgres://localhost/lore_test
11+
use_batch_mode: True

‎lore/__init__.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@
1515

1616
__author__ = 'Montana Low and Jeremy Stanley'
1717
__copyright__ = 'Copyright © 2018, Instacart'
18-
__credits__ = ['Montana Low', 'Jeremy Stanley', 'Emmanuel Turlay', 'Shrikar Archak']
18+
__credits__ = ['Montana Low', 'Jeremy Stanley', 'Emmanuel Turlay', 'Shrikar Archak', 'Ganesh Krishnan']
1919
__license__ = 'MIT'
20-
__version__ = '0.6.27'
20+
__version__ = '0.7.0rc1'
2121
__maintainer__ = 'Montana Low'
2222
__email__ = 'montana@instacart.com'
2323
__status__ = 'Development Status :: 4 - Beta'

‎lore/encoders.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,8 @@ def __str__(self):
6565
return self.name
6666

6767
def __repr__(self):
68-
return self.name
68+
properties = ['%s=%s' % (key, value) for key, value in self.__dict__]
69+
return '<%s(%s)>' % (self.name, ', '.join(properties))
6970

7071
def __setstate__(self, dict):
7172
self.__dict__ = dict

‎lore/env.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -432,8 +432,12 @@ def set_python_version(python_version):
432432
if os.environ.get('LANG', None):
433433
UNICODE_LOCALE = False
434434
else:
435-
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
436-
UNICODE_UPGRADED = True
435+
try:
436+
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
437+
UNICODE_UPGRADED = True
438+
except StandardError:
439+
UNICODE_LOCALE = False
440+
437441

438442
# -- Load Environment --------------------------------------------------------
439443
ENV_FILE = os.environ.get('ENV_FILE', '.env') #: environment variables will be loaded from this file first

‎lore/io/__init__.py

+3
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@
3434
else:
3535
vars()[section.lower()] = Connection(name=section.lower(), **options)
3636

37+
if 'metadata' not in vars():
38+
vars()['metadata'] = Connection('sqlite:///%s/metadata.sqlite' % lore.env.DATA_DIR)
39+
3740
redis_config = lore.env.REDIS_CONFIG
3841
if redis_config:
3942
require(lore.dependencies.REDIS)

‎lore/io/connection.py

+16-14
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ class Connection(object):
7878
UNLOAD_PREFIX = os.path.join(lore.env.NAME, 'unloads')
7979
IAM_ROLE = os.environ.get('IAM_ROLE', None)
8080

81-
def __init__(self, url, name='connection', **kwargs):
81+
def __init__(self, url, name='connection', watermark=True, **kwargs):
8282
if not sqlalchemy:
8383
raise lore.env.ModuleNotFoundError('No module named sqlalchemy. Please add it to requirements.txt.')
8484

@@ -101,6 +101,7 @@ def __init__(self, url, name='connection', **kwargs):
101101
del kwargs['__name__']
102102
if 'echo' not in kwargs:
103103
kwargs['echo'] = False
104+
logger.info("Creating engine: %s %s" % (url, kwargs))
104105
self._engine = sqlalchemy.create_engine(url, **kwargs).execution_options(autocommit=True)
105106
self._metadata = None
106107
self.name = name
@@ -110,21 +111,22 @@ def __init__(self, url, name='connection', **kwargs):
110111
@event.listens_for(self._engine, "before_cursor_execute", retval=True)
111112
def comment_sql_calls(conn, cursor, statement, parameters, context, executemany):
112113
conn.info.setdefault('query_start_time', []).append(datetime.now())
113-
stack = inspect.stack()[1:-1]
114-
if sys.version_info.major == 3:
115-
stack = [(x.filename, x.lineno, x.function) for x in stack]
116-
else:
117-
stack = [(x[1], x[2], x[3]) for x in stack]
114+
if watermark:
115+
stack = inspect.stack()[1:-1]
116+
if sys.version_info.major == 3:
117+
stack = [(x.filename, x.lineno, x.function) for x in stack]
118+
else:
119+
stack = [(x[1], x[2], x[3]) for x in stack]
118120

119-
paths = [x[0] for x in stack]
120-
origin = next((x for x in paths if x.startswith(lore.env.ROOT)), None)
121-
if origin is None:
122-
origin = next((x for x in paths if 'sqlalchemy' not in x), None)
123-
if origin is None:
124-
origin = paths[0]
125-
caller = next(x for x in stack if x[0] == origin)
121+
paths = [x[0] for x in stack]
122+
origin = next((x for x in paths if x.startswith(lore.env.ROOT)), None)
123+
if origin is None:
124+
origin = next((x for x in paths if 'sqlalchemy' not in x), None)
125+
if origin is None:
126+
origin = paths[0]
127+
caller = next(x for x in stack if x[0] == origin)
126128

127-
statement = "/* %s | %s:%d in %s */\n" % (lore.env.APP, caller[0], caller[1], caller[2]) + statement
129+
statement = "/* %s | %s:%d in %s */\n" % (lore.env.APP, caller[0], caller[1], caller[2]) + statement
128130
return statement, parameters
129131

130132
@event.listens_for(self._engine, "after_cursor_execute")

0 commit comments

Comments
 (0)
Please sign in to comment.