aws-samples
diff --git a/‎.gitignore
Lines changed: 2 additions & 0 deletions b/‎.gitignore
Lines changed: 2 additions & 0 deletions
diff --git a/‎Dockerfile
Lines changed: 9 additions & 0 deletions b/‎Dockerfile
Lines changed: 9 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 12 additions & 5 deletions b/‎README.md
Lines changed: 12 additions & 5 deletions
diff --git a/‎data/infer.csv
Lines changed: 10001 additions & 0 deletions b/‎data/infer.csv
Lines changed: 10001 additions & 0 deletions
diff --git a/‎data/test.csv
Lines changed: 4886 additions & 0 deletions b/‎data/test.csv
Lines changed: 4886 additions & 0 deletions
diff --git a/‎data/train.csv
Lines changed: 48843 additions & 0 deletions b/‎data/train.csv
Lines changed: 48843 additions & 0 deletions
diff --git a/‎docker_utils.py
Lines changed: 215 additions & 0 deletions b/‎docker_utils.py
Lines changed: 215 additions & 0 deletions
diff --git a/‎model/model.tar.gz
33.7 KB b/‎model/model.tar.gz
33.7 KB
diff --git a/‎script/inference.py
Lines changed: 61 additions & 0 deletions b/‎script/inference.py
Lines changed: 61 additions & 0 deletions
diff --git a/‎script/preprocess.pkl
3.87 KB b/‎script/preprocess.pkl
3.87 KB
@@ -0,0 +1,2 @@
+*.ipynb_checkpoints
+.idea
@@ -0,0 +1,9 @@
+FROM python:3.8-slim-buster
+
+RUN pip3 install pandas==1.1.4 numpy==1.19.4 scikit-learn==0.23.2 scipy==1.5.4 boto3==1.17.12
+
+WORKDIR /home
+
+COPY src/* /home/
+
+ENTRYPOINT ["python3", "drift_detector.py"]
@@ -1,11 +1,18 @@
-## My Project
+# Bring your own container to project model accuracy drift with Amazon SageMaker Model Monitor
 
-TODO: Fill this README out!
+The world we live in is constantly changing and so is the data that is collected to build models. One of the problems that is constantly seen in production environment is that the deployed model is not behaving the same way as it was during the training phase. This concept is generally called as *Data Drift* or *Dataset Shift* and can be caused by many factors such as bias in sampling data that affects features or label data, non-stationary nature of time series data, or changes in data pipeline. Since machine learning models are not deterministic, it is important to minimize the variance in the production environment by periodically monitoring the deployment environment for model drift and sending alerts and if necessary trigger re-training of the models with new data.
 
-Be sure to:
+[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. After you train an ML model, you can deploy it on SageMaker endpoints that are fully managed and can serve inferences in real time with low latency. After you deploy your model, you can use Amazon SageMaker Model Monitor to continuously monitor the quality of your ML model in real time. You can also configure alerts to notify and trigger actions if any drift in model performance is observed. Early and proactive detection of these deviations enables you to take corrective actions, such as collecting new ground truth training data, retraining models, and auditing upstream systems, without having to manually monitor models or build additional tooling.
 
-* Change the title in this README
-* Edit your repository description on GitHub
+In this repository, we will present techniques to detect covariate drift, and demonstrate how to incorporate your own custom drift detection algorithms and visualizations with SageMaker model monitor.
+
+## Contents
+* `sm_model_monitor.ipynb`: The main SageMaker notebook that will connect all the above data source and scripts.
+* `Dockerfile`: The docker file for custom model monitor container.
+* `src`: contains files that are used to detect model drift using custom algorithms with SageMaker Model Monitor.  
+* `data`: We have chosen [Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Adult) from UCI Machine Learning Repository. The dataset consists of people income and several attributes describe demographics of the population. The task is predict if a person makes above or below $50,000. This dataset contains both categorical and integral attributes, and has several missing values. The `data` folder contains training and test datasets, and also data that will be used during inference.
+* `model`: contains the XGBoost model trained using `sm_train_xgb.ipynb`script.
+* `script`: This folder contains scripts used during model inference.
 
 ## Security
 
 
@@ -0,0 +1,215 @@
+# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"). You
+# may not use this file except in compliance with the License. A copy of
+# the License is located at
+#
+#     http://aws.amazon.com/apache2.0/
+#
+# or in the "license" file accompanying this file. This file is
+# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
+# ANY KIND, either express or implied. See the License for the specific
+# language governing permissions and limitations under the License.
+
+from __future__ import absolute_import
+
+import base64
+import contextlib
+import os
+import time
+import shlex
+import shutil
+import subprocess
+import sys
+import tempfile
+
+import boto3
+import json
+
+IMAGE_TEMPLATE = "{account}.dkr.ecr.{region}.amazonaws.com/{image_name}:{version}"
+
+
+def build_and_push_docker_image(repository_name, dockerfile='Dockerfile', build_args={}):
+    """Builds a docker image from the specified dockerfile, and pushes it to
+    ECR.  Handles things like ECR login, creating the repository.
+
+    Returns the name of the created docker image in ECR
+    """
+    base_image = _find_base_image_in_dockerfile(dockerfile)
+    _ecr_login_if_needed(base_image)
+    _build_from_dockerfile(repository_name, dockerfile, build_args)
+    ecr_tag = push(repository_name)
+    return ecr_tag
+
+
+def _build_from_dockerfile(repository_name, dockerfile='Dockerfile', build_args={}):
+    build_cmd = ['docker', 'build', '-t', repository_name, '-f', dockerfile, '.']
+    for k,v in build_args.items():
+        build_cmd += ['--build-arg', '%s=%s' % (k,v)]
+
+    print("Building docker image %s from %s" % (repository_name, dockerfile))
+    _execute(build_cmd)
+    print("Done building docker image %s" % repository_name)
+    
+
+def _find_base_image_in_dockerfile(dockerfile):
+    dockerfile_lines = open(dockerfile).readlines()
+    from_line = list(filter(lambda line: line.startswith("FROM "), dockerfile_lines))[0].rstrip()
+    base_image = from_line[5:]
+    return base_image
+
+
+def push(tag, aws_account=None, aws_region=None):
+    """
+    Push the builded tag to ECR.
+
+    Args:
+        tag (string): tag which you named your algo
+        aws_account (string): aws account of the ECR repo
+        aws_region (string): aws region where the repo is located
+
+    Returns:
+        (string): ECR repo image that was pushed
+    """
+    session = boto3.Session()
+    aws_account = aws_account or session.client("sts").get_caller_identity()['Account']
+    aws_region = aws_region or session.region_name
+    try:
+        repository_name, version = tag.split(':')
+    except ValueError:  # split failed because no :
+        repository_name = tag
+        version = "latest"
+    ecr_client = session.client('ecr', region_name=aws_region)
+
+    _create_ecr_repo(ecr_client, repository_name)
+    _ecr_login(ecr_client, aws_account)
+    ecr_tag = _push(aws_account, aws_region, tag)
+
+    return ecr_tag
+
+
+def _push(aws_account, aws_region, tag):
+    ecr_repo = '%s.dkr.ecr.%s.amazonaws.com' % (aws_account, aws_region)
+    ecr_tag = '%s/%s' % (ecr_repo, tag)
+    _execute(['docker', 'tag', tag, ecr_tag])
+    print("Pushing docker image to ECR repository %s/%s\n" % (ecr_repo, tag))
+    _execute(['docker', 'push', ecr_tag])
+    print("Done pushing %s" % ecr_tag)
+    return ecr_tag
+
+
+def _create_ecr_repo(ecr_client, repository_name):
+    """
+    Create the repo if it doesn't already exist.
+    """
+    try:
+        ecr_client.create_repository(repositoryName=repository_name)
+        print("Created new ECR repository: %s" % repository_name)
+    except ecr_client.exceptions.RepositoryAlreadyExistsException:
+        print("ECR repository already exists: %s" % repository_name)
+
+
+def _ecr_login(ecr_client, aws_account):
+    auth = ecr_client.get_authorization_token(registryIds=[aws_account])
+    authorization_data = auth['authorizationData'][0]
+
+    raw_token = base64.b64decode(authorization_data['authorizationToken'])
+    token = raw_token.decode('utf-8').strip('AWS:')
+    ecr_url = auth['authorizationData'][0]['proxyEndpoint']
+
+    cmd = ['docker', 'login', '-u', 'AWS', '-p', token, ecr_url]
+    _execute(cmd, quiet=True)
+    print("Logged into ECR")
+
+
+def _ecr_login_if_needed(image):
+    ecr_client = boto3.client('ecr')
+
+    # Only ECR images need login
+    if not ('dkr.ecr' in image and 'amazonaws.com' in image):
+        return
+
+    # do we have the image?
+    if _check_output('docker images -q %s' % image).strip():
+        return
+
+    aws_account = image.split('.')[0]
+    _ecr_login(ecr_client, aws_account)
+
+
+@contextlib.contextmanager
+def _tmpdir(suffix='', prefix='tmp', dir=None):  # type: (str, str, str) -> None
+    """Create a temporary directory with a context manager. The file is deleted when the context exits.
+
+    The prefix, suffix, and dir arguments are the same as for mkstemp().
+
+    Args:
+        suffix (str):  If suffix is specified, the file name will end with that suffix, otherwise there will be no
+                        suffix.
+        prefix (str):  If prefix is specified, the file name will begin with that prefix; otherwise,
+                        a default prefix is used.
+        dir (str):  If dir is specified, the file will be created in that directory; otherwise, a default directory is
+                        used.
+    Returns:
+        str: path to the directory
+    """
+    tmp = tempfile.mkdtemp(suffix=suffix, prefix=prefix, dir=dir)
+    yield tmp
+    shutil.rmtree(tmp)
+
+
+def _execute(command, quiet=False):
+    if not quiet:
+        print("$ %s" % ' '.join(command))
+    process = subprocess.Popen(command,
+                               stdout=subprocess.PIPE,
+                               stderr=subprocess.STDOUT)
+    try:
+        _stream_output(process)
+    except RuntimeError as e:
+        # _stream_output() doesn't have the command line. We will handle the exception
+        # which contains the exit code and append the command line to it.
+        msg = "Failed to run: %s, %s" % (command, str(e))
+        raise RuntimeError(msg)
+
+
+def _stream_output(process):
+    """Stream the output of a process to stdout
+
+    This function takes an existing process that will be polled for output. Only stdout
+    will be polled and sent to sys.stdout.
+
+    Args:
+        process(subprocess.Popen): a process that has been started with
+            stdout=PIPE and stderr=STDOUT
+
+    Returns (int): process exit code
+    """
+    exit_code = None
+
+    while exit_code is None:
+        stdout = process.stdout.readline().decode("utf-8")
+        sys.stdout.write(stdout)
+        exit_code = process.poll()
+
+    if exit_code != 0:
+        raise RuntimeError("Process exited with code: %s" % exit_code)
+
+
+def _check_output(cmd, *popenargs, **kwargs):
+    if isinstance(cmd, str):
+        cmd = shlex.split(cmd)
+
+    success = True
+    try:
+        output = subprocess.check_output(cmd, *popenargs, **kwargs)
+    except subprocess.CalledProcessError as e:
+        output = e.output
+        success = False
+
+    output = output.decode("utf-8")
+    if not success:
+        print("Command output: %s" % output)
+        raise Exception("Failed to run %s" % ",".join(cmd))
+
+    return output
@@ -0,0 +1,61 @@
+#  Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License").
+#  You may not use this file except in compliance with the License.
+#  A copy of the License is located at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  or in the "license" file accompanying this file. This file is distributed
+#  on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+#  express or implied. See the License for the specific language governing
+#  permissions and limitations under the License.
+
+import os
+import pickle
+import pathlib
+
+from io import StringIO
+
+import pandas as pd
+
+import sagemaker_xgboost_container.encoder as xgb_encoders
+
+
+script_path = pathlib.Path(__file__).parent.absolute()
+with open(f'{script_path}/preprocess.pkl', 'rb') as f:
+    preprocess = pickle.load(f) 
+
+
+def input_fn(request_body, content_type):
+    """
+    The SageMaker XGBoost model server receives the request data body and the content type,
+    and invokes the `input_fn`.
+
+    Return a DMatrix (an object that can be passed to predict_fn).
+    """
+
+    if content_type == "text/csv":        
+        df = pd.read_csv(StringIO(request_body), header=None)
+        X = preprocess.transform(df)
+        
+        X_csv = StringIO()
+        pd.DataFrame(X).to_csv(X_csv, header=False, index=False)
+        req_transformed = X_csv.getvalue().replace('\n', '')
+                
+        return xgb_encoders.csv_to_dmatrix(req_transformed)
+    else:
+        raise ValueError(
+            "Content type {} is not supported.".format(content_type)
+        )
+        
+
+def model_fn(model_dir):
+    """
+    Deserialize and return fitted model.
+    """
+    
+    model_file = "xgboost-model"
+    booster = pickle.load(open(os.path.join(model_dir, model_file), "rb"))
+        
+    return booster