Skip to content

Releases: ONSdigital/research-and-development

First version

01 Jun 09:21
e717ec3
Compare
Choose a tag to compare
First version Pre-release
Pre-release

Release Notes - v0.1.0

Released on May 16, 2023.

Changes in .github/pull_request_template.md - Pull Request Template

  • Modified the pull request template to provide more detailed information.
  • Added bullet points for changes and instructions to help the reviewer.
  • Included a section to specify the ticket(s) being closed with the pull request.

The updated pull request template includes a set of checks that the submitter must complete before requesting a review. These checks are represented as checkboxes and help ensure that the pull request meets the required standards for code, documentation, data, testing, and peer review.

Changes in /.github/workflows/pytest-action.yaml - GitHub Action

These changes introduce a new GitHub Action workflow that runs pytest with code coverage analysis on every pull request to the develop branch. It removes the pydoop dependency, sets up the Miniconda environment, and generates a coverage report. Additionally, it adds a detailed coverage report comment to the pull request, including a coverage badge and test summary.

Changes in .gitignore

  • Added the following patterns to the .gitignore file:
    • logs/* to ignore log files.
    • !logs/.gitkeep to include an empty logs directory by using a .gitkeep file.
    • !logs/logs.md to include a specific logs.md file within the logs directory.

These changes ensure that log files are ignored by Git, except for the logs directory, which will contain an empty .gitkeep file and a logs.md file.

Changes in .pre-commit-config.yaml

  • Added the following hooks to the .pre-commit-config.yaml file:
    • detect-secrets hook: Detects secrets in staged code using the detect-secrets tool. Excludes files in the tests directory and the .cruft.json file.
    • restricted-filenames hook: Checks commits for restricted file extensions. If any commits contain files with the restricted extensions, it will fail the commit. The restricted file extensions include: .csv, .feather, .xlsx, .zip, .hdf5, .h5, .txt, .json, .xml, .hd, and .parquet.

These changes add additional pre-commit hooks to enhance code quality and security. The detect-secrets hook helps identify potential secrets in staged code, while the restricted-filenames hook enforces restrictions on certain file extensions in commits.

Changes in config/userconfig.toml

  • Added a new TOML configuration file userconfig.toml to allow users to set parameters before running the pipeline.
  • The configuration file includes the following sections:
    • period: Specifies the period to be considered, with start_period and end_period values in the format YYYY-MM-DD.
    • source_file: Sets up the source file location and file name.
    • output_location: Specifies the output location, including the Hive database name and table name.
    • outlier_correction: Provides the location, file name, and a boolean flag for an outlier correction file, if needed.

These changes enable users to customize the pipeline by setting various parameters in the userconfig.toml file, allowing for more flexibility and adaptability.

Changes in environment.yml

  • Updated the dependencies in the environment configuration file.
  • Added new dependencies including arrow, cookiecutter, detect-secrets, myst-parser, pre-commit, python-dotenv, table_logger, pandas, numpy, pydoop, setuptools, and pytest.
  • Removed specific version numbers for some dependencies.
  • Upgraded pre-commit to version 2.17.0.

These changes ensure that the project has the necessary dependencies for development and testing.

Changes in hdfs.py

  • Added a new file hdfs.py to the project.
  • This file contains code for exporting data to HDFS using the pydoop library.
  • The code creates a pandas.DataFrame, data, and saves it as a CSV file in HDFS.

Changes in logs directory

  • Added a new file .gitkeep to the logs directory to ensure it is tracked by Git.
  • Added a new file logs.md to the logs directory, providing instructions and guidelines for using log files.
  • Updated .gitignore to exclude log files from being committed to the repository.

Changes in main.py

  • Added a new file main.py to the project.
  • The code in main.py reloads the src module and runs the pipeline using the run_pipeline function.
  • This allows any changes made to the src module to be implemented without restarting the entire script.

These changes enhance the functionality of the project by enabling data export to HDFS and providing guidelines for managing log files.

src/main.py

  • Added a new file main.py in the src directory, containing the main pipeline logic. This file includes functions for running the pipeline, collecting logging parameters, manipulating data, and creating runlog files. We intend to rename one of the two main.py files as we note this not best practice.

src/utils/hdfs_mods.py

  • Added a new file hdfs_mods.py in the src/utils directory. This file contains general functions for the HDFS file system, such as reading and writing CSV files and loading JSON data.

Release Notes

Changes in src/utils/helpers.py

  • Added a new file helpers.py in the src/utils directory. This file defines helper functions that wrap regularly-used functions. It includes functions for reading user configuration files in TOML format, retrieving configuration settings, and selecting periods.

Changes in pyproject.toml

  • Modified the pyproject.toml file to include additional options for the pytest command. The --ignore option is now set to tests/test_utils/test_hdfs_mods.py, excluding this test file from pytest execution.

Changes in src/_version.py

  • Modified the src/_version.py file to update the version number to "0.1.0".

Changes in src/data_ingest/loading.py

  • Modified the src/data_ingest/loading.py file to include data loading logic from HDFS using the hdfs_load_json function. It loads JSON data from the specified snapshot path, creates Pandas DataFrames from the loaded data, and prints the first few rows of the DataFrames.

Changes in src/utils/helpers.py

  • Updated the src/utils/helpers.py file to include additional functions and classes. Changes include defining a Config_settings class, implementing logic to create a directory in the Hadoop Distributed File System (HDFS) using hdfs.mkdir, and modifying existing functions like user_config_reader and period_select to use the new class.

Changes in src/utils/runlog.py

  • Added a new file runlog.py in the src/utils directory. The file defines a RunLog class responsible for creating a runlog instance for the pipeline. It includes methods for generating a unique run ID, recording the time taken for the pipeline to run, retrieving pipeline logs, creating runlog files, and writing the runlog to files specified in the config.

Changes in src/utils/helpers.py

  • Updated the src/utils/helpers.py file to include additional functions and classes. Changes include defining a Config_settings class, implementing logic to create a directory in the Hadoop Distributed File System (HDFS) using hdfs.mkdir, and modifying existing functions like user_config_reader and period_select to use the new class.

Changes in src/utils/runlog.py

  • Added a new file runlog.py in the src/utils directory. The file defines a RunLog class responsible for creating a runlog instance for the pipeline. It includes methods for generating a unique run ID, recording the time taken for the pipeline to run, retrieving pipeline logs, creating runlog files, and writing the runlog to files specified in the config.

Changes in src/utils/wrappers.py

  • Added a new file wrappers.py in the src/utils directory. This file contains several function wrappers used for logging, timing, and exception handling:
    • logger_creator(global_config): A function that sets up logging configurations based on global settings.
    • time_logger_wrap(func): A decorator that logs the time taken to execute a function.
    • starting_time(): A function that records the start time of a function.
    • finishing_time(func, enter_time): A function that records the end time of a function and logs its runtime.
    • start_finish_wrapper(func): A decorator that logs the start and finish of a function.
    • log_start(func): A function that logs the start of a function at the INFO level.
    • log_finish(func): A function that logs the end of a function at the INFO level.
    • method_start_finish_wrapper(func): A decorator that logs the start time and end time of a function.
    • meth_log_start(func): A function that logs the start of a function at the DEBUG level.
    • meth_log_finish(func): A function that logs the end of a function at the DEBUG level.
    • exception_wrap(func): A decorator that logs all exceptions that occur in a function.
    • df_change_wrap(func): A decorator that logs changes in a DataFrame's columns and rows.

Changes in tests/test_main.py

  • Added a new file test_main.py in the tests directory. This file contains test cases for the add(), user_config_reader(), and period_select() functions.
  • test_add(): Tests the add() function by asserting its results for various input values, including positive and negative cases.
  • test_user_config_reader(): Tests the user_config_reader() function by asserting its result type and comparing it with an expected value.
  • test_period_select(): Tests the period_select() function by asserting its result type and comparing it with an expected value.

Changes in tests/test_utils/test_hdfs_mods.py

  • Added a new file test_hdfs_mods.py in the tests/test_utils direct...
Read more