Releases: ONSdigital/research-and-development
First version
Release Notes - v0.1.0
Released on May 16, 2023.
Changes in .github/pull_request_template.md
- Pull Request Template
- Modified the pull request template to provide more detailed information.
- Added bullet points for changes and instructions to help the reviewer.
- Included a section to specify the ticket(s) being closed with the pull request.
The updated pull request template includes a set of checks that the submitter must complete before requesting a review. These checks are represented as checkboxes and help ensure that the pull request meets the required standards for code, documentation, data, testing, and peer review.
Changes in /.github/workflows/pytest-action.yaml
- GitHub Action
These changes introduce a new GitHub Action workflow that runs pytest
with code coverage analysis on every pull request to the develop branch. It removes the pydoop
dependency, sets up the Miniconda
environment, and generates a coverage report. Additionally, it adds a detailed coverage report comment to the pull request, including a coverage badge and test summary.
Changes in .gitignore
- Added the following patterns to the
.gitignore
file:logs/*
to ignore log files.!logs/.gitkeep
to include an emptylogs
directory by using a.gitkeep
file.!logs/logs.md
to include a specificlogs.md
file within thelogs
directory.
These changes ensure that log files are ignored by Git, except for the logs
directory, which will contain an empty .gitkeep
file and a logs.md
file.
Changes in .pre-commit-config.yaml
- Added the following hooks to the
.pre-commit-config.yaml
file:detect-secrets
hook: Detects secrets in staged code using thedetect-secrets
tool. Excludes files in thetests
directory and the.cruft.json
file.restricted-filenames
hook: Checks commits for restricted file extensions. If any commits contain files with the restricted extensions, it will fail the commit. The restricted file extensions include:.csv
,.feather
,.xlsx
,.zip
,.hdf5
,.h5
,.txt
,.json
,.xml
,.hd
, and.parquet
.
These changes add additional pre-commit hooks to enhance code quality and security. The detect-secrets
hook helps identify potential secrets in staged code, while the restricted-filenames
hook enforces restrictions on certain file extensions in commits.
Changes in config/userconfig.toml
- Added a new TOML configuration file
userconfig.toml
to allow users to set parameters before running the pipeline. - The configuration file includes the following sections:
period
: Specifies the period to be considered, withstart_period
andend_period
values in the formatYYYY-MM-DD
.source_file
: Sets up the source file location and file name.output_location
: Specifies the output location, including the Hive database name and table name.outlier_correction
: Provides the location, file name, and a boolean flag for an outlier correction file, if needed.
These changes enable users to customize the pipeline by setting various parameters in the userconfig.toml
file, allowing for more flexibility and adaptability.
Changes in environment.yml
- Updated the dependencies in the environment configuration file.
- Added new dependencies including
arrow
,cookiecutter
,detect-secrets
,myst-parser
,pre-commit
,python-dotenv
,table_logger
,pandas
,numpy
,pydoop
,setuptools
, andpytest
. - Removed specific version numbers for some dependencies.
- Upgraded
pre-commit
to version 2.17.0.
These changes ensure that the project has the necessary dependencies for development and testing.
Changes in hdfs.py
- Added a new file
hdfs.py
to the project. - This file contains code for exporting data to HDFS using the
pydoop
library. - The code creates a
pandas.DataFrame
,data
, and saves it as a CSV file in HDFS.
Changes in logs
directory
- Added a new file
.gitkeep
to thelogs
directory to ensure it is tracked by Git. - Added a new file
logs.md
to thelogs
directory, providing instructions and guidelines for using log files. - Updated
.gitignore
to exclude log files from being committed to the repository.
Changes in main.py
- Added a new file
main.py
to the project. - The code in
main.py
reloads thesrc
module and runs the pipeline using therun_pipeline
function. - This allows any changes made to the
src
module to be implemented without restarting the entire script.
These changes enhance the functionality of the project by enabling data export to HDFS and providing guidelines for managing log files.
src/main.py
- Added a new file
main.py
in thesrc
directory, containing the main pipeline logic. This file includes functions for running the pipeline, collecting logging parameters, manipulating data, and creating runlog files. We intend to rename one of the twomain.py
files as we note this not best practice.
src/utils/hdfs_mods.py
- Added a new file
hdfs_mods.py
in thesrc/utils
directory. This file contains general functions for the HDFS file system, such as reading and writing CSV files and loading JSON data.
Release Notes
Changes in src/utils/helpers.py
- Added a new file
helpers.py
in thesrc/utils
directory. This file defines helper functions that wrap regularly-used functions. It includes functions for reading user configuration files in TOML format, retrieving configuration settings, and selecting periods.
Changes in pyproject.toml
- Modified the
pyproject.toml
file to include additional options for the pytest command. The--ignore
option is now set totests/test_utils/test_hdfs_mods.py
, excluding this test file from pytest execution.
Changes in src/_version.py
- Modified the
src/_version.py
file to update the version number to "0.1.0".
Changes in src/data_ingest/loading.py
- Modified the
src/data_ingest/loading.py
file to include data loading logic from HDFS using thehdfs_load_json
function. It loads JSON data from the specified snapshot path, creates Pandas DataFrames from the loaded data, and prints the first few rows of the DataFrames.
Changes in src/utils/helpers.py
- Updated the
src/utils/helpers.py
file to include additional functions and classes. Changes include defining aConfig_settings
class, implementing logic to create a directory in the Hadoop Distributed File System (HDFS) usinghdfs.mkdir
, and modifying existing functions likeuser_config_reader
andperiod_select
to use the new class.
Changes in src/utils/runlog.py
- Added a new file
runlog.py
in thesrc/utils
directory. The file defines aRunLog
class responsible for creating a runlog instance for the pipeline. It includes methods for generating a unique run ID, recording the time taken for the pipeline to run, retrieving pipeline logs, creating runlog files, and writing the runlog to files specified in the config.
Changes in src/utils/helpers.py
- Updated the
src/utils/helpers.py
file to include additional functions and classes. Changes include defining aConfig_settings
class, implementing logic to create a directory in the Hadoop Distributed File System (HDFS) usinghdfs.mkdir
, and modifying existing functions likeuser_config_reader
andperiod_select
to use the new class.
Changes in src/utils/runlog.py
- Added a new file
runlog.py
in thesrc/utils
directory. The file defines aRunLog
class responsible for creating a runlog instance for the pipeline. It includes methods for generating a unique run ID, recording the time taken for the pipeline to run, retrieving pipeline logs, creating runlog files, and writing the runlog to files specified in the config.
Changes in src/utils/wrappers.py
- Added a new file
wrappers.py
in thesrc/utils
directory. This file contains several function wrappers used for logging, timing, and exception handling:logger_creator(global_config)
: A function that sets up logging configurations based on global settings.time_logger_wrap(func)
: A decorator that logs the time taken to execute a function.starting_time()
: A function that records the start time of a function.finishing_time(func, enter_time)
: A function that records the end time of a function and logs its runtime.start_finish_wrapper(func)
: A decorator that logs the start and finish of a function.log_start(func)
: A function that logs the start of a function at the INFO level.log_finish(func)
: A function that logs the end of a function at the INFO level.method_start_finish_wrapper(func)
: A decorator that logs the start time and end time of a function.meth_log_start(func)
: A function that logs the start of a function at the DEBUG level.meth_log_finish(func)
: A function that logs the end of a function at the DEBUG level.exception_wrap(func)
: A decorator that logs all exceptions that occur in a function.df_change_wrap(func)
: A decorator that logs changes in a DataFrame's columns and rows.
Changes in tests/test_main.py
- Added a new file
test_main.py
in thetests
directory. This file contains test cases for theadd()
,user_config_reader()
, andperiod_select()
functions. test_add()
: Tests theadd()
function by asserting its results for various input values, including positive and negative cases.test_user_config_reader()
: Tests theuser_config_reader()
function by asserting its result type and comparing it with an expected value.test_period_select()
: Tests theperiod_select()
function by asserting its result type and comparing it with an expected value.
Changes in tests/test_utils/test_hdfs_mods.py
- Added a new file
test_hdfs_mods.py
in thetests/test_utils
direct...