Skip to content

0.10.0

Compare
Choose a tag to compare
@github-actions github-actions released this 19 Sep 19:42
· 35 commits to main since this release

New PySUS version 0.10.0 (2023-09-19) has been release!

This release includes, majorly, new classes to interact with DATASUS FTP server, they are intended to replace the pysus.online_data.__init__ classes and functionalities. These classes, located on pysus.ftp module, are the building blocks to retrieve data files, to be then visualized as dataframes when locally parsed to parquets. An additional class pysus.data.local.ParquetSet has been introduced to work with downloaded DBC or DBF files from the server, that will be responsible for transforming them into parquet files.

pysus.online_data Deprecation

This release aims to create a better FTP interface to interact with DATASUS FTP server. Therefore, the classes and methods, mainly on init.py, are intended to be deprecated in future releases. To grant similar usage compatibility with previous versions, the methods on the former databases have been updated to use the new FTP interface, but methods in the modules remain similar until FTP_Inspect and FTP_Downloader classes become completely obsolete.

pysus.ftp module

The classes introduced in this module are consist in two groups. Base FTP modules, found in init.py file, in which are File, Directory and Database. And the second group are the Databases itself, representing DATASUS groups of data inside the server.

pysus.ftp.File & pysus.ftp.Directory classes

FTP File is an output class when listing DATASUS FTP content with PySUS. The File's main methods are the download() and async_download(), but they can also display information retrieved in the FTP server with File.info.

A FTP Directory is responsible for actual parse the FTP content in Files or other Directories. When instantiated, a Directory CWDs into the path provided and load itself and its parent into CACHE, but not its content yet. To load the content inside a directory, it has to be explicit loaded with Directory.load(), this will then parse all the FTP content into Files/Directories in its own content. The CACHE here matters, because when a child dir is loaded, it can be linked to a former Directory instance that have been loaded already.
Note that only Directories are stored in cache.

Database classes

PySUS FTP Databases are the reason why File and Directory exist. A Database consists in specifics Directories in DATASUS with specific File (DBC or DBF) names format. These files will be parsed to ParquetSets (parquet) format when extracted from DATASUS to a local machine in order to be read as pandas DataFrames.
A list with all databases implemented to this day can be found in pysus.ftp.databases directory. Each Database has its own specifications, but they all share the same main functionalities:

  • name: <ABBREVIATION> - <Long Name>
  • paths: A list of Directories or a Directory where Database's Files will be searched for, in DATASUS
  • metadata: A dictionary with detailed information about the Database
  • groups: A Database's specific groups of data found in FTP Server
  • content: The loaded content (Files/Directories) of a Databases
  • files: Its content, filtered by Files only
  • load(): Loads a Directory content in its own content. The default Directories are its paths
  • describe(): Displays a File (specific to its Database) information in a human format
  • format(): Extracts a File information into a tuple
  • get_files(): Filters its content based on its specifications
  • download(): Downloads a list of Files and returns in ParquetSet format

pysus.data.local.ParquetSet

ParquetSets are the output class when retrieving Database files into another machine. They represent a final file format after parsed DBC -> DBF -> parquet. The parquet data format splits the data in smaller chunks, so it can be better managed when loaded in memory to be visualized. In general, the ParquetSet is able to load all the chunks into memory and display the data as DataFrame with to_dataframe(). But be aware that large parquet sets may fill the entire memory.

Features

  • databases: create CACHE structure to ftp Directories & add CNES database (#152) (b99dd38)
  • pbar: include a progress bar to download and parsing data (8cd691c)
  • struc: database modularization and code improvement (#137) (d7e6d27)

Bug Fixes

  • pyreaddbc: update pyreaddbc to fix dbc parsing bug (#153) (4c8315a)
  • release: include main branch on workflow_dispatch (#155) (8f9367d)
  • release: update .releaserc.json (#156) (f9fc7f4)
  • release: update branch from master to main on release file (#154) (4600137)