0.10.2
0.10.2 (2023-09-25)
Bug Fixes
Please read information about Release 0.10:
New PySUS version 0.10.0 (2023-09-19) has been release!
This release includes, majorly, new classes to interact with DATASUS FTP server, they are intended to replace the pysus.online_data.__init__
classes and functionalities. These classes, located on pysus.ftp
module, are the building blocks to retrieve data files, to be then visualized as dataframes when locally parsed to parquets. An additional class pysus.data.local.ParquetSet
has been introduced to work with downloaded DBC or DBF files from the server, that will be responsible for transforming them into parquet files.
pysus.online_data
Deprecation
This release aims to create a better FTP interface to interact with DATASUS FTP server. Therefore, the classes and methods, mainly on init.py, are intended to be deprecated in future releases. To grant similar usage compatibility with previous versions, the methods on the former databases have been updated to use the new FTP interface, but methods in the modules remain similar until FTP_Inspect and FTP_Downloader classes become completely obsolete.
pysus.ftp
module
The classes introduced in this module are consist in two groups. Base FTP modules, found in init.py file, in which are File, Directory and Database. And the second group are the Databases itself, representing DATASUS groups of data inside the server.
pysus.ftp.File
& pysus.ftp.Directory
classes
FTP File is an output class when listing DATASUS FTP content with PySUS. The File's main methods are the download()
and async_download()
, but they can also display information retrieved in the FTP server with File.info
.
A FTP Directory is responsible for actual parse the FTP content in Files or other Directories. When instantiated, a Directory CWDs into the path provided and load itself and its parent into CACHE, but not its content yet. To load the content inside a directory, it has to be explicit loaded with Directory.load()
, this will then parse all the FTP content into Files/Directories in its own content
. The CACHE here matters, because when a child dir is loaded, it can be linked to a former Directory instance that have been loaded already.
Note that only Directories are stored in cache.
Database
classes
PySUS FTP Databases are the reason why File and Directory exist. A Database consists in specifics Directories in DATASUS with specific File (DBC or DBF) names format. These files will be parsed to ParquetSets
(parquet) format when extracted from DATASUS to a local machine in order to be read as pandas DataFrames.
A list with all databases implemented to this day can be found in pysus.ftp.databases directory. Each Database has its own specifications, but they all share the same main functionalities:
name
: <ABBREVIATION> - <Long Name>paths
: A list of Directories or a Directory where Database's Files will be searched for, in DATASUSmetadata
: A dictionary with detailed information about the Databasegroups
: A Database's specific groups of data found in FTP Servercontent
: The loaded content (Files/Directories) of a Databasesfiles
: Its content, filtered by Files onlyload()
: Loads a Directory content in its own content. The default Directories are itspaths
describe()
: Displays a File (specific to its Database) information in a human formatformat()
: Extracts a File information into a tupleget_files()
: Filters itscontent
based on its specificationsdownload()
: Downloads a list of Files and returns inParquetSet
format
pysus.data.local.ParquetSet
ParquetSets are the output class when retrieving Database files into another machine. They represent a final file format after parsed DBC -> DBF -> parquet. The parquet data format splits the data in smaller chunks, so it can be better managed when loaded in memory to be visualized. In general, the ParquetSet
is able to load all the chunks into memory and display the data as DataFrame with to_dataframe()
. But be aware that large parquet sets may fill the entire memory.