This application is designed to extract, process, and analyze data related to public companies from both structured and unstructured sources. It provides insights through calculated financial ratios, sentiment analysis, and trend identification, culminating in a summary via streamlit dashboard and chatbot.
Ensure you have the following installed:
- Python 3.x
- Pip (Python package installer)
-
Clone the repository:
git clone https://github.com/kyle-cassidy/company-data-etl
-
Navigate to the project directory
python3 -m venv venv
-
Activate env
source venv/bin/activate
-
Install the required dependencies
pip3 install -r requirements.txt
- note: if you would like to use jupyter notebook to run interactive cells, you will need to install the jupyter notebook package in your venv as well. to do so, run the following command in your terminal or command prompt:
pip3 install jupyter
the application is designed to extract, process, and analyze data related to public companies from both structured and unstructured sources. currently it collects, transforms, translates structured data and provides insights through calculated financial ratios, sentiment analysis, and trend identification, culminating in a summary via streamlit.
update: a demonstration chatbot is live on the streamlit dashboard. it is a simple chatbot that can answer questions about uber and lyft 10-k financial documents. it is designed to be a simple demonstration of the potential for a chatbot to provide insights and answer questions about unstructured data.
Set your OPENAI_API_KEY environment variable in .env (located at the root level) or Run the following command in your terminal, replacing with your API key.
- on a mac:
echo "export OPENAI_API_KEY=<enter-your-key-here>" >> ~/.zshrc
- Update the shell with the new variable:
source ~/.zshrc
(or restart your shell) - windows: `setx OPENAI_API_KEY ""
Detailed instructions can be found here: https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
-
Notebook analysis can be found in the
analysis_sqlite.ipynb
notebook at the root of the project. -
Run the Flask API backend and Streamlit dashboard frontend app in your prefered terminal or command prompt, navigate to the root of the project and run the following command to launch everything at once:
python3 main.py
alternatively, you can run the backend and frontend separately:
-
Run the Flask application:
python3 run_backend.py
-
Run the Streamlit application:
python3 run_frontend.py
note: the streamlit application is still in development but it is functional and contains basic visualizations and a chatbot. it is designed to serve as a dashboard for the Flask application and beyond. there is still lots of potential here.
note: the repo has been restructured to better organize the codebase. the following is a brief overview of the new structure:
.
├── README.md
├── requirements.txt
├── main.py
├── src
│ ├── frontend
│ │ ├── dashboard # contains the streamlit application and the chatbot.
│ │ ├── API # empty directory for future development and code organization.
│ │ ├── __init__.py # initializes the streamlit application and sets up routes.
|── data # seed pdf for chatbot demo
├── llm # llm chatbot logic: ingestion, tokenization, and response generation.
├── storage # persist indexed datastore
│ ├── backend
│ │ ├── server.py # imports the Flask application and runs it.
│ │ ├── app
│ │ │ ├──
│ │ │ ├── api # Contains the Flask application, database models, and API clients.
│ │ │ ├── data
│ │ │ │ ├── FMP-API # JSON files containing data from our fmpAPIclient.
│ │ │ │ │ └── FMP-client-output
│ │ │ │ ├── migrations # SQL scripts for creating the database schema.
│ │ │ │ │ ├── phase_1 # a simplified schema used in the initial development and analysis phase.
│ │ │ │ │ │ └──
│ │ │ │ │ └── phase_2 # database schema that adheres to 3rd normal form.
│ │ │ │ │ └──
│ │ │ │ ├── seed #data used for seeding the database.
│ │ │ │ │ └── SEC-EDGAR
│ │ │ │ │ ├── companyfacts
│ │ │ ├── clients # Contains classes for making requests to external APIs
| | | | # and retrieving data from external data sources such as the Financial Modeling Prep API and SEC EDGAR.
│ │ │ ├── utils
│ │ │ │ ├── adapters # Contains classes for adapting, transforming, and orchestrating the population of the data retrieved by the clients.
│ │ │ ├── models # Contains the ORM model classes that represent the tables in the database and include methods for interacting with the database.
│ │ │ ├── __init__.py # Initializes the Flask application, sets up routes, and configures the database connection.
│ │ │ ├── db
│ │ │ ├── db.py # DEPRICATED: Contains my old ORM functions for database operations such as connecting to the database, finding records, and building model instances from database records.
## what's next?
- develop the chatbot to be more robust and capable of answering more complex questions. agentic behavior and more robust RAG architecture.
- data preprocessing and feature engineering for the financial data. clean data is the foundation of any good model.
- automate the scraping, api calls, and data ingestion
- deploy phase 2 schema designed to third normal form
- setup XML to MD conversion for the chatbot to access company reports and press releases in real time
- develop a more robust and interactive dashboard with more visualizations and insights
- refactor the dashboard: pull out functions and classes into separate files and directories
- there is always more to do. the sky is the limit...
## License
This project is licensed under the MIT License - see the `LICENSE.md` file for details.
## Acknowledgments
- SEC EDGAR for providing company reports and press releases.
- Financial Modeling Prep API for providing financial data.
## Contact
For any queries or further assistance, please contact the repository owner.
docker build -t company-data-etl:dev -f Dockerfile .
docker run -it --entrypoint /bin/bash --env-file=.flaskenv --name company-data-etl-prod company-data-etl:dev