Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets (90 min version)
"Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory. In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud -- starting from how the data is stored and read, to how it is processed and visualized.
By the end, you will be able to answer:
- What makes some data formats more efficient at scale?
- Why, how, and when (and when not) to leverage parallel and distributed computation (primarily with Dask) for your work?
- How to manage cloud storage, resources, and costs effectively?
- How interactive visualization can make large and complex data more understandable (primarily with hvPlot)?
- How to comfortably collaborate on data science projects with your entire team?
This tutorial is designed to run in the cloud.
Live presentations will run on a Nebari (JupterHub) instance, check out the introduction notebook for details.
- PyDat Global 2023 (Upcoming!)
- PyData NYC 2023
You can check out the tags for previous versions of this tutorial.
This repository is covered by the Nebari Code of Conduct, and is under BSD 3-Clause license.