College football fans, rejoice! Or sob uncontrollably? No matter your disposition or current mental health state, sports statistics provide abundant opportunities for data nerds to get lost in the numbers. And sometimes it's an added benefit (or curse!) if it is the team you happen to support.
Data is usually spread out throughout the galaxy of the internet and sports media, making it difficult to get a full picture of the question you are attempting to answer. So where does that leave us, the college football faithful? Using Sports Reference, who "...democratizes data, so our users enjoy, understand, and share the sports they love."
The repo is going to be evolving and will include further applications and aspects of all things data. Keep following to see where this goes!
The script utilizes the following libraries:
- Splinter - Splinter Documentation
- BeautifulSoup - BeautifulSoup Documentation
- pandas - pandas Documentation
- NumPy - NumPy Documentation
- SQLite - SQLite Documentation
- Scikit Learn - Scikit Learn Documentation
Google Chrome is recommended for web scraping. If another browser is used, please remember to import the appropriate dependencies in the first cell.
Article explores college football data through web scraping, emphasizing the importance of sports data in storytelling. Using Python, the author provides a script for generating a list of football programs from Sports Reference. The script, utilizing Splinter and BeautifulSoup, ensures accurate data extraction by inspecting HTML elements. A user script prompts input for school and years, generating URLs for scraping offensive statistics. The resulting data is stored in a Pandas DataFrame and a CSV file. The article concludes by hinting at upcoming sections on data cleaning, offering a comprehensive guide to exploring and analyzing college football statistics.
The article explores the data pipeline, starting with extraction using Python scripts for team statistics. The complexities of the data are illustrated, leading to the Transform phase. Pandas, NumPy, and SQLite libraries are employed to clean and manipulate the data. Article details the process of handling missing values, formatting issues, and converting data types. The article emphasizes the importance of thorough checks, highlighting the need to drop incomplete rows. The Load phase involves exporting the cleaned data to CSV files and an SQLite database.
Delve into the nuances of data cleaning and analysis in Python with this insightful article. The author presents two unique cases, illustrating the challenges and solutions encountered when dealing with missing data in different datasets. The importance of subject matter expertise in data engineering is emphasized, showcasing the significance of understanding the context behind the data.
Embark on a data analysis journey delving into Nebraska Football from 2000–2022 using SQL. The article covers essential data engineering tasks, including source data extraction, cleaning, transformation, and storage. The focus then shifts to SQL queries for insightful analysis. Explore aspects like game counts, offensive and defensive performances, turnovers, penalties, and team rankings. The author utilizes SQL commands to reveal trends, strengths, and weaknesses, offering a comprehensive view of Nebraska's football performance over the years.
This article delves into the performance of the Nebraska Cornhuskers' football team through data analysis and K-means clustering. Using Greek mythology and existential philosophy, the author introduces the concept of the Absurd to make sense of the team's inconsistent trajectory. Python and Scikit-learn are employed for clustering, revealing three distinct performance groups across coaching eras. The analysis sheds light on Nebraska's position, despite historical success, and examines the challenges faced by coaches, including Scott Frost. The article concludes by suggesting a reassessment of fan expectations and offering a nuanced perspective on Nebraska's football journey.
The school_list
and school_year
scripts do not need to be run until after the season to include any new schools competing this year and to include the 2023 season.
The school_stats
script should be the first stop in selecting the school and years you want to analyze. After you have scraped your data, you will fun the corresoponding ETL Notebook
to clean and load the data.
All web scraping scripts generate .csv files in the resources
directory in addition to DataFrames. The school_stats_script
notebook needs user inputs for school name, initial year, and final year. Currently, school names need to be capitalized. Examples are below:
- Nebraska
- Iowa State
- Texas A&M
Every ETL script produces both a .csv and updates the database. Special attention needs to be paid to the standings_transform_loading
script, as dropping NaN columns in the incorrect order will elimate valid data. Please follow the script in it's entirety.
The data pipeline is structured as follows:
Run the entire notebook to produce two outputs:
- Plot of user selected statistics
- Dataset that includes k-means cluster groups (.csv file)
The analysis groups evey team that participated in College Football from 2000 through 2022 and grouped teams based on wins, losses, rank, strength of schedule, and points. The graphs also allow for an individual school to be analyzed to determine which seasons belonged to the categorical group. For a use case please visit this Medium post.
This section highlights a use case for the project focusing on the Nebraska Cornhuskers. For the most in-depth analysis, check out the Medium blog posts for specific examples and further analysis.
SQL script is included as an ad-hoc code snips for generating statistics and random facts for the Medium blog posts.
K-means clustering analysis for the Nebraska Cornhuskers to highlight which coachese had the most impact on their team. Dataset used was 2000 - 2022 team standings in the standings.csv
file.
Regression analysis to predict outcomes based on indvidual game statistics. This script can be run using any team dataset using the school_stats
webscraping script. Regression model utilizes both offensive and defensive datasets and is currently predicting with ~70% accuracy.
I welcome contributions and collaboration from the college football community and data enthusiasts. If you'd like to get involved in making this project even better, here's how you can contribute:
- Fork this repository to your GitHub account.
- Clone the forked repository to your local machine.
- Create a new branch for your feature or bug fix:
git checkout -b your-feature
. - Make your changes and commit them:
git commit -m 'Added a new feature'
. - Push the changes to your fork:
git push origin your-feature
. - Open a Pull Request (PR) to the
main
branch of this repository. Please provide a clear and descriptive title and a summary of your changes.
If you encounter any issues or have suggestions for improvement, please open an issue on the GitHub repository. Be sure to include a detailed description of the problem or your idea for enhancement.