This respository focuses in the analysis of the renowned Fisher's Iris data set as part of the Programming and Scripting module for the Higher Diploma in Data Analytics at ATU.
Below is an overview of the content of this respository:
-
README File: This document provides an introduction about the respository. It includes the purpose, contents and the analysis made using Python of Fisher's Iris data set. Also, It includes the methodology employed and a list of all URL references used.
-
iris.csv: The data set investigated is downloaded into the respository in a .csv document. It is a simple text file to store data in a tabular format.
-
analysis.py: this file contains all Python code, a language programing, which is uses to make the analysis of the data set. All statictis measures, plots, correlation between other is done by Python code and is organized in this document step by step.
-
summary_variable: It is a text file which shows an a concise and clear summary of information from data set. It is provides a brief of the relevant points or main statistical measures of variables in Fisher's Iris data set.
-
plots.png: Finally, in this repository you can see .png files. These are use for storing a variety of plots/ graphics such as Histograms, Scatter plots and Heat map for correlation matrix.
The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in 1936 paper as described on Wikipedia.
This project has the aim to analyze the Fisher's Iris data set, particularly, the length and width of sepal and petal in just three types of species of Iris flowers knowned as Setosa, Versicolor and Virginica.
Therefore, there are five variables in total in this investigation classified as follows:
-
Categorical: Species (Setosa, Virginica and Versicolor). 1 variable.
-
Numerical: Sepal length, sepal width, petal length and petal width. 4 variables.
Image 2. Giri, A. (2022, June 16). IRIS Flowers Classification using Machine Learning. Machinelearning4you.
In this analysis project, Python is used as the main method or tool. The reason why is due has powerful libraries for facilitate data manipulation and frames, statistical operations, support for mathematical functions and visualization through graphics. The libraries are imported in the analysis.py file.
Some basic libraries imported are:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
Also, the raw data in the project is collected from seaborn-data.
df = pd.read_csv ("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")
Image 3. by ScienceDirect.com | Science, health and medical journals, full text articles and books. (n.d.).
The iris data set is balanced, containing an equal number of Iris flower of each species, with 50 samples each (50 Setosa, 50 Versicolor, 50 Virginica). There are 150 Iris flower in total to investigate.
In this table is observed that Sepal Length (7.9 cm) has the maximun cm measure and Petal Width the minimum measure (0.10 cm). In general, in these Iris flowers species, Sepal are bigger size than Petal.
In relation to metrics such as mean and standard desviation there is a variability of results. For example, Petal Lenght has the higher standar desviation (1.7) meaning that there is a higher dispersion of the data from the mean. On the other side, Sepal width (0.43) keeps the data points closer to the mean which is less disperse. Petal length is a higher variety in the flowers measurements.
Next, you can observe the following histograms with data distribution of these numerical variables:
Below are graphs to show with detail characteristics of each pair of variables taking into account the different iris flower species.
In this study, the interpretation of result is:
-
Iris Setosa usually shows to have shorter and wider sepals and smaller and shorter petals if is compared to Versicolor and Virginica.
-
Iris Virginica tends to be the specie with longest and wider petals and sepal.
In this project is analyzed as well the correlation between each pair of variables , excluding the categorical variable 'Species'.
It is used a correlation matrix, which is defined as a matrix which shows the correlations between all the variables giving us a value between -1 and 1. It helps to summarize all data. For more info:CorrelationMatrix.
The correlation values range between -1 and 1:
-
Value of 1: stronger and perfect positive linear relationship.When one variable increases so does the value of the other.
-
Value of 0: indicates there is no association between the two variables.
-
Value of -1: indictes negative association. When one variable increases, the other one decreases.
In this study, the correlation is interpreted as follows:
-
Negative and weak correlation Sepal length vs sepal width is the weakest and negative linear relationship (-0.12). This means that if the sepal length gets bigger its width won't increase its size too. Also, petal length vs sepal width (-0.43) and sepal width vs petal width (-037).
-
Positive and strong correlation Petal length vs petal width (0.96) has the strongest and positive correlation. This means that as longer are the petals as longer is its width. Also, sepal length vs petal width (0.82) and sepal length vs petal length (0.87).
To start using Python programing language for a data analysis, first of all is to import all libraries and loaded the data set as previously detailed. Then you can start using the following codes to explore the data and outcomes some visualizations as .png.
To explore the data and calculate key statistical measures is used the following code:
iris_data.min()
iris_data.max()
iris_data.mean()
iris_data.median()
iris_data.std()
Importing a package that is not installed:
python -m pip install {package_name}
To save plots or figures:
plt.savefig('.png')
To save a summary in a text file:
numerical_summary = df.describe(include=[np.number])
categorical_summary = df.describe(include=[object])
To plot a visualization:
plt.hist()
plt.scatter()
sns.heatmap()
To obtained correlation matrix:
correlation_matrix = numeric_df.corr()
To sum up, the analysis of Fisher's Iris data set reveals key insights:
-
Strong and positive correlation between petal dimensions: Petal length and petal width (0.96). It means that as longer are the petals on Iris flowers as much is increasing the petal wide.
-
Weak and negative correlation with sepal width: sepal length and sepal width (-0.12). It is not matter how much bigger or length has the iris sepal , Its wide won't increase its size too. It is not correlated.
-
Species differentation: Iris setosa displays more distinct measurements compared to Iris Versicolor and Virginica. Iris Setosa tends to have shorter and wider sepals. Also, smaller and shorter petals. Iris Versicolor and Virginica shows some overlap in some scatter plots. They are more coincidences.
You can submit a pull request regarding my code if you discover an error or if It should be updated.
I am Noemi Diaz and I am currently studying Science in Computing in Data Analytics at ATU.
-
Analytics vidhya: Scatter Plot Visualization in Python using matplotlib. https://www.analyticsvidhya.com/blog/2024/02/scatter-plot-visualization-in-python-using-matplotlib/
-
Builtin: Correlatin matrix. https://builtin.com/data-science/correlation-matrix
-
Datagy: Rounding correlation values. https://datagy.io/python-correlation-matrix/
-
Datagy: How to plot a Heat map https://datagy.io/python-correlation-matrix/
-
Flexiple: Heat map for correlation visualization. https://flexiple.com/python/exploratory-data-analysis-on-iris-dataset
-
Giri, A. (2022, June 16). IRIS Flowers Classification using Machine Learning. Machinelearning4you. https://machinelearning4ya.blogspot.com/2022/04/iris-flowers-classification-using.html
-
GitHub Docs: README Format. https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax
-
Kaggle: Create scatter plot. https://www.kaggle.com/code/holfyuen/tutorial-scatter-plots-in-python
-
Learn python: Statistical measures. https://learnpython.com/blog/how-to-summarize-data-in-python/
-
Machine Learning +: Python scatter plot. https://www.machinelearningplus.com/plots/python-scatter-plot/
-
Matplotlib: Compute and plot a histogram. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html
-
Matplotlib: Types of markers for plots. https://matplotlib.org/stable/api/markers_api.html
-
Matplotlib: Choosing colormaps. https://matplotlib.org/stable/users/explain/colors/colormaps.html
-
Pandas: Make a histogram. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
-
Pandas: Exclude non-numeric varibales for matrix correlation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
-
Pierrian Training: Export and save plot in png file. https://pieriantraining.com/python-tutorial-how-to-export-and-save-a-seaborn-plot/#:~:text=PNG%20(Portable%20Network%20Graphics)%20is,pyplot%20library.
-
Real Python: Create a Scatter plot. https://realpython.com/visualizing-python-plt-scatter/
-
Science direct: Image. https://www.sciencedirect.com/topics/mathematics/virginica
-
Stack over flow: Summaries variables to a text file https://stackoverflow.com/questions/30139243/saving-a-variable-in-a-text-file
-
Statology: Correlation matrix in python https://www.statology.org/correlation-matrix-python/
-
Visual Studio Code: Import resolve failure. https://code.visualstudio.com/docs/python/editing#_importresolvefailure
-
Wikipedia: Information about Iris flower data set https://en.wikipedia.org/wiki/Iris_flower_data_set#:~:text=The%20Iris%20flower%20data%20set,example%20of%20linear%20discriminant%20analysis.
-
W3 Schools: https://www.w3schools.com/python/default.asp