项目作者: IdealisticINTJ

项目描述 :
Data Visualization and Classification of the three different Iris species (setosa, virginica and versicolor) using the Fisher's Iris data set.
高级语言: Jupyter Notebook
项目地址: git://github.com/IdealisticINTJ/Iris-Dataset-Analysis.git
创建时间: 2021-03-16T09:04:00Z
项目社区:https://github.com/IdealisticINTJ/Iris-Dataset-Analysis

开源协议:GNU General Public License v3.0

下载


Iris Dataset Analysis

Type
Type
Status

An exploratory data analysis using statistics and data visualisation, and a very basic illustration of how this dataset can be used in machine learning.

About the Dataset

Iris Species

The dataset contains: 3 classes (3 different Iris species) with 50 samples each, with four numeric properties about those classes: Sepal Length, Sepal Width, Petal Length, and Petal Width.

One species, Iris Setosa, is “linearly separable” from the other two. This means that we can draw a line (or a hyperplane in higher-dimensional spaces) between Iris Setosa samples and samples corresponding to the other two species.

Machine Learning

This is a supervised learning problem as the example provides both input (iris measurements) and output (iris species) pairs. The information from these pairings should ideally allow us to create a model that can accurately predict a species of iris when presented with new data inputs.
Classification is a type of supervised machine learning problem where the target (response) variable is categorical.

The steps in the construction of a supervised machine learning program are as stated individually below:

  • Data collection- Import Libraries and Load Dataset
  • Train-Test Split
  • Exploratory Data Analysis
  • Choose a model.
  • Train the model.
  • Evaluate the model.
  • Make predictions.

    Choosing a Model

    Scikit-learn is a popular Python library used for creating machine learning models. There are several algorithms available in this library that can be used to build a machine learning model for the Iris Dataset. However, the ones I used this time around are:

  • Logistic Regression

  • K-Nearest Neighbors
  • Support Vector Models

Conclusion

Through the exploratory data analysis performed on this dataset, many intrigiung inferences could be drawn.

Takeaways

Which predictor(s) can effectively help with the predictions & conclusions? Which of the petal/sepal measurements are more useful to look at?

A) It is seen that Petal measurements have highly positive correlation, while the sepal one is uncorrelated.
It is also worth noting that the petal features have relatively high correlation with sepal_length, but not with sepal_width.
Therefore, petal measurements can separate species better than the sepal ones.

Furthermore, all the 3 models achieved a test accuracy of over 95% (~97%).