How to create your first Machine Learning project with Python

April 08, 2022

Tags: Technologies

python

 

Python is one of the most popular programming languages ​​in the world, not only among software developers and engineers, but also among mathematicians, data analysts, scientists, and even accountants, due to its great ease of use.

 

People within these different fields use Python for multiple tasks, such as data analysis and visualization, creating artificial intelligence, and automating different machines, perhaps the latter being its most popular use.

 

Other uses that are given to Python, among people who are not developers, are: automating the actions of copying and pasting files and folders and uploading them to a server. Even with Python, you can automate your tasks in Excel, PDF, and CSV files.

 

Among all these uses of Python, the most popular at the moment is machine learning, and in this blog, we are going to teach you how to start your first machine learning project.

 

Machine learning project with Python

 

1. Download and install Python SciPy

 

Download and install Python and SciPy. After you have both installed, you must install the following libraries: scipy, numpy, matplotlib, pandas, and sklearn. The scipy install page has excellent step-by-step instructions for installing the libraries if you have any questions.

 

After everything is installed, run Python and check the versions. To do this, open a command line and start the Python interpreter, then paste the following script:

 

# Check the versions of libraries
 
# python version
import sys
print('Python: {}'.format(sys.version))
#scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
#numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

 

It should give you this result:

 

Python: 3.6.11 (default, Jun 29 2020, 13:22:26)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]
scipy: 1.5.2
numpy: 1.19.1
matplotlib: 3.3.0
pandas: 1.1.0
sklearn: 0.23.2

 

2. Load the data

 

We are going to use the iris flower dataset. This dataset is famous because it is used by virtually everyone as the "hello world" dataset in machine learning and statistics.

 

First, let's import the modules, functions, and objects that we'll use in this tutorial:

 

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighboursClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
...

 

Next, load the dataset. This will be done directly from the UCI machine learning repository.

 

...
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

 

3. Summarize the data set

 

This is the step where we are going to take a look at the data, we will do this in four different ways: dimensions of the data set, look at the data and a statistical summary of the attributes.

 

Data set dimensions

 

...
# shape
print(dataset.shape)

 

It should return 150 instances and 5 attributes.

 

Look at the date

 

...
# head
print(dataset.head(20))

 

It should return this result:

 

sepal-length  sepal-width  petal-length  petal-width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7          3.8           1.7          0.3  Iris-setosa
19           5.1          3.8           1.5          0.3  Iris-setosa

 

Statistical Summary

 

...
# descriptions
print(dataset.describe())

 

We can see that all numerical values ​​have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

 

4. Data visualization

 

Having a basic idea of ​​the data, we need to further explain it in this case with a visualization. We will see some univariate graphs to better understand each attribute. Writes:

 

...
# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

 

5. Evaluate some algorithms

 

Create a validation data set

 

We need to verify that the created model is good. We're going to split the loaded dataset in two: 80% we'll use to train, evaluate, and select between our models, and 20% we'll hold as a validation dataset.

 

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

 

Build models

 

We don't know which algorithm will work or which configuration to use. We are going to test six different algorithms:

 

...
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighboursClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

 

Running these algorithms, we should get these results:

 

LR: 0.960897 (0.052113)
LDA: 0.973974 (0.040110)
KNN: 0.957191 (0.043263)
CART: 0.957191 (0.043263)
NB: 0.948858 (0.056322)
SVM: 0.983974 (0.032083)

 

Complete the example

 

# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighboursClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighboursClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

 

We recommend you on video