How to create your first Machine Learning project with Python

Tags: Technologies

Quick Access

python

Python is one of the most popular programming languages in the world, not only among software developers and engineers, but also among mathematicians, data analysts, scientists, and even accountants, due to its great ease of use.

People within these different fields use Python for multiple tasks, such as data analysis and visualization, creating artificial intelligence, and automating different machines, perhaps the latter being its most popular use.

Other uses that are given to Python, among people who are not developers, are: automating the actions of copying and pasting files and folders and uploading them to a server. Even with Python, you can automate your tasks in Excel, PDF, and CSV files.

Among all these uses of Python, the most popular at the moment is machine learning, and in this blog, we are going to teach you how to start your first machine learning project.

Machine learning project with Python

1. Download and install Python SciPy

Download and install Python and SciPy. After you have both installed, you must install the following libraries: scipy, numpy, matplotlib, pandas, and sklearn. The scipy install page has excellent step-by-step instructions for installing the libraries if you have any questions.

After everything is installed, run Python and check the versions. To do this, open a command line and start the Python interpreter, then paste the following script:

# Check the versions of libraries
 
# python version
import sys
print('Python: {}'.format(sys.version))
#scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
#numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

It should give you this result:

Python: 3.6.11 (default, Jun 29 2020, 13:22:26)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]
scipy: 1.5.2
numpy: 1.19.1
matplotlib: 3.3.0
pandas: 1.1.0
sklearn: 0.23.2

2. Load the data

We are going to use the iris flower dataset. This dataset is famous because it is used by virtually everyone as the "hello world" dataset in machine learning and statistics.

First, let's import the modules, functions, and objects that we'll use in this tutorial:

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighboursClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
...

Next, load the dataset. This will be done directly from the UCI machine learning repository.

...
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

3. Summarize the data set

This is the step where we are going to take a look at the data, we will do this in four different ways: dimensions of the data set, look at the data and a statistical summary of the attributes.

Data set dimensions

...
# shape
print(dataset.shape)

It should return 150 instances and 5 attributes.

Look at the date

...
# head
print(dataset.head(20))

It should return this result:

sepal-length  sepal-width  petal-length  petal-width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7          3.8           1.7          0.3  Iris-setosa
19           5.1          3.8           1.5          0.3  Iris-setosa

Statistical Summary

...
# descriptions
print(dataset.describe())

We can see that all numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

4. Data visualization

Having a basic idea of the data, we need to further explain it in this case with a visualization. We will see some univariate graphs to better understand each attribute. Writes:

...
# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

5. Evaluate some algorithms

Create a validation data set

We need to verify that the created model is good. We're going to split the loaded dataset in two: 80% we'll use to train, evaluate, and select between our models, and 20% we'll hold as a validation dataset.

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

Build models

We don't know which algorithm will work or which configuration to use. We are going to test six different algorithms:

...
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighboursClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

Running these algorithms, we should get these results:

LR: 0.960897 (0.052113)
LDA: 0.973974 (0.040110)
KNN: 0.957191 (0.043263)
CART: 0.957191 (0.043263)
NB: 0.948858 (0.056322)
SVM: 0.983974 (0.032083)

Complete the example

# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighboursClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighboursClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

We recommend you on video

Related blogs

Digital Signatures for Businesses: How Rootstack Can Be Your Digital Partner

August 6th 2025

Tags: Technologies

If we go to a technical definition, a digital signature for companies is a set of data that accompanies a document with the purpose of identifying the signatory without leaving room for error

Digital Signature vs. Electronic Signature

August 6th 2025

Tags: Technologies, Digital Signatures

At Rootstack, together with our partner Validated ID, we have implemented multiple digital signature solutions for various companies and industries, so, based on our experience, we can help you in this process

Most important features of a Digital Signature Solution

August 6th 2025

Tags: Technologies, Digital Signatures

This is nothing more than software to facilitate your company's processes, avoiding the use of physical papers that can be damaged, lost or, in the worst case, fall victim to forged signatures that can lead to legal problems

A new era of speed and security at Pantheon: GitHub Actions, PHP Runtime, and a revamped UI

August 1st 2025

Tags: Tech Trends, Technologies

Pantheon is a cloud-based Platform as a Service (PaaS) specialized in hosting and managing websites developed in WordPress and Drupal, two of the most popular content management systems (CMS) in the world

From Slack to Jira: The next generation of AI-powered automation at Atlassian

August 1st 2025

Tags: Technologies

In this new paradigm, technology does not replace human beings, but rather enhances their capabilities, freeing up time for creativity, analysis and strategic decision-making

Cloud Security: Key Controls and Best Practices for Hybrid Cloud

July 31st 2025

Tags: Cloud computing, Technologies

As organizations evolve toward hybrid architectures that combine on-premises environments with public and private clouds, the risks and complexity of data and systems protection also grow

How to create your first Machine Learning project with Python

Table of contents

Quick Access

Machine learning project with Python

1. Download and install Python SciPy

2. Load the data

3. Summarize the data set

4. Data visualization

5. Evaluate some algorithms

We recommend you on video

Related blogs

Digital Signatures for Businesses: How Rootstack Can Be Your Digital Partner

Digital Signature vs. Electronic Signature

Most important features of a Digital Signature Solution

A new era of speed and security at Pantheon: GitHub Actions, PHP Runtime, and a revamped UI

From Slack to Jira: The next generation of AI-powered automation at Atlassian

Cloud Security: Key Controls and Best Practices for Hybrid Cloud

Join Our Team

See all the services we have

Join Our Team

See all the services we have

Join Our Team

How to create your first Machine Learning project with Python

Table of contents

Quick Access

How Matrix Work in Python and How to Use Them

How to automate the download folder with Python

Python: Popular Features of this Language

Machine learning project with Python

1. Download and install Python SciPy

2. Load the data

3. Summarize the data set

4. Data visualization

5. Evaluate some algorithms

We recommend you on video

Related blogs

Digital Signatures for Businesses: How Rootstack Can Be Your Digital Partner

Digital Signature vs. Electronic Signature

Most important features of a Digital Signature Solution

A new era of speed and security at Pantheon: GitHub Actions, PHP Runtime, and a revamped UI

From Slack to Jira: The next generation of AI-powered automation at Atlassian

Cloud Security: Key Controls and Best Practices for Hybrid Cloud

See all the services we have