Python is one of the most popular programming languages in the world, not only among software developers and engineers, but also among mathematicians, data analysts, scientists, and even accountants, due to its great ease of use.
People within these different fields use Python for multiple tasks, such as data analysis and visualization, creating artificial intelligence, and automating different machines, perhaps the latter being its most popular use.
Other uses that are given to Python, among people who are not developers, are: automating the actions of copying and pasting files and folders and uploading them to a server. Even with Python, you can automate your tasks in Excel, PDF, and CSV files.
Among all these uses of Python, the most popular at the moment is machine learning, and in this blog, we are going to teach you how to start your first machine learning project.
Download and install Python and SciPy. After you have both installed, you must install the following libraries: scipy, numpy, matplotlib, pandas, and sklearn. The scipy install page has excellent step-by-step instructions for installing the libraries if you have any questions.
After everything is installed, run Python and check the versions. To do this, open a command line and start the Python interpreter, then paste the following script:
# Check the versions of libraries # python version import sys print('Python: {}'.format(sys.version)) #scipy import scipy print('scipy: {}'.format(scipy.__version__)) #numpy import numpy print('numpy: {}'.format(numpy.__version__)) # matplotlib import matplotlib print('matplotlib: {}'.format(matplotlib.__version__)) # pandas import pandas print('pandas: {}'.format(pandas.__version__)) # scikit-learn import sklearn print('sklearn: {}'.format(sklearn.__version__))
It should give you this result:
Python: 3.6.11 (default, Jun 29 2020, 13:22:26) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] scipy: 1.5.2 numpy: 1.19.1 matplotlib: 3.3.0 pandas: 1.1.0 sklearn: 0.23.2
We are going to use the iris flower dataset. This dataset is famous because it is used by virtually everyone as the "hello world" dataset in machine learning and statistics.
First, let's import the modules, functions, and objects that we'll use in this tutorial:
# Load libraries from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighboursClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC ...
Next, load the dataset. This will be done directly from the UCI machine learning repository.
... # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names)
This is the step where we are going to take a look at the data, we will do this in four different ways: dimensions of the data set, look at the data and a statistical summary of the attributes.
Data set dimensions
... # shape print(dataset.shape)
It should return 150 instances and 5 attributes.
Look at the date
... # head print(dataset.head(20))
It should return this result:
sepal-length sepal-width petal-length petal-width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa 10 5.4 3.7 1.5 0.2 Iris-setosa 11 4.8 3.4 1.6 0.2 Iris-setosa 12 4.8 3.0 1.4 0.1 Iris-setosa 13 4.3 3.0 1.1 0.1 Iris-setosa 14 5.8 4.0 1.2 0.2 Iris-setosa 15 5.7 4.4 1.5 0.4 Iris-setosa 16 5.4 3.9 1.3 0.4 Iris-setosa 17 5.1 3.5 1.4 0.3 Iris-setosa 18 5.7 3.8 1.7 0.3 Iris-setosa 19 5.1 3.8 1.5 0.3 Iris-setosa
Statistical Summary
... # descriptions print(dataset.describe())
We can see that all numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.
Having a basic idea of the data, we need to further explain it in this case with a visualization. We will see some univariate graphs to better understand each attribute. Writes:
... # box and whisker plots dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False) pyplot.show()
Create a validation data set
We need to verify that the created model is good. We're going to split the loaded dataset in two: 80% we'll use to train, evaluate, and select between our models, and 20% we'll hold as a validation dataset.
# Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
Build models
We don't know which algorithm will work or which configuration to use. We are going to test six different algorithms:
... # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighboursClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
Running these algorithms, we should get these results:
LR: 0.960897 (0.052113) LDA: 0.973974 (0.040110) KNN: 0.957191 (0.043263) CART: 0.957191 (0.043263) NB: 0.948858 (0.056322) SVM: 0.983974 (0.032083)
Complete the example
# compare algorithms from pandas import read_csv from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighboursClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighboursClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())) # Compare Algorithms pyplot.boxplot(results, labels=names) pyplot.title('Algorithm Comparison') pyplot.show()