Matplotlib - The ultimate data science tool

Matplotlib - The ultimate data science tool

ยท

17 min read

Understanding Matplotlib: A Comprehensive Guide for absolute beginners

Matplotlib is a widely used Python library for data visualization. Whether you're creating basic plots or advanced visualizations, Matplotlib is an important tool for any data analyst or researcher. Let's get started.

Introduction to Matplotlib

Have you heard of Matplotlib? It's a cool Python tool that lets you embed all kinds of visual graphs into your apps. ๐Ÿ“Š With it, you can create things like plots, histograms, bar charts, and even some advance stuff like power spectra and error charts. ๐Ÿ“‰๐Ÿ“ˆ It's like the ultimate ninjutsu of data visualization! ๐Ÿช“

The brains behind this is John D. Hunter ๐Ÿง . And the best part is it's all open source, which means we can use it without spending a dime ๐Ÿ’ฐ. While it's mainly written in Python, there are bits in C, Objective-C, and even Javascript to make sure it plays well with other platforms ๐Ÿ–ฅ๏ธ๐Ÿ“ฑ.

Installing Matplotlib

You can install Matplotlib using pip:

pip install matplotlib

As always, let's learn more by doing.

Line Plot

A basic plot can be created using pyplot, a module in Matplotlib.

import matplotlib.pyplot as plt

x = [0, 1, 2, 3, 4]
y = [0, 1, 4, 9, 16]

plt.plot(x, y)
plt.title('Basic Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

This code will produce a simple line plot of y = x^2. When you run the code, you will get a line plot with the curve of y = x^2, labelled axes, and a title. The plot visualizes how y values increase as x values are squared.

Let's breakdown the code -

  1. import matplotlib.pyplot as plt:

    • This imports the pyplot module from the matplotlib library and renames it to plt for easier reference.
  2. x = [0, 1, 2, 3, 4] and y = [0, 1, 4, 9, 16]:

    • Two lists, x and y, are defined. They represent the x-coordinates and y-coordinates of points, respectively. Essentially, the pairs (0,0), (1,1), (2,4), (3,9), and (4,16) will be plotted. These points represent the curve y = x^2.
  3. plt.plot(x, y):

    • This function plots the x and y values. It will connect the points with a line, producing a line graph.
  4. plt.title('Basic Line Plot'):

    • This sets the title of the plot to "Basic Line Plot".
  5. plt.xlabel('X-axis') and plt.ylabel('Y-axis'):

    • These functions label the x-axis and y-axis of the plot, respectively.
  6. plt.show():

    • This displays the figure. It's telling the program to render and show the plot with all the defined settings and data.

We used a very important function here - plot(). Let's understand this in more detail.

Matplotlib's plot() Function

plot() is a function in Matplotlib's pyplot module that allows you to create line plots. Line plots are ideal for visualizing data points in a sequence or showing how a value changes over time.

Basic Usage

Using plot() is straightforward. At its simplest, it needs two lists: one for the x-axis and one for the y-axis:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [1, 4, 9, 16]

plt.plot(x, y)
plt.show()

This code will plot the curve y = x^2.

Customizing the Line

The beauty of plot() lies in its customizability. Want a red dotted line? No problem!

plt.plot(x, y, 'ro--')

Here, 'r' stands for red color, o stands for circle marker and -- specifies a dashed line.

Certainly! The plot() function in Matplotlib provides a wide array of customization options for both color and line type. Let's break them down:

Colors:

The following are some of the basic color abbreviations:

  • 'b': Blue

  • 'g': Green

  • 'r': Red

  • 'c': Cyan

  • 'm': Magenta

  • 'y': Yellow

  • 'k': Black

  • 'w': White

You can also specify colors in many other ways, like full names ('green'), hex strings ('#FFDD44'), RGB tuples ((1,0,0) for red), and more.

Line Styles:

Here are the basic line styles you can use:

  • '-': Solid line (default)

  • '--': Dashed line

  • '-.': Dash-dot line

  • ':': Dotted line

Markers:

In addition to line styles, you can also specify markers to denote points. Here are some of the common ones:

  • '.': Point marker

  • 'o': Circle marker

  • 's': Square marker

  • '^': Upward-pointing triangle marker

  • 'v': Downward-pointing triangle marker

  • '<': Left-pointing triangle marker

  • '>': Right-pointing triangle marker

  • 'p': Pentagon marker

  • '*': Star marker

  • 'h': Hexagon marker

  • '+': Plus marker

  • 'x': X marker

  • 'D': Diamond marker

You can combine color, marker, and line style into a single string argument in the plot() function. For instance, 'ro-' will give you a red solid line with circle markers.

This provides an incredible amount of flexibility and allows you to customize your plots to fit your exact needs or preferences.

Multiple Lines in One Plot

Yes, you can plot multiple lines in one go. Just call plot() multiple times before calling show():

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
y2 = [1, 2, 3, 4]
plt.plot(x, y, 'r--', label='y = x^2')
plt.plot(x, y2, 'g-', label='y = x')
plt.legend()

plt.grid()

plt.show()

The legend() function displays the legend on the plot. The legend will show a small sample of the line style (red dashed for y = x^2 and green solid for y = x) next to their respective labels.

The grid() function adds a grid to the graph for better visual understanding.

grid() looks interesting

You can customize the grid() function in many ways. You can choose to either show just the verticle or the horizontal lines of the grid. You can customize the color of the grid. And probably the linestyle of the grid. That's a lot of customization. Let's have a look:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
y2 = [1, 2, 3, 4]
z = [1, 8, 27, 64]
plt.plot(x, y, 'rs--', label='y = x^2')
plt.plot(x, y2, 'g.-', label='y = x')
plt.plot(x, z, 'bo-.', label='y = x^3')
plt.legend()

plt.grid(color = 'green', linestyle = '--', linewidth = 0.5)

plt.show()

In the above code, we have set the grid color to green, while specifying the linestyle to -- dash lines. We also set the linewidth to 0.5

Scatter Plot

A scatter plot visualizes individual data points on a two-dimensional plane, using Cartesian coordinates. It's an important tool when you want to observe relationships, patterns, or clusters between two variables.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5, 6, 7, 8]
y = [10, 20, 30, 40, 45, 47, 49, 46]

plt.scatter(x, y, color='red')
plt.title('Basic Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

plt.show()

Let's dig deeper into customizing the scatter plot.

Customization in Scatter plot

Customization is at the heart of Matplotlib, and that's why it is a must have tool in every data analyst's tookkit. Let's learn more about customization in Scatter plot.

import matplotlib.pyplot as plt

# Data
x1 = [1, 2, 3, 4]
y1 = [10, 20, 30, 40]

x2 = [1.5, 2.5, 3.5, 4.5]
y2 = [15, 25, 35, 45]

# Customizations for the first dataset
sizes1 = [50, 100, 150, 200]
colors1 = [10, 20, 30, 40]  # Values for colormap
plt.scatter(x1, y1, 
            c=colors1, 
            cmap='viridis', 
            s=sizes1, 
            alpha=0.6, 
            marker='o', 
            edgecolor='black', 
            linewidth=1.5, 
            label='Dataset 1')

# Customizations for the second dataset
sizes2 = [60, 110, 160, 210]
colors2 = [15, 25, 35, 45]  # Values for colormap
plt.scatter(x2, y2, 
            c=colors2, 
            cmap='plasma', 
            s=sizes2, 
            alpha=0.8, 
            marker='^', 
            edgecolor='blue', 
            linewidth=2, 
            label='Dataset 2')

# Adding other plot elements
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.title('Enhanced Scatter Plot')
plt.legend()  # Display legend to differentiate the datasets
plt.colorbar()  # Display colorbar for the colormaps

# Display the plot
plt.show()

Generally, Scatter plots are not as colorful as the above illustration, the purpose of customization is to achieve simplicity and streamlined visualization of data. The colorful scatter plot is demonstrated for your understanding, let's understand the code:

  1. Dataset:

    • The first dataset (x1, y1) represents four points: (1,10), (2,20), (3,30), and (4,40).

    • The second dataset (x2, y2) represents another set of four points: (1.5,15), (2.5,25), (3.5,35), and (4.5,45).

  2. Customization for the Dataset:

    • c=colors1: This assigns specific colors from the colormap to the markers based on the values in colors1.

    • cmap='viridis': This sets the colormap to 'viridis', which is a yellow-green-blue colormap.

    • `s=sizes1`: This determines the size of each marker.

    • alpha=0.6: Sets the transparency of the markers. It ranges from 0 (fully transparent) to 1 (fully opaque).

    • marker='o': Specifies the shape of the marker as a circle.

    • edgecolor='black' & linewidth=1.5: Sets the edge color of the markers to black with a width of 1.5 units.

    • label='Dataset 1': This is the label for this data in the legend.

  3. Other plot elements

    • xlabel(), ylabel(), and title(): These functions set the labels for the x-axis, y-axis, and the title for the whole plot, respectively.

    • legend(): Displays a legend to differentiate between the two datasets.

    • colorbar(): Adds a color bar to the plot, which shows the color scale used in the scatter plot.

As I mentioned, the above scatter plot is completely fictional and not how a scatter plot generally looks like, to get a better idea on how scatter plots look like in a real use case, observe the below code and it's scatter plot.

import matplotlib.pyplot as plt
import random

# Generate random scores for two classes
class_a_scores = [random.randint(50, 100) for _ in range(20)]
class_b_scores = [random.randint(50, 100) for _ in range(20)]

# Student numbers (for x-axis)
students = list(range(1, 21))

# Scatter plot for Class A scores
plt.scatter(students, class_a_scores, color='blue', label='Class A')

# Scatter plot for Class B scores
plt.scatter(students, class_b_scores, color='red', label='Class B')

# Adding labels and title
plt.xlabel('Student Number')
plt.ylabel('Exam Score')
plt.title('Exam Scores of Class A vs. Class B')
plt.legend()

# Display the plot
plt.show()

In this plot:

  • Each student (from 1 to 20) is plotted on the x-axis.

  • Their respective scores in the exam are plotted on the y-axis.

  • Scores from Class A are shown in blue, while those from Class B are shown in red.

  • A legend differentiates the scores from the two classes.

Histograms

import numpy as np
import matplotlib.pyplot as plt

data = np.random.randn(1000)

plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

  • numpy is used for its capability to handle and generate numerical data efficiently.

  • data = np.random.randn(1000): This function is a convenient way to generate an array of 1000 random numbers sampled from a standard normal distribution (mean = 0, standard deviation = 1). Using pure Python to achieve this would be more verbose and slower.

  • plt.hist(data, bins=30, edgecolor='black'): bins=30 divides the entire range of values in data into 30 bins or intervals.

  • edgecolor='black' sets the color of the edges of the bins to black.

More about Histograms

Normalization

We can use the density parameter to normalize the histogram such that the total area under the histogram will sum up to 1.

plt.hist(data, bins=30, density=True)

Cumulative Histogram

With the cumulative parameter, you can display the cumulative distribution rather than the frequency.

plt.hist(data, bins=30, cumulative=True)

Setting Histogram Range

The range parameter allows you to set the minimum and maximum range for the bins.

plt.hist(data, bins=30, range=(min_value, max_value))

Logarithmic Scale

For datasets with wide ranges, a logarithmic scale can be beneficial. Use the log parameter for this.

plt.hist(data, bins=30, log=True)

Stacked Histogram

When plotting multiple datasets, you can stack them on top of each other using the stacked parameter.

data1 = np.random.randn(1000)
data2 = np.random.randn(1000)
plt.hist([data1, data2], bins=30, stacked=True)

Histogram Type

The histtype parameter can take values such as 'bar', 'barstacked', 'step', and 'stepfilled' to provide different visualizations.

plt.hist(data, bins=30, histtype='step')

Orientation

Use the orientation parameter to create horizontal histograms.

plt.hist(data, bins=30, orientation='horizontal')

Returning Values

plt.hist() not only plots the histogram but also returns the frequency counts, the bin edges, and patches, which can be useful for further computations.

counts, bins, patches = plt.hist(data, bins=30)

Advanced Histogram Customization:

  1. Fill:

    You can control the fill properties using parameters like fc (fill color) and ec (edge color).

     plt.hist(data, bins=30, fc='lightblue', ec='black')
    
  2. Alpha:

    The alpha parameter controls the transparency of the histogram bars. It ranges from 0 (transparent) to 1 (opaque).

     plt.hist(data, bins=30, alpha=0.7)
    
  3. Edge Customization:

    You can use the linewidth and linestyle parameters to customize the histogram's edges.

     plt.hist(data, bins=30, edgecolor='black', linewidth=1.2, linestyle='--')
    

Matplotlib's histogram capabilities are vast and varied. By understanding and utilizing these advanced features, you can create visually appealing and insightful visualizations tailored to your specific needs. Dive in, experiment, and extract deeper insights from your data!

c. Subplots

Subplotting allows users to have multiple individual plots within a single window, organizing them in a grid-like structure. By utilizing subplots, data scientists and analysts can view multiple visualizations for more effective comparative analysis. Whether you are aiming to contrast similar data across different scenarios, display progression through time, or simply showcase various data points, subplots provide a compact means to present your findings. With simple commands, users can specify the number of rows and columns in the grid, enabling precise control over the layout. Furthermore, each subplot retains its individual properties, like titles, x-axis, and y-axis labels, ensuring clarity in representation. Let's dive deeper into the concept.

Observe the following code where you can plot two charts in a single instance.

import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 1)  # 2 rows, 1 column

x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
y2 = [1, 8, 27, 64]

# First subplot
axs[0].plot(x, y)
axs[0].set_title('y = x^2')

# Second subplot
axs[1].plot(x, y2, 'r--')  # 'r--' means red color with dashed line style
axs[1].set_title('y = x^3')

plt.tight_layout()  # Adjust the space between plots
plt.show()

Let's decode -

  • Creating a Figure and Axes:

      fig, axs = plt.subplots(2, 1)  # 2 rows, 1 column
    

    This line creates a figure (fig) and a 2x1 grid of subplots (axs). In simpler terms, the plotting area is divided into 2 rows and 1 column, giving you two separate plots (or subplots) vertically stacked on top of each other. axs is an array that allows you to access each of these plots.

  • Defining the Data:

      x = [1, 2, 3, 4]
      y = [1, 4, 9, 16]
      y2 = [1, 8, 27, 64]
    

    Here, you have defined the x-values and y-values for two different functions:

    • (y = x^2)

    • (y = x^3)

  • First Subplot:

      axs[0].plot(x, y)
      axs[0].set_title('y = x^2')
    

    These lines plot the first function (y = x^2) on the first subplot (top plot). axs[0] accesses the first subplot. The set_title function sets the title for this subplot.

  • Second Subplot:

      axs[1].plot(x, y2, 'r--')  # 'r--' means red color with dashed line style
      axs[1].set_title('y = x^3')
    

    These lines plot the second function (y = x^3) on the second subplot (bottom plot). Here, the line is styled as a red dashed line using the 'r--' style string. axs[1] accesses the second subplot, and its title is set accordingly.

  • Layout Adjustment:

      plt.tight_layout()  # Adjust the space between plots
    

    This function adjusts the space between the subplots to make sure they don't overlap and are displayed in a neat manner.

  • In summary, the code creates a figure with two vertically stacked subplots. The top subplot shows a plot of the function (y = x^2) and the bottom one shows a plot of (y = x^3) in a red dashed line style.

Bar Plot

A bar plot or bar chart is a chart or graph that represents categorical data using rectangular bars with heights or lengths proportional to the values they represent. Typically, bar plots are used to compare a single category of data between individual sub-items.

Now, let's start by plotting a basic bar chart.

import matplotlib.pyplot as plt

categories = ['A', 'B', 'C', 'D']
values = [50, 30, 70, 40]

plt.bar(categories, values)
plt.show()

This piece of code will render a bar plot with categories 'A', 'B', 'C', and 'D', each having respective values.

Customizing the Bar Plot

Colors:

You can modify the color of the bars using the color parameter.

plt.bar(categories, values, color=['red', 'green', 'blue', 'cyan'])

Bar Width:

The width parameter adjusts the width of the bars. The default value is 0.8.

plt.bar(categories, values, width=0.5)

Aligning Bars:

With the align parameter, you can align bars to either the center (default) or the edge.

plt.bar(categories, values, align='edge')

More about Bar Plots

Horizontal Bar Plots:

If vertical bars are not your thing, Matplotlib supports horizontal bars with barh().

plt.barh(categories, values, color='skyblue')

Stacked Bar Plots:

For datasets that need comparative analysis across multiple categories, stacked bar plots are an excellent choice. Here's a simple example:

import matplotlib.pyplot as plt
categories = ['A', 'B', 'C', 'D']
values1 = [50, 30, 70, 40]
values2 = [60, 20, 50, 30]

plt.bar(categories, values1, label='Group 1', color='blue')
plt.bar(categories, values2, label='Group 2', color='cyan', bottom=values1)

plt.legend()
plt.show()

The bottom parameter tells where the bar should start, thus stacking values2 on top of values1.

Grouped Bar Plots:

To draw multiple datasets side-by-side, you can adjust the position and width of bars.

import numpy as np

labels = ['A', 'B', 'C', 'D']
values1 = [50, 30, 70, 40]
values2 = [30, 60, 40, 60]

barWidth = 0.35
r1 = np.arange(len(values1))
r2 = [x + barWidth for x in r1]

plt.bar(r1, values1, color='blue', width=barWidth, label='Group 1')
plt.bar(r2, values2, color='red', width=barWidth, label='Group 2')

plt.xlabel('categories', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(values1))], labels)

plt.legend()
plt.show()

Error Bars:

Error bars provide a graphical representation of the variability of data and are used on graphs to indicate the error or uncertainty in a reported measurement.

import matplotlib.pyplot as plt
import numpy as np

# Sample data: Categories and their values
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 20, 25]

# Simulated standard errors for each bar (can be any measure of variability)
errors = [0.5, 1.2, 1.5, 2.0]

# Create bar plot with error bars
plt.bar(categories, values, yerr=errors, align='center', alpha=0.7, ecolor='black', capsize=10, color='skyblue')

plt.ylabel('Values')
plt.title('Bar plot with error bars')
plt.show()

In this code:

  • yerr specifies the vertical error for the bars.

  • ecolor sets the color of the error bars.

  • capsize specifies the size of the caps at the end of the error bars.

  • alpha gives the bars a slight transparency.

Difference between Bar plot and Histograms

Bar plots and histograms are both graphical displays of data that utilize bars, but they serve different purposes and represent data in distinct ways. Here are the main differences between the two:

  1. Type of Data:

    • Bar Plot: Represents categorical data. Each bar represents a category and is used to compare different sets of data.

    • Histogram: Represents numerical data. It shows the distribution of a continuous variable, divided into bins or intervals.

  2. Axes:

    • Bar Plot: The x-axis represents different categories, while the y-axis shows the value (count, percentage, etc.) for each category.

    • Histogram: The x-axis represents the bins (ranges of values) while the y-axis represents the frequency or count of observations within each bin.

  3. Order of Bars:

    • Bar Plot: The bars can be arranged in any order, depending on the categorical data.

    • Histogram: The bars are always in ascending order of the variable value.

  4. Spacing:

    • Bar Plot: There are spaces between bars, unless they are grouped or stacked bars.

    • Histogram: Consecutive bars are adjacent with no space in between because it's a continuous scale.

  5. Purpose:

    • Bar Plot: Used for comparing different groups or to show data changes over time for a discrete variable.

    • Histogram: Used to visualize the underlying frequency distribution of a continuous set of data, helping to determine the shape and spread of continuous data.

  6. Appearance:

    • Bar Plot: Typically, each bar is of uniform width, with the height of each bar corresponding to the data value.

    • Histogram: The area of each bar represents the frequency of data values in the respective bin, meaning bars can have different widths if bins have different size ranges.

While both bar plots and histograms use bars to represent data, they're best suited for different types of data and questions. Bar plots are great for categorical data, allowing for direct comparisons between categories. Histograms, provide insight into the distribution, and spread of continuous data.

Conclusion

Matplotlib stands as a cornerstone in the world of Python data visualization. Throughout our discussion, we delved deep into its varied functionalities, from basic line plotting to the detailing of scatter plots and histograms. We explored the customizability of the plot() function, enabling a variety of visual styles. We discovered the power of scatter plots, and their use for visualizing individual data points and their extensive customizable features, including marker sizes, transparency, and color mapping. Our journey also led us to histograms, which offer an insightful glimpse into the distribution and frequency of continuous data sets. Furthermore, we touched upon the utility of bar plots and their distinct nature compared to histograms. The subplotting feature of Matplotlib is an essential tool for crafting multiple plots in a single window, which is perfect for comparative data visualization. Matplotlib's versatility and expansive features make it an indispensable tool for anyone seeking to convey complex data in a visually appealing manner.

ย