Understanding Matplotlib: A Comprehensive Guide for absolute beginners
Matplotlib is a widely used Python library for data visualization. Whether you're creating basic plots or advanced visualizations, Matplotlib is an important tool for any data analyst or researcher. Let's get started.
Introduction to Matplotlib
Have you heard of Matplotlib? It's a cool Python tool that lets you embed all kinds of visual graphs into your apps. ๐ With it, you can create things like plots, histograms, bar charts, and even some advance stuff like power spectra and error charts. ๐๐ It's like the ultimate ninjutsu of data visualization! ๐ช
The brains behind this is John D. Hunter ๐ง . And the best part is it's all open source, which means we can use it without spending a dime ๐ฐ. While it's mainly written in Python, there are bits in C, Objective-C, and even Javascript to make sure it plays well with other platforms ๐ฅ๏ธ๐ฑ.
Installing Matplotlib
You can install Matplotlib
using pip:
pip install matplotlib
As always, let's learn more by doing.
Line Plot
A basic plot can be created using pyplot
, a module in Matplotlib.
import matplotlib.pyplot as plt
x = [0, 1, 2, 3, 4]
y = [0, 1, 4, 9, 16]
plt.plot(x, y)
plt.title('Basic Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
This code will produce a simple line plot of y = x^2. When you run the code, you will get a line plot with the curve of y = x^2, labelled axes, and a title. The plot visualizes how y values increase as x values are squared.
Let's breakdown the code -
import matplotlib.pyplot as plt
:- This imports the
pyplot
module from thematplotlib
library and renames it toplt
for easier reference.
- This imports the
x = [0, 1, 2, 3, 4]
andy = [0, 1, 4, 9, 16]
:- Two lists,
x
andy
, are defined. They represent the x-coordinates and y-coordinates of points, respectively. Essentially, the pairs (0,0), (1,1), (2,4), (3,9), and (4,16) will be plotted. These points represent the curve y = x^2.
- Two lists,
plt.plot(x, y)
:- This function plots the
x
andy
values. It will connect the points with a line, producing a line graph.
- This function plots the
plt.title('Basic Line Plot')
:- This sets the title of the plot to "Basic Line Plot".
plt.xlabel('X-axis')
andplt.ylabel('Y-axis')
:- These functions label the x-axis and y-axis of the plot, respectively.
plt.show()
:- This displays the figure. It's telling the program to render and show the plot with all the defined settings and data.
We used a very important function here - plot()
. Let's understand this in more detail.
Matplotlib's plot()
Function
plot()
is a function in Matplotlib's pyplot
module that allows you to create line plots. Line plots are ideal for visualizing data points in a sequence or showing how a value changes over time.
Basic Usage
Using plot()
is straightforward. At its simplest, it needs two lists: one for the x-axis and one for the y-axis:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
plt.plot(x, y)
plt.show()
This code will plot the curve y = x^2.
Customizing the Line
The beauty of plot()
lies in its customizability. Want a red dotted line? No problem!
plt.plot(x, y, 'ro--')
Here, 'r'
stands for red
color, o
stands for circle
marker and --
specifies a dashed
line.
Certainly! The plot()
function in Matplotlib provides a wide array of customization options for both color and line type. Let's break them down:
Colors:
The following are some of the basic color abbreviations:
'b'
: Blue'g'
: Green'r'
: Red'c'
: Cyan'm'
: Magenta'y'
: Yellow'k'
: Black'w'
: White
You can also specify colors in many other ways, like full names ('green'
), hex strings ('#FFDD44'
), RGB tuples ((1,0,0) for red), and more.
Line Styles:
Here are the basic line styles you can use:
'-'
: Solid line (default)'--'
: Dashed line'-.'
: Dash-dot line':'
: Dotted line
Markers:
In addition to line styles, you can also specify markers to denote points. Here are some of the common ones:
'.'
: Point marker'o'
: Circle marker's'
: Square marker'^'
: Upward-pointing triangle marker'v'
: Downward-pointing triangle marker'<'
: Left-pointing triangle marker'>'
: Right-pointing triangle marker'p'
: Pentagon marker'*'
: Star marker'h'
: Hexagon marker'+'
: Plus marker'x'
: X marker'D'
: Diamond marker
You can combine color, marker, and line style into a single string argument in the plot()
function. For instance, 'ro-'
will give you a red solid line with circle markers.
This provides an incredible amount of flexibility and allows you to customize your plots to fit your exact needs or preferences.
Multiple Lines in One Plot
Yes, you can plot multiple lines in one go. Just call plot()
multiple times before calling show()
:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
y2 = [1, 2, 3, 4]
plt.plot(x, y, 'r--', label='y = x^2')
plt.plot(x, y2, 'g-', label='y = x')
plt.legend()
plt.grid()
plt.show()
The legend()
function displays the legend on the plot. The legend will show a small sample of the line style (red dashed for y = x^2
and green solid for y = x
) next to their respective labels.
The grid()
function adds a grid to the graph for better visual understanding.
grid()
looks interesting
You can customize the grid()
function in many ways. You can choose to either show just the verticle or the horizontal lines of the grid. You can customize the color
of the grid. And probably the linestyle
of the grid. That's a lot of customization. Let's have a look:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
y2 = [1, 2, 3, 4]
z = [1, 8, 27, 64]
plt.plot(x, y, 'rs--', label='y = x^2')
plt.plot(x, y2, 'g.-', label='y = x')
plt.plot(x, z, 'bo-.', label='y = x^3')
plt.legend()
plt.grid(color = 'green', linestyle = '--', linewidth = 0.5)
plt.show()
In the above code, we have set the grid color
to green
, while specifying the linestyle
to --
dash lines. We also set the linewidth to 0.5
Scatter Plot
A scatter plot visualizes individual data points on a two-dimensional plane, using Cartesian coordinates. It's an important tool when you want to observe relationships, patterns, or clusters between two variables.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5, 6, 7, 8]
y = [10, 20, 30, 40, 45, 47, 49, 46]
plt.scatter(x, y, color='red')
plt.title('Basic Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
plt.show()
Let's dig deeper into customizing the scatter plot.
Customization in Scatter plot
Customization is at the heart of Matplotlib, and that's why it is a must have tool in every data analyst's tookkit. Let's learn more about customization in Scatter plot.
import matplotlib.pyplot as plt
# Data
x1 = [1, 2, 3, 4]
y1 = [10, 20, 30, 40]
x2 = [1.5, 2.5, 3.5, 4.5]
y2 = [15, 25, 35, 45]
# Customizations for the first dataset
sizes1 = [50, 100, 150, 200]
colors1 = [10, 20, 30, 40] # Values for colormap
plt.scatter(x1, y1,
c=colors1,
cmap='viridis',
s=sizes1,
alpha=0.6,
marker='o',
edgecolor='black',
linewidth=1.5,
label='Dataset 1')
# Customizations for the second dataset
sizes2 = [60, 110, 160, 210]
colors2 = [15, 25, 35, 45] # Values for colormap
plt.scatter(x2, y2,
c=colors2,
cmap='plasma',
s=sizes2,
alpha=0.8,
marker='^',
edgecolor='blue',
linewidth=2,
label='Dataset 2')
# Adding other plot elements
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.title('Enhanced Scatter Plot')
plt.legend() # Display legend to differentiate the datasets
plt.colorbar() # Display colorbar for the colormaps
# Display the plot
plt.show()
Generally, Scatter plots are not as colorful as the above illustration, the purpose of customization is to achieve simplicity and streamlined visualization of data. The colorful scatter plot is demonstrated for your understanding, let's understand the code:
Dataset:
The first dataset (x1, y1) represents four points: (1,10), (2,20), (3,30), and (4,40).
The second dataset (x2, y2) represents another set of four points: (1.5,15), (2.5,25), (3.5,35), and (4.5,45).
Customization for the Dataset:
c=colors1
: This assigns specific colors from the colormap to the markers based on the values in colors1.cmap='viridis'
: This sets the colormap to 'viridis', which is a yellow-green-blue colormap.`s=sizes1`: This determines the size of each marker.
alpha=0.6
: Sets the transparency of the markers. It ranges from 0 (fully transparent) to 1 (fully opaque).marker='o'
: Specifies the shape of the marker as a circle.edgecolor='black' & linewidth=1.5
: Sets the edge color of the markers to black with a width of 1.5 units.label='Dataset 1'
: This is the label for this data in the legend.
Other
plot
elementsxlabel()
,ylabel()
, andtitle()
: These functions set the labels for thex-axis
,y-axis
, and the title for the whole plot, respectively.legend()
: Displays alegend
to differentiate between the two datasets.colorbar()
: Adds a color bar to the plot, which shows the color scale used in the scatter plot.
As I mentioned, the above scatter plot is completely fictional and not how a scatter plot generally looks like, to get a better idea on how scatter plots look like in a real use case, observe the below code and it's scatter plot.
import matplotlib.pyplot as plt
import random
# Generate random scores for two classes
class_a_scores = [random.randint(50, 100) for _ in range(20)]
class_b_scores = [random.randint(50, 100) for _ in range(20)]
# Student numbers (for x-axis)
students = list(range(1, 21))
# Scatter plot for Class A scores
plt.scatter(students, class_a_scores, color='blue', label='Class A')
# Scatter plot for Class B scores
plt.scatter(students, class_b_scores, color='red', label='Class B')
# Adding labels and title
plt.xlabel('Student Number')
plt.ylabel('Exam Score')
plt.title('Exam Scores of Class A vs. Class B')
plt.legend()
# Display the plot
plt.show()
In this plot:
Each student (from 1 to 20) is plotted on the
x-axis
.Their respective scores in the exam are plotted on the
y-axis
.Scores from
Class A
are shown inblue
, while those fromClass B
are shown inred
.A
legend
differentiates the scores from the two classes.
Histograms
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randn(1000)
plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
numpy
is used for its capability to handle and generate numerical data efficiently.data = np.random.randn(1000)
: This function is a convenient way to generate an array of 1000 random numbers sampled from a standard normal distribution (mean = 0, standard deviation = 1). Using pure Python to achieve this would be more verbose and slower.plt.hist(data, bins=30, edgecolor='black')
:bins=30
divides the entire range of values in data into 30 bins or intervals.edgecolor='black'
sets the color of the edges of the bins to black.
More about Histograms
Normalization
We can use the density
parameter to normalize the histogram such that the total area under the histogram will sum up to 1.
plt.hist(data, bins=30, density=True)
Cumulative Histogram
With the cumulative
parameter, you can display the cumulative distribution rather than the frequency.
plt.hist(data, bins=30, cumulative=True)
Setting Histogram Range
The range
parameter allows you to set the minimum and maximum range for the bins.
plt.hist(data, bins=30, range=(min_value, max_value))
Logarithmic Scale
For datasets with wide ranges, a logarithmic scale can be beneficial. Use the log
parameter for this.
plt.hist(data, bins=30, log=True)
Stacked Histogram
When plotting multiple datasets, you can stack them on top of each other using the stacked
parameter.
data1 = np.random.randn(1000)
data2 = np.random.randn(1000)
plt.hist([data1, data2], bins=30, stacked=True)
Histogram Type
The histtype
parameter can take values such as 'bar'
, 'barstacked'
, 'step'
, and 'stepfilled'
to provide different visualizations.
plt.hist(data, bins=30, histtype='step')
Orientation
Use the orientation
parameter to create horizontal histograms.
plt.hist(data, bins=30, orientation='horizontal')
Returning Values
plt.hist()
not only plots the histogram but also returns the frequency counts, the bin edges, and patches, which can be useful for further computations.
counts, bins, patches = plt.hist(data, bins=30)
Advanced Histogram Customization:
Fill:
You can control the fill properties using parameters like
fc
(fill color) andec
(edge color).plt.hist(data, bins=30, fc='lightblue', ec='black')
Alpha:
The
alpha
parameter controls the transparency of the histogram bars. It ranges from 0 (transparent) to 1 (opaque).plt.hist(data, bins=30, alpha=0.7)
Edge Customization:
You can use the
linewidth
andlinestyle
parameters to customize the histogram's edges.plt.hist(data, bins=30, edgecolor='black', linewidth=1.2, linestyle='--')
Matplotlib's histogram capabilities are vast and varied. By understanding and utilizing these advanced features, you can create visually appealing and insightful visualizations tailored to your specific needs. Dive in, experiment, and extract deeper insights from your data!
c. Subplots
Subplotting allows users to have multiple individual plots within a single window, organizing them in a grid-like structure. By utilizing subplots, data scientists and analysts can view multiple visualizations for more effective comparative analysis. Whether you are aiming to contrast similar data across different scenarios, display progression through time, or simply showcase various data points, subplots provide a compact means to present your findings. With simple commands, users can specify the number of rows and columns in the grid, enabling precise control over the layout. Furthermore, each subplot retains its individual properties, like titles, x-axis, and y-axis labels, ensuring clarity in representation. Let's dive deeper into the concept.
Observe the following code where you can plot two charts in a single instance.
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 1) # 2 rows, 1 column
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
y2 = [1, 8, 27, 64]
# First subplot
axs[0].plot(x, y)
axs[0].set_title('y = x^2')
# Second subplot
axs[1].plot(x, y2, 'r--') # 'r--' means red color with dashed line style
axs[1].set_title('y = x^3')
plt.tight_layout() # Adjust the space between plots
plt.show()
Let's decode -
Creating a Figure and Axes:
fig, axs = plt.subplots(2, 1) # 2 rows, 1 column
This line creates a figure (
fig
) and a 2x1 grid of subplots (axs
). In simpler terms, the plotting area is divided into 2 rows and 1 column, giving you two separate plots (or subplots) vertically stacked on top of each other.axs
is an array that allows you to access each of these plots.Defining the Data:
x = [1, 2, 3, 4] y = [1, 4, 9, 16] y2 = [1, 8, 27, 64]
Here, you have defined the x-values and y-values for two different functions:
(y = x^2)
(y = x^3)
First Subplot:
axs[0].plot(x, y) axs[0].set_title('y = x^2')
These lines plot the first function
(y = x^2)
on the first subplot (top plot).axs[0]
accesses the first subplot. Theset_title
function sets the title for this subplot.Second Subplot:
axs[1].plot(x, y2, 'r--') # 'r--' means red color with dashed line style axs[1].set_title('y = x^3')
These lines plot the second function
(y = x^3)
on the second subplot (bottom plot). Here, the line is styled as a red dashed line using the'r--'
style string.axs[1]
accesses the second subplot, and its title is set accordingly.Layout Adjustment:
plt.tight_layout() # Adjust the space between plots
This function adjusts the space between the subplots to make sure they don't overlap and are displayed in a neat manner.
In summary, the code creates a figure with two vertically stacked subplots. The top subplot shows a plot of the function
(y = x^2)
and the bottom one shows a plot of (y = x^3) in a red dashed line style.
Bar Plot
A bar plot or bar chart is a chart or graph that represents categorical data using rectangular bars with heights or lengths proportional to the values they represent. Typically, bar plots are used to compare a single category of data between individual sub-items.
Now, let's start by plotting a basic bar chart.
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C', 'D']
values = [50, 30, 70, 40]
plt.bar(categories, values)
plt.show()
This piece of code will render a bar plot with categories 'A', 'B', 'C', and 'D', each having respective values.
Customizing the Bar Plot
Colors:
You can modify the color of the bars using the color
parameter.
plt.bar(categories, values, color=['red', 'green', 'blue', 'cyan'])
Bar Width:
The width
parameter adjusts the width of the bars. The default value is 0.8
.
plt.bar(categories, values, width=0.5)
Aligning Bars:
With the align
parameter, you can align bars to either the center (default) or the edge.
plt.bar(categories, values, align='edge')
More about Bar Plots
Horizontal Bar Plots:
If vertical bars are not your thing, Matplotlib supports horizontal bars with barh()
.
plt.barh(categories, values, color='skyblue')
Stacked Bar Plots:
For datasets that need comparative analysis across multiple categories, stacked bar plots are an excellent choice. Here's a simple example:
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C', 'D']
values1 = [50, 30, 70, 40]
values2 = [60, 20, 50, 30]
plt.bar(categories, values1, label='Group 1', color='blue')
plt.bar(categories, values2, label='Group 2', color='cyan', bottom=values1)
plt.legend()
plt.show()
The bottom
parameter tells where the bar should start, thus stacking values2
on top of values1
.
Grouped Bar Plots:
To draw multiple datasets side-by-side, you can adjust the position and width of bars.
import numpy as np
labels = ['A', 'B', 'C', 'D']
values1 = [50, 30, 70, 40]
values2 = [30, 60, 40, 60]
barWidth = 0.35
r1 = np.arange(len(values1))
r2 = [x + barWidth for x in r1]
plt.bar(r1, values1, color='blue', width=barWidth, label='Group 1')
plt.bar(r2, values2, color='red', width=barWidth, label='Group 2')
plt.xlabel('categories', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(values1))], labels)
plt.legend()
plt.show()
Error Bars:
Error bars provide a graphical representation of the variability of data and are used on graphs to indicate the error or uncertainty in a reported measurement.
import matplotlib.pyplot as plt
import numpy as np
# Sample data: Categories and their values
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 20, 25]
# Simulated standard errors for each bar (can be any measure of variability)
errors = [0.5, 1.2, 1.5, 2.0]
# Create bar plot with error bars
plt.bar(categories, values, yerr=errors, align='center', alpha=0.7, ecolor='black', capsize=10, color='skyblue')
plt.ylabel('Values')
plt.title('Bar plot with error bars')
plt.show()
In this code:
yerr
specifies the vertical error for the bars.ecolor
sets the color of the error bars.capsize
specifies the size of the caps at the end of the error bars.alpha
gives the bars a slight transparency.
Difference between Bar plot and Histograms
Bar plots and histograms are both graphical displays of data that utilize bars, but they serve different purposes and represent data in distinct ways. Here are the main differences between the two:
Type of Data:
Bar Plot: Represents categorical data. Each bar represents a category and is used to compare different sets of data.
Histogram: Represents numerical data. It shows the distribution of a continuous variable, divided into bins or intervals.
Axes:
Bar Plot: The x-axis represents different categories, while the y-axis shows the value (count, percentage, etc.) for each category.
Histogram: The x-axis represents the bins (ranges of values) while the y-axis represents the frequency or count of observations within each bin.
Order of Bars:
Bar Plot: The bars can be arranged in any order, depending on the categorical data.
Histogram: The bars are always in ascending order of the variable value.
Spacing:
Bar Plot: There are spaces between bars, unless they are grouped or stacked bars.
Histogram: Consecutive bars are adjacent with no space in between because it's a continuous scale.
Purpose:
Bar Plot: Used for comparing different groups or to show data changes over time for a discrete variable.
Histogram: Used to visualize the underlying frequency distribution of a continuous set of data, helping to determine the shape and spread of continuous data.
Appearance:
Bar Plot: Typically, each bar is of uniform width, with the height of each bar corresponding to the data value.
Histogram: The area of each bar represents the frequency of data values in the respective bin, meaning bars can have different widths if bins have different size ranges.
While both bar plots and histograms use bars to represent data, they're best suited for different types of data and questions. Bar plots are great for categorical data, allowing for direct comparisons between categories. Histograms, provide insight into the distribution, and spread of continuous data.
Conclusion
Matplotlib stands as a cornerstone in the world of Python data visualization. Throughout our discussion, we delved deep into its varied functionalities, from basic line plotting to the detailing of scatter plots and histograms. We explored the customizability of the plot()
function, enabling a variety of visual styles. We discovered the power of scatter plots, and their use for visualizing individual data points and their extensive customizable features, including marker sizes, transparency, and color mapping. Our journey also led us to histograms, which offer an insightful glimpse into the distribution and frequency of continuous data sets. Furthermore, we touched upon the utility of bar plots and their distinct nature compared to histograms. The subplotting feature of Matplotlib is an essential tool for crafting multiple plots in a single window, which is perfect for comparative data visualization. Matplotlib's versatility and expansive features make it an indispensable tool for anyone seeking to convey complex data in a visually appealing manner.