# This is an integer (int type)
print(3, "is a", type(3))
3 is a <class 'int'>
\[ \newcommand{\float}{\mathbb{F}} \newcommand{\real}{\mathbb{R}} \newcommand{\complex}{\mathbb{C}} \newcommand{\nat}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\bfa}{\mathbf{a}} \newcommand{\bfe}{\mathbf{e}} \newcommand{\bfh}{\mathbf{h}} \newcommand{\bfp}{\mathbf{p}} \newcommand{\bfq}{\mathbf{q}} \newcommand{\bfu}{\mathbf{u}} \newcommand{\bfv}{\mathbf{v}} \newcommand{\bfw}{\mathbf{w}} \newcommand{\bfx}{\mathbf{x}} \newcommand{\bfy}{\mathbf{y}} \newcommand{\bfz}{\mathbf{z}} \newcommand{\bfA}{\mathbf{A}} \newcommand{\bfW}{\mathbf{W}} \newcommand{\bfX}{\mathbf{X}} \newcommand{\bfzero}{\boldsymbol{0}} \newcommand{\bfmu}{\boldsymbol{\mu}} \newcommand{\TP}{\text{TP}} \newcommand{\TN}{\text{TN}} \newcommand{\FP}{\text{FP}} \newcommand{\FN}{\text{FN}} \newcommand{\rmn}[2]{\mathbb{R}^{#1 \times #2}} \newcommand{\dd}[2]{\frac{d #1}{d #2}} \newcommand{\pp}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\norm}[1]{\left\lVert \mathstrut #1 \right\rVert} \newcommand{\abs}[1]{\left\lvert \mathstrut #1 \right\rvert} \newcommand{\twonorm}[1]{\norm{#1}_2} \newcommand{\onenorm}[1]{\norm{#1}_1} \newcommand{\infnorm}[1]{\norm{#1}_\infty} \newcommand{\innerprod}[2]{\langle #1,#2 \rangle} \newcommand{\pr}[1]{^{(#1)}} \newcommand{\diag}{\operatorname{diag}} \newcommand{\sign}{\operatorname{sign}} \newcommand{\dist}{\operatorname{dist}} \newcommand{\simil}{\operatorname{sim}} \newcommand{\ee}{\times 10^} \newcommand{\floor}[1]{\lfloor#1\rfloor} \newcommand{\argmin}{\operatorname{argmin}} \newcommand{\E}[1]{\operatorname{\mathbb{E}}\left[\mathstrut #1\right]} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\logit}{\operatorname{logit}} \]
First we have to discuss how to represent data, both abstractly and in Python.
Definition 1.1 A quantitative value is one that is numerical and supports meaningful comparison and arithmetic operations.
Quantitative data is further divided into continuous and discrete types. The difference is the same as between real numbers and integers.
Example 1.1 Some continuous quantitative data sources:
Examples of discrete quantitative data:
Sometimes there may be room for interpretation or context. For example, the retail price of a gallon of milk might be regarded as discrete data, since it technically represents a whole number of pennies. But in finance, transactions are regularly computed to much higher precision, so it might make more sense to interpret prices as continuous values. As a rule of thumb, if there are many possible values, or the values are the result of a measurement, the continuous interpretation is usually more appropriate.
Not all numerical values represent truly quantitative data. ZIP codes (postal codes) in the U.S. are 5-digit numbers, and while there is some logic to how they were assigned, there is no clearly meaningful interpretation of averaging them, for instance.
For both continuous and discrete quantities, it makes sense to order different values, compute averages of them, etc. (However, averages of discrete quantities are continuous.)
Mathematically, the real and integer number sets are infinite, but computers are finite machines. Integers are represented exactly within some range that is determined by how many binary bits are dedicated. The computational analog of real numbers are floating-point numbers, or more simply, floats. These are bounded in range as well as discretized. The details are complicated, but essentially, the floating-point numbers have about 16 significant decimal digits by default—which is virtually always far more precision than real data offers.
Example 1.2 if you enter an integer, it is interpreted as an int
:
# This is an integer (int type)
print(3, "is a", type(3))
3 is a <class 'int'>
To get the floating-point version, you add a decimal point:
# This is a real number (float type)
print(3.0, "is a", type(3.0))
3.0 is a <class 'float'>
You can also use scientific notation:
print(6.02e23, "is Avogadro's number")
print(6.62607e-34, "is Planck's constant in Joule-seconds")
6.02e+23 is Avogadro's number
6.62607e-34 is Planck's constant in Joule-seconds
There are also explicit converter functions. The conversion of an int
to a float
is generally exact:
# Convert int to float (generally no change to numerical value)
print( "float(3) creates",float(3) )
float(3) creates 3.0
But the conversion of int
to float
truncates the number, which can cause subtle bugs:
# Truncate float to int
print( "int(3.14) creates", int(3.14) )
int(3.14) creates 3
Because operations on floats are not exact, results can be surprising:
print(0.2 - 0.1 == 0.1) # as expected
print(3.1 - 3.0 == 0.1) # uh-oh
True
False
The Numpy function isclose
is a safer way to compare floats:
import numpy as np
print( np.isclose(3.1 - 3.0, 0.1) )
True
Be careful about comparing floating-point values. Exact comparisons may fail due to rounding errors.
There are two additional quasi-numerical float
values to be aware of as well.
For numerical work in Python, the NumPy package indispensible. We will use it often, and it is also loaded and used by most other scientifically oriented packages.
The value inf
in NumPy stands for infinity.
By definition, every operation that involves a NaN
value results in a NaN
.
Example 1.3 Infinity is greater than every number, and some arithmetic operations with infinity are well defined:
import numpy as np
print( "(np.inf + 5) is", np.inf + 5 )
print( "(np.inf + np.inf) is", np.inf + np.inf )
print( "(5 - np.inf) is", 5 - np.inf )
(np.inf + 5) is inf
(np.inf + np.inf) is inf
(5 - np.inf) is -inf
However, in calculus you learned that some expressions with infinity are considered to be undefined without additional information (e.g., L’Hôpital’s Rule):
print( "np.inf / np.inf is", np.inf / np.inf )
np.inf / np.inf is nan
The result nan
above stands for Not a Number. It is the result of indeterminate arithmetic operations, like \(\infty/\infty\) and \(\infty - \infty\). It is also used sometimes as a placeholder for missing data.
One notorious trap of using NaN is that nan==nan
is not True
!
print( "np.nan == np.nan is", np.nan == np.nan )
np.nan == np.nan is False
Instead, you should use np.isnan
to check for NaN
values:
print( "np.isnan(np.nan) is", np.isnan(np.nan) )
np.isnan(np.nan) is True
Handling times and dates can be tricky. Aside from headaches such as time zones and leap years, there are many different ways people and machines represent dates, and with varying amounts of precision. Python has its own inbuilt system for handling dates and times, but we will show the facilities provided by the NumPy package, which is more comprehensive.
There are two basic types:
Definition 1.2 A datetime is a representation of an instant in time. A time delta is a representation of a duration; i.e., a difference between two datetimes.
Native Python uses a datetime
type, while NumPy uses datetime64
.
Example 1.4 You can use many natural date formats within strings:
import numpy as np
"2020-01-17") # YYYY-MM-DD
np.datetime64("1969-07-20T20:17") # YYYY-MM-DDThh:mm np.datetime64(
numpy.datetime64('1969-07-20T20:17')
# Current date and time, down to the second
"now") np.datetime64(
numpy.datetime64('2024-09-13T18:16:39')
A time delta in NumPy indicates its units (i. e., its granularity).
"1969-07-20T20:17") - np.datetime64("today") np.datetime64(
numpy.timedelta64(-29006143,'m')
Most interesting phenomena are characterized and influenced by more than one factor. Collections of values therefore play a central role in data science. The workhorse type for collections in base Python is the list. However, we’re going to need some more powerful metaphors and tools as well, beginning with vectors.
Definition 1.3 A vector is a collection of values called elements, all of the same type, indexed by consecutive integers.
In math, vector indexes usually begin with 1. In Python, they begin with 0.
A vector with \(n\) elements is often referred to as an \(n\)-vector, and we say that \(n\) is the length of the vector. In math we often use \(\real^n\) to denote the set of all \(n\)-vectors with real-valued elements.
Example 1.5 The usual way to work with vectors in Python is through NumPy:
import numpy as np
= np.array( [1, 2, 3, 4, 5] )
x x
array([1, 2, 3, 4, 5])
A vector has a data type for its elements:
x.dtype
dtype('int64')
Any float values in the vector cause the data type of the entire vector to be float
:
= np.array( [1.0, 2, 3, 4, 5] )
y y.dtype
dtype('float64')
Use len
to determine the length of a vector:
len(x)
5
You can create a special type of vector called a range that has equally spaced elements:
0, 5) np.arange(
array([0, 1, 2, 3, 4])
1, 3, 0.5) np.arange(
array([1. , 1.5, 2. , 2.5])
The syntax here is (start,stop,step)
.
In base Python and in NumPy, the last element of a range is omitted. This is guaranteed to cause confusion if you are used to just about any other computer language.
There are functions for creating vectors of ones and zeros:
print(np.ones(5))
print(np.zeros(4))
[1. 1. 1. 1. 1.]
[0. 0. 0. 0.]
The elements of a vector can be accessed directly using square brackets notation.
Example 1.6 Use square brackets to refer to an element of a vector, remembering that the first element has index 0:
0] x[
1
4] x[
5
Negative values for the index are counted from the end. The last element of a vector always has index -1
, and more-negative values move backward through the elements:
-1] x[
5
-3] x[
3
-len(x)] x[
1
Element references can also be on the left side of an assignment:
2] = -3
x[ x
array([ 1, 2, -3, 4, 5])
Note, however, that once the data type of a vector is set, it can’t be changed:
0] = 1.234
x[# float was truncated to int, without warning!!! x
array([ 1, 2, -3, 4, 5])
You can also use a list in square brackets to access multiple elements at once:
0, 2, 4] ] x[ [
array([ 1, -3, 5])
The result of a list reference is a new vector, not a number. Note the difference here:
0] x[
1
0] ] x[ [
array([1])
You can also access multiple elements of a vector by using a list or range as the index. In the latter case, this is called slicing.
As with ranges in base Python, the last element of a slice range is omitted. This causes many headaches and bugs.
Example 1.7 The syntax of a slice is start:stop
.
print(x[0:3])
print(x[-3:-1])
[ 1 2 -3]
[-3 4]
If you want to use a range that skips over elements, you can use a third argument in the form start:stop:step
:
0:2:5] x[
array([1])
When the start
of the range is omitted, it means “from the beginning”, and when stop
is omitted, it means “through to the end.” Hence, [:k]
means “first \(k\) elements” and [-k:]
means “last \(k\) elements”:
3] x[:
array([ 1, 2, -3])
-3:] x[
array([-3, 4, 5])
And we also have this idiom:
-1] # reverse the vector x[::
array([ 5, 4, -3, 2, 1])
NumPy will happily allow you to reference invalid indexes. It will just return as much as is available without warning or error:
10] x[:
array([ 1, 2, -3, 4, 5])
A vector is a special case of a more general construct.
Definition 1.4 An array is a collection of values called elements, all of the same type, indexed by one or more sets of consecutive integers. The number of indexes needed to specify a value is the dimension of the array.
A dimension is called an axis
in NumPy and related packages.
The term matrix is often used simply to mean a 2D array. Technically, though, a matrix should have only numerical values, and matrices obey certain properties that make them important mathematical objects. These properties and their consequences are studied in linear algebra.
For us, arrays will mostly be the results of importing or working on data. But it’s occasionally useful to know how to build them from scratch.
Example 1.8 One way to construct an array is by a list comprehension:
= np.array([ [j-i for j in range(6)] for i in range(4) ])
A A
array([[ 0, 1, 2, 3, 4, 5],
[-1, 0, 1, 2, 3, 4],
[-2, -1, 0, 1, 2, 3],
[-3, -2, -1, 0, 1, 2]])
The shape
of an array is what we would often call the size:
A.shape
(4, 6)
There is no difference between a vector and a 1D array:
x.shape
(5,)
There is also no difference between a 2D array and a vector of vectors giving the rows of the array.
= np.array( [ [1, 2, 3], [4, 5, 6] ])
R R
array([[1, 2, 3],
[4, 5, 6]])
Here are some other common ways to construct arrays.
5) np.ones(
array([1., 1., 1., 1., 1.])
3, 6) ) np.zeros( (
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
3) np.repeat(np.pi,
array([3.14159265, 3.14159265, 3.14159265])
You can also stack arrays vertically or horizontally to create new arrays.
2, 2)), np.zeros((2, 3)) ) ) np.hstack( ( np.ones((
array([[1., 1., 0., 0., 0.],
[1., 1., 0., 0., 0.]])
range(5), range(5, 0, -1)) ) np.vstack( (
array([[0, 1, 2, 3, 4],
[5, 4, 3, 2, 1]])
One way to think about a 2D array is a vector of vector rows of the same length. That idea is reflected in NumPy syntax.
Example 1.9 We can use successive brackets to refer to an element, row then column:
1][2] # second row, third column R[
6
It’s equivalent, and typically more convenient, to use a single bracket set with indexes separated by commas:
1, 2] # second row, third column R[
6
You can use slicing ranges in each dimension individually.
1, -2:] # first row, last two columns R[:
array([[2, 3]])
The result above is another 2D array. Note how this result is subtly different:
0, -2:] R[
array([2, 3])
Because we accessed an individual row, not a slice, the result is one dimension lower—a vector. Finally, a :
in one slice position means to keep everything in that dimension.
2] # all rows, first 2 columns A[:, :
array([[ 0, 1],
[-1, 0],
[-2, -1],
[-3, -2]])
A common task is to reduce an array along one dimension, called an axis in numpy, resulting in an array of one less dimension.
Example 1.10 The first dimension of an array is axis number 0, the second is axis number 1, etc.
A
array([[ 0, 1, 2, 3, 4, 5],
[-1, 0, 1, 2, 3, 4],
[-2, -1, 0, 1, 2, 3],
[-3, -2, -1, 0, 1, 2]])
sum(A, axis=0) # sum along the rows np.
array([-6, -2, 2, 6, 10, 14])
Even when we sum along the column dimension, axis=1
, the result is a vector that we would interpret as a row, not a column.
sum(A, axis=1) # sum along the columns np.
array([15, 9, 3, -3])
If you don’t specify an axis, the reduction occurs over all dimensions at once, resulting in a single number.
sum(A) np.
24
You can also do reductions with np.max
, np.min
, np.mean
, etc.
A qualitative value is one that is not quantitative. There are numerous types, but we will consider only a few of them here. Ultimately, we need to be able to represent qualitative data in a quantitative way for use in algorithms.
Definition 1.5 Categorical data has values drawn from a finite set \(S\) of categories. If the members of \(S\) support meaningful ordering comparisons, then the data is ordinal; otherwise, it is nominal.
Example 1.11 Examples of ordinal categorical data:
Examples of nominal categorical data:
There are nuanced cases. For instance, letter grades are themselves ordinal categorical data. However, schools convert them to discrete quantitative data and then compute a continuous quantitative GPA.
One way to quantify ordinal categorical data is to assign integer values to the categories in a manner that preserves ordering. This approach can be suspect, though, when it comes to operations such as averaging or computing a distance between values.
Another means of quantifying categorical data is called dummy variables in classical statistics and one-hot encoding in much of machine learning. Suppose a variable \(x\) has values in a category set that has \(m\) members, i. e., \(S = \{c_1,\ldots,c_m\}\). There are two ways to replace the values with dummy variables.
\[ x_i = \begin{cases} 1, & v = c_i, \\ 0, & v \neq c_i. \end{cases} \]
Example 1.12 Suppose that the stooge variable can take the values Moe
, Larry
, Curly
, or Shemp
, and we have the dataset
"Curly", "Moe", "Curly", "Larry", "Shemp", "Moe"] [
['Curly', 'Moe', 'Curly', 'Larry', 'Shemp', 'Moe']
If we use 4 dummy variables, the data would be replaced by the array
Curly | Larry | Moe | Shemp | |
---|---|---|---|---|
0 | True | False | False | False |
1 | False | False | True | False |
2 | True | False | False | False |
3 | False | True | False | False |
4 | False | False | False | True |
5 | False | False | True | False |
If we use only 3 dummy variables, we would get
Larry | Moe | Shemp | |
---|---|---|---|
0 | False | False | False |
1 | False | True | False |
2 | False | False | False |
3 | True | False | False |
4 | False | False | True |
5 | False | True | False |
The all-zero rows above correspond to the Curly
values.
Text is a ubiquitous data source. One way to quantify text is to use a dictionary of interesting keywords \(w_1,\ldots,w_n\). Given a collection of documents \(d_1,\ldots,d_m\), we can define an \(m\times n\) document–term matrix \(T\) by letting \(T_{ij}\) be the number of times term \(j\) appears in document \(i\).
The most straightforward way to represent an image is as a 3D array of values representing intensities representing of red, green and blue in each pixel. Sometimes it might be preferable to represent the image by a vector of statistics about these values, or by presence or absence of detected objects, etc.
The most popular Python package for manipulating and analyzing data is pandas. We will use the paradigm it presents, which is fairly well understood throughout data science.
Definition 1.6 A series is a vector that is indexed by a finite ordered set.
In NumPy, we always index vectors by absolute position. In pandas, the elements still have an underlying ordering, but we can index them by anything, including strings. Since names are more memorable and intuitive than arbitrary integers, using them makes code easier to write, read, and debug.
Example 1.13 Some data that can be viewed as series:
Example 1.14 Here is a small pandas series. Note that we give the values first and then the index:
import pandas as pd
= ["Marge", "Homer", "Bart", "Lisa", "Maggie"]
names = pd.Series( ["March", "May", "February", "May", "January"], index=names)
months months
Marge March
Homer May
Bart February
Lisa May
Maggie January
dtype: object
We can access the values by choosing a name in the index:
"Lisa"] months[
'May'
If we access multiple values, we get a series that is a subset of the original:
"Homer", "Marge"] ] months[ [
Homer May
Marge March
dtype: object
We can also use the iloc
property to access the underlying vector by NumPy slicing:
-3:] months.iloc[
Bart February
Lisa May
Maggie January
dtype: object
Here is a series of years based on the same index:
= pd.Series([1955, 1956, 1979, 1981, 1988], index=names)
years years
Marge 1955
Homer 1956
Bart 1979
Lisa 1981
Maggie 1988
dtype: int64
We can also start with the series and separate it into a vector of values and its index:
print("Values:")
print(years.values)
print("\nIndex:")
print(years.index)
Values:
[1955 1956 1979 1981 1988]
Index:
Index(['Marge', 'Homer', 'Bart', 'Lisa', 'Maggie'], dtype='object')
Table tbl-data-series lists some common operations on pandas series. There is an exhaustive list in the pandas documentation.
Description | Syntax | Result |
---|---|---|
First or last entries | s.head() , s.tail() |
Series |
Length | len(s) |
integer |
All of the values | s.values |
array |
Convert values to list | list(s) |
list |
Index | s.index |
Index |
Unique values | s.unique() |
array |
Appearance frequencies | s.value_counts() |
Series |
Extreme values | s.min() , s.max() |
number |
Sort by values | s.sort_values() |
Series |
Sum | s.sum() |
number |
Comparison | s > 1 , s=="foo" |
boolean Series |
Locate missing | s.isna() |
boolean Series |
Arithmetic | s + 1 , s * t |
Series |
Delete one or more rows | s.drop() |
Series |
As always in Python, you need to pay attention to the difference between applying a function, like foo(bar)
, accessing a property, like foo.bar
, and calling an object method, foo.bar()
. Extra or missing parentheses groups can cause errors.
Example 1.15 We continue with the definitions in Example exm-data-series-create. Here are all the unique values of the months:
months.unique()
array(['March', 'May', 'February', 'January'], dtype=object)
Here are the frequencies of their appearance:
months.value_counts()
May 2
March 1
February 1
January 1
Name: count, dtype: int64
Note that we can do a logical comparison and get a series of boolean values:
== "May" months
Marge False
Homer True
Bart False
Lisa True
Maggie False
dtype: bool
Since the years are numerical, we can do arithmetic operations on them:
print("Minimum year is", years.min(), "and maximum year is", years.max())
print("The sum of the years is", years.sum())
Minimum year is 1955 and maximum year is 1988
The sum of the years is 9859
It can be useful to know the index where the minimum or maximum value occurs:
print("The oldest is", years.idxmin(), "and the youngest is", years.idxmax())
The oldest is Marge and the youngest is Maggie
Definition 1.7 A data frame is a collection of series that all share the same index set.
Essentially, a data frame is a 2D array with a common index for the rows, where the constituent series are the columns of the array. Those columns can also be indexed by names (strings). Since names are much more memorable and intuitive than arbitrarily assigned integers, data frames make code easier to write, read, and reason about.
Curly braces { }
are used to construct a dictionary in Python.
Example 1.16 We continue with the definitions of months
and years
in Example exm-data-series-create, and use a dictionary to specify the columns of a data frame:
= pd.DataFrame( {"month": months, "date": [19, 12, 23, 9, 14], "year": years } )
simpsons simpsons
month | date | year | |
---|---|---|---|
Marge | March | 19 | 1955 |
Homer | May | 12 | 1956 |
Bart | February | 23 | 1979 |
Lisa | May | 9 | 1981 |
Maggie | January | 14 | 1988 |
Notice above that the date column was given as just a list of numbers; the index for it was inherited from the months
series.
If we give the name of a column as an index into the frame, we get back a series for that column:
"year"] simpsons[
Marge 1955
Homer 1956
Bart 1979
Lisa 1981
Maggie 1988
Name: year, dtype: int64
If we give a list of column names, we get back a data frame with just those columns:
"month", "date"] ] simpsons[ [
month | date | |
---|---|---|
Marge | March | 19 |
Homer | May | 12 |
Bart | February | 23 |
Lisa | May | 9 |
Maggie | January | 14 |
We can even index a nonexistent column on the left side of an assignment, and pandas will create it for us:
"weekday"] = ["Saturday", "Saturday", "Friday", "Saturday", "Thursday"]
simpsons[ simpsons
month | date | year | weekday | |
---|---|---|---|---|
Marge | March | 19 | 1955 | Saturday |
Homer | May | 12 | 1956 | Saturday |
Bart | February | 23 | 1979 | Friday |
Lisa | May | 9 | 1981 | Saturday |
Maggie | January | 14 | 1988 | Thursday |
We can access a row by using brackets with the loc
property of the frame, getting a series indexed by the column names of the frame:
"Maggie"] simpsons.loc[
month January
date 14
year 1988
weekday Thursday
Name: Maggie, dtype: object
We are also free to strip away the index and get an ordinary array:
simpsons.to_numpy()
array([['March', 19, 1955, 'Saturday'],
['May', 12, 1956, 'Saturday'],
['February', 23, 1979, 'Friday'],
['May', 9, 1981, 'Saturday'],
['January', 14, 1988, 'Thursday']], dtype=object)
Another way to create a data frame is to give a list of rows, plus (optionally) the index and the names of the columns:
pd.DataFrame( "white", 5), ("brown", 9), ("pink", 8) ],
[ (=["Color", "Rating"],
columns=["vanilla", "chocolate", "strawberry"]
index )
Color | Rating | |
---|---|---|
vanilla | white | 5 |
chocolate | brown | 9 |
strawberry | pink | 8 |
Table tbl-data-frames shows common operations specific to a data frame. But since the columns are series, the operations in Table tbl-data-series can be applied too, and they will act columnwise.
Description | Syntax | Result |
---|---|---|
Summary information | df.info() , df.describe() |
DataFrame |
First or last row(s) | df.head() , df.tail() |
DataFrame |
Number of rows | len(df) |
integer |
Number of rows and columns | df.shape |
tuple |
Row index | df.index |
Index |
Column names | list(df) |
list |
Column names | df.columns |
Index |
Access by column name(s) | df["name"] , df[["col1", "col2"]] |
Series or DataFrame |
Access by index/name | df.loc[rows, "name"] |
(varies) |
Access by position | df.iloc[i, j] |
(varies) |
Delete a column | df.drop(name, axis=1) |
DataFrame |
Sort by column values | df.sort_values("column") |
DataFrame |
Example 1.17 We continue with the definitions of simpsons
in Example exm-data-frames-create:
simpsons.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Marge to Maggie
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 month 5 non-null object
1 date 5 non-null int64
2 year 5 non-null int64
3 weekday 5 non-null object
dtypes: int64(2), object(2)
memory usage: 372.0+ bytes
min() simpsons.
month February
date 9
year 1955
weekday Friday
dtype: object
Summing numbers does what you expect, but summing strings concatenates them:
sum() simpsons.
month MarchMayFebruaryMayJanuary
date 77
year 9859
weekday SaturdaySaturdayFridaySaturdayThursday
dtype: object
We can sort the entire frame by its values in a column:
"date") simpsons.sort_values(
month | date | year | weekday | |
---|---|---|---|---|
Lisa | May | 9 | 1981 | Saturday |
Homer | May | 12 | 1956 | Saturday |
Maggie | January | 14 | 1988 | Thursday |
Marge | March | 19 | 1955 | Saturday |
Bart | February | 23 | 1979 | Friday |
For a more thorough demonstration of operations on data frames, look at the pandas user guide. There is also an exhaustive list of relevant functions in the pandas documentation.
Datasets can be presented in many forms. The simplest and most universal are comma-separated value (CSV) files. Such files can be read by pandas locally or over the web.
Example 1.18 The pandas function we use to load a dataset from CSV is read_csv
. Here we use it to read a file that is available over the web:
= pd.read_csv("https://raw.githubusercontent.com/tobydriscoll/ds1book/master/advertising.csv")
ads 6) # show the first 6 rows ads.head(
TV | Radio | Newspaper | Sales | |
---|---|---|---|---|
0 | 230.1 | 37.8 | 69.2 | 22.1 |
1 | 44.5 | 39.3 | 45.1 | 10.4 |
2 | 17.2 | 45.9 | 69.3 | 12.0 |
3 | 151.5 | 41.3 | 58.5 | 16.5 |
4 | 180.8 | 10.8 | 58.4 | 17.9 |
5 | 8.7 | 48.9 | 75.0 | 7.2 |
Note above that we used head(6)
to see just the first 6 rows.
It’s a good idea to check what you’ve just loaded:
ads.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TV 200 non-null float64
1 Radio 200 non-null float64
2 Newspaper 200 non-null float64
3 Sales 200 non-null float64
dtypes: float64(4)
memory usage: 6.4 KB
Here is one thing to watch for when loading numerical data: if a column has anything non-numeric in it, like “N/A” or “unknown”, pandas will load that column as a string.
While it’s possible to import datasets directly from the web, links change and disappear frequently. If storing the dataset locally is not a problem, you may want to download your own copy before working on it.
It’s also possible to import data from most other general-purpose formats you might encounter, such as Excel spreadsheets (though possibly requiring one-time installation of additional libraries). There are many functions starting with pd.read_
showing the formats pandas understands.
In a data frame, we access columns by giving a string name, or a list of names, in square brackets. We can access a row by the loc
property with an index value, or iloc
with an absolute row number starting from zero.
Example 1.19 Here’s a local file that contains daily weather summaries from Newark, Delaware:
= pd.read_csv("_datasets/ghcn_newark.csv")
weather weather.head()
STATION | DATE | LATITUDE | LONGITUDE | ELEVATION | NAME | PRCP | PRCP_ATTRIBUTES | SNOW | SNOW_ATTRIBUTES | ... | WT08 | WT08_ATTRIBUTES | WT11 | WT11_ATTRIBUTES | WT14 | WT14_ATTRIBUTES | WT16 | WT16_ATTRIBUTES | WT18 | WT18_ATTRIBUTES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | USC00076410 | 1894-04-01 | 39.6682 | -75.74569 | 32.3 | NEWARK AG FARM, DE US | 0.0 | P,,6, | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | USC00076410 | 1894-04-02 | 39.6682 | -75.74569 | 32.3 | NEWARK AG FARM, DE US | 0.0 | P,,6, | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | USC00076410 | 1894-04-03 | 39.6682 | -75.74569 | 32.3 | NEWARK AG FARM, DE US | 0.0 | P,,6, | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | USC00076410 | 1894-04-04 | 39.6682 | -75.74569 | 32.3 | NEWARK AG FARM, DE US | 165.0 | ,,6, | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | USC00076410 | 1894-04-05 | 39.6682 | -75.74569 | 32.3 | NEWARK AG FARM, DE US | 0.0 | P,,6, | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 56 columns
There are lots of columns here. We will work with a small subset of them:
= ["DATE", "PRCP", "SNOW", "TMAX", "TMIN"]
columns = weather[columns]
weather weather.head()
DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|
0 | 1894-04-01 | 0.0 | NaN | 189.0 | 111.0 |
1 | 1894-04-02 | 0.0 | NaN | 133.0 | 22.0 |
2 | 1894-04-03 | 0.0 | NaN | 89.0 | -61.0 |
3 | 1894-04-04 | 165.0 | NaN | 117.0 | 17.0 |
4 | 1894-04-05 | 0.0 | NaN | 156.0 | 56.0 |
The first row has absolute position zero:
0] weather.iloc[
DATE 1894-04-01
PRCP 0.0
SNOW NaN
TMAX 189.0
TMIN 111.0
Name: 0, dtype: object
The result above is a series, with the column names of the original data frame serving as the index. The following result is subtly different:
1] weather.iloc[:
DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|
0 | 1894-04-01 | 0.0 | NaN | 189.0 | 111.0 |
The slicing syntax [:1]
means “the first 1 rows”. The result is a one-row data frame—not a series. To access the last 4 rows, we can use a negative index:
-4:] weather.iloc[
DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|
29513 | 2022-12-17 | 0.0 | NaN | 72.0 | 11.0 |
29514 | 2022-12-18 | 0.0 | NaN | 39.0 | -22.0 |
29515 | 2023-01-06 | 0.0 | NaN | 89.0 | 33.0 |
29516 | 2023-01-07 | 0.0 | NaN | 72.0 | 17.0 |
In Example exm-data-row-iloc, we did not specify an index for the data frame, so the row numbers were the index. If we give a data frame a different index, we can use loc
to access rows by the index value.
Example 1.20 A natural candidate for an index in the weather data is the DATE column. This can be specified when the file is read, but here we do it after the fact:
= weather.set_index("DATE")
dated_weather dated_weather.head()
PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|
DATE | ||||
1894-04-01 | 0.0 | NaN | 189.0 | 111.0 |
1894-04-02 | 0.0 | NaN | 133.0 | 22.0 |
1894-04-03 | 0.0 | NaN | 89.0 | -61.0 |
1894-04-04 | 165.0 | NaN | 117.0 | 17.0 |
1894-04-05 | 0.0 | NaN | 156.0 | 56.0 |
Notice above that the leftmost column of dated_weather
is the index, which still has the column name DATE. We can access a row by giving a date string to loc
:
"1979-03-28"] dated_weather.loc[
PRCP 0.0
SNOW 0.0
TMAX 111.0
TMIN -33.0
Name: 1979-03-28, dtype: float64
We can extract a row and column simultaneously by giving both to loc
:
"1979-03-28", "TMAX"] dated_weather.loc[
111.0
When the index is a date, as is the case here, we can even use a range of dates:
"1979-03-01":"1979-03-06"] dated_weather.loc[
PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|
DATE | ||||
1979-03-01 | 0.0 | 0.0 | 122.0 | -22.0 |
1979-03-02 | 3.0 | 0.0 | 128.0 | 33.0 |
1979-03-03 | 0.0 | 0.0 | 117.0 | 22.0 |
1979-03-04 | 0.0 | 0.0 | NaN | NaN |
1979-03-05 | 61.0 | 0.0 | NaN | NaN |
1979-03-06 | 173.0 | 0.0 | 128.0 | 83.0 |
Frequently we want to select rows from a data frame based on criteria applied to the data itself. We can use relational operators such as >
, <
, ==
, etc. to select rows based on the values in a column.
Example 1.21 Here is how we can select all the rows in which the value of PRCP is greater than 1200:
"PRCP"] > 1200] weather.loc[weather[
DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|
9396 | 1952-07-09 | 1377.0 | 0.0 | 261.0 | 200.0 |
12382 | 1960-09-12 | 1331.0 | 0.0 | 222.0 | 167.0 |
16352 | 1971-08-27 | 1207.0 | 0.0 | 256.0 | 194.0 |
26543 | 1999-09-16 | 2202.0 | 0.0 | 206.0 | 156.0 |
27938 | 2004-07-12 | 1580.0 | 0.0 | 244.0 | 189.0 |
27986 | 2004-09-29 | 1450.0 | 0.0 | 244.0 | 178.0 |
There are two steps above. The expression inside the square brackets produces a series of Boolean values:
= (weather["PRCP"] > 1200)
is_prcp is_prcp.head()
0 False
1 False
2 False
3 False
4 False
Name: PRCP, dtype: bool
This series can be used to select rows with loc
. Actually, it works without loc
as well, because pandas knows it is a row selector:
print("There are",is_prcp.sum(),"rows with PRCP > 1200.")
weather[is_prcp]
There are 6 rows with PRCP > 1200.
DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|
9396 | 1952-07-09 | 1377.0 | 0.0 | 261.0 | 200.0 |
12382 | 1960-09-12 | 1331.0 | 0.0 | 222.0 | 167.0 |
16352 | 1971-08-27 | 1207.0 | 0.0 | 256.0 | 194.0 |
26543 | 1999-09-16 | 2202.0 | 0.0 | 206.0 | 156.0 |
27938 | 2004-07-12 | 1580.0 | 0.0 | 244.0 | 189.0 |
27986 | 2004-09-29 | 1450.0 | 0.0 | 244.0 | 178.0 |
We can use logical operators &
(AND), |
(OR), and ~
(NOT) on these Boolean series in order to combine criteria:
"TMAX"] < 0) & (weather["SNOW"] > 0)] weather[(weather[
DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|
272 | 1895-01-29 | 102.0 | 102.0 | -22.0 | -94.0 |
282 | 1895-02-08 | 127.0 | 178.0 | -139.0 | -189.0 |
692 | 1896-03-24 | 33.0 | 51.0 | -22.0 | -100.0 |
943 | 1896-11-30 | 107.0 | 76.0 | -6.0 | -33.0 |
966 | 1896-12-23 | 48.0 | 51.0 | -6.0 | -89.0 |
... | ... | ... | ... | ... | ... |
25224 | 1996-01-08 | 3.0 | 25.0 | -28.0 | -156.0 |
25249 | 1996-02-02 | 20.0 | 25.0 | -22.0 | -56.0 |
25250 | 1996-02-03 | 155.0 | 147.0 | -33.0 | -111.0 |
25594 | 1997-01-12 | 13.0 | 13.0 | -33.0 | -100.0 |
25623 | 1997-03-10 | 178.0 | 8.0 | -17.0 | 167.0 |
131 rows × 5 columns
The pandas user guide has a handy section on selections.
For practice with pandas fundamentals, try the Kaggle course.
Sometimes you want to combine data from different sources. There are multiple ways to do this in pandas, depending on the structure of the data you want to combine and the result you want to receive.
The concat
function is used to stack frames or series on top of each other.
Example 1.22 Here we create two frames with three rows each and some commonly named columns:
= pd.DataFrame(
birds
{"Name": ["Eagles", "Crows", "Falcons"],
"Predatory": [True, False, True],
"Family": ["Accipitridae", "Corvidae", "Falconidae"],
"Color": ["Brown", "Black", "Blue–Gray"]
}
) birds
Name | Predatory | Family | Color | |
---|---|---|---|---|
0 | Eagles | True | Accipitridae | Brown |
1 | Crows | False | Corvidae | Black |
2 | Falcons | True | Falconidae | Blue–Gray |
= pd.DataFrame(
teams
{"Name": ["Falcons", "Eagles", "Bears"],
"Home": ["Atlanta", "Philadelphia", "Chicago"],
"Color": ["Red", "Green", "Blue"]
}
) teams
Name | Home | Color | |
---|---|---|---|
0 | Falcons | Atlanta | Red |
1 | Eagles | Philadelphia | Green |
2 | Bears | Chicago | Blue |
We can use pd.concat()
to stack these frames on top of each other:
pd.concat([birds, teams])
Name | Predatory | Family | Color | Home | |
---|---|---|---|---|---|
0 | Eagles | True | Accipitridae | Brown | NaN |
1 | Crows | False | Corvidae | Black | NaN |
2 | Falcons | True | Falconidae | Blue–Gray | NaN |
0 | Falcons | NaN | NaN | Red | Atlanta |
1 | Eagles | NaN | NaN | Green | Philadelphia |
2 | Bears | NaN | NaN | Blue | Chicago |
Notice above that any value that wasn’t defined in the source frames ends up as a missing value in the result.
We can also concatenate the frames side by side by using the axis
keyword:
="columns") pd.concat([birds, teams], axis
Name | Predatory | Family | Color | Name | Home | Color | |
---|---|---|---|---|---|---|---|
0 | Eagles | True | Accipitridae | Brown | Falcons | Atlanta | Red |
1 | Crows | False | Corvidae | Black | Eagles | Philadelphia | Green |
2 | Falcons | True | Falconidae | Blue–Gray | Bears | Chicago | Blue |
The result is just a dumb pasting together of the source frames. There is no attempt to make the entries of the Name column match up, even though there were some overlapping values in it.
Concatenation is occasionally useful, but more often data frames should be joined horizontally using the more flexible merge
method in the next section.
The merge
function is used to merge frames based on the values within the columns. Generically, each source frame can have a key column that is used to match up rows.
Example 1.23 Continuing with the data frames from Example exm-data-concat:
="Name") pd.merge(birds, teams, on
Name | Predatory | Family | Color_x | Home | Color_y | |
---|---|---|---|---|---|---|
0 | Eagles | True | Accipitridae | Brown | Philadelphia | Green |
1 | Falcons | True | Falconidae | Blue–Gray | Atlanta | Red |
The on
keyword specifies the column that is the key for the merge. In the above merge, the result is a frame with only the rows that have a common value in the Name
column. That’s called an inner join, which is the default. An alternative is an outer join, which includes all rows from both frames:
="Name", how="outer") pd.merge(birds, teams, on
Name | Predatory | Family | Color_x | Home | Color_y | |
---|---|---|---|---|---|---|
0 | Bears | NaN | NaN | NaN | Chicago | Blue |
1 | Crows | False | Corvidae | Black | NaN | NaN |
2 | Eagles | True | Accipitridae | Brown | Philadelphia | Green |
3 | Falcons | True | Falconidae | Blue–Gray | Atlanta | Red |
Note also above that the column Color
that appeared in both of the source frames was renamed to Color_x
and Color_y
in the result, since otherwise there would be a conflict.
It’s also possible to do a left join or a right join, which use one of the original frames as the basis for the result:
print("Left merge:")
print(pd.merge(birds, teams, on="Name", how="left"))
print("\nRight merge:")
print(pd.merge(birds, teams, on="Name", how="right"))
Left merge:
Name Predatory Family Color_x Home Color_y
0 Eagles True Accipitridae Brown Philadelphia Green
1 Crows False Corvidae Black NaN NaN
2 Falcons True Falconidae Blue–Gray Atlanta Red
Right merge:
Name Predatory Family Color_x Home Color_y
0 Falcons True Falconidae Blue–Gray Atlanta Red
1 Eagles True Accipitridae Brown Philadelphia Green
2 Bears NaN NaN NaN Chicago Blue
The difference between outer join and inner join is the same as the union and intersection of sets—the sets here being the "on"
column contents.
Example 1.24 Here we load in weather data for Newark, DE, being sure that the column representing the date is parsed as dates:
= pd.read_csv("_datasets/ghcn_newark.csv", parse_dates=["DATE"])
weather = weather[["DATE", "PRCP", "SNOW", "TMAX", "TMIN"]]
weather weather.head()
DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|
0 | 1894-04-01 | 0.0 | NaN | 189.0 | 111.0 |
1 | 1894-04-02 | 0.0 | NaN | 133.0 | 22.0 |
2 | 1894-04-03 | 0.0 | NaN | 89.0 | -61.0 |
3 | 1894-04-04 | 165.0 | NaN | 117.0 | 17.0 |
4 | 1894-04-05 | 0.0 | NaN | 156.0 | 56.0 |
Here is another data frame for the S&P 500 stock market index:
= pd.read_csv("_datasets/SP500_daily.csv", parse_dates=["Date"])
sp500 "Gain"] = sp500["Close/Last"] - sp500["Open"]
sp500[ sp500.head()
Date | Close/Last | Open | High | Low | Gain | |
---|---|---|---|---|---|---|
0 | 2024-03-05 | 5078.65 | 5110.52 | 5114.54 | 5056.82 | -31.87 |
1 | 2024-03-04 | 5130.95 | 5130.99 | 5149.67 | 5127.18 | -0.04 |
2 | 2024-03-01 | 5137.08 | 5098.51 | 5140.33 | 5094.16 | 38.57 |
3 | 2024-02-29 | 5096.27 | 5085.36 | 5104.99 | 5061.89 | 10.91 |
4 | 2024-02-28 | 5069.76 | 5067.20 | 5077.37 | 5058.35 | 2.56 |
Now we can merge the weather data with the stock data according to the date. Note that here, the key is named differently in the two frames:
= pd.merge(sp500, weather, left_on="Date", right_on="DATE")
merged merged.head()
Date | Close/Last | Open | High | Low | Gain | DATE | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-01-06 | 3895.08 | 3823.37 | 3906.19 | 3809.56 | 71.71 | 2023-01-06 | 0.0 | NaN | 89.0 | 33.0 |
1 | 2022-11-30 | 4080.11 | 3957.18 | 4080.11 | 3938.58 | 122.93 | 2022-11-30 | 97.0 | NaN | 128.0 | 6.0 |
2 | 2022-11-29 | 3957.63 | 3964.19 | 3976.77 | 3937.65 | -6.56 | 2022-11-29 | 0.0 | NaN | 83.0 | -22.0 |
3 | 2022-11-28 | 3963.94 | 4005.36 | 4012.27 | 3955.77 | -41.42 | 2022-11-28 | 0.0 | NaN | 122.0 | 17.0 |
4 | 2022-11-25 | 4026.12 | 4023.34 | 4034.02 | 4020.76 | 2.78 | 2022-11-25 | 13.0 | NaN | 150.0 | 6.0 |
Since we did an inner join, only the rows with a common date in both frames are included:
print("There are", len(sp500), "rows in the S&P 500 data.")
print("There are", len(weather), "rows in the weather data.")
print("There are", len(merged), "rows in the inner-merged data.")
There are 1259 rows in the S&P 500 data.
There are 29517 rows in the weather data.
There are 820 rows in the inner-merged data.
If the key you want to merge on is the index column, the syntax of the merge changes a little. Here is an example:
= weather.set_index("DATE")
weather weather.head()
PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|
DATE | ||||
1894-04-01 | 0.0 | NaN | 189.0 | 111.0 |
1894-04-02 | 0.0 | NaN | 133.0 | 22.0 |
1894-04-03 | 0.0 | NaN | 89.0 | -61.0 |
1894-04-04 | 165.0 | NaN | 117.0 | 17.0 |
1894-04-05 | 0.0 | NaN | 156.0 | 56.0 |
= pd.merge(sp500, weather, left_on="Date", right_index=True)
result result.head()
Date | Close/Last | Open | High | Low | Gain | PRCP | SNOW | TMAX | TMIN | |
---|---|---|---|---|---|---|---|---|---|---|
290 | 2023-01-06 | 3895.08 | 3823.37 | 3906.19 | 3809.56 | 71.71 | 0.0 | NaN | 89.0 | 33.0 |
315 | 2022-11-30 | 4080.11 | 3957.18 | 4080.11 | 3938.58 | 122.93 | 97.0 | NaN | 128.0 | 6.0 |
316 | 2022-11-29 | 3957.63 | 3964.19 | 3976.77 | 3937.65 | -6.56 | 0.0 | NaN | 83.0 | -22.0 |
317 | 2022-11-28 | 3963.94 | 4005.36 | 4012.27 | 3955.77 | -41.42 | 0.0 | NaN | 122.0 | 17.0 |
318 | 2022-11-25 | 4026.12 | 4023.34 | 4034.02 | 4020.76 | 2.78 | 13.0 | NaN | 150.0 | 6.0 |
Raw data often needs to be manipulated into a useable format before algorithms can be applied. Preprocessing data so that it is suitable for machine analysis is known as data wrangling.
There are a number of frequently encountered tasks in data wrangling, including:
These are tasks that graphical and AI assistive tools can help with a great deal. When using such tools, have them generate code that reproduces its steps. That way, you have a definitive record of what was done and can reproduce or modify it later.
One of the most commonly encountered snags found in real data sets are rows that have missing values. There is no universal way that such values are presented. A missing value may be indicated by a zero or nonsensical value, or with a string such as “n/a”, or just an empty cell. Within Python, it might be natural to represent missing values as None
, but that is not a valid value for a NumPy array. It’s more flexible within pandas to use np.nan
, which stands for “not a number”.
Most algorithms will fail when presented with missing values, so you must either remove or replace them.
For removal, the dropna()
method deletes all rows that have missing values. By default, it does this to a new copy of the data frame, so you must assign the result to a variable. Alternatively, you can use the argument inplace=True
to modify the original frame. When applying dropna()
, you may want to limit its scope to a subset of columns by using the subset
argument with a list of columns.
Replacement of missing values is called imputation. The fillna()
method makes it easy to use the mean, median, or mode of the non-missing values in a column. Doing so allows you to avoid data loss, but you should be cautious about imputation, because it can easily misrepresent the underlying data. At the very least, you need to document how many values have been changed.
Example 1.25 Here is a well-known demonstration data set often referred to as Palmer Penguins:
import seaborn as sns
import numpy as np
= sns.load_dataset("penguins")
penguins penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
Note above that the fourth row of the frame is missing measurements. We can discover how many such rows there are using isna
, which creates a Boolean series:
penguins.isna()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | True | True | True | True | True |
4 | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... |
339 | False | False | True | True | True | True | True |
340 | False | False | False | False | False | False | False |
341 | False | False | False | False | False | False | False |
342 | False | False | False | False | False | False | False |
343 | False | False | False | False | False | False | False |
344 rows × 7 columns
The total number of rows with missing data is
sum() penguins.isna().
species 0
island 0
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
dtype: int64
To drop the rows with missing values, you can use dropna()
:
= penguins.dropna()
dropped print("Original has", len(penguins), "rows.")
print("After removals, there are", len(dropped), "rows.")
Original has 344 rows.
After removals, there are 333 rows.
If you want to impute missing values, you can use fillna()
. Here, for example, we replace missing values in any quantitative column with the mean of the column:
= penguins.copy()
filled for col in penguins.select_dtypes(np.number).columns:
=True)
filled[col].fillna(penguins[col].mean(), inplace filled.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.10000 | 18.70000 | 181.000000 | 3750.000000 | Male |
1 | Adelie | Torgersen | 39.50000 | 17.40000 | 186.000000 | 3800.000000 | Female |
2 | Adelie | Torgersen | 40.30000 | 18.00000 | 195.000000 | 3250.000000 | Female |
3 | Adelie | Torgersen | 43.92193 | 17.15117 | 200.915205 | 4201.754386 | NaN |
4 | Adelie | Torgersen | 36.70000 | 19.30000 | 193.000000 | 3450.000000 | Female |
Note that, for example, the categorical column for sex was left alone and still contains missing values.
A common pitfall is that data that seems numerical to you might be parsed as strings by pandas. This happens when there are non-numeric characters in the data, such as dollar signs or percent signs, and any row with such a character will cause the whole column to be interpreted as strings.
Example 1.26 To demonstrate algorithms in later sections, we will be using a dataset describing loans made on the crowdfunding site LendingClub. For this example, we’ll keep only a few columns.
= pd.read_csv("_datasets/loans.csv")
loans = loans[["loan_amnt", "funded_amnt", "int_rate", "title"]]
loans 8) loans.head(
loan_amnt | funded_amnt | int_rate | title | |
---|---|---|---|---|
0 | 5000 | 5000 | 10.65% | Computer |
1 | 2500 | 2500 | 15.27% | bike |
2 | 2400 | 2400 | 15.96% | real estate business |
3 | 10000 | 10000 | 13.49% | personel |
4 | 3000 | 3000 | 12.69% | Personal |
5 | 5000 | 5000 | 7.90% | My wedding loan I promise to pay back |
6 | 7000 | 7000 | 15.96% | Loan |
7 | 3000 | 3000 | 18.64% | Car Downpayment |
The int_rate
column, which gives the interest rate on the loan, has been interpreted as strings due to the percent sign. We’d like to strip out those percent signs and convert the rest to numeric types. We can use the str
property of the column to access string methods:
= loans["int_rate"].str.strip('%') int_rate
Then, we can convert the result to numbers:
"int_rate"] = pd.to_numeric( int_rate )
loans[ loans.head()
loan_amnt | funded_amnt | int_rate | title | |
---|---|---|---|---|
0 | 5000 | 5000 | 10.65 | Computer |
1 | 2500 | 2500 | 15.27 | bike |
2 | 2400 | 2400 | 15.96 | real estate business |
3 | 10000 | 10000 | 13.49 | personel |
4 | 3000 | 3000 | 12.69 | Personal |
The title column contains strings that are haphazardly formatted. We might want to convert everything in that column to lowercase:
"title"] = loans["title"].str.lower()
loans[8) loans.head(
loan_amnt | funded_amnt | int_rate | title | |
---|---|---|---|---|
0 | 5000 | 5000 | 10.65 | computer |
1 | 2500 | 2500 | 15.27 | bike |
2 | 2400 | 2400 | 15.96 | real estate business |
3 | 10000 | 10000 | 13.49 | personel |
4 | 3000 | 3000 | 12.69 | personal |
5 | 5000 | 5000 | 7.90 | my wedding loan i promise to pay back |
6 | 7000 | 7000 | 15.96 | loan |
7 | 3000 | 3000 | 18.64 | car downpayment |
You can look at the pandas user guide for a more thorough demonstration of string methods in pandas.
Most algorithms require numerical data. If you have qualitative data, you need to convert it to a numerical form. This is called encoding.
For a category with ordered values, it’s usually best to replace the categories with integers that reflect the ordering.
Example 1.27 This dataset contains information about diamonds:
import seaborn as sns
= sns.load_dataset("diamonds")
diamonds diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
As you can see above, some of the features (cut, color, clarity) have string designations. We could convert cut to an ordinal categorical variable like so:
= ["Fair", "Good", "Very Good", "Premium", "Ideal"]
cuts = pd.Categorical(diamonds["cut"], categories=cuts, ordered=True) # ordinal
cuts_cat cuts_cat
['Ideal', 'Premium', 'Good', 'Premium', 'Good', ..., 'Ideal', 'Good', 'Very Good', 'Premium', 'Ideal']
Length: 53940
Categories (5, object): ['Fair' < 'Good' < 'Very Good' < 'Premium' < 'Ideal']
Replacing strings with categories in the Cat column makes no apparent change:
"cut"] = cuts_cat
diamonds[ diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
However, we could convert those values to integers with the same ordering:
"cut"] = cuts_cat.codes
diamonds[ diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | 4 | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | 3 | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | 1 | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | 3 | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | 1 | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
For nominal categories that have no natural ordering, it’s typically best to encode via dummy variables rather than integers.
Example 1.28 Here is a dataset of NBA player statistics from 2023–2024:
= pd.read_csv("_datasets/NBA player stats 2023-24.csv")
nba nba.head()
RANK | NAME | TEAM | POS | AGE | GP | MPG | USG% | TO% | FTA | ... | APG | SPG | BPG | TPG | P+R | P+A | P+R+A | VI | ORtg | DRtg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Joel Embiid | Phi | C | 30.2 | 6 | 41.4 | 35.7 | 15.8 | 78 | ... | 5.7 | 1.2 | 1.5 | 4.2 | 43.8 | 38.7 | 49.5 | 12.2 | 117.1 | 108.0 |
1 | NaN | Jalen Brunson | Nyk | G | 27.8 | 13 | 39.8 | 36.4 | 9.3 | 120 | ... | 7.5 | 0.8 | 0.2 | 2.7 | 35.7 | 39.8 | 43.2 | 9.3 | 114.8 | 114.7 |
2 | NaN | Damian Lillard | Mil | G | 33.9 | 4 | 39.1 | 31.4 | 10.0 | 38 | ... | 5.0 | 1.0 | 0.0 | 2.3 | 34.5 | 36.3 | 39.5 | 8.2 | 127.6 | 115.7 |
3 | NaN | Shai Gilgeous-Alexander | Okc | G | 25.9 | 10 | 39.9 | 32.3 | 8.9 | 81 | ... | 6.4 | 1.3 | 1.7 | 2.2 | 37.4 | 36.6 | 43.8 | 11.2 | 118.3 | 106.9 |
4 | NaN | Tyrese Maxey | Phi | G | 23.6 | 6 | 44.6 | 28.1 | 8.6 | 28 | ... | 6.8 | 0.8 | 0.3 | 2.2 | 35.0 | 36.7 | 41.8 | 9.1 | 120.9 | 113.3 |
5 rows × 29 columns
For this example, we will narrow the focus to the NAME and POS columns:
= nba[["NAME", "POS"]]
nba 10) nba.head(
NAME | POS | |
---|---|---|
0 | Joel Embiid | C |
1 | Jalen Brunson | G |
2 | Damian Lillard | G |
3 | Shai Gilgeous-Alexander | G |
4 | Tyrese Maxey | G |
5 | Donovan Mitchell | G |
6 | Luka Doncic | G |
7 | Nikola Jokic | C |
8 | LeBron James | F |
9 | Anthony Davis | C |
The POS column is a categorical variable (the player’s position):
"POS"].unique() nba[
array(['C', 'G', 'F', 'G-F', 'F-G', 'C-F', 'F-C'], dtype=object)
We can use pd.get_dummies
to convert this column to dummy variables:
= pd.get_dummies(nba["POS"])
dum1 10) dum1.head(
C | C-F | F | F-C | F-G | G | G-F | |
---|---|---|---|---|---|---|---|
0 | True | False | False | False | False | False | False |
1 | False | False | False | False | False | True | False |
2 | False | False | False | False | False | True | False |
3 | False | False | False | False | False | True | False |
4 | False | False | False | False | False | True | False |
5 | False | False | False | False | False | True | False |
6 | False | False | False | False | False | True | False |
7 | True | False | False | False | False | False | False |
8 | False | False | True | False | False | False | False |
9 | True | False | False | False | False | False | False |
As you can see, the POS column was replaced by 7 columns, one for each of the unique positions in the dataset. But this result is a little awkward, because a position value such as "C-F"
means both center and forward. We can use a different encoding scheme to handle this more sensibly:
= nba["POS"].str.get_dummies("-") # split on hyphen
dum2 10) dum2.head(
C | F | G | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
5 | 0 | 0 | 1 |
6 | 0 | 0 | 1 |
7 | 1 | 0 | 0 |
8 | 0 | 1 | 0 |
9 | 1 | 0 | 0 |
Now we can manually paste together the original data frame with the dummy variables:
= pd.concat([nba, dum2], axis="columns")
nba 10) nba.head(
NAME | POS | C | F | G | |
---|---|---|---|---|---|
0 | Joel Embiid | C | 1 | 0 | 0 |
1 | Jalen Brunson | G | 0 | 0 | 1 |
2 | Damian Lillard | G | 0 | 0 | 1 |
3 | Shai Gilgeous-Alexander | G | 0 | 0 | 1 |
4 | Tyrese Maxey | G | 0 | 0 | 1 |
5 | Donovan Mitchell | G | 0 | 0 | 1 |
6 | Luka Doncic | G | 0 | 0 | 1 |
7 | Nikola Jokic | C | 1 | 0 | 0 |
8 | LeBron James | F | 0 | 1 | 0 |
9 | Anthony Davis | C | 1 | 0 | 0 |
Here’s how it looks for a player who has multiple positions:
"NAME"] == "Kelly Oubre Jr."] nba[nba[
NAME | POS | C | F | G | |
---|---|---|---|---|---|
54 | Kelly Oubre Jr. | F-G | 0 | 1 | 1 |
Exercise 1.1 For each type of data, classify it as discrete quantitative, continuous quantitative, categorical, or other.
Exercise 1.2 Give the length of each vector or series.
Exercise 1.3 Describe a scheme for creating dummy variables for the days of the week. Use your scheme to encode the vector:
[Tuesday, Sunday, Friday, Tuesday, Monday]