1 Representation of data

First we have to discuss how to represent data, both abstractly and in Python.

1.1 Quantitative data

Definition 1.1 A quantitative value is one that is numerical and supports meaningful comparison and arithmetic operations.

Quantitative data is further divided into continuous and discrete types. The difference is the same as between real numbers and integers.

Example 1.1 Some continuous quantitative data sources:

Temperature at noon at a given airport
Your height
Voltage across the terminals of a battery

Examples of discrete quantitative data:

The number of shoes you own
Number of people at a restaurant table
Score on an exam

Note

Sometimes there may be room for interpretation or context. For example, the retail price of a gallon of milk might be regarded as discrete data, since it technically represents a whole number of pennies. But in finance, transactions are regularly computed to much higher precision, so it might make more sense to interpret prices as continuous values. As a rule of thumb, if there are many possible values, or the values are the result of a measurement, the continuous interpretation is usually more appropriate.

Caution

Not all numerical values represent truly quantitative data. ZIP codes (postal codes) in the U.S. are 5-digit numbers, and while there is some logic to how they were assigned, there is no clearly meaningful interpretation of averaging them, for instance.

For both continuous and discrete quantities, it makes sense to order different values, compute averages of them, etc. (However, averages of discrete quantities are continuous.)

Mathematically, the real and integer number sets are infinite, but computers are finite machines. Integers are represented exactly within some range that is determined by how many binary bits are dedicated. The computational analog of real numbers are floating-point numbers, or more simply, floats. These are bounded in range as well as discretized. The details are complicated, but essentially, the floating-point numbers have about 16 significant decimal digits by default—which is virtually always far more precision than real data offers.

Example 1.2 if you enter an integer, it is interpreted as an int:

# This is an integer (int type)
print(3, "is a", type(3))

3 is a <class 'int'>

To get the floating-point version, you add a decimal point:

# This is a real number (float type)
print(3.0, "is a", type(3.0))

3.0 is a <class 'float'>

You can also use scientific notation:

print(6.02e23, "is Avogadro's number")
print(6.62607e-34, "is Planck's constant in Joule-seconds")

6.02e+23 is Avogadro's number
6.62607e-34 is Planck's constant in Joule-seconds

There are also explicit converter functions. The conversion of an int to a float is generally exact:

# Convert int to float (generally no change to numerical value)
print( "float(3) creates",float(3) )

float(3) creates 3.0

But the conversion of int to float truncates the number, which can cause subtle bugs:

# Truncate float to int
print( "int(3.14) creates", int(3.14) )

int(3.14) creates 3

Because operations on floats are not exact, results can be surprising:

print(0.2 - 0.1 == 0.1)    # as expected
print(3.1 - 3.0 == 0.1)    # uh-oh

True
False

The Numpy function isclose is a safer way to compare floats:

import numpy as np
print( np.isclose(3.1 - 3.0, 0.1) )

True

Warning

Be careful about comparing floating-point values. Exact comparisons may fail due to rounding errors.

1.1.1 Inf and NaN

There are two additional quasi-numerical float values to be aware of as well.

Important

For numerical work in Python, the NumPy package indispensible. We will use it often, and it is also loaded and used by most other scientifically oriented packages.

The value inf in NumPy stands for infinity.

Warning

By definition, every operation that involves a NaN value results in a NaN.

Example 1.3 Infinity is greater than every number, and some arithmetic operations with infinity are well defined:

import numpy as np
print( "(np.inf + 5) is", np.inf + 5 )
print( "(np.inf + np.inf) is", np.inf + np.inf )
print( "(5 - np.inf) is", 5 - np.inf )

(np.inf + 5) is inf
(np.inf + np.inf) is inf
(5 - np.inf) is -inf

However, in calculus you learned that some expressions with infinity are considered to be undefined without additional information (e.g., L’Hôpital’s Rule):

print( "np.inf / np.inf is", np.inf / np.inf )

np.inf / np.inf is nan

The result nan above stands for Not a Number. It is the result of indeterminate arithmetic operations, like \(\infty/\infty\) and \(\infty - \infty\). It is also used sometimes as a placeholder for missing data.

One notorious trap of using NaN is that nan==nan is not True!

print( "np.nan == np.nan is", np.nan == np.nan )

np.nan == np.nan is False

Instead, you should use np.isnan to check for NaN values:

print( "np.isnan(np.nan) is", np.isnan(np.nan) )

np.isnan(np.nan) is True

1.1.2 Dates and times

Handling times and dates can be tricky. Aside from headaches such as time zones and leap years, there are many different ways people and machines represent dates, and with varying amounts of precision. Python has its own inbuilt system for handling dates and times, but we will show the facilities provided by the NumPy package, which is more comprehensive.

There are two basic types:

Definition 1.2 A datetime is a representation of an instant in time. A time delta is a representation of a duration; i.e., a difference between two datetimes.

Important

Native Python uses a datetime type, while NumPy uses datetime64.

Example 1.4 You can use many natural date formats within strings:

import numpy as np
np.datetime64("2020-01-17")    # YYYY-MM-DD 
np.datetime64("1969-07-20T20:17")    # YYYY-MM-DDThh:mm

numpy.datetime64('1969-07-20T20:17')

# Current date and time, down to the second
np.datetime64("now")

numpy.datetime64('2024-08-26T14:54:31')

A time delta in NumPy indicates its units (i. e., its granularity).

np.datetime64("1969-07-20T20:17") - np.datetime64("today")

numpy.timedelta64(-28980223,'m')

1.2 Vectors

Most interesting phenomena are characterized and influenced by more than one factor. Collections of values therefore play a central role in data science. The workhorse type for collections in base Python is the list. However, we’re going to need some more powerful metaphors and tools as well, beginning with vectors.

Definition 1.3 A vector is a collection of values called elements, all of the same type, indexed by consecutive integers.

Important

In math, vector indexes usually begin with 1. In Python, they begin with 0.

A vector with \(n\) elements is often referred to as an \(n\)-vector, and we say that \(n\) is the length of the vector. In math we often use \(\real^n\) to denote the set of all \(n\)-vectors with real-valued elements.

Example 1.5 The usual way to work with vectors in Python is through NumPy:

import numpy as np

x = np.array( [1, 2, 3, 4, 5] )
x

array([1, 2, 3, 4, 5])

A vector has a data type for its elements:

x.dtype

dtype('int64')

Any float values in the vector cause the data type of the entire vector to be float:

y = np.array( [1.0, 2, 3, 4, 5] )
y.dtype

dtype('float64')

Use len to determine the length of a vector:

len(x)

You can create a special type of vector called a range that has equally spaced elements:

np.arange(0, 5)

array([0, 1, 2, 3, 4])

np.arange(1, 3, 0.5)

array([1. , 1.5, 2. , 2.5])

The syntax here is (start,stop,step).

Caution

In base Python and in NumPy, the last element of a range is omitted. This is guaranteed to cause confusion if you are used to just about any other computer language.

There are functions for creating vectors of ones and zeros:

print(np.ones(5))
print(np.zeros(4))

[1. 1. 1. 1. 1.]
[0. 0. 0. 0.]

1.2.1 Element access

The elements of a vector can be accessed directly using square brackets notation.

Example 1.6 Use square brackets to refer to an element of a vector, remembering that the first element has index 0:

x[0]

x[4]

Negative values for the index are counted from the end. The last element of a vector always has index -1, and more-negative values move backward through the elements:

x[-1]

x[-3]

x[-len(x)]

Element references can also be on the left side of an assignment:

x[2] = -3
x

array([ 1,  2, -3,  4,  5])

Note, however, that once the data type of a vector is set, it can’t be changed:

x[0] = 1.234
x    # float was truncated to int, without warning!!!

array([ 1,  2, -3,  4,  5])

You can also use a list in square brackets to access multiple elements at once:

x[ [0, 2, 4] ]

array([ 1, -3,  5])

The result of a list reference is a new vector, not a number. Note the difference here:

x[0]

x[ [0] ]

array([1])

1.2.2 Slicing

You can also access multiple elements of a vector by using a list or range as the index. In the latter case, this is called slicing.

Warning

As with ranges in base Python, the last element of a slice range is omitted. This causes many headaches and bugs.

Example 1.7 The syntax of a slice is start:stop.

print(x[0:3])
print(x[-3:-1])

[ 1  2 -3]
[-3  4]

If you want to use a range that skips over elements, you can use a third argument in the form start:stop:step:

x[0:2:5]

array([1])

When the start of the range is omitted, it means “from the beginning”, and when stop is omitted, it means “through to the end.” Hence, [:k] means “first \(k\) elements” and [-k:] means “last \(k\) elements”:

x[:3]

array([ 1,  2, -3])

x[-3:]

array([-3,  4,  5])

And we also have this idiom:

x[::-1]   # reverse the vector

array([ 5,  4, -3,  2,  1])

NumPy will happily allow you to reference invalid indexes. It will just return as much as is available without warning or error:

x[:10]

array([ 1,  2, -3,  4,  5])

1.3 Arrays

A vector is a special case of a more general construct.

Definition 1.4 An array is a collection of values called elements, all of the same type, indexed by one or more sets of consecutive integers. The number of indexes needed to specify a value is the dimension of the array.

Tip

A dimension is called an axis in NumPy and related packages.

Note

The term matrix is often used simply to mean a 2D array. Technically, though, a matrix should have only numerical values, and matrices obey certain properties that make them important mathematical objects. These properties and their consequences are studied in linear algebra.

For us, arrays will mostly be the results of importing or working on data. But it’s occasionally useful to know how to build them from scratch.

Example 1.8 One way to construct an array is by a list comprehension:

A = np.array([ [j-i for j in range(6)] for i in range(4) ])
A

array([[ 0,  1,  2,  3,  4,  5],
       [-1,  0,  1,  2,  3,  4],
       [-2, -1,  0,  1,  2,  3],
       [-3, -2, -1,  0,  1,  2]])

The shape of an array is what we would often call the size:

A.shape

(4, 6)

There is no difference between a vector and a 1D array:

x.shape

(5,)

There is also no difference between a 2D array and a vector of vectors giving the rows of the array.

R = np.array( [ [1, 2, 3], [4, 5, 6] ])
R

array([[1, 2, 3],
       [4, 5, 6]])

Here are some other common ways to construct arrays.

np.ones(5)

array([1., 1., 1., 1., 1.])

np.zeros( (3, 6) )

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

np.repeat(np.pi, 3)

array([3.14159265, 3.14159265, 3.14159265])

You can also stack arrays vertically or horizontally to create new arrays.

np.hstack( ( np.ones((2, 2)), np.zeros((2, 3)) ) )

array([[1., 1., 0., 0., 0.],
       [1., 1., 0., 0., 0.]])

np.vstack( (range(5), range(5, 0, -1)) )

array([[0, 1, 2, 3, 4],
       [5, 4, 3, 2, 1]])

1.3.1 Indexing and slicing

One way to think about a 2D array is a vector of vector rows of the same length. That idea is reflected in NumPy syntax.

Example 1.9 We can use successive brackets to refer to an element, row then column:

R[1][2]    # second row, third column

It’s equivalent, and typically more convenient, to use a single bracket set with indexes separated by commas:

R[1, 2]    # second row, third column

You can use slicing ranges in each dimension individually.

R[:1, -2:]    # first row, last two columns

array([[2, 3]])

The result above is another 2D array. Note how this result is subtly different:

R[0, -2:]

array([2, 3])

Because we accessed an individual row, not a slice, the result is one dimension lower—a vector. Finally, a : in one slice position means to keep everything in that dimension.

A[:, :2]    # all rows, first 2 columns

array([[ 0,  1],
       [-1,  0],
       [-2, -1],
       [-3, -2]])

1.3.2 Reductions

A common task is to reduce an array along one dimension, called an axis in numpy, resulting in an array of one less dimension.

Example 1.10 The first dimension of an array is axis number 0, the second is axis number 1, etc.

array([[ 0,  1,  2,  3,  4,  5],
       [-1,  0,  1,  2,  3,  4],
       [-2, -1,  0,  1,  2,  3],
       [-3, -2, -1,  0,  1,  2]])

np.sum(A, axis=0)    # sum along the rows

array([-6, -2,  2,  6, 10, 14])

Even when we sum along the column dimension, axis=1, the result is a vector that we would interpret as a row, not a column.

np.sum(A, axis=1)    # sum along the columns

array([15,  9,  3, -3])

If you don’t specify an axis, the reduction occurs over all dimensions at once, resulting in a single number.

np.sum(A)

Tip

You can also do reductions with np.max, np.min, np.mean, etc.

1.4 Qualitative data

A qualitative value is one that is not quantitative. There are numerous types, but we will consider only a few of them here. Ultimately, we need to be able to represent qualitative data in a quantitative way for use in algorithms.

1.4.1 Categorical

Definition 1.5 Categorical data has values drawn from a finite set \(S\) of categories. If the members of \(S\) support meaningful ordering comparisons, then the data is ordinal; otherwise, it is nominal.

Example 1.11 Examples of ordinal categorical data:

Seat classes on a commercial airplane (e.g., economy, business, first)
Letters of the alphabet

Examples of nominal categorical data:

Yes/No responses
Marital status
Make of a car

There are nuanced cases. For instance, letter grades are themselves ordinal categorical data. However, schools convert them to discrete quantitative data and then compute a continuous quantitative GPA.

One way to quantify ordinal categorical data is to assign integer values to the categories in a manner that preserves ordering. This approach can be suspect, though, when it comes to operations such as averaging or computing a distance between values.

1.4.1.1 Dummy variables

Another means of quantifying categorical data is called dummy variables in classical statistics and one-hot encoding in much of machine learning. Suppose a variable \(x\) has values in a category set that has \(m\) members, i. e., \(S = \{c_1,\ldots,c_m\}\). There are two ways to replace the values with dummy variables.

Introduce \(m\) variables \(x_1,\ldots,x_m\), where for a given categorical value \(v\) we have

\[ x_i = \begin{cases} 1, & v = c_i, \\ 0, & v \neq c_i. \end{cases} \]

Introduce only \(m-1\) variables, leaving out \(x_m\). If \(x_1=x_2=\ldots=x_{m-1}=0\), then we know that the value is \(c_m\). (This variant is important in statistics because otherwise, \(x_m\) has to be correlated with the other \(x_i\).)

Example 1.12 Suppose that the stooge variable can take the values Moe, Larry, Curly, or Shemp, and we have the dataset

["Curly", "Moe", "Curly", "Larry", "Shemp", "Moe"]

['Curly', 'Moe', 'Curly', 'Larry', 'Shemp', 'Moe']

If we use 4 dummy variables, the data would be replaced by the array

	Curly	Larry	Moe	Shemp
0	True	False	False	False
1	False	False	True	False
2	True	False	False	False
3	False	True	False	False
4	False	False	False	True
5	False	False	True	False

If we use only 3 dummy variables, we would get

	Larry	Moe	Shemp
0	False	False	False
1	False	True	False
2	False	False	False
3	True	False	False
4	False	False	True
5	False	True	False

The all-zero rows above correspond to the Curly values.

1.4.2 Text

Text is a ubiquitous data source. One way to quantify text is to use a dictionary of interesting keywords \(w_1,\ldots,w_n\). Given a collection of documents \(d_1,\ldots,d_m\), we can define an \(m\times n\) document–term matrix \(T\) by letting \(T_{ij}\) be the number of times term \(j\) appears in document \(i\).

1.4.3 Images

The most straightforward way to represent an image is as a 3D array of values representing intensities representing of red, green and blue in each pixel. Sometimes it might be preferable to represent the image by a vector of statistics about these values, or by presence or absence of detected objects, etc.

1.5 Working in pandas

The most popular Python package for manipulating and analyzing data is pandas. We will use the paradigm it presents, which is fairly well understood throughout data science.

1.5.1 Series

Definition 1.6 A series is a vector that is indexed by a finite ordered set.

In NumPy, we always index vectors by absolute position. In pandas, the elements still have an underlying ordering, but we can index them by anything, including strings. Since names are more memorable and intuitive than arbitrary integers, using them makes code easier to write, read, and debug.

Example 1.13 Some data that can be viewed as series:

The length, width, and height of a box can be expressed as a 3-vector of positive real numbers. If we index the vector by the names of the measurements, it becomes a series.
The number of steps taken by an individual over the course of a week can be expressed as a 7-vector of nonnegative integers. We could index it by the integers 1–7, or in a series by the names of the days of the week.
The bid prices of a stock at the end of each trading day can be represented as a time series, in which the index is drawn from timestamps.

Example 1.14 Here is a small pandas series. Note that we give the values first and then the index:

import pandas as pd

names = ["Marge", "Homer", "Bart", "Lisa", "Maggie"]
months = pd.Series( ["March", "May", "February", "May", "January"], index=names)
months

Marge        March
Homer          May
Bart      February
Lisa           May
Maggie     January
dtype: object

We can access the values by choosing a name in the index:

months["Lisa"]

'May'

If we access multiple values, we get a series that is a subset of the original:

months[ ["Homer", "Marge"] ]

Homer      May
Marge    March
dtype: object

We can also use the iloc property to access the underlying vector by NumPy slicing:

months.iloc[-3:]

Bart      February
Lisa           May
Maggie     January
dtype: object

Here is a series of years based on the same index:

years = pd.Series([1955, 1956, 1979, 1981, 1988], index=names)
years

Marge     1955
Homer     1956
Bart      1979
Lisa      1981
Maggie    1988
dtype: int64

We can also start with the series and separate it into a vector of values and its index:

print("Values:")
print(years.values)
print("\nIndex:")
print(years.index)

Values:
[1955 1956 1979 1981 1988]

Index:
Index(['Marge', 'Homer', 'Bart', 'Lisa', 'Maggie'], dtype='object')

Table 1.1 lists some common operations on pandas series. There is an exhaustive list in the pandas documentation.

Table 1.1: Operations on pandas series

Description	Syntax	Result
First or last entries	`s.head()`, `s.tail()`	Series
Length	`len(s)`	integer
All of the values	`s.values`	array
Convert values to list	`list(s)`	list
Index	`s.index`	Index
Unique values	`s.unique()`	array
Appearance frequencies	`s.value_counts()`	Series
Extreme values	`s.min()`, `s.max()`	number
Sort by values	`s.sort_values()`	Series
Sum	`s.sum()`	number
Comparison	`s > 1`, `s=="foo"`	boolean Series
Locate missing	`s.isna()`	boolean Series
Arithmetic	`s + 1`, `s * t`	Series
Delete one or more rows	`s.drop()`	Series

Caution

As always in Python, you need to pay attention to the difference between applying a function, like foo(bar), accessing a property, like foo.bar, and calling an object method, foo.bar(). Extra or missing parentheses groups can cause errors.

Example 1.15 We continue with the definitions in Example 1.14. Here are all the unique values of the months:

months.unique()

array(['March', 'May', 'February', 'January'], dtype=object)

Here are the frequencies of their appearance:

months.value_counts()

May         2
March       1
February    1
January     1
Name: count, dtype: int64

Note that we can do a logical comparison and get a series of boolean values:

months == "May"

Marge     False
Homer      True
Bart      False
Lisa       True
Maggie    False
dtype: bool

Since the years are numerical, we can do arithmetic operations on them:

print("Minimum year is", years.min(), "and maximum year is", years.max())
print("The sum of the years is", years.sum())

Minimum year is 1955 and maximum year is 1988
The sum of the years is 9859

It can be useful to know the index where the minimum or maximum value occurs:

print("The oldest is", years.idxmin(), "and the youngest is", years.idxmax())

The oldest is Marge and the youngest is Maggie

1.5.2 Data frames

Definition 1.7 A data frame is a collection of series that all share the same index set.

Essentially, a data frame is a 2D array with a common index for the rows, where the constituent series are the columns of the array. Those columns can also be indexed by names (strings). Since names are much more memorable and intuitive than arbitrarily assigned integers, data frames make code easier to write, read, and reason about.

Tip

Curly braces { } are used to construct a dictionary in Python.

Example 1.16 We continue with the definitions of months and years in Example 1.14, and use a dictionary to specify the columns of a data frame:

simpsons = pd.DataFrame( {"month": months, "date":  [19, 12, 23, 9, 14], "year": years } )
simpsons

	month	date	year
Marge	March	19	1955
Homer	May	12	1956
Bart	February	23	1979
Lisa	May	9	1981
Maggie	January	14	1988

Notice above that the date column was given as just a list of numbers; the index for it was inherited from the months series.

If we give the name of a column as an index into the frame, we get back a series for that column:

simpsons["year"]

Marge     1955
Homer     1956
Bart      1979
Lisa      1981
Maggie    1988
Name: year, dtype: int64

If we give a list of column names, we get back a data frame with just those columns:

simpsons[ ["month", "date"] ]

	month	date
Marge	March	19
Homer	May	12
Bart	February	23
Lisa	May	9
Maggie	January	14

We can even index a nonexistent column on the left side of an assignment, and pandas will create it for us:

simpsons["weekday"] = ["Saturday", "Saturday", "Friday", "Saturday", "Thursday"]
simpsons

	month	date	year	weekday
Marge	March	19	1955	Saturday
Homer	May	12	1956	Saturday
Bart	February	23	1979	Friday
Lisa	May	9	1981	Saturday
Maggie	January	14	1988	Thursday

We can access a row by using brackets with the loc property of the frame, getting a series indexed by the column names of the frame:

simpsons.loc["Maggie"]

month       January
date             14
year           1988
weekday    Thursday
Name: Maggie, dtype: object

We are also free to strip away the index and get an ordinary array:

simpsons.to_numpy()

array([['March', 19, 1955, 'Saturday'],
       ['May', 12, 1956, 'Saturday'],
       ['February', 23, 1979, 'Friday'],
       ['May', 9, 1981, 'Saturday'],
       ['January', 14, 1988, 'Thursday']], dtype=object)

Another way to create a data frame is to give a list of rows, plus (optionally) the index and the names of the columns:

pd.DataFrame( 
    [ ("white", 5), ("brown", 9), ("pink", 8) ], 
    columns=["Color", "Rating"],
    index=["vanilla", "chocolate", "strawberry"]
    )

	Color	Rating
vanilla	white	5
chocolate	brown	9
strawberry	pink	8

1.5.3 Common operations

Table 1.2 shows common operations specific to a data frame. But since the columns are series, the operations in Table 1.1 can be applied too, and they will act columnwise.

Table 1.2: Operations on pandas data frames

Description	Syntax	Result
Summary information	`df.info()`, `df.describe()`	DataFrame
First or last row(s)	`df.head()`, `df.tail()`	DataFrame
Number of rows	`len(df)`	integer
Number of rows and columns	`df.shape`	tuple
Row index	`df.index`	Index
Column names	`list(df)`	list
Column names	`df.columns`	Index
Access by column name(s)	`df["name"]`, `df[["col1", "col2"]]`	Series or DataFrame
Access by index/name	`df.loc[rows, "name"]`	(varies)
Access by position	`df.iloc[i, j]`	(varies)
Delete a column	`df.drop(name, axis=1)`	DataFrame
Sort by column values	`df.sort_values("column")`	DataFrame

Example 1.17 We continue with the definitions of simpsons in Example 1.16:

simpsons.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Marge to Maggie
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   month    5 non-null      object
 1   date     5 non-null      int64 
 2   year     5 non-null      int64 
 3   weekday  5 non-null      object
dtypes: int64(2), object(2)
memory usage: 372.0+ bytes

simpsons.min()

month      February
date              9
year           1955
weekday      Friday
dtype: object

Summing numbers does what you expect, but summing strings concatenates them:

simpsons.sum()

month                  MarchMayFebruaryMayJanuary
date                                           77
year                                         9859
weekday    SaturdaySaturdayFridaySaturdayThursday
dtype: object

We can sort the entire frame by its values in a column:

simpsons.sort_values("date")

	month	date	year	weekday
Lisa	May	9	1981	Saturday
Homer	May	12	1956	Saturday
Maggie	January	14	1988	Thursday
Marge	March	19	1955	Saturday
Bart	February	23	1979	Friday

For a more thorough demonstration of operations on data frames, look at the pandas user guide. There is also an exhaustive list of relevant functions in the pandas documentation.

1.5.4 Loading from files

Datasets can be presented in many forms. The simplest and most universal are comma-separated value (CSV) files. Such files can be read by pandas locally or over the web.

Example 1.18 The pandas function we use to load a dataset from CSV is read_csv. Here we use it to read a file that is available over the web:

ads = pd.read_csv("https://raw.githubusercontent.com/tobydriscoll/ds1book/master/advertising.csv")
ads.head(6)    # show the first 6 rows

	TV	Radio	Newspaper	Sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	12.0
3	151.5	41.3	58.5	16.5
4	180.8	10.8	58.4	17.9
5	8.7	48.9	75.0	7.2

Note above that we used head(6) to see just the first 6 rows.

It’s a good idea to check what you’ve just loaded:

ads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB

Tip

Here is one thing to watch for when loading numerical data: if a column has anything non-numeric in it, like “N/A” or “unknown”, pandas will load that column as a string.

Tip

While it’s possible to import datasets directly from the web, links change and disappear frequently. If storing the dataset locally is not a problem, you may want to download your own copy before working on it.

It’s also possible to import data from most other general-purpose formats you might encounter, such as Excel spreadsheets (though possibly requiring one-time installation of additional libraries). There are many functions starting with pd.read_ showing the formats pandas understands.

1.6 Selecting rows and columns

In a data frame, we access columns by giving a string name, or a list of names, in square brackets. We can access a row by the loc property with an index value, or iloc with an absolute row number starting from zero.

Example 1.19 Here’s a local file that contains daily weather summaries from Newark, Delaware:

weather = pd.read_csv("_datasets/ghcn_newark.csv")
weather.head()

	STATION	DATE	LATITUDE	LONGITUDE	ELEVATION	NAME	PRCP	PRCP_ATTRIBUTES	SNOW	SNOW_ATTRIBUTES	...	WT08	WT08_ATTRIBUTES	WT11	WT11_ATTRIBUTES	WT14	WT14_ATTRIBUTES	WT16	WT16_ATTRIBUTES	WT18	WT18_ATTRIBUTES
0	USC00076410	1894-04-01	39.6682	-75.74569	32.3	NEWARK AG FARM, DE US	0.0	P,,6,	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	USC00076410	1894-04-02	39.6682	-75.74569	32.3	NEWARK AG FARM, DE US	0.0	P,,6,	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	USC00076410	1894-04-03	39.6682	-75.74569	32.3	NEWARK AG FARM, DE US	0.0	P,,6,	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	USC00076410	1894-04-04	39.6682	-75.74569	32.3	NEWARK AG FARM, DE US	165.0	,,6,	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	USC00076410	1894-04-05	39.6682	-75.74569	32.3	NEWARK AG FARM, DE US	0.0	P,,6,	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 56 columns

There are lots of columns here. We will work with a small subset of them:

columns = ["DATE", "PRCP", "SNOW", "TMAX", "TMIN"]
weather = weather[columns]
weather.head()

	DATE	PRCP	SNOW	TMAX	TMIN
0	1894-04-01	0.0	NaN	189.0	111.0
1	1894-04-02	0.0	NaN	133.0	22.0
2	1894-04-03	0.0	NaN	89.0	-61.0
3	1894-04-04	165.0	NaN	117.0	17.0
4	1894-04-05	0.0	NaN	156.0	56.0

The first row has absolute position zero:

weather.iloc[0]

DATE    1894-04-01
PRCP           0.0
SNOW           NaN
TMAX         189.0
TMIN         111.0
Name: 0, dtype: object

The result above is a series, with the column names of the original data frame serving as the index. The following result is subtly different:

weather.iloc[:1]

	DATE	PRCP	SNOW	TMAX	TMIN
0	1894-04-01	0.0	NaN	189.0	111.0

The slicing syntax [:1] means “the first 1 rows”. The result is a one-row data frame—not a series. To access the last 4 rows, we can use a negative index:

weather.iloc[-4:]

	DATE	SNOW	TMAX	TMIN
29513	2022-12-17	NaN	72.0	11.0
29514	2022-12-18	NaN	39.0	-22.0
29515	2023-01-06	NaN	89.0	33.0
29516	2023-01-07	NaN	72.0	17.0

In Example 1.19, we did not specify an index for the data frame, so the row numbers were the index. If we give a data frame a different index, we can use loc to access rows by the index value.

Example 1.20 A natural candidate for an index in the weather data is the DATE column. This can be specified when the file is read, but here we do it after the fact:

dated_weather = weather.set_index("DATE")
dated_weather.head()

	PRCP	SNOW	TMAX	TMIN
DATE
1894-04-01	0.0	NaN	189.0	111.0
1894-04-02	0.0	NaN	133.0	22.0
1894-04-03	0.0	NaN	89.0	-61.0
1894-04-04	165.0	NaN	117.0	17.0
1894-04-05	0.0	NaN	156.0	56.0

Notice above that the leftmost column of dated_weather is the index, which still has the column name DATE. We can access a row by giving a date string to loc:

dated_weather.loc["1979-03-28"]

PRCP      0.0
SNOW      0.0
TMAX    111.0
TMIN    -33.0
Name: 1979-03-28, dtype: float64

We can extract a row and column simultaneously by giving both to loc:

dated_weather.loc["1979-03-28", "TMAX"]

111.0

When the index is a date, as is the case here, we can even use a range of dates:

dated_weather.loc["1979-03-01":"1979-03-06"]

	PRCP	SNOW	TMAX	TMIN
DATE
1979-03-01	0.0	0.0	122.0	-22.0
1979-03-02	3.0	0.0	128.0	33.0
1979-03-03	0.0	0.0	117.0	22.0
1979-03-04	0.0	0.0	NaN	NaN
1979-03-05	61.0	0.0	NaN	NaN
1979-03-06	173.0	0.0	128.0	83.0

Frequently we want to select rows from a data frame based on criteria applied to the data itself. We can use relational operators such as >, <, ==, etc. to select rows based on the values in a column.

Example 1.21 Here is how we can select all the rows in which the value of PRCP is greater than 1200:

weather.loc[weather["PRCP"] > 1200]

	DATE	PRCP	TMAX	TMIN
9396	1952-07-09	1377.0	261.0	200.0
12382	1960-09-12	1331.0	222.0	167.0
16352	1971-08-27	1207.0	256.0	194.0
26543	1999-09-16	2202.0	206.0	156.0
27938	2004-07-12	1580.0	244.0	189.0
27986	2004-09-29	1450.0	244.0	178.0

There are two steps above. The expression inside the square brackets produces a series of Boolean values:

is_prcp = (weather["PRCP"] > 1200)
is_prcp.head()

0    False
1    False
2    False
3    False
4    False
Name: PRCP, dtype: bool

This series can be used to select rows with loc. Actually, it works without loc as well, because pandas knows it is a row selector:

print("There are",is_prcp.sum(),"rows with PRCP > 1200.")
weather[is_prcp]

There are 6 rows with PRCP > 1200.

	DATE	PRCP	TMAX	TMIN
9396	1952-07-09	1377.0	261.0	200.0
12382	1960-09-12	1331.0	222.0	167.0
16352	1971-08-27	1207.0	256.0	194.0
26543	1999-09-16	2202.0	206.0	156.0
27938	2004-07-12	1580.0	244.0	189.0
27986	2004-09-29	1450.0	244.0	178.0

We can use logical operators & (AND), | (OR), and ~ (NOT) on these Boolean series in order to combine criteria:

weather[(weather["TMAX"] < 0) & (weather["SNOW"] > 0)]

	DATE	PRCP	SNOW	TMAX	TMIN
272	1895-01-29	102.0	102.0	-22.0	-94.0
282	1895-02-08	127.0	178.0	-139.0	-189.0
692	1896-03-24	33.0	51.0	-22.0	-100.0
943	1896-11-30	107.0	76.0	-6.0	-33.0
966	1896-12-23	48.0	51.0	-6.0	-89.0
...	...	...	...	...	...
25224	1996-01-08	3.0	25.0	-28.0	-156.0
25249	1996-02-02	20.0	25.0	-22.0	-56.0
25250	1996-02-03	155.0	147.0	-33.0	-111.0
25594	1997-01-12	13.0	13.0	-33.0	-100.0
25623	1997-03-10	178.0	8.0	-17.0	167.0

131 rows × 5 columns

Tip

The pandas user guide has a handy section on selections.

Tip

For practice with pandas fundamentals, try the Kaggle course.

1.7 Combining data frames

Sometimes you want to combine data from different sources. There are multiple ways to do this in pandas, depending on the structure of the data you want to combine and the result you want to receive.

1.7.1 Concatenation

The concat function is used to stack frames or series on top of each other.

Example 1.22 Here we create two frames with three rows each and some commonly named columns:

birds = pd.DataFrame(
    {
        "Name": ["Eagles", "Crows", "Falcons"],
        "Predatory": [True, False, True],
        "Family": ["Accipitridae", "Corvidae", "Falconidae"],
        "Color": ["Brown", "Black", "Blue–Gray"]
    }
)
birds

	Name	Predatory	Family	Color
0	Eagles	True	Accipitridae	Brown
1	Crows	False	Corvidae	Black
2	Falcons	True	Falconidae	Blue–Gray

teams = pd.DataFrame(
    {
        "Name": ["Falcons", "Eagles", "Bears"],
        "Home": ["Atlanta", "Philadelphia", "Chicago"],
        "Color": ["Red", "Green", "Blue"]
    }
)
teams

	Name	Home	Color
0	Falcons	Atlanta	Red
1	Eagles	Philadelphia	Green
2	Bears	Chicago	Blue

We can use pd.concat() to stack these frames on top of each other:

pd.concat([birds, teams])

	Name	Predatory	Family	Color	Home
0	Eagles	True	Accipitridae	Brown	NaN
1	Crows	False	Corvidae	Black	NaN
2	Falcons	True	Falconidae	Blue–Gray	NaN
0	Falcons	NaN	NaN	Red	Atlanta
1	Eagles	NaN	NaN	Green	Philadelphia
2	Bears	NaN	NaN	Blue	Chicago

Notice above that any value that wasn’t defined in the source frames ends up as a missing value in the result.

We can also concatenate the frames side by side by using the axis keyword:

pd.concat([birds, teams], axis="columns")

	Name	Predatory	Family	Color	Name	Home	Color
0	Eagles	True	Accipitridae	Brown	Falcons	Atlanta	Red
1	Crows	False	Corvidae	Black	Eagles	Philadelphia	Green
2	Falcons	True	Falconidae	Blue–Gray	Bears	Chicago	Blue

The result is just a dumb pasting together of the source frames. There is no attempt to make the entries of the Name column match up, even though there were some overlapping values in it.

Concatenation is occasionally useful, but more often data frames should be joined horizontally using the more flexible merge method in the next section.

1.7.2 Merging data

The merge function is used to merge frames based on the values within the columns. Generically, each source frame can have a key column that is used to match up rows.

Example 1.23 Continuing with the data frames from Example 1.22:

pd.merge(birds, teams, on="Name")

	Name	Predatory	Family	Color_x	Home	Color_y
0	Eagles	True	Accipitridae	Brown	Philadelphia	Green
1	Falcons	True	Falconidae	Blue–Gray	Atlanta	Red

The on keyword specifies the column that is the key for the merge. In the above merge, the result is a frame with only the rows that have a common value in the Name column. That’s called an inner join, which is the default. An alternative is an outer join, which includes all rows from both frames:

pd.merge(birds, teams, on="Name", how="outer")

	Name	Predatory	Family	Color_x	Home	Color_y
0	Bears	NaN	NaN	NaN	Chicago	Blue
1	Crows	False	Corvidae	Black	NaN	NaN
2	Eagles	True	Accipitridae	Brown	Philadelphia	Green
3	Falcons	True	Falconidae	Blue–Gray	Atlanta	Red

Note also above that the column Color that appeared in both of the source frames was renamed to Color_x and Color_y in the result, since otherwise there would be a conflict.

It’s also possible to do a left join or a right join, which use one of the original frames as the basis for the result:

print("Left merge:")
print(pd.merge(birds, teams, on="Name", how="left"))
print("\nRight merge:")
print(pd.merge(birds, teams, on="Name", how="right"))

Left merge:
      Name  Predatory        Family    Color_x          Home Color_y
0   Eagles       True  Accipitridae      Brown  Philadelphia   Green
1    Crows      False      Corvidae      Black           NaN     NaN
2  Falcons       True    Falconidae  Blue–Gray       Atlanta     Red

Right merge:
      Name Predatory        Family    Color_x          Home Color_y
0  Falcons      True    Falconidae  Blue–Gray       Atlanta     Red
1   Eagles      True  Accipitridae      Brown  Philadelphia   Green
2    Bears       NaN           NaN        NaN       Chicago    Blue

Note

The difference between outer join and inner join is the same as the union and intersection of sets—the sets here being the "on" column contents.

Example 1.24 Here we load in weather data for Newark, DE, being sure that the column representing the date is parsed as dates:

weather = pd.read_csv("_datasets/ghcn_newark.csv", parse_dates=["DATE"])
weather = weather[["DATE", "PRCP", "SNOW", "TMAX", "TMIN"]]
weather.head()

	DATE	PRCP	SNOW	TMAX	TMIN
0	1894-04-01	0.0	NaN	189.0	111.0
1	1894-04-02	0.0	NaN	133.0	22.0
2	1894-04-03	0.0	NaN	89.0	-61.0
3	1894-04-04	165.0	NaN	117.0	17.0
4	1894-04-05	0.0	NaN	156.0	56.0

Here is another data frame for the S&P 500 stock market index:

sp500 = pd.read_csv("_datasets/SP500_daily.csv", parse_dates=["Date"])
sp500["Gain"] = sp500["Close/Last"] - sp500["Open"]
sp500.head()

	Date	Close/Last	Open	High	Low	Gain
0	2024-03-05	5078.65	5110.52	5114.54	5056.82	-31.87
1	2024-03-04	5130.95	5130.99	5149.67	5127.18	-0.04
2	2024-03-01	5137.08	5098.51	5140.33	5094.16	38.57
3	2024-02-29	5096.27	5085.36	5104.99	5061.89	10.91
4	2024-02-28	5069.76	5067.20	5077.37	5058.35	2.56

Now we can merge the weather data with the stock data according to the date. Note that here, the key is named differently in the two frames:

merged = pd.merge(sp500, weather, left_on="Date", right_on="DATE")
merged.head()

	Date	Close/Last	Open	High	Low	Gain	DATE	PRCP	SNOW	TMAX	TMIN
0	2023-01-06	3895.08	3823.37	3906.19	3809.56	71.71	2023-01-06	0.0	NaN	89.0	33.0
1	2022-11-30	4080.11	3957.18	4080.11	3938.58	122.93	2022-11-30	97.0	NaN	128.0	6.0
2	2022-11-29	3957.63	3964.19	3976.77	3937.65	-6.56	2022-11-29	0.0	NaN	83.0	-22.0
3	2022-11-28	3963.94	4005.36	4012.27	3955.77	-41.42	2022-11-28	0.0	NaN	122.0	17.0
4	2022-11-25	4026.12	4023.34	4034.02	4020.76	2.78	2022-11-25	13.0	NaN	150.0	6.0

Since we did an inner join, only the rows with a common date in both frames are included:

print("There are", len(sp500), "rows in the S&P 500 data.")
print("There are", len(weather), "rows in the weather data.")
print("There are", len(merged), "rows in the inner-merged data.")

There are 1259 rows in the S&P 500 data.
There are 29517 rows in the weather data.
There are 820 rows in the inner-merged data.

If the key you want to merge on is the index column, the syntax of the merge changes a little. Here is an example:

weather = weather.set_index("DATE")
weather.head()

	PRCP	SNOW	TMAX	TMIN
DATE
1894-04-01	0.0	NaN	189.0	111.0
1894-04-02	0.0	NaN	133.0	22.0
1894-04-03	0.0	NaN	89.0	-61.0
1894-04-04	165.0	NaN	117.0	17.0
1894-04-05	0.0	NaN	156.0	56.0

result = pd.merge(sp500, weather, left_on="Date", right_index=True)
result.head()

	Date	Close/Last	Open	High	Low	Gain	PRCP	SNOW	TMAX	TMIN
290	2023-01-06	3895.08	3823.37	3906.19	3809.56	71.71	0.0	NaN	89.0	33.0
315	2022-11-30	4080.11	3957.18	4080.11	3938.58	122.93	97.0	NaN	128.0	6.0
316	2022-11-29	3957.63	3964.19	3976.77	3937.65	-6.56	0.0	NaN	83.0	-22.0
317	2022-11-28	3963.94	4005.36	4012.27	3955.77	-41.42	0.0	NaN	122.0	17.0
318	2022-11-25	4026.12	4023.34	4034.02	4020.76	2.78	13.0	NaN	150.0	6.0

1.8 Data wrangling

Raw data often needs to be manipulated into a useable format before algorithms can be applied. Preprocessing data so that it is suitable for machine analysis is known as data wrangling.

There are a number of frequently encountered tasks in data wrangling, including:

Cleaning: Handling missing values, removing duplicates, correcting errors, and ensuring that data is in the right format.
Encoding: Converting categorical variables to numerical form.
Transforming: Normalizing, scaling, and encoding data (e.g., defining dummy variables).

These are tasks that graphical and AI assistive tools can help with a great deal. When using such tools, have them generate code that reproduces its steps. That way, you have a definitive record of what was done and can reproduce or modify it later.

1.8.1 Missing values

One of the most commonly encountered snags found in real data sets are rows that have missing values. There is no universal way that such values are presented. A missing value may be indicated by a zero or nonsensical value, or with a string such as “n/a”, or just an empty cell. Within Python, it might be natural to represent missing values as None, but that is not a valid value for a NumPy array. It’s more flexible within pandas to use np.nan, which stands for “not a number”.

Most algorithms will fail when presented with missing values, so you must either remove or replace them.

For removal, the dropna() method deletes all rows that have missing values. By default, it does this to a new copy of the data frame, so you must assign the result to a variable. Alternatively, you can use the argument inplace=True to modify the original frame. When applying dropna(), you may want to limit its scope to a subset of columns by using the subset argument with a list of columns.

Replacement of missing values is called imputation. The fillna() method makes it easy to use the mean, median, or mode of the non-missing values in a column. Doing so allows you to avoid data loss, but you should be cautious about imputation, because it can easily misrepresent the underlying data. At the very least, you need to document how many values have been changed.

Example 1.25 Here is a well-known demonstration data set often referred to as Palmer Penguins:

import seaborn as sns
import numpy as np
penguins = sns.load_dataset("penguins")
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

Note above that the fourth row of the frame is missing measurements. We can discover how many such rows there are using isna, which creates a Boolean series:

penguins.isna()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False
3	False	False	True	True	True	True	True
4	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...
339	False	False	True	True	True	True	True
340	False	False	False	False	False	False	False
341	False	False	False	False	False	False	False
342	False	False	False	False	False	False	False
343	False	False	False	False	False	False	False

344 rows × 7 columns

The total number of rows with missing data is

penguins.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

To drop the rows with missing values, you can use dropna():

dropped = penguins.dropna()
print("Original has", len(penguins), "rows.")
print("After removals, there are", len(dropped), "rows.")

Original has 344 rows.
After removals, there are 333 rows.

If you want to impute missing values, you can use fillna(). Here, for example, we replace missing values in any quantitative column with the mean of the column:

filled = penguins.copy()
for col in penguins.select_dtypes(np.number).columns:
    filled[col].fillna(penguins[col].mean(), inplace=True)
filled.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.10000	18.70000	181.000000	3750.000000	Male
1	Adelie	Torgersen	39.50000	17.40000	186.000000	3800.000000	Female
2	Adelie	Torgersen	40.30000	18.00000	195.000000	3250.000000	Female
3	Adelie	Torgersen	43.92193	17.15117	200.915205	4201.754386	NaN
4	Adelie	Torgersen	36.70000	19.30000	193.000000	3450.000000	Female

Note that, for example, the categorical column for sex was left alone and still contains missing values.

1.8.2 Strings

A common pitfall is that data that seems numerical to you might be parsed as strings by pandas. This happens when there are non-numeric characters in the data, such as dollar signs or percent signs, and any row with such a character will cause the whole column to be interpreted as strings.

Example 1.26 To demonstrate algorithms in later sections, we will be using a dataset describing loans made on the crowdfunding site LendingClub. For this example, we’ll keep only a few columns.

loans = pd.read_csv("_datasets/loans.csv")
loans = loans[["loan_amnt", "funded_amnt", "int_rate", "title"]]
loans.head(8)

	loan_amnt	funded_amnt	int_rate	title
0	5000	5000	10.65%	Computer
1	2500	2500	15.27%	bike
2	2400	2400	15.96%	real estate business
3	10000	10000	13.49%	personel
4	3000	3000	12.69%	Personal
5	5000	5000	7.90%	My wedding loan I promise to pay back
6	7000	7000	15.96%	Loan
7	3000	3000	18.64%	Car Downpayment

The int_rate column, which gives the interest rate on the loan, has been interpreted as strings due to the percent sign. We’d like to strip out those percent signs and convert the rest to numeric types. We can use the str property of the column to access string methods:

int_rate = loans["int_rate"].str.strip('%')

Then, we can convert the result to numbers:

loans["int_rate"] = pd.to_numeric( int_rate )
loans.head()

	loan_amnt	funded_amnt	int_rate	title
0	5000	5000	10.65	Computer
1	2500	2500	15.27	bike
2	2400	2400	15.96	real estate business
3	10000	10000	13.49	personel
4	3000	3000	12.69	Personal

The title column contains strings that are haphazardly formatted. We might want to convert everything in that column to lowercase:

loans["title"] = loans["title"].str.lower()
loans.head(8)

	loan_amnt	funded_amnt	int_rate	title
0	5000	5000	10.65	computer
1	2500	2500	15.27	bike
2	2400	2400	15.96	real estate business
3	10000	10000	13.49	personel
4	3000	3000	12.69	personal
5	5000	5000	7.90	my wedding loan i promise to pay back
6	7000	7000	15.96	loan
7	3000	3000	18.64	car downpayment

You can look at the pandas user guide for a more thorough demonstration of string methods in pandas.

1.8.3 Encoding qualitative data

Most algorithms require numerical data. If you have qualitative data, you need to convert it to a numerical form. This is called encoding.

For a category with ordered values, it’s usually best to replace the categories with integers that reflect the ordering.

Example 1.27 This dataset contains information about diamonds:

import seaborn as sns
diamonds = sns.load_dataset("diamonds")
diamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

As you can see above, some of the features (cut, color, clarity) have string designations. We could convert cut, for example, to a categorical variable like so:

pd.Categorical(diamonds["cut"])

['Ideal', 'Premium', 'Good', 'Premium', 'Good', ..., 'Ideal', 'Good', 'Very Good', 'Premium', 'Ideal']
Length: 53940
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

However, these values actually have a specific ordering in this context that is lost in the nominal variable above. Instead, we can replace the strings with integers that reflect the ordering:

cuts = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
diamonds["cut"].replace(cuts, range(1, 6), inplace=True)
diamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	5	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	4	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	2	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	4	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	2	J	SI2	63.3	58.0	335	4.34	4.35	2.75

For nominal categories that have no natural ordering, it’s common to encode via dummy variables.

Example 1.28 Here is a dataset of NBA player statistics from 2023–2024:

nba = pd.read_csv("_datasets/NBA player stats 2023-24.csv")
nba.head()

	RANK	NAME	TEAM	POS	AGE	GP	MPG	USG%	TO%	FTA	...	APG	SPG	BPG	TPG	P+R	P+A	P+R+A	VI	ORtg	DRtg
0	NaN	Joel Embiid	Phi	C	30.2	6	41.4	35.7	15.8	78	...	5.7	1.2	1.5	4.2	43.8	38.7	49.5	12.2	117.1	108.0
1	NaN	Jalen Brunson	Nyk	G	27.8	13	39.8	36.4	9.3	120	...	7.5	0.8	0.2	2.7	35.7	39.8	43.2	9.3	114.8	114.7
2	NaN	Damian Lillard	Mil	G	33.9	4	39.1	31.4	10.0	38	...	5.0	1.0	0.0	2.3	34.5	36.3	39.5	8.2	127.6	115.7
3	NaN	Shai Gilgeous-Alexander	Okc	G	25.9	10	39.9	32.3	8.9	81	...	6.4	1.3	1.7	2.2	37.4	36.6	43.8	11.2	118.3	106.9
4	NaN	Tyrese Maxey	Phi	G	23.6	6	44.6	28.1	8.6	28	...	6.8	0.8	0.3	2.2	35.0	36.7	41.8	9.1	120.9	113.3

5 rows × 29 columns

For this example, we will narrow the focus to the NAME and POS columns:

nba = nba[["NAME", "POS"]]
nba.head(10)

	NAME	POS
0	Joel Embiid	C
1	Jalen Brunson	G
2	Damian Lillard	G
3	Shai Gilgeous-Alexander	G
4	Tyrese Maxey	G
5	Donovan Mitchell	G
6	Luka Doncic	G
7	Nikola Jokic	C
8	LeBron James	F
9	Anthony Davis	C

The POS column is a categorical variable (the player’s position):

nba["POS"].unique()

array(['C', 'G', 'F', 'G-F', 'F-G', 'C-F', 'F-C'], dtype=object)

We can use pd.get_dummies to convert this column to dummy variables:

dum1 = pd.get_dummies(nba["POS"])
dum1.head(10)

	C	C-F	F	F-C	F-G	G	G-F
0	True	False	False	False	False	False	False
1	False	False	False	False	False	True	False
2	False	False	False	False	False	True	False
3	False	False	False	False	False	True	False
4	False	False	False	False	False	True	False
5	False	False	False	False	False	True	False
6	False	False	False	False	False	True	False
7	True	False	False	False	False	False	False
8	False	False	True	False	False	False	False
9	True	False	False	False	False	False	False

As you can see, the POS column was replaced by 7 columns, one for each of the unique positions in the dataset. But this result is a little awkward, because a position value such as "C-F" means both center and forward. We can use a different encoding scheme to handle this more sensibly:

dum2 = nba["POS"].str.get_dummies("-")  # split on hyphen
dum2.head(10)

	C	F	G
0	1	0	0
1	0	0	1
2	0	0	1
3	0	0	1
4	0	0	1
5	0	0	1
6	0	0	1
7	1	0	0
8	0	1	0
9	1	0	0

Now we can manually paste together the original data frame with the dummy variables:

nba = pd.concat([nba, dum2], axis="columns")
nba.head(10)

	NAME	POS	C	F	G
0	Joel Embiid	C	1	0	0
1	Jalen Brunson	G	0	0	1
2	Damian Lillard	G	0	0	1
3	Shai Gilgeous-Alexander	G	0	0	1
4	Tyrese Maxey	G	0	0	1
5	Donovan Mitchell	G	0	0	1
6	Luka Doncic	G	0	0	1
7	Nikola Jokic	C	1	0	0
8	LeBron James	F	0	1	0
9	Anthony Davis	C	1	0	0

Here’s how it looks for a player who has multiple positions:

nba[nba["NAME"] == "Kelly Oubre Jr."]

	NAME	POS	C	F	G
54	Kelly Oubre Jr.	F-G	0	1	1

Exercises

Exercise 1.1 For each type of data, classify it as discrete quantitative, continuous quantitative, categorical, or other.

How many students are enrolled at a university
Your favorite day of the week
How many inches of rain fall at an airport during one day
Weight of a motor vehicle
Manufacturer of a motor vehicle
Text of all Yelp reviews for a restaurant
Star ratings from all Yelp reviews for a restaurant
Size of the living area of an apartment
DNA nucleotide sequence of a cell

Exercise 1.2 Give the length of each vector or series.

Morning waking times every day for a week
Number of siblings (max 12) for each student in a class of 30
Position and momentum of a roller coaster car

Exercise 1.3 Describe a scheme for creating dummy variables for the days of the week. Use your scheme to encode the vector:

[Tuesday, Sunday, Friday, Tuesday, Monday]