1  Representation of data

First we have to discuss how to represent data, both abstractly and in Python.

1.1 Quantitative data

Definition 1.1 A quantitative value is one that is numerical and supports meaningful comparison and arithmetic operations.

Quantitative data is further divided into continuous and discrete types. The difference is the same as between real numbers and integers.

Example 1.1 Some continuous quantitative data sources:

  • Temperature at noon at a given airport
  • Your height
  • Voltage across the terminals of a battery

Examples of discrete quantitative data:

  • The number of shoes you own
  • Number of people at a restaurant table
  • Score on an exam
Note

Sometimes there may be room for interpretation or context. For example, the retail price of a gallon of milk might be regarded as discrete data, since it technically represents a whole number of pennies. But in finance, transactions are regularly computed to much higher precision, so it might make more sense to interpret prices as continuous values. As a rule of thumb, if there are many possible values, or the values are the result of a measurement, the continuous interpretation is usually more appropriate.

Caution

Not all numerical values represent truly quantitative data. ZIP codes (postal codes) in the U.S. are 5-digit numbers, and while there is some logic to how they were assigned, there is no clearly meaningful interpretation of averaging them, for instance.

For both continuous and discrete quantities, it makes sense to order different values, compute averages of them, etc. (However, averages of discrete quantities are continuous.)

Mathematically, the real and integer number sets are infinite, but computers are finite machines. Integers are represented exactly within some range that is determined by how many binary bits are dedicated. The computational analog of real numbers are floating-point numbers, or more simply, floats. These are bounded in range as well as discretized. The details are complicated, but essentially, the floating-point numbers have about 16 significant decimal digits by default—which is virtually always far more precision than real data offers.

Example 1.2 if you enter an integer, it is interpreted as an int:

# This is an integer (int type)
print(3, "is a", type(3))
3 is a <class 'int'>

To get the floating-point version, you add a decimal point:

# This is a real number (float type)
print(3.0, "is a", type(3.0))
3.0 is a <class 'float'>

You can also use scientific notation:

print(6.02e23, "is Avogadro's number")
print(6.62607e-34, "is Planck's constant in Joule-seconds")
6.02e+23 is Avogadro's number
6.62607e-34 is Planck's constant in Joule-seconds

There are also explicit converter functions. The conversion of an int to a float is generally exact:

# Convert int to float (generally no change to numerical value)
print( "float(3) creates",float(3) )
float(3) creates 3.0

But the conversion of int to float truncates the number, which can cause subtle bugs:

# Truncate float to int
print( "int(3.14) creates", int(3.14) )
int(3.14) creates 3

Because operations on floats are not exact, results can be surprising:

print(0.2 - 0.1 == 0.1)    # as expected
print(3.1 - 3.0 == 0.1)    # uh-oh
True
False

The Numpy function isclose is a safer way to compare floats:

import numpy as np
print( np.isclose(3.1 - 3.0, 0.1) )
True



Warning

Be careful about comparing floating-point values. Exact comparisons may fail due to rounding errors.

1.1.1 Inf and NaN

There are two additional quasi-numerical float values to be aware of as well.

Important

For numerical work in Python, the NumPy package indispensible. We will use it often, and it is also loaded and used by most other scientifically oriented packages.

The value inf in NumPy stands for infinity.

Warning

By definition, every operation that involves a NaN value results in a NaN.

Example 1.3 Infinity is greater than every number, and some arithmetic operations with infinity are well defined:

import numpy as np
print( "(np.inf + 5) is", np.inf + 5 )
print( "(np.inf + np.inf) is", np.inf + np.inf )
print( "(5 - np.inf) is", 5 - np.inf )
(np.inf + 5) is inf
(np.inf + np.inf) is inf
(5 - np.inf) is -inf

However, in calculus you learned that some expressions with infinity are considered to be undefined without additional information (e.g., L’Hôpital’s Rule):

print( "np.inf / np.inf is", np.inf / np.inf )
np.inf / np.inf is nan

The result nan above stands for Not a Number. It is the result of indeterminate arithmetic operations, like \(\infty/\infty\) and \(\infty - \infty\). It is also used sometimes as a placeholder for missing data.

One notorious trap of using NaN is that nan==nan is not True!

print( "np.nan == np.nan is", np.nan == np.nan )
np.nan == np.nan is False

Instead, you should use np.isnan to check for NaN values:

print( "np.isnan(np.nan) is", np.isnan(np.nan) )
np.isnan(np.nan) is True

1.1.2 Dates and times

Handling times and dates can be tricky. Aside from headaches such as time zones and leap years, there are many different ways people and machines represent dates, and with varying amounts of precision. Python has its own inbuilt system for handling dates and times, but we will show the facilities provided by the NumPy package, which is more comprehensive.

There are two basic types:

Definition 1.2 A datetime is a representation of an instant in time. A time delta is a representation of a duration; i.e., a difference between two datetimes.

Important

Native Python uses a datetime type, while NumPy uses datetime64.

Example 1.4 You can use many natural date formats within strings:

import numpy as np
np.datetime64("2020-01-17")    # YYYY-MM-DD 
np.datetime64("1969-07-20T20:17")    # YYYY-MM-DDThh:mm
numpy.datetime64('1969-07-20T20:17')
# Current date and time, down to the second
np.datetime64("now")
numpy.datetime64('2024-08-26T14:54:31')

A time delta in NumPy indicates its units (i. e., its granularity).

np.datetime64("1969-07-20T20:17") - np.datetime64("today")
numpy.timedelta64(-28980223,'m')

1.2 Vectors

Most interesting phenomena are characterized and influenced by more than one factor. Collections of values therefore play a central role in data science. The workhorse type for collections in base Python is the list. However, we’re going to need some more powerful metaphors and tools as well, beginning with vectors.

Definition 1.3 A vector is a collection of values called elements, all of the same type, indexed by consecutive integers.

Important

In math, vector indexes usually begin with 1. In Python, they begin with 0.

A vector with \(n\) elements is often referred to as an \(n\)-vector, and we say that \(n\) is the length of the vector. In math we often use \(\real^n\) to denote the set of all \(n\)-vectors with real-valued elements.

Example 1.5 The usual way to work with vectors in Python is through NumPy:

import numpy as np

x = np.array( [1, 2, 3, 4, 5] )
x
array([1, 2, 3, 4, 5])

A vector has a data type for its elements:

x.dtype
dtype('int64')

Any float values in the vector cause the data type of the entire vector to be float:

y = np.array( [1.0, 2, 3, 4, 5] )
y.dtype
dtype('float64')

Use len to determine the length of a vector:

len(x)
5

You can create a special type of vector called a range that has equally spaced elements:

np.arange(0, 5)
array([0, 1, 2, 3, 4])
np.arange(1, 3, 0.5)
array([1. , 1.5, 2. , 2.5])

The syntax here is (start,stop,step).

Caution

In base Python and in NumPy, the last element of a range is omitted. This is guaranteed to cause confusion if you are used to just about any other computer language.

There are functions for creating vectors of ones and zeros:

print(np.ones(5))
print(np.zeros(4))
[1. 1. 1. 1. 1.]
[0. 0. 0. 0.]



1.2.1 Element access

The elements of a vector can be accessed directly using square brackets notation.

Example 1.6 Use square brackets to refer to an element of a vector, remembering that the first element has index 0:

x[0]
1
x[4]
5

Negative values for the index are counted from the end. The last element of a vector always has index -1, and more-negative values move backward through the elements:

x[-1]
5
x[-3]
3
x[-len(x)]
1

Element references can also be on the left side of an assignment:

x[2] = -3
x
array([ 1,  2, -3,  4,  5])

Note, however, that once the data type of a vector is set, it can’t be changed:

x[0] = 1.234
x    # float was truncated to int, without warning!!!
array([ 1,  2, -3,  4,  5])

You can also use a list in square brackets to access multiple elements at once:

x[ [0, 2, 4] ]
array([ 1, -3,  5])

The result of a list reference is a new vector, not a number. Note the difference here:

x[0]
1
x[ [0] ]
array([1])

1.2.2 Slicing

You can also access multiple elements of a vector by using a list or range as the index. In the latter case, this is called slicing.

Warning

As with ranges in base Python, the last element of a slice range is omitted. This causes many headaches and bugs.

Example 1.7 The syntax of a slice is start:stop.

print(x[0:3])
print(x[-3:-1])
[ 1  2 -3]
[-3  4]

If you want to use a range that skips over elements, you can use a third argument in the form start:stop:step:

x[0:2:5]
array([1])

When the start of the range is omitted, it means “from the beginning”, and when stop is omitted, it means “through to the end.” Hence, [:k] means “first \(k\) elements” and [-k:] means “last \(k\) elements”:

x[:3]
array([ 1,  2, -3])
x[-3:]
array([-3,  4,  5])

And we also have this idiom:

x[::-1]   # reverse the vector
array([ 5,  4, -3,  2,  1])

NumPy will happily allow you to reference invalid indexes. It will just return as much as is available without warning or error:

x[:10]
array([ 1,  2, -3,  4,  5])

1.3 Arrays

A vector is a special case of a more general construct.

Definition 1.4 An array is a collection of values called elements, all of the same type, indexed by one or more sets of consecutive integers. The number of indexes needed to specify a value is the dimension of the array.

Tip

A dimension is called an axis in NumPy and related packages.

Note

The term matrix is often used simply to mean a 2D array. Technically, though, a matrix should have only numerical values, and matrices obey certain properties that make them important mathematical objects. These properties and their consequences are studied in linear algebra.

For us, arrays will mostly be the results of importing or working on data. But it’s occasionally useful to know how to build them from scratch.

Example 1.8 One way to construct an array is by a list comprehension:

A = np.array([ [j-i for j in range(6)] for i in range(4) ])
A
array([[ 0,  1,  2,  3,  4,  5],
       [-1,  0,  1,  2,  3,  4],
       [-2, -1,  0,  1,  2,  3],
       [-3, -2, -1,  0,  1,  2]])

The shape of an array is what we would often call the size:

A.shape
(4, 6)

There is no difference between a vector and a 1D array:

x.shape
(5,)

There is also no difference between a 2D array and a vector of vectors giving the rows of the array.

R = np.array( [ [1, 2, 3], [4, 5, 6] ])
R
array([[1, 2, 3],
       [4, 5, 6]])

Here are some other common ways to construct arrays.

np.ones(5)
array([1., 1., 1., 1., 1.])
np.zeros( (3, 6) )
array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])
np.repeat(np.pi, 3)
array([3.14159265, 3.14159265, 3.14159265])

You can also stack arrays vertically or horizontally to create new arrays.

np.hstack( ( np.ones((2, 2)), np.zeros((2, 3)) ) )
array([[1., 1., 0., 0., 0.],
       [1., 1., 0., 0., 0.]])
np.vstack( (range(5), range(5, 0, -1)) )
array([[0, 1, 2, 3, 4],
       [5, 4, 3, 2, 1]])

1.3.1 Indexing and slicing

One way to think about a 2D array is a vector of vector rows of the same length. That idea is reflected in NumPy syntax.

Example 1.9 We can use successive brackets to refer to an element, row then column:

R[1][2]    # second row, third column
6

It’s equivalent, and typically more convenient, to use a single bracket set with indexes separated by commas:

R[1, 2]    # second row, third column
6

You can use slicing ranges in each dimension individually.

R[:1, -2:]    # first row, last two columns
array([[2, 3]])

The result above is another 2D array. Note how this result is subtly different:

R[0, -2:]
array([2, 3])

Because we accessed an individual row, not a slice, the result is one dimension lower—a vector. Finally, a : in one slice position means to keep everything in that dimension.

A[:, :2]    # all rows, first 2 columns
array([[ 0,  1],
       [-1,  0],
       [-2, -1],
       [-3, -2]])



1.3.2 Reductions

A common task is to reduce an array along one dimension, called an axis in numpy, resulting in an array of one less dimension.

Example 1.10 The first dimension of an array is axis number 0, the second is axis number 1, etc.

A
array([[ 0,  1,  2,  3,  4,  5],
       [-1,  0,  1,  2,  3,  4],
       [-2, -1,  0,  1,  2,  3],
       [-3, -2, -1,  0,  1,  2]])
np.sum(A, axis=0)    # sum along the rows
array([-6, -2,  2,  6, 10, 14])

Even when we sum along the column dimension, axis=1, the result is a vector that we would interpret as a row, not a column.

np.sum(A, axis=1)    # sum along the columns
array([15,  9,  3, -3])

If you don’t specify an axis, the reduction occurs over all dimensions at once, resulting in a single number.

np.sum(A)
24



Tip

You can also do reductions with np.max, np.min, np.mean, etc.

1.4 Qualitative data

A qualitative value is one that is not quantitative. There are numerous types, but we will consider only a few of them here. Ultimately, we need to be able to represent qualitative data in a quantitative way for use in algorithms.

1.4.1 Categorical

Definition 1.5 Categorical data has values drawn from a finite set \(S\) of categories. If the members of \(S\) support meaningful ordering comparisons, then the data is ordinal; otherwise, it is nominal.

Example 1.11 Examples of ordinal categorical data:

  • Seat classes on a commercial airplane (e.g., economy, business, first)
  • Letters of the alphabet

Examples of nominal categorical data:

  • Yes/No responses
  • Marital status
  • Make of a car

There are nuanced cases. For instance, letter grades are themselves ordinal categorical data. However, schools convert them to discrete quantitative data and then compute a continuous quantitative GPA.

One way to quantify ordinal categorical data is to assign integer values to the categories in a manner that preserves ordering. This approach can be suspect, though, when it comes to operations such as averaging or computing a distance between values.

1.4.1.1 Dummy variables

Another means of quantifying categorical data is called dummy variables in classical statistics and one-hot encoding in much of machine learning. Suppose a variable \(x\) has values in a category set that has \(m\) members, i. e., \(S = \{c_1,\ldots,c_m\}\). There are two ways to replace the values with dummy variables.

  1. Introduce \(m\) variables \(x_1,\ldots,x_m\), where for a given categorical value \(v\) we have

\[ x_i = \begin{cases} 1, & v = c_i, \\ 0, & v \neq c_i. \end{cases} \]

  1. Introduce only \(m-1\) variables, leaving out \(x_m\). If \(x_1=x_2=\ldots=x_{m-1}=0\), then we know that the value is \(c_m\). (This variant is important in statistics because otherwise, \(x_m\) has to be correlated with the other \(x_i\).)

Example 1.12 Suppose that the stooge variable can take the values Moe, Larry, Curly, or Shemp, and we have the dataset

["Curly", "Moe", "Curly", "Larry", "Shemp", "Moe"]
['Curly', 'Moe', 'Curly', 'Larry', 'Shemp', 'Moe']

If we use 4 dummy variables, the data would be replaced by the array

Curly Larry Moe Shemp
0 True False False False
1 False False True False
2 True False False False
3 False True False False
4 False False False True
5 False False True False

If we use only 3 dummy variables, we would get

Larry Moe Shemp
0 False False False
1 False True False
2 False False False
3 True False False
4 False False True
5 False True False

The all-zero rows above correspond to the Curly values.

1.4.2 Text

Text is a ubiquitous data source. One way to quantify text is to use a dictionary of interesting keywords \(w_1,\ldots,w_n\). Given a collection of documents \(d_1,\ldots,d_m\), we can define an \(m\times n\) document–term matrix \(T\) by letting \(T_{ij}\) be the number of times term \(j\) appears in document \(i\).

1.4.3 Images

The most straightforward way to represent an image is as a 3D array of values representing intensities representing of red, green and blue in each pixel. Sometimes it might be preferable to represent the image by a vector of statistics about these values, or by presence or absence of detected objects, etc.

1.5 Working in pandas

The most popular Python package for manipulating and analyzing data is pandas. We will use the paradigm it presents, which is fairly well understood throughout data science.

1.5.1 Series

Definition 1.6 A series is a vector that is indexed by a finite ordered set.

In NumPy, we always index vectors by absolute position. In pandas, the elements still have an underlying ordering, but we can index them by anything, including strings. Since names are more memorable and intuitive than arbitrary integers, using them makes code easier to write, read, and debug.

Example 1.13 Some data that can be viewed as series:

  • The length, width, and height of a box can be expressed as a 3-vector of positive real numbers. If we index the vector by the names of the measurements, it becomes a series.
  • The number of steps taken by an individual over the course of a week can be expressed as a 7-vector of nonnegative integers. We could index it by the integers 1–7, or in a series by the names of the days of the week.
  • The bid prices of a stock at the end of each trading day can be represented as a time series, in which the index is drawn from timestamps.

Example 1.14 Here is a small pandas series. Note that we give the values first and then the index:

import pandas as pd

names = ["Marge", "Homer", "Bart", "Lisa", "Maggie"]
months = pd.Series( ["March", "May", "February", "May", "January"], index=names)
months
Marge        March
Homer          May
Bart      February
Lisa           May
Maggie     January
dtype: object

We can access the values by choosing a name in the index:

months["Lisa"]
'May'

If we access multiple values, we get a series that is a subset of the original:

months[ ["Homer", "Marge"] ]
Homer      May
Marge    March
dtype: object

We can also use the iloc property to access the underlying vector by NumPy slicing:

months.iloc[-3:]
Bart      February
Lisa           May
Maggie     January
dtype: object

Here is a series of years based on the same index:

years = pd.Series([1955, 1956, 1979, 1981, 1988], index=names)
years
Marge     1955
Homer     1956
Bart      1979
Lisa      1981
Maggie    1988
dtype: int64

We can also start with the series and separate it into a vector of values and its index:

print("Values:")
print(years.values)
print("\nIndex:")
print(years.index)
Values:
[1955 1956 1979 1981 1988]

Index:
Index(['Marge', 'Homer', 'Bart', 'Lisa', 'Maggie'], dtype='object')

Table 1.1 lists some common operations on pandas series. There is an exhaustive list in the pandas documentation.

Table 1.1: Operations on pandas series
Description Syntax Result
First or last entries s.head(), s.tail() Series
Length len(s) integer
All of the values s.values array
Convert values to list list(s) list
Index s.index Index
Unique values s.unique() array
Appearance frequencies s.value_counts() Series
Extreme values s.min(), s.max() number
Sort by values s.sort_values() Series
Sum s.sum() number
Comparison s > 1, s=="foo" boolean Series
Locate missing s.isna() boolean Series
Arithmetic s + 1, s * t Series
Delete one or more rows s.drop() Series
Caution

As always in Python, you need to pay attention to the difference between applying a function, like foo(bar), accessing a property, like foo.bar, and calling an object method, foo.bar(). Extra or missing parentheses groups can cause errors.

Example 1.15 We continue with the definitions in Example 1.14. Here are all the unique values of the months:

months.unique()
array(['March', 'May', 'February', 'January'], dtype=object)

Here are the frequencies of their appearance:

months.value_counts()
May         2
March       1
February    1
January     1
Name: count, dtype: int64

Note that we can do a logical comparison and get a series of boolean values:

months == "May"
Marge     False
Homer      True
Bart      False
Lisa       True
Maggie    False
dtype: bool

Since the years are numerical, we can do arithmetic operations on them:

print("Minimum year is", years.min(), "and maximum year is", years.max())
print("The sum of the years is", years.sum())
Minimum year is 1955 and maximum year is 1988
The sum of the years is 9859

It can be useful to know the index where the minimum or maximum value occurs:

print("The oldest is", years.idxmin(), "and the youngest is", years.idxmax())
The oldest is Marge and the youngest is Maggie

1.5.2 Data frames

Definition 1.7 A data frame is a collection of series that all share the same index set.

Essentially, a data frame is a 2D array with a common index for the rows, where the constituent series are the columns of the array. Those columns can also be indexed by names (strings). Since names are much more memorable and intuitive than arbitrarily assigned integers, data frames make code easier to write, read, and reason about.

Tip

Curly braces { } are used to construct a dictionary in Python.

Example 1.16 We continue with the definitions of months and years in Example 1.14, and use a dictionary to specify the columns of a data frame:

simpsons = pd.DataFrame( {"month": months, "date":  [19, 12, 23, 9, 14], "year": years } )
simpsons
month date year
Marge March 19 1955
Homer May 12 1956
Bart February 23 1979
Lisa May 9 1981
Maggie January 14 1988

Notice above that the date column was given as just a list of numbers; the index for it was inherited from the months series.

If we give the name of a column as an index into the frame, we get back a series for that column:

simpsons["year"]
Marge     1955
Homer     1956
Bart      1979
Lisa      1981
Maggie    1988
Name: year, dtype: int64

If we give a list of column names, we get back a data frame with just those columns:

simpsons[ ["month", "date"] ]
month date
Marge March 19
Homer May 12
Bart February 23
Lisa May 9
Maggie January 14

We can even index a nonexistent column on the left side of an assignment, and pandas will create it for us:

simpsons["weekday"] = ["Saturday", "Saturday", "Friday", "Saturday", "Thursday"]
simpsons
month date year weekday
Marge March 19 1955 Saturday
Homer May 12 1956 Saturday
Bart February 23 1979 Friday
Lisa May 9 1981 Saturday
Maggie January 14 1988 Thursday

We can access a row by using brackets with the loc property of the frame, getting a series indexed by the column names of the frame:

simpsons.loc["Maggie"]
month       January
date             14
year           1988
weekday    Thursday
Name: Maggie, dtype: object

We are also free to strip away the index and get an ordinary array:

simpsons.to_numpy()
array([['March', 19, 1955, 'Saturday'],
       ['May', 12, 1956, 'Saturday'],
       ['February', 23, 1979, 'Friday'],
       ['May', 9, 1981, 'Saturday'],
       ['January', 14, 1988, 'Thursday']], dtype=object)

Another way to create a data frame is to give a list of rows, plus (optionally) the index and the names of the columns:

pd.DataFrame( 
    [ ("white", 5), ("brown", 9), ("pink", 8) ], 
    columns=["Color", "Rating"],
    index=["vanilla", "chocolate", "strawberry"]
    )
Color Rating
vanilla white 5
chocolate brown 9
strawberry pink 8



1.5.3 Common operations

Table 1.2 shows common operations specific to a data frame. But since the columns are series, the operations in Table 1.1 can be applied too, and they will act columnwise.

Table 1.2: Operations on pandas data frames
Description Syntax Result
Summary information df.info(), df.describe() DataFrame
First or last row(s) df.head(), df.tail() DataFrame
Number of rows len(df) integer
Number of rows and columns df.shape tuple
Row index df.index Index
Column names list(df) list
Column names df.columns Index
Access by column name(s) df["name"], df[["col1", "col2"]] Series or DataFrame
Access by index/name df.loc[rows, "name"] (varies)
Access by position df.iloc[i, j] (varies)
Delete a column df.drop(name, axis=1) DataFrame
Sort by column values df.sort_values("column") DataFrame

Example 1.17 We continue with the definitions of simpsons in Example 1.16:

simpsons.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Marge to Maggie
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   month    5 non-null      object
 1   date     5 non-null      int64 
 2   year     5 non-null      int64 
 3   weekday  5 non-null      object
dtypes: int64(2), object(2)
memory usage: 372.0+ bytes
simpsons.min()
month      February
date              9
year           1955
weekday      Friday
dtype: object

Summing numbers does what you expect, but summing strings concatenates them:

simpsons.sum()
month                  MarchMayFebruaryMayJanuary
date                                           77
year                                         9859
weekday    SaturdaySaturdayFridaySaturdayThursday
dtype: object

We can sort the entire frame by its values in a column:

simpsons.sort_values("date")
month date year weekday
Lisa May 9 1981 Saturday
Homer May 12 1956 Saturday
Maggie January 14 1988 Thursday
Marge March 19 1955 Saturday
Bart February 23 1979 Friday

For a more thorough demonstration of operations on data frames, look at the pandas user guide. There is also an exhaustive list of relevant functions in the pandas documentation.

1.5.4 Loading from files

Datasets can be presented in many forms. The simplest and most universal are comma-separated value (CSV) files. Such files can be read by pandas locally or over the web.

Example 1.18 The pandas function we use to load a dataset from CSV is read_csv. Here we use it to read a file that is available over the web:

ads = pd.read_csv("https://raw.githubusercontent.com/tobydriscoll/ds1book/master/advertising.csv")
ads.head(6)    # show the first 6 rows
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9
5 8.7 48.9 75.0 7.2

Note above that we used head(6) to see just the first 6 rows.

It’s a good idea to check what you’ve just loaded:

ads.info()   
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB



Tip

Here is one thing to watch for when loading numerical data: if a column has anything non-numeric in it, like “N/A” or “unknown”, pandas will load that column as a string.

Tip

While it’s possible to import datasets directly from the web, links change and disappear frequently. If storing the dataset locally is not a problem, you may want to download your own copy before working on it.

It’s also possible to import data from most other general-purpose formats you might encounter, such as Excel spreadsheets (though possibly requiring one-time installation of additional libraries). There are many functions starting with pd.read_ showing the formats pandas understands.

1.6 Selecting rows and columns

In a data frame, we access columns by giving a string name, or a list of names, in square brackets. We can access a row by the loc property with an index value, or iloc with an absolute row number starting from zero.

Example 1.19 Here’s a local file that contains daily weather summaries from Newark, Delaware:

weather = pd.read_csv("_datasets/ghcn_newark.csv")
weather.head()
STATION DATE LATITUDE LONGITUDE ELEVATION NAME PRCP PRCP_ATTRIBUTES SNOW SNOW_ATTRIBUTES ... WT08 WT08_ATTRIBUTES WT11 WT11_ATTRIBUTES WT14 WT14_ATTRIBUTES WT16 WT16_ATTRIBUTES WT18 WT18_ATTRIBUTES
0 USC00076410 1894-04-01 39.6682 -75.74569 32.3 NEWARK AG FARM, DE US 0.0 P,,6, NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 USC00076410 1894-04-02 39.6682 -75.74569 32.3 NEWARK AG FARM, DE US 0.0 P,,6, NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 USC00076410 1894-04-03 39.6682 -75.74569 32.3 NEWARK AG FARM, DE US 0.0 P,,6, NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 USC00076410 1894-04-04 39.6682 -75.74569 32.3 NEWARK AG FARM, DE US 165.0 ,,6, NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 USC00076410 1894-04-05 39.6682 -75.74569 32.3 NEWARK AG FARM, DE US 0.0 P,,6, NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 56 columns

There are lots of columns here. We will work with a small subset of them:

columns = ["DATE", "PRCP", "SNOW", "TMAX", "TMIN"]
weather = weather[columns]
weather.head()
DATE PRCP SNOW TMAX TMIN
0 1894-04-01 0.0 NaN 189.0 111.0
1 1894-04-02 0.0 NaN 133.0 22.0
2 1894-04-03 0.0 NaN 89.0 -61.0
3 1894-04-04 165.0 NaN 117.0 17.0
4 1894-04-05 0.0 NaN 156.0 56.0

The first row has absolute position zero:

weather.iloc[0]
DATE    1894-04-01
PRCP           0.0
SNOW           NaN
TMAX         189.0
TMIN         111.0
Name: 0, dtype: object

The result above is a series, with the column names of the original data frame serving as the index. The following result is subtly different:

weather.iloc[:1]
DATE PRCP SNOW TMAX TMIN
0 1894-04-01 0.0 NaN 189.0 111.0

The slicing syntax [:1] means “the first 1 rows”. The result is a one-row data frame—not a series. To access the last 4 rows, we can use a negative index:

weather.iloc[-4:]
DATE PRCP SNOW TMAX TMIN
29513 2022-12-17 0.0 NaN 72.0 11.0
29514 2022-12-18 0.0 NaN 39.0 -22.0
29515 2023-01-06 0.0 NaN 89.0 33.0
29516 2023-01-07 0.0 NaN 72.0 17.0

In Example 1.19, we did not specify an index for the data frame, so the row numbers were the index. If we give a data frame a different index, we can use loc to access rows by the index value.

Example 1.20 A natural candidate for an index in the weather data is the DATE column. This can be specified when the file is read, but here we do it after the fact:

dated_weather = weather.set_index("DATE")
dated_weather.head()
PRCP SNOW TMAX TMIN
DATE
1894-04-01 0.0 NaN 189.0 111.0
1894-04-02 0.0 NaN 133.0 22.0
1894-04-03 0.0 NaN 89.0 -61.0
1894-04-04 165.0 NaN 117.0 17.0
1894-04-05 0.0 NaN 156.0 56.0

Notice above that the leftmost column of dated_weather is the index, which still has the column name DATE. We can access a row by giving a date string to loc:

dated_weather.loc["1979-03-28"]
PRCP      0.0
SNOW      0.0
TMAX    111.0
TMIN    -33.0
Name: 1979-03-28, dtype: float64

We can extract a row and column simultaneously by giving both to loc:

dated_weather.loc["1979-03-28", "TMAX"]
111.0

When the index is a date, as is the case here, we can even use a range of dates:

dated_weather.loc["1979-03-01":"1979-03-06"]
PRCP SNOW TMAX TMIN
DATE
1979-03-01 0.0 0.0 122.0 -22.0
1979-03-02 3.0 0.0 128.0 33.0
1979-03-03 0.0 0.0 117.0 22.0
1979-03-04 0.0 0.0 NaN NaN
1979-03-05 61.0 0.0 NaN NaN
1979-03-06 173.0 0.0 128.0 83.0

Frequently we want to select rows from a data frame based on criteria applied to the data itself. We can use relational operators such as >, <, ==, etc. to select rows based on the values in a column.

Example 1.21 Here is how we can select all the rows in which the value of PRCP is greater than 1200:

weather.loc[weather["PRCP"] > 1200]
DATE PRCP SNOW TMAX TMIN
9396 1952-07-09 1377.0 0.0 261.0 200.0
12382 1960-09-12 1331.0 0.0 222.0 167.0
16352 1971-08-27 1207.0 0.0 256.0 194.0
26543 1999-09-16 2202.0 0.0 206.0 156.0
27938 2004-07-12 1580.0 0.0 244.0 189.0
27986 2004-09-29 1450.0 0.0 244.0 178.0

There are two steps above. The expression inside the square brackets produces a series of Boolean values:

is_prcp = (weather["PRCP"] > 1200)
is_prcp.head()
0    False
1    False
2    False
3    False
4    False
Name: PRCP, dtype: bool

This series can be used to select rows with loc. Actually, it works without loc as well, because pandas knows it is a row selector:

print("There are",is_prcp.sum(),"rows with PRCP > 1200.")
weather[is_prcp]
There are 6 rows with PRCP > 1200.
DATE PRCP SNOW TMAX TMIN
9396 1952-07-09 1377.0 0.0 261.0 200.0
12382 1960-09-12 1331.0 0.0 222.0 167.0
16352 1971-08-27 1207.0 0.0 256.0 194.0
26543 1999-09-16 2202.0 0.0 206.0 156.0
27938 2004-07-12 1580.0 0.0 244.0 189.0
27986 2004-09-29 1450.0 0.0 244.0 178.0

We can use logical operators & (AND), | (OR), and ~ (NOT) on these Boolean series in order to combine criteria:

weather[(weather["TMAX"] < 0) & (weather["SNOW"] > 0)]
DATE PRCP SNOW TMAX TMIN
272 1895-01-29 102.0 102.0 -22.0 -94.0
282 1895-02-08 127.0 178.0 -139.0 -189.0
692 1896-03-24 33.0 51.0 -22.0 -100.0
943 1896-11-30 107.0 76.0 -6.0 -33.0
966 1896-12-23 48.0 51.0 -6.0 -89.0
... ... ... ... ... ...
25224 1996-01-08 3.0 25.0 -28.0 -156.0
25249 1996-02-02 20.0 25.0 -22.0 -56.0
25250 1996-02-03 155.0 147.0 -33.0 -111.0
25594 1997-01-12 13.0 13.0 -33.0 -100.0
25623 1997-03-10 178.0 8.0 -17.0 167.0

131 rows × 5 columns



Tip

The pandas user guide has a handy section on selections.

Tip

For practice with pandas fundamentals, try the Kaggle course.

1.7 Combining data frames

Sometimes you want to combine data from different sources. There are multiple ways to do this in pandas, depending on the structure of the data you want to combine and the result you want to receive.

1.7.1 Concatenation

The concat function is used to stack frames or series on top of each other.

Example 1.22 Here we create two frames with three rows each and some commonly named columns:

birds = pd.DataFrame(
    {
        "Name": ["Eagles", "Crows", "Falcons"],
        "Predatory": [True, False, True],
        "Family": ["Accipitridae", "Corvidae", "Falconidae"],
        "Color": ["Brown", "Black", "Blue–Gray"]
    }
)
birds
Name Predatory Family Color
0 Eagles True Accipitridae Brown
1 Crows False Corvidae Black
2 Falcons True Falconidae Blue–Gray
teams = pd.DataFrame(
    {
        "Name": ["Falcons", "Eagles", "Bears"],
        "Home": ["Atlanta", "Philadelphia", "Chicago"],
        "Color": ["Red", "Green", "Blue"]
    }
)
teams
Name Home Color
0 Falcons Atlanta Red
1 Eagles Philadelphia Green
2 Bears Chicago Blue

We can use pd.concat() to stack these frames on top of each other:

pd.concat([birds, teams])
Name Predatory Family Color Home
0 Eagles True Accipitridae Brown NaN
1 Crows False Corvidae Black NaN
2 Falcons True Falconidae Blue–Gray NaN
0 Falcons NaN NaN Red Atlanta
1 Eagles NaN NaN Green Philadelphia
2 Bears NaN NaN Blue Chicago

Notice above that any value that wasn’t defined in the source frames ends up as a missing value in the result.

We can also concatenate the frames side by side by using the axis keyword:

pd.concat([birds, teams], axis="columns")
Name Predatory Family Color Name Home Color
0 Eagles True Accipitridae Brown Falcons Atlanta Red
1 Crows False Corvidae Black Eagles Philadelphia Green
2 Falcons True Falconidae Blue–Gray Bears Chicago Blue

The result is just a dumb pasting together of the source frames. There is no attempt to make the entries of the Name column match up, even though there were some overlapping values in it.

Concatenation is occasionally useful, but more often data frames should be joined horizontally using the more flexible merge method in the next section.

1.7.2 Merging data

The merge function is used to merge frames based on the values within the columns. Generically, each source frame can have a key column that is used to match up rows.

Example 1.23 Continuing with the data frames from Example 1.22:

pd.merge(birds, teams, on="Name")
Name Predatory Family Color_x Home Color_y
0 Eagles True Accipitridae Brown Philadelphia Green
1 Falcons True Falconidae Blue–Gray Atlanta Red

The on keyword specifies the column that is the key for the merge. In the above merge, the result is a frame with only the rows that have a common value in the Name column. That’s called an inner join, which is the default. An alternative is an outer join, which includes all rows from both frames:

pd.merge(birds, teams, on="Name", how="outer")
Name Predatory Family Color_x Home Color_y
0 Bears NaN NaN NaN Chicago Blue
1 Crows False Corvidae Black NaN NaN
2 Eagles True Accipitridae Brown Philadelphia Green
3 Falcons True Falconidae Blue–Gray Atlanta Red

Note also above that the column Color that appeared in both of the source frames was renamed to Color_x and Color_y in the result, since otherwise there would be a conflict.

It’s also possible to do a left join or a right join, which use one of the original frames as the basis for the result:

print("Left merge:")
print(pd.merge(birds, teams, on="Name", how="left"))
print("\nRight merge:")
print(pd.merge(birds, teams, on="Name", how="right"))
Left merge:
      Name  Predatory        Family    Color_x          Home Color_y
0   Eagles       True  Accipitridae      Brown  Philadelphia   Green
1    Crows      False      Corvidae      Black           NaN     NaN
2  Falcons       True    Falconidae  Blue–Gray       Atlanta     Red

Right merge:
      Name Predatory        Family    Color_x          Home Color_y
0  Falcons      True    Falconidae  Blue–Gray       Atlanta     Red
1   Eagles      True  Accipitridae      Brown  Philadelphia   Green
2    Bears       NaN           NaN        NaN       Chicago    Blue



Note

The difference between outer join and inner join is the same as the union and intersection of sets—the sets here being the "on" column contents.

Example 1.24 Here we load in weather data for Newark, DE, being sure that the column representing the date is parsed as dates:

weather = pd.read_csv("_datasets/ghcn_newark.csv", parse_dates=["DATE"])
weather = weather[["DATE", "PRCP", "SNOW", "TMAX", "TMIN"]]
weather.head()
DATE PRCP SNOW TMAX TMIN
0 1894-04-01 0.0 NaN 189.0 111.0
1 1894-04-02 0.0 NaN 133.0 22.0
2 1894-04-03 0.0 NaN 89.0 -61.0
3 1894-04-04 165.0 NaN 117.0 17.0
4 1894-04-05 0.0 NaN 156.0 56.0

Here is another data frame for the S&P 500 stock market index:

sp500 = pd.read_csv("_datasets/SP500_daily.csv", parse_dates=["Date"])
sp500["Gain"] = sp500["Close/Last"] - sp500["Open"]
sp500.head()
Date Close/Last Open High Low Gain
0 2024-03-05 5078.65 5110.52 5114.54 5056.82 -31.87
1 2024-03-04 5130.95 5130.99 5149.67 5127.18 -0.04
2 2024-03-01 5137.08 5098.51 5140.33 5094.16 38.57
3 2024-02-29 5096.27 5085.36 5104.99 5061.89 10.91
4 2024-02-28 5069.76 5067.20 5077.37 5058.35 2.56

Now we can merge the weather data with the stock data according to the date. Note that here, the key is named differently in the two frames:

merged = pd.merge(sp500, weather, left_on="Date", right_on="DATE")
merged.head()
Date Close/Last Open High Low Gain DATE PRCP SNOW TMAX TMIN
0 2023-01-06 3895.08 3823.37 3906.19 3809.56 71.71 2023-01-06 0.0 NaN 89.0 33.0
1 2022-11-30 4080.11 3957.18 4080.11 3938.58 122.93 2022-11-30 97.0 NaN 128.0 6.0
2 2022-11-29 3957.63 3964.19 3976.77 3937.65 -6.56 2022-11-29 0.0 NaN 83.0 -22.0
3 2022-11-28 3963.94 4005.36 4012.27 3955.77 -41.42 2022-11-28 0.0 NaN 122.0 17.0
4 2022-11-25 4026.12 4023.34 4034.02 4020.76 2.78 2022-11-25 13.0 NaN 150.0 6.0

Since we did an inner join, only the rows with a common date in both frames are included:

print("There are", len(sp500), "rows in the S&P 500 data.")
print("There are", len(weather), "rows in the weather data.")
print("There are", len(merged), "rows in the inner-merged data.")
There are 1259 rows in the S&P 500 data.
There are 29517 rows in the weather data.
There are 820 rows in the inner-merged data.

If the key you want to merge on is the index column, the syntax of the merge changes a little. Here is an example:

weather = weather.set_index("DATE")
weather.head()
PRCP SNOW TMAX TMIN
DATE
1894-04-01 0.0 NaN 189.0 111.0
1894-04-02 0.0 NaN 133.0 22.0
1894-04-03 0.0 NaN 89.0 -61.0
1894-04-04 165.0 NaN 117.0 17.0
1894-04-05 0.0 NaN 156.0 56.0
result = pd.merge(sp500, weather, left_on="Date", right_index=True)
result.head()
Date Close/Last Open High Low Gain PRCP SNOW TMAX TMIN
290 2023-01-06 3895.08 3823.37 3906.19 3809.56 71.71 0.0 NaN 89.0 33.0
315 2022-11-30 4080.11 3957.18 4080.11 3938.58 122.93 97.0 NaN 128.0 6.0
316 2022-11-29 3957.63 3964.19 3976.77 3937.65 -6.56 0.0 NaN 83.0 -22.0
317 2022-11-28 3963.94 4005.36 4012.27 3955.77 -41.42 0.0 NaN 122.0 17.0
318 2022-11-25 4026.12 4023.34 4034.02 4020.76 2.78 13.0 NaN 150.0 6.0

1.8 Data wrangling

Raw data often needs to be manipulated into a useable format before algorithms can be applied. Preprocessing data so that it is suitable for machine analysis is known as data wrangling.

There are a number of frequently encountered tasks in data wrangling, including:

Cleaning
Handling missing values, removing duplicates, correcting errors, and ensuring that data is in the right format.
Encoding
Converting categorical variables to numerical form.
Transforming
Normalizing, scaling, and encoding data (e.g., defining dummy variables).

These are tasks that graphical and AI assistive tools can help with a great deal. When using such tools, have them generate code that reproduces its steps. That way, you have a definitive record of what was done and can reproduce or modify it later.

1.8.1 Missing values

One of the most commonly encountered snags found in real data sets are rows that have missing values. There is no universal way that such values are presented. A missing value may be indicated by a zero or nonsensical value, or with a string such as “n/a”, or just an empty cell. Within Python, it might be natural to represent missing values as None, but that is not a valid value for a NumPy array. It’s more flexible within pandas to use np.nan, which stands for “not a number”.

Most algorithms will fail when presented with missing values, so you must either remove or replace them.

For removal, the dropna() method deletes all rows that have missing values. By default, it does this to a new copy of the data frame, so you must assign the result to a variable. Alternatively, you can use the argument inplace=True to modify the original frame. When applying dropna(), you may want to limit its scope to a subset of columns by using the subset argument with a list of columns.

Replacement of missing values is called imputation. The fillna() method makes it easy to use the mean, median, or mode of the non-missing values in a column. Doing so allows you to avoid data loss, but you should be cautious about imputation, because it can easily misrepresent the underlying data. At the very least, you need to document how many values have been changed.

Example 1.25 Here is a well-known demonstration data set often referred to as Palmer Penguins:

import seaborn as sns
import numpy as np
penguins = sns.load_dataset("penguins")
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

Note above that the fourth row of the frame is missing measurements. We can discover how many such rows there are using isna, which creates a Boolean series:

penguins.isna()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False True True True True True
4 False False False False False False False
... ... ... ... ... ... ... ...
339 False False True True True True True
340 False False False False False False False
341 False False False False False False False
342 False False False False False False False
343 False False False False False False False

344 rows × 7 columns

The total number of rows with missing data is

penguins.isna().sum()
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

To drop the rows with missing values, you can use dropna():

dropped = penguins.dropna()
print("Original has", len(penguins), "rows.")
print("After removals, there are", len(dropped), "rows.")
Original has 344 rows.
After removals, there are 333 rows.

If you want to impute missing values, you can use fillna(). Here, for example, we replace missing values in any quantitative column with the mean of the column:

filled = penguins.copy()
for col in penguins.select_dtypes(np.number).columns:
    filled[col].fillna(penguins[col].mean(), inplace=True)
filled.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.10000 18.70000 181.000000 3750.000000 Male
1 Adelie Torgersen 39.50000 17.40000 186.000000 3800.000000 Female
2 Adelie Torgersen 40.30000 18.00000 195.000000 3250.000000 Female
3 Adelie Torgersen 43.92193 17.15117 200.915205 4201.754386 NaN
4 Adelie Torgersen 36.70000 19.30000 193.000000 3450.000000 Female

Note that, for example, the categorical column for sex was left alone and still contains missing values.



1.8.2 Strings

A common pitfall is that data that seems numerical to you might be parsed as strings by pandas. This happens when there are non-numeric characters in the data, such as dollar signs or percent signs, and any row with such a character will cause the whole column to be interpreted as strings.

Example 1.26 To demonstrate algorithms in later sections, we will be using a dataset describing loans made on the crowdfunding site LendingClub. For this example, we’ll keep only a few columns.

loans = pd.read_csv("_datasets/loans.csv")
loans = loans[["loan_amnt", "funded_amnt", "int_rate", "title"]]
loans.head(8)
loan_amnt funded_amnt int_rate title
0 5000 5000 10.65% Computer
1 2500 2500 15.27% bike
2 2400 2400 15.96% real estate business
3 10000 10000 13.49% personel
4 3000 3000 12.69% Personal
5 5000 5000 7.90% My wedding loan I promise to pay back
6 7000 7000 15.96% Loan
7 3000 3000 18.64% Car Downpayment

The int_rate column, which gives the interest rate on the loan, has been interpreted as strings due to the percent sign. We’d like to strip out those percent signs and convert the rest to numeric types. We can use the str property of the column to access string methods:

int_rate = loans["int_rate"].str.strip('%')

Then, we can convert the result to numbers:

loans["int_rate"] = pd.to_numeric( int_rate )
loans.head()
loan_amnt funded_amnt int_rate title
0 5000 5000 10.65 Computer
1 2500 2500 15.27 bike
2 2400 2400 15.96 real estate business
3 10000 10000 13.49 personel
4 3000 3000 12.69 Personal

The title column contains strings that are haphazardly formatted. We might want to convert everything in that column to lowercase:

loans["title"] = loans["title"].str.lower()
loans.head(8)
loan_amnt funded_amnt int_rate title
0 5000 5000 10.65 computer
1 2500 2500 15.27 bike
2 2400 2400 15.96 real estate business
3 10000 10000 13.49 personel
4 3000 3000 12.69 personal
5 5000 5000 7.90 my wedding loan i promise to pay back
6 7000 7000 15.96 loan
7 3000 3000 18.64 car downpayment

You can look at the pandas user guide for a more thorough demonstration of string methods in pandas.

1.8.3 Encoding qualitative data

Most algorithms require numerical data. If you have qualitative data, you need to convert it to a numerical form. This is called encoding.

For a category with ordered values, it’s usually best to replace the categories with integers that reflect the ordering.

Example 1.27 This dataset contains information about diamonds:

import seaborn as sns
diamonds = sns.load_dataset("diamonds")
diamonds.head()
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

As you can see above, some of the features (cut, color, clarity) have string designations. We could convert cut, for example, to a categorical variable like so:

pd.Categorical(diamonds["cut"])
['Ideal', 'Premium', 'Good', 'Premium', 'Good', ..., 'Ideal', 'Good', 'Very Good', 'Premium', 'Ideal']
Length: 53940
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

However, these values actually have a specific ordering in this context that is lost in the nominal variable above. Instead, we can replace the strings with integers that reflect the ordering:

cuts = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
diamonds["cut"].replace(cuts, range(1, 6), inplace=True)
diamonds.head()
carat cut color clarity depth table price x y z
0 0.23 5 E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 4 E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 2 E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 4 I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 2 J SI2 63.3 58.0 335 4.34 4.35 2.75

For nominal categories that have no natural ordering, it’s common to encode via dummy variables.

Example 1.28 Here is a dataset of NBA player statistics from 2023–2024:

nba = pd.read_csv("_datasets/NBA player stats 2023-24.csv")
nba.head()
RANK NAME TEAM POS AGE GP MPG USG% TO% FTA ... APG SPG BPG TPG P+R P+A P+R+A VI ORtg DRtg
0 NaN Joel Embiid Phi C 30.2 6 41.4 35.7 15.8 78 ... 5.7 1.2 1.5 4.2 43.8 38.7 49.5 12.2 117.1 108.0
1 NaN Jalen Brunson Nyk G 27.8 13 39.8 36.4 9.3 120 ... 7.5 0.8 0.2 2.7 35.7 39.8 43.2 9.3 114.8 114.7
2 NaN Damian Lillard Mil G 33.9 4 39.1 31.4 10.0 38 ... 5.0 1.0 0.0 2.3 34.5 36.3 39.5 8.2 127.6 115.7
3 NaN Shai Gilgeous-Alexander Okc G 25.9 10 39.9 32.3 8.9 81 ... 6.4 1.3 1.7 2.2 37.4 36.6 43.8 11.2 118.3 106.9
4 NaN Tyrese Maxey Phi G 23.6 6 44.6 28.1 8.6 28 ... 6.8 0.8 0.3 2.2 35.0 36.7 41.8 9.1 120.9 113.3

5 rows × 29 columns

For this example, we will narrow the focus to the NAME and POS columns:

nba = nba[["NAME", "POS"]]
nba.head(10)
NAME POS
0 Joel Embiid C
1 Jalen Brunson G
2 Damian Lillard G
3 Shai Gilgeous-Alexander G
4 Tyrese Maxey G
5 Donovan Mitchell G
6 Luka Doncic G
7 Nikola Jokic C
8 LeBron James F
9 Anthony Davis C

The POS column is a categorical variable (the player’s position):

nba["POS"].unique()
array(['C', 'G', 'F', 'G-F', 'F-G', 'C-F', 'F-C'], dtype=object)

We can use pd.get_dummies to convert this column to dummy variables:

dum1 = pd.get_dummies(nba["POS"])
dum1.head(10)
C C-F F F-C F-G G G-F
0 True False False False False False False
1 False False False False False True False
2 False False False False False True False
3 False False False False False True False
4 False False False False False True False
5 False False False False False True False
6 False False False False False True False
7 True False False False False False False
8 False False True False False False False
9 True False False False False False False

As you can see, the POS column was replaced by 7 columns, one for each of the unique positions in the dataset. But this result is a little awkward, because a position value such as "C-F" means both center and forward. We can use a different encoding scheme to handle this more sensibly:

dum2 = nba["POS"].str.get_dummies("-")  # split on hyphen
dum2.head(10)
C F G
0 1 0 0
1 0 0 1
2 0 0 1
3 0 0 1
4 0 0 1
5 0 0 1
6 0 0 1
7 1 0 0
8 0 1 0
9 1 0 0

Now we can manually paste together the original data frame with the dummy variables:

nba = pd.concat([nba, dum2], axis="columns")
nba.head(10)
NAME POS C F G
0 Joel Embiid C 1 0 0
1 Jalen Brunson G 0 0 1
2 Damian Lillard G 0 0 1
3 Shai Gilgeous-Alexander G 0 0 1
4 Tyrese Maxey G 0 0 1
5 Donovan Mitchell G 0 0 1
6 Luka Doncic G 0 0 1
7 Nikola Jokic C 1 0 0
8 LeBron James F 0 1 0
9 Anthony Davis C 1 0 0

Here’s how it looks for a player who has multiple positions:

nba[nba["NAME"] == "Kelly Oubre Jr."]
NAME POS C F G
54 Kelly Oubre Jr. F-G 0 1 1

Exercises

Exercise 1.1 For each type of data, classify it as discrete quantitative, continuous quantitative, categorical, or other.

  1. How many students are enrolled at a university
  2. Your favorite day of the week
  3. How many inches of rain fall at an airport during one day
  4. Weight of a motor vehicle
  5. Manufacturer of a motor vehicle
  6. Text of all Yelp reviews for a restaurant
  7. Star ratings from all Yelp reviews for a restaurant
  8. Size of the living area of an apartment
  9. DNA nucleotide sequence of a cell

Exercise 1.2 Give the length of each vector or series.

  1. Morning waking times every day for a week
  2. Number of siblings (max 12) for each student in a class of 30
  3. Position and momentum of a roller coaster car

Exercise 1.3 Describe a scheme for creating dummy variables for the days of the week. Use your scheme to encode the vector:

[Tuesday, Sunday, Friday, Tuesday, Monday]