Manipulating Data with Labels in pandas
pandas is probably the most widely used Python package for handling tabular data. Although I hated using pandas for quite some time (fun fact) due to its cumbersome usage (for example, get a flavor of it from pandas user guide: miscellaneous indexing faq and real python: SettingWithCopyWarning) compared to the straightforward transformations achievable with NumPy arrays, my perspective shifted over time. I came to recognize its strengths, particularly when confronted with CSV files. From my perspective, pandas’ defining feature lies in its use of labels. Unlike NumPy arrays, where manipulating elements requires numeric indices devoid of semantic meaning, pandas utilizes row and column labels that provide explicit meaning to the data, thereby preventing users from getting lost in mere numbers.
While I maintain a preference for NumPy to execute array transformations, the utilization of pandas becomes essential when working with labels. This post serves as a repository for the practical applications that I have collected while working with pandas labels, offering a convenient reference for future.
pandas.Index
pandas.Index is the data structure to store a series of labels in pandas. For row labels, the pandas.Index
object is called index
, and for column labels, it’s called columns
(as shown in the code example below). Note that index becomes an overloaded term here. To differentiate them clearly, this post adds pandas.
prefix explicitly when referring to the class underlying both row and column labels, and use index directly when referring to row labels.
In : df = pd.DataFrame({"unit": ["a", "a", "b", "c", "c"],
"time": [0, 1, 1, 0, 1],
"value": [10, 15, 5, 8, 12]})
In : df # df will be also used in code examples later
Out:
unit time value
0 a 0 10
1 a 1 15
2 b 1 5
3 c 0 8
4 c 1 12
In : df.index
Out: RangeIndex(start=0, stop=5, step=1)
In : df.columns
Out: Index(['unit', 'time', 'value'], dtype='object')
Subclass. In certain use cases, subclasses of pandas.Index
are available to optimize memory usage and improve computing speed. Notable examples include RangeIndex
for monotonic integer labels, and MultiIndex
for multi-level labels.
Attributes. Member variables in pandas.Index
class include:
name
: str, the name (category) of all labelsnames
: pandasFrozenList
of str, the name (category) of all labels in each level. The length of names equals the number of levels.values
: NumPy ndarray, the actual labels
In : print(df.columns.name)
None
In : df.columns.names
Out: FrozenList([None])
In : df.columns.values
Out: array(['unit', 'time', 'value'], dtype=object)
Like the code example above, name
and names
are usually empty, unless explicitly set in order to provide more semantic meaning. An example with concrete names be found latter in MultiIndex section.
data selection by labels
loc
uses labels to access data. Contrarily, iloc
uses positions to access data. See details in pandas user guide: indexing and selecting data.
index
This section expands further for row labels specifically.
When creating a DataFrame
(2-dimensional data) or a Series
(1-dimensional data), if the input data contains no indexing information and index is not explicitly provided either, its index defaults to RangeIndex
.
Set Index. The index of a DataFrame can be changed by DataFrame.set_index(keys)
. Providing a column name instructs using an existing column as index. The old index gets completely replaced (unless specify append=True
in which case the new index are combined with the old index forming a multi-level index).
In : df.set_index("unit")
Out:
time value
unit
a 0 10
a 1 15
b 1 5
c 0 8
c 1 12
In : multi_index_df = df.set_index(["unit", "time"])
In : multi_index_df # multi_index_df will be also used in code examples later
Out:
value
unit time
a 0 10
1 15
b 1 5
c 0 8
1 12
Note that even though index is generally expected to be unique for efficient data retrieval, pandas.Index
does not enforce the uniqueness constraint (as shown in the first example above). It’s important to understand what you are doing and pre-process the column as needed before setting it as index, because removing duplicated index in a DataFrame afterwards is not very straightforward (ref).
In : df.set_index("unit", verify_integrity=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Index has duplicate keys: Index(['a', 'c'], dtype='object', name='unit')
Reindex. If the new index does not come directly from an existing column name, explicitly specify the new index to replace the old index by assigning DataFrame.index
directly or calling DataFrame.reindex(index)
. The benefit of the latter one is it allows different lengths between the old and the new index with its filling logic to add missing values, while the first one throws ValueError complaining length mismatch. An example of DataFrame.reindex(index)
can be found later in the MultiIndex section, where a multi-level index is expanded to be the full cartesian product of labels from each level of it, and missing values are labelled NA (by default) for the added labels.
Reset Index. To change back to the default sequential numerical index RangeIndex
, DataFrame.reset_index()
is used. The default adds the old index as a column with the index name (if the index name is empty, the new column name is set to “index”). In use cases such as numerical index becomes nuisance after filtering or reordering the rows, set drop=True
parameter to completely drop the old index.
In : df[df["time"] == 1] # index is useless after filtering rows
Out:
unit time value
1 a 1 15
2 b 1 5
4 c 1 12
In : df[df["time"] == 1].reset_index() # index becomes a column by default
Out:
index unit time value
0 1 a 1 15
1 2 b 1 5
2 4 c 1 12
In : df[df["time"] == 1].reset_index(drop=True)
Out:
unit time value
0 a 1 15
1 b 1 5
2 c 1 12
In : df.sort_values(by="value", ignore_index=True) # ignore_index introduced in pandas version >1.0
# equivalent to df.sort_values(by="value").reset_index(drop=True)
Out:
unit time value
0 b 1 5
1 c 0 8
2 a 0 10
3 c 1 12
4 a 1 15
MultiIndex
MultiIndex
is the subclass of pandas.Index
for multi-level labels. It can be used for both row and column labels.
To create a DataFrame with multi-level index, you can use set_index
to set multiple columns as the index. To directly construct a MultiIndex
object, you can use its class functions to construct it from arrays, tuples, and products (see pandas user guide: MultiIndex / advanced indexing).
Following the example of multi_index_df
from set_index
section:
In : multi_index_df
Out:
value
unit time
a 0 10
1 15
b 1 5
c 0 8
1 12
In : multi_index_df.index.levels # levels are member variable in MultiIndex but not in Index
Out: FrozenList([['a', 'b', 'c'], [0, 1]])
In : multi_index_df.index.values
Out: array([('a', 0), ('a', 1), ('b', 1), ('c', 0), ('c', 1)], dtype=object)
In : expanded_index = pd.MultiIndex.from_product(multi_index_df.index.levels, names=["unit", "time"])
In : expanded_index.levels
Out: FrozenList([['a', 'b', 'c'], [0, 1]])
In : expanded_index.values # label ('b', 0) is newly added
Out:
array([('a', 0), ('a', 1), ('b', 0), ('b', 1), ('c', 0), ('c', 1)], dtype=object)
In : expanded_df = multi_index_df.reindex(expanded_index)
In : expanded_df
Out:
value
unit time
a 0 10.0
1 15.0
b 0 NaN
1 5.0
c 0 8.0
1 12.0
To get the labels for all data at one specific level, use MultiIndex.get_level_values(level)
by providing its integer position or its level name. Note that the result verbosely displays all labels exactly as the order they appear in the data, while MultiIndex.levels[level_idx]
shows a de-duplicated list and does not necessarily show labels according to the order that they appear in the data. Therefore, slicing tricks may be needed on the verbose list of labels in order to keep the correct order in the de-duplicated list (as shown in the last block in the below code example).
In : import numpy as np
In : unsorted_expanded_df = pd.DataFrame(
{"unit": ["c", "c", "b", "b", "a", "a"],
"time": [0, 1, 0, 1, 0, 1],
"value": [8, 12, np.nan, 5, 10, 15]}
).set_index(["unit", "time"])
In : unsorted_expanded_df
Out:
value
unit time
c 0 8.0
1 12.0
b 0 NaN
1 5.0
a 0 10.0
1 15.0
In : unsorted_expanded_df.index.levels[0]
Out: Index(['a', 'b', 'c'], dtype='object', name='unit')
In : unsorted_expanded_df.index.get_level_values(0)
Out: Index(['c', 'c', 'b', 'b', 'a', 'a'], dtype='object', name='unit')
In : unsorted_expanded_df.index.get_level_values(0)[::2] # 2 is len(unsorted_expanded_df.index.levels[1])
Out: Index(['c', 'b', 'a'], dtype='object', name='unit')
Note that after slicing a MultiIndex or slicing a DataFrame with MultiIndex, the MultiIndex levels does not re-compute and keep the same defined labels (as defined by pandas FrozenList
) (see pandas user guide: ), so there could be useless labels that the data subset does not need. In this case, use MultiIndex.remove_unused_levels()
to clean up the multi-level labels (the func name is a little confusing to me, since it actually removes unused labels in each level - stored in the member variable levels, rather than reduces the number of levels).
In : sliced = multi_index_df.loc[("a", 1):("b", 1)]
In : sliced
Out:
value
unit time
a 1 15
b 1 5
In : sliced.index.levels # the defined labels in each level keep unchanged after slicing
Out: FrozenList([['a', 'b', 'c'], [0, 1]])
In : sliced.index = sliced.index.remove_unused_levels()
In : sliced.index.levels
Out: FrozenList([['a', 'b'], [1]])
Data Shape Transformations
- long to wide: pivot
wide_data = long_data.pivot(index="unit", columns="period")
- wide to long: stack
long_data = wide_data.stack().reset_index()