Pandas is great for dealing with both numerical and text data. In most projects you’ll need to clean up and verify your data before analysing or using it for anything useful. Data might be delivered in databases, csv or other formats of data file, web scraping results, or even manually entered. Once you have loaded data into pandas, you’ll likely need to convert it to a type that makes the most sense for what you are trying to accomplish. In this post, I’m going to review the basic datatypes in pandas and how to safely and accurately convert data.
DataFrame and Series
First, let’s review the basic container types in pandas, Series
and DataFrame
. A Series
is a one dimensional labeled array of data, backed by a NumPy
array. A DataFrame
is a two-dimensional structure that consists of multiple Series columns that share an index. A Series has a data type, referenced as dtype
, and all elements in that Series
will share the same type.
But what types?
The data type can be a core NumPy
datatype, which means it could be a numerical type, or Python object. But the type can also be a pandas extension type, known as an ExtensionDType
. Without getting into too much detail, just know two very common examples are the CategoricalDType
, and in pandas 1.0+, the StringDType
. For now, what’s important to remember is that all elements in a Series
share the same type.
What’s important to realize is that when constructiong a Series
or a DataFrame
, pandas will pick the datatype that can represent all values in the Series
(or DataFrame
). Let’s look at an example to make this more clear. Note, this example was run using pandas version 1.1.4.
>>> import pandas as pd >>> s = pd.Series([1.0, 'N/A', 2]) >>> s 0 1 1 N/A 2 2 dtype: object
As you can see, pandas has chosen the object
type for my Series
since it can represent values that are floating point numbers, strings, and integers. The individual items in this Series
are all of a different type in this case, but can be represented as objects.
>>> print(type(s[0])) <class 'float'> >>> print(type(s[1])) <class 'str'> >>> print(type(s[2])) <class 'int'>
So, what’s the problem?
The problem with using object for everything is that you rarely want to work with your data this way. Looking at this first example, if you had imported this data from a text file you’d most likely want it to be treated as numerical, and perhaps calculate some statistical values from it.
>>> try: ... s.mean() ... except Exception as ex: ... print(ex) ... unsupported operand type(s) for +: 'float' and 'str'
It’s clear here that the mean
function fails because it’s trying to add up the values in the Series
and cannot add the ‘N/A’ to the running sum of values.
So how do we fix this?
Well, we could inspect the values and convert them by hand or using some other logic, but luckily pandas gives us a few options to do this in a sensible way. Let’s go through them all.
astype
First, you can try to use astype
to convert values. astype
is limited, however, because if it cannot convert a value it will either raise an error or return the original value. Because of this, it cannot completely help us in this situation.
>>> try: ... s.astype('float') ... except Exception as ex: ... print(ex) ... could not convert string to float: 'N/A'
But astype
is very useful, so before moving on, let’s look at a few examples where you would use it. First, if your data was all convertible between types, it would do just what you want.
>>> s2 = pd.Series([1, "2", "3.4", 5.5]) >>> print(s2) 0 1 1 2 2 3.4 3 5.5 dtype: object >>> print(s2.astype('float')) 0 1.0 1 2.0 2 3.4 3 5.5 dtype: float64
Second, astype
is useful for saving space in Series
and DataFrame
s, especially when you have repeated types that can be expressed as categoricals. Categoricals can save memory and also make data a little more readable during analysis since it will tell you all the possible values. For example:
>>> s3 = pd.Series(["Black", "Red"] * 1000) >>> >>> s3.astype('category') 0 Black 1 Red 2 Black 3 Red 4 Black ... 1995 Red 1996 Black 1997 Red 1998 Black 1999 Red Length: 2000, dtype: category Categories (2, object): ['Black', 'Red'] >>> >>> print("String:", s3.memory_usage()) String: 16128 >>> print("Category:", s3.astype('category').memory_usage()) Category: 2224 >>>
You can also save space by using smaller NumPy
types.
>>> s4 = pd.Series([22000, 3, 1, 9]) >>>s4.memory_usage() 160 >>> s4.astype('int8').memory_usage() 132
But note there is an error above! astype
will happily convert numbers that don’t fit in the new type without reporting the error to you.
>>> s4.astype('int8') 0 -16 1 3 2 1 3 9 dtype: int8
Note that you can also use astype
on DataFrame
s, even specifying different values for each column
>>> df = pd.DataFrame({'a': [1,2,3.3, 4], 'b': [4, 5, 2, 3], 'c': ["4", 5.5, "7.09", 1]}) >>> df.astype('float') a b c 0 1.0 4.0 4.00 1 2.0 5.0 5.50 2 3.3 2.0 7.09 3 4.0 3.0 1.00 >>> df.astype({'a': 'uint', 'b': 'float16'}) a b c 0 1 4.0 4 1 2 5.0 5.5 2 3 2.0 7.09 3 4 3.0 1
to_numeric (or to_datetime or to_timedelta)
There are a few better options available in pandas for converting one-dimensional data (i.e. one Series
at a time). These methods provide better error correction than astype
through the optional errors
and downcast
parameters. Take a look at how it can deal with the first Series
created in this post. Using coerce
for errors will turn any conversion errors into NaN
. Passing in ignore
will get the same behavior we had available in astype
, returning our original input. Likewise, passing in raise
will raise an exception.
>>> pd.to_numeric(s, errors='coerce') 0 1.0 1 NaN 2 2.0 dtype: float64
And if we want to save some space, we can safely downcast to the minimim size that will hold our data without errors (getting int16
instead of int64
if we didn’t downcast)
>>> pd.to_numeric(s4, downcast='integer') 0 22000 1 3 2 1 3 9 dtype: int16 >>> pd.to_numeric(s4).dtype dtype('int64')
The to_datetime
and to_timedelta
methods will behave similarly, but for dates and timedeltas.
>>> pd.to_numeric(s4).dtype dtype('int64') >>> pd.to_timedelta(['2 days', '5 min', '-3s', '4M', '1 parsec'], errors='coerce') TimedeltaIndex([ '2 days 00:00:00', '0 days 00:05:00', '-1 days +23:59:57', '0 days 00:04:00', NaT], dtype='timedelta64[ns]', freq=None) >>> pd.to_datetime(['11/1/2020', 'Jan 4th 1919', '20200930 08:00:31']) DatetimeIndex(['2020-11-01 00:00:00', '1919-01-04 00:00:00', '2020-09-30 08:00:31'], dtype='datetime64[ns]', freq=None)
Since these functions are all for 1-dimensional data, you will need to use apply
on a DataFrame
. For instance, to downcast all the values to the smallest possible floating point size, use the downcast parameter.
>>> from functools import partial >>> df.apply(partial(pd.to_numeric, downcast='float')).dtypes a float32 b float32 c float32 dtype: object
infer_objects
If you happend to have a pandas object that consists of objects that haven’t been converted yet, both Series
and DataFrame
have a method that will attempt to convert those objects to the most sensible type. To see this, you have to do a sort of contrived example, because pandas will attempt to convert objects when you create them. For example:
>>> pd.Series([1, 2, 3, 4], dtype='object').infer_objects().dtype int64 >>> pd.Series([1, 2, 3, '4'], dtype='object').infer_objects().dtype object >>>pd.Series([1, 2, 3, 4]).dtype int64
You can see here that if the Series happens to have all numerical types (in this case integers) but they are stored as objects, it can figure out how to convert these to integers. But it doesn’t know how to convert the ‘4’ to an integer. For that, you need to use one of the techniques from above.
convert_dtypes
This method is new in pandas 1.0, and can convert to the best possible dtype
that supports pd.NA
. Note that this will be the pandas dtype versus the NumPy
dtype (i.e. Int64
instead of int64
).
>>> pd.Series([1, 2, 3, 4], dtype='object').convert_dtypes().dtype Int64 >>> pd.Series([1, 2, 3, '4'], dtype='object').convert_dtypes().dtype object >>> pd.Series([1, 2, 3, 4]).convert_dtypes().dtype Int64
What should you use most often then?
What I recommend doing is looking at your raw data once it is imported. Depending on your data source, it may already be in the dtype that you want. But once you need to convert it, you have all the tools you need to do this correctly. For numeric types, the pd.to_numeric
method is best suited for doing this conversion in a safe way, and with wise use of the downcast
parameter, you can also save space. Consider using astype("category")
when you have repeated data to save some space as well. The convert_dtypes
and infer_objects
methods are not going to be that helpful in most cases unless you somehow have data stored as objects that is readily convertible to another type. Remember, there’s no magic function in pandas that will ensure you have the best data type for every case, you need to examine and understand your own data to use or analyze it correctly. But knowing the best way to do that conversion is a great start.