10 Pandas functions that you will always find helpful
Data is messy. In fact, the richest and most interesting data can be extremely messy. Luckily for us, the pandas data preprocessing library can help us to feature engineer the messiest of data.
Pandas is the Swiss army knife of dataframe creation and transformation. One drawback to pandas however is that its a MASSIVE library. First time users tend to get overwhelmed upon encountering the myriad of functions and methods that pandas has available. Hold on a second, do we really have to bog ourselves down trying to learn ALL of pandas functions? I promise you we don’t. In fact, in a typical machine learning workflow, you will find yourself repeatedly using a subset of pandas functions. In this article, I am going to go over 10 of these pandas functions. These are common functions that you will utilize in almost every data project.
Before we jump in however, what exactly is Pandas? No not that panda in the cover photo (although he is very cute). Pandas is an open-source software library that was created by Wes McKinney in order to perform data analysis and transformation on tabular datasets. Pandas is often the workhorse of a critical step in the machine learning lifecycle, feature engineering. During the feature engineering process, it is important to check and perform the necessary transformations on each of your model’s features. This process ensures that when we train our algorithm on this clean dataset that each feature will proportionately contribute to the model’s overall predictive power. In other words, feature engineering ensures that we are maximizing the most value out of every pertinent feature in our dataset by transforming the feature into a format that the machine learning algorithm can readily use to generate insights.
Now pandas is a HUGE library. And if you have not done so as yet, I’d encourage you to check out the documentation here. It is by far one of the most carefully documentation that you will find for a machine learning framework. Despite being so well documented, beginners are often thrown off by the myriad of functions and methods that this library has. In this guide, I will show you ten key aspects of Pandas that will take your data from a hot mess to a beautiful, organized, and insightful dataset.
1. Check your Data
After loading your dataset, the immediate task you should perform next is to check it out. Let’s see what we’re working with here. The three most common pandas functions for doing this are: head, tail, and sample. Let’s see it in action. For the purposes of this demo, we will use a dataset from Kaggle that provides feedback scores for over 120,000 customers. Get the dataset here.
2. Check the data about your data
Checking out the rows in your dataset is important. It is also important to get some information about the dataset in general such as total number of rows, total number of columns, the data types, etc. In order to address these questions, we will use pandas’ shape and info methods.
3. Check the Summary Statistics
If you’re working with numerical data, another easy sanity check is to find out how our data is characterized statistically. Luckily for us, pandas has a method that can easily do this for us in just one line of code, the describe method. The describe method returns the following for each numerical variable: count, mean, standard deviation, minimum, maximum and the quartiles. Because I don’t like to particularly see a trail of numbers behind every figure, I typically chain pandas’ round(2) method after the describe method in order to get results with up to two decimal places.
4. Check Cardinality of Categorical Variables
Another useful feature that pandas provides us with is the ability to check the cardinality of categorical variables. Cardinality may be defined as the number of member-types within a given set. In order to check the members within a particular categorical variable, we will use pandas’ value_counts() function. A common function that I like to chain with value_counts() in order to get the total number of member types within a given categorical variable is the nunique() method.
5. Check, Fill and/or Drop Missing Data
More often than not we will encounter a dataset with missing values (unless of course you’re using a Kaggle dataset haha). Okay, jokes aside, most real world datasets will have missing values, and it’s important to know: 1) how to check for them, 2) impute the missing values, and 3) drop the missing values. Pandas has three functions that performs those tasks and they’re very easy to remember. They are isna, fillna, dropna, and these are used to check, impute and drop missing values respectively. Here are the functions in action:
6. Select a specific data type
Often times, we will want to separate our numerical variables from our categorical variables (or any other data type for that matter). Before we can separate them however, we need to be able to select them. We can do this with pandas by using the select_dtypes method. This method allows to select features according to a specific data type. In the example below, we will select only the features that are objects, that is, categorical variables.
7. Transform Objects into Dummy Variables
At this point, you might be wondering why would I want to select only object data types crazy lady? Aha, well your questions is perfectly timed my friend because our next pandas go to function is get_dummies(). get_dummies(), as its name implies, allows us to convert categorical variables into 1’s and 0’s. Before we can train our model, we need to ensure that all our object type variables have been converted to a numerical representation of that. Pandas get_dummies() function does so with ease as illustrated in the example below. We’ve also passed the drop_first=True parameter in order to prevent multicollinearity among the dummy variables (otherwise known as the dummy trap):
8. Create Pivot Tables
Pivot tables are a great way to compare features within a large dataset based on an aggregate measure of another feature (typically, the target feature). In this example, we use pandas groupby feature to illustrate the airline passengers’ level of satisfaction based on their evaluation of key aspects of the flight. groupby is accompanied by an aggregate measure such as COUNT, MIN, MAX, MEAN, etc. In this example below, I’ve provided the mean satisfaction rating for each customer based on each factor according to the dataset. (SPOILER ALERT: better leg room service seems to be associated with a higher level of in-flight satisfaction — no surprise there right?)
9. Combine two datasets
Often times, we will want to combine two datasets together. Remember our example earlier where we transformed all the categorical variables into dummy values? Well what if we now wanted to incorporated these feature engineered values back into our original dataset? You guessed it! pandas has a function for that, and it’s called concat. Here’s how it works:
10. Check the Correlation of the Features
Last, but certainly not least, checking the correlation among features is extremely important for a number of reasons. Here are two: 1) it’s always good to assess the pairwise association between your key features and yoru targets; for example, does the sale of ice cream increase with the rise in temperatures? 2) it’s always good to check for multicollinearity among your features, that is, are two or more of your independent variables (features) explaining the same phenomenon but in different ways (possibly telling the same piece of the story but possibly using a different measure). It is vital that we mitigate against that because having features that are multicollinear dramatically reduces the predictive power of each feature individually. Pandas provides with an easy function to assess this, corr(). the corr() function uses the Pearson correlation coefficient where -1 represents a perfect negative (or inverse association), 0 represents no association and +1 represents a perfect positive association. Here is the function in action:
Conclusion
The pandas functions and methods we discussed in this guide are some of the most helpful functions that I’ve encountered while working on my machine learning projects. Pandas is a workhorse library, and it comes in especially handy during the feature engineering stage of the machine learning workflow. Often times, beginners are daunted by the wide array of functions and methods that the pandas framework has. This is why it’s very important to not get bogged down in trying to learn everything but instead, establish your personal workflow, in other words, find the way of doing things that makes you most comfortable, efficient and productive. However, if you’re just starting out, please feel free to copy my workflow and apply it your needs as necessary. If you found this guide helpful please give it a clap and follow me for future guides similar to this one. Blessed Love!