Pandas DataFrame Basics: How To Perform Indexing And Slicing

Share At:

1. Introduction

Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among other functions.

To give Python these enhanced features, Pandas introduces two new data types to Python: Series and DataFrame.

Series: Pandas series are basically dataset having only one row or one column. Means, if we filter out only one row or only one column from a dataframe, its called “series”.

Dataframe: The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame.

A Pandas DataFrame can also be thought of as a dictionary or collection of Series objects.

Dataframe example:

Series Example: If we filter out one row from the above dataframe at index 3, it will be series. Similarly, if we slice only one column from the above dataframe, that too will be a series. Have a look:

Filtering a row:

Slicing a column:

2. Creating your own data:

2.a. Creating a Series: The easiest way to create a Series is to pass in a Python list. If we pass in a list of mixed types, the most common representation of both will be used.

Typically the dtype will be object.

2.b. Creating a Dataframe: a DataFrame can be thought of as a dictionary of Series objects. This is why dictionaries are the the most common way of creating a DataFrame. The key represents the column name, and the values are the contents of the column.

If we wanted to use the name column for the row index, we can use the index parameter.

3. Dataframe Explained in Detail

3.a. Load your dataset:

Loading your dataset includes loading your pandas library follwed by read_csv function to load the dataset which is in csv format. head() function will display first 5 rows.

3.b. Get the number of rows and columns:

The shape attribute returns a tuple in which the first value is the number of rows and the second number is the number of columns. From the preceding results, we see our data set has 1704 rows and 6 columns. Since shape is an attribute of the dataframe, and not a function or method of the DataFrame, it does not have parentheses after the period.

3.c. Get the column names:

df.columns method can be used to get the column names of a dataframe

3.d. Get the dtype of each column:

3.e. Get more information about our data:

3.f. Pandas Types Versus Python Types:

4. Looking at Columns, Rows, and Cells:

Now that we’re able to load data file, we want to be able to inspect its contents. We could print out the contents of the dataframe, but with today’s data, there are often too many cells to make sense. Instead, the best way to look at our data is to inspect it in parts by looking at various subsets of the data.

4.1 Subsetting Columns:

If we want to examine multiple columns, we can specify them by names, positions, or ranges.

4.1.a. Subsetting Columns by name: If we want only a specific column from our data, we can access the data using square brackets.

# Looking at country, continent, and year:

4.1.b. Subsetting Columns by Index Position(No Longer vailable since Pandas vers v0.20):

As of pandas v0.20, you are no longer able to pass in a list of integers in the square brackets to subset columns.

For example, df[[1]], df[[0, -1], and df[list(range(5)] no longer work.

4.1.c. Subsetting Columns by Range: You can use the built-in range function to create a range of values in Python. This way you can specify beginning and end values, and Python will automatically create a range of values in between.

# Similarly, create a range from 0 to 5 inclusive, every other integer:

4.2 Subsetting Rows:

Rows can be subset in multiple ways, by row name or row index. Below gives a quick overview of the various methods.

4.2.a. Subset Rows by Index Label: loc

# Get the 100th row:

# Get the last row: df.loc[-1] will give you error. instead do this –

Alternatively, we can use the tail method to return the last 1 row, instead of the default 5.

4.2.b. Subset Rows by Row Number: iloc — iloc does the same thing as loc but is used to subset by the row index number. In our current example, iloc and loc will behave om exactly the same way since the index labels are the row numbers. However, keep in mind that the index labels do not necessarily have to be row numbers.

4.2.c. Subsetting Rows With ix (No Longer Works in Pandas v0.20): The ix attribute does not work in versions later than Pandas v0.20, since it can be confusing.

4.3 Mixing it up: Subsetting Multiple Rows and Columns

The loc and iloc attributes can be used to obtain subsets of columns, rows, or both. The general syntax for loc and iloc uses square brackets with a comma. The part to the left of the comma is the row values to subset; the part to the right of the comma is the column values to subset. That is, df.loc[[rows], [columns]] or df.iloc[[rows], [columns]].

4.3.a. Subsetting column using loc — (This has been discussed above , however for continuity discussing it again)- The Python slicing syntax uses a colon, :. If we have just a colon, the attribute refers to everything. So, if we just want to get the first column using the loc or iloc syntax, we can write something like df.loc[:, [columns]] to subset the column(s).

4.3.b. Subsetting column using iloc — Below is the example of how to perform subsetting using iloc :

We will get an error if we don’t specify loc and iloc correctly.

4.3.c. Slicing Columns: Python’s slicing syntax, :, is similar to the range syntax. Instead of a function that specifies start, stop, and step values delimited by a comma, we separate the values with the colon. If you understand what was going on with the range function earlier, then slicing can be seen as a shorthand means to the same thing.

Look at below 2 examples — Both will produce same output:

And,

4.3.d. Subsetting rows and columns: We can combine the row and column subsetting syntax with the multiple-row and multiple-column subsetting syntax to get various slices of our data.

Look at below:

You can also use the slicing syntax on the row portion of the loc and iloc attributes.

Conclusion:

The pandas has many features that’s easy to use but are quite powerful and that’s why its very popular among Data Scientists. Furthermore, the more you practice, the more you learn.

Happy Coding !!!

For further reading Please refer:

  1. https://www.oreilly.com/library/view/pandas-for-everyone/9780134547046/
  2. https://pandas.pydata.org/


Share At:
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
Back To Top

Contact Us