pandas dataframe

Pandas DataFrame-1

Pandas DataFrame is a two-dimensional structure having rows and columns. Pandas DataFrame can hold Heterogeneous i.e. different data types labelled by its axes (rows and columns). In simple word we can say Pandas Dataframe are same as tables in Databases or Excel.

In this tutorial we are going to cover

  • Creating DataFrame
  • Accessing DataFrame

Creating a DataFrame

Pandas Dataframe can be created from a list, Dictionary,
Numpy Array, TuplePandas Series etc. Pandas DataFrame can also be created from csv (comma separated value) file, Excel, databases etc. we will see few examples below.

Before Creating Pandas DataFrame, lets have a look at
some parameters for the DataFrame Constructor:

pandas.DataFrame(data,index,column,dtype)

data: Here we pass the data source for which we want to create a DataFrame. Data can take various forms like list, tuple, array, dictionary etc. data can also be passed as another DataFrame, since DataFrame can be created from another DataFrame.

index: Here we pass the row labels. If we do not pass any value, default labels i.e. index start from 0.

column: Here we pass the Column labels. If we do not pass any value, default labels start from 0.

dtype: Data Type of Each Column.

DataFrame From List

DataFrames can be created from a single list or from a list of lists. To create a DataFrame we have to import pandas Library. Below is an Example.

Example

 import pandas as pd
  
 #Creating List of Lists as input
 data_list = [['Alpha',32],['Bravo',26],['Charlie',27]]
  
 #Creating Data Frame
 df_list = pd.DataFrame(data_list,columns=['Name','Age'],dtype=float)
 print(df_list) 

Output

       Name   Age
 0    Alpha  32.0
 1    Bravo  26.0
 2  Charlie  27.0 

In the above example, we have given the column labels.
But we can see the default indexing at row level.

DataFrame From Dict

When creating a data frame from dict, the key is treated as column labels and value as data. If index is passed then the length of values should be same (in cases where value is list, tuple, array).

Example

  import pandas as pd
 #creating data of Lists
 data = {"Name":["Alpha","Bravo","Charlie"],
         "Age":[32,27,26]
          "Height":[175,169,180]}
  
 #Creating DataFrame From above list
 df = pd.DataFrame(data)
 print(df) 

Output

     Name  Age  Height
0    Alpha   32     175
1    Bravo   27     169
2  Charlie   26     180 

In
the above example, we can see the Keys of each key value pair is the column name and value i.e. list is used as data. We can see default indexing at row level. i.e. starting from 0.

Creating a DataFrame from CSV file

To
create a DataFrame using a csv file, firstly we have to set our current working directory to the folder where the file is located, or locate the file in our current working directory, or we can directly pass the full path while reading  the file. We can check the type of the object by type keyword.

Example

 df2 = pd.read_csv("E:\Blog\Python\Dataset\iris.csv")
  
 print(df2.head())
 print(type(df2)) 

Output

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
<class 'pandas.core.frame.DataFrame'> 

In the above example, we read a csv file using the read_csv function available in Pandas. The dataframe.head() function gives us the top 5 rows of the dataframe.
The Header in the CSV file is converted to Column names and default row indexing.

Accessing Data From DataFrame

Dataframe has a 2-dimensional data structure and stores the data in tabular format. Operations like selecting, updating, deleting, renaming etc. can be performed on both rows and columns.

Below are few basic commands that are helpful while dealing with DataFrames.

Df.head()

Df.head() function by default returns the first 5 rows of the data frame. If we want to view more or less number of rows we can pass that value to head function like df.head(10) for first 10 records.

Df.tail()

Df.tail() function by default returns the last 5 rows ofthe data frame. If we want to view more or less number of rows we can pass that value to tail function like df.tail(10) for last 10 records.

Df.describe()

Df.describe() function will give us the complete summary of the data set excluding the NaN i.e. NULL (missing) values. This function analyzes both numeric and object series, and DataFrame with mixed data type columns as well. This function will give us values like mean, count, standard deviation, minimum values, media, max etc.

Example: head, tail, describe

 df2 = pd.read_csv("E:\Blog\Python\Dataset\iris.csv")
 print("First 3 Records")
 print(df2.head(3)) # print first 3 records
 print("n Last 4 Records")
 print(df2.tail(4))# print last 4 records
  
 print("n Summary of the dataset")
 print(df2.describe()) #Print Summary of data Set 

Output

 First 3 Records
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
 
 Last 4 Records
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica
 
 Summary of the dataset
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000 

Column Wise Operation

Select

We can select a single column or multiple columns by calling their column names.

Example

 df2 = pd.read_csv("E:\Blog\Python\Dataset\iris.csv")
 df3 = df2.head()  
 #selecting single column
 #print(df3["Sepal.Length"])
  
 #Selecting multiple columns
 print(df3[["Sepal.Width" , "Petal.Length", "Sepal.Length"]])

Output

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: Sepal.Length, dtype: float64
   Sepal.Width  Petal.Length  Sepal.Length
0          3.5           1.4           5.1
1          3.0           1.4           4.9
2          3.2           1.3           4.7
3          3.1           1.5           4.6
4          3.6           1.4           5.0 

Adding a column

Below we can see two types of how we can add a column in Data Frame.

Example

 d = {'Car' : pd.Series(["Mercedes","Bentley","Ferrari"], index=['a', 'b', 'c']),
    'Mileage' : pd.Series([20, 21, 22], index=['a', 'b', 'c'])}
  
 df = pd.DataFrame(d)
  
 # Adding a new column by Series
  
 print ("Adding a new column by passing as Series:")
 df['Price']=pd.Series([100000,100050,99955],index=['a','b','c'])
 print (df)
  
 # Adding a new column by existing Columns
 print ("n Adding a new column Random Value using the existing columns in DataFrame")
 df['Random Value']=df['Mileage']+df['Price']
  
 print (df)
   

Output

Adding a new column by passing as Series:
        Car  Mileage   Price
a  Mercedes       20  100000
b   Bentley       21  100050
c   Ferrari       22   99955
Adding a new column Random Value using the existing columns in DataFrame
        Car  Mileage   Price  Random Value
a  Mercedes       20  100000        100020
b   Bentley       21  100050        100071
c   Ferrari       22   99955         99977 

Deleting a Column

In DataFrame, we can delete a column from a data frame by usind del function or pop function.

Example

 df2 = pd.read_csv("E:\Blog\Python\Dataset\iris.csv")
  
 df3 = df2.head()
  
 print("Original DataFrame")
 print(df3)
  
 #Deleting a column using del function
 print("n Our DataFrame after Deleting Sepal.Width")
 del df3["Sepal.Width"]
 print(df3)
  
 #Deleting a column using pop function
 print("n Our DataFrame after popping out Petal.Width")
 df3.pop("Petal.Width")
 print(df3) 

Output

Original DataFrame
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
 
 Our DataFrame after Deleting Sepal.Width
   Sepal.Length  Petal.Length  Petal.Width Species
0           5.1           1.4          0.2  setosa
1           4.9           1.4          0.2  setosa
2           4.7           1.3          0.2  setosa
3           4.6           1.5          0.2  setosa
4           5.0           1.4          0.2  setosa
 
 Our DataFrame after popping out Petal.Width
   Sepal.Length  Petal.Length Species
0           5.1           1.4  setosa
1           4.9           1.4  setosa
2           4.7           1.3  setosa
3           4.6           1.5  setosa
4           5.0           1.4  setosa 

Row Wise Operations

Select

In DataFrame, we can select data row wise by passing the label to loc[] function, or by passing integer location to iloc[] function.

Example

 d = {'Car' : pd.Series(["Mercedes","Bentley","Ferrari"], index=['a', 'b', 'c']),
    'Mileage' : pd.Series([20, 21, 22], index=['a', 'b', 'c'])}
  
 df = pd.DataFrame(d)
 print(df)
  
 print("n loc[] function")
 #selecting by loc[] function
 print(df.loc["b"])
  
 print("n loc[] function")
 #selecting by iloc[] function
 print(df.iloc[2]) 

Output

        Car  Mileage
a  Mercedes       20
b   Bentley       21
c   Ferrari       22
 
 loc[] function
Car        Bentley
Mileage         21
Name: b, dtype: object
 
 loc[] function
Car        Ferrari
Mileage         22
Name: c, dtype: object 

Adding rows

In DataFrame, we can add a new row to data frame by append function. This function will add a new row at the end.

Example

 d = {'Car' : pd.Series(["Mercedes","Bentley","Ferrari"], index=['a', 'b', 'c']),
    'Mileage' : pd.Series([20, 21, 22], index=['a', 'b', 'c'])}
 df = pd.DataFrame(d)
  
 #DataFrame 2
 d1= {'Car' : pd.Series(["Hyundai","Suzuki"], index=['a', 'b']),
    'Mileage' : pd.Series([21, 22], index=['a', 'b'])}
 df1 = pd.DataFrame(d1)
  
 #Appending df with data of df1
 df2 = df.append(df1)
  
 print(df2) 

Output

         Car  Mileage
 a  Mercedes       20
 b   Bentley       21
 c   Ferrari       22
 a   Hyundai       21
 b    Suzuki       22 

Deleting rows

In DataFrame, we can delete a row by using drop function. we can drop a row by row label or row index.

Example

 d = {'Car' : pd.Series(["Mercedes","Bentley","Ferrari"], index=['a', 'b', 'c']),
    'Mileage' : pd.Series([20, 21, 22], index=['a', 'b', 'c']),
     'Price' : pd.Series([100000,100050,99955],index=['a','b','c'])}
  
 df = pd.DataFrame(d)
  
 print(df)
  
 #Deleting rows by row index or labels
 #Row with index a will be dropped
 B = df.drop("a")
 print("n DataFrame after dropping 'a'")
 print(B)  

Output

         Car  Mileage   Price
 a  Mercedes       20  100000
 b   Bentley       21  100050
 c   Ferrari       22   99955
  
  DataFrame after dropping 'a'
        Car  Mileage   Price
 b  Bentley       21  100050
 c  Ferrari       22   99955
   

To Stay in touch, follow us on Twitter and Facebook.

 

About the author

Gaurav Tiwari

My Name is Gaurav Tiwari. I am working in the IT industry for over 3.5+ years. I completed my B.E. from Mumbai University in 2015, Since then I’m working with Accenture Solutions PVT. LTD. as data Analyst.
I’ve started writing blogs as hobby.

View all posts

1 Comment

Leave a Reply

Your e-mail address will not be published. Required fields are marked *