Data Frame in python pandas ~ Computers At School

Data Frame in python pandas

Pandas provide data structure like Series, DataFrame, Panels. Of all these data structure, DataFrame is the most widely used. DataFrame object of pandas allows to store 2 Dimensional data I.e in rows and columns just like in spreadsheet application like MS Excel.

Note:

1. DataFrame are 2- Dimesnional

2. Size and Data Mutable

3. Heterogeneous tabular data structure

4. Data is identified by row and column index

5. Column are referred as axis=1 and row are referred as axis=0

6. Index can be numbers, letters or Strings

7. Same column should have data of same type but different column can be of different data types

8. Plots can be made from DataFrames directly without importing matplotlib library

9. Can be thought of dictionary of Series Object

10. pandas library has to be imported to create a DataFrame

Creating DataFrame Object:

General syntax for DataFrame Object creation

Pandas.DataFrame(data=None, index=None, columns=None, dtype=None)

Arguments:

data: This argument can be Series, ndarray, Dictionary, list, and even DataFrame

index: List of values for Labeling the Rows of DataFrame and can be ndarray also

columns: List of values for labeling the columns of DataFrame

dtype: allows to set the data types of the dataframe. If none then type is inferred form data items

Code Example	Remarks
Creating a DataFrame from Dictionary
import pandas as pd dt={‘sl’:[1,2,3,4,5], ‘name’:[‘ram’,’sayam’,’dinesh’,’ganesh’,’suresh’], ‘age’:[21,17,22,32,19]} df=pd.DataFrame(data=dt) print(df) OUTPUT: sl name age 0 1 ram 21 1 2 sayam 17 2 3 dinesh 22 3 4 ganesh 32 4 5 suresh 19 dt={'sl':[1,2,3,4,5], 'name':['ram','sayam','dinesh','ganesh','suresh', ], 'age':[21,17,22,32,19]} df=pd.DataFrame(data=dt, index=['i','ii','iii','iv','v']) print(df) OUTPUT: sl name age i 1 ram 21 ii 2 sayam 17 iii 3 dinesh 22 iv 4 ganesh 32 v 5 suresh 19	Here only the data argument is given so the dictionary keys are used as column names and each key-value pair makes the column name and column data respectively. Index values are provided so as to name the row of the dataFrame with roman values
Creating a DataFrame object from 2-D ndarray
import numpy as np nparr=np.array([[12,13,14,15],[22,23,24,25]], dtype=np.int32) df=pd.DataFrame(nparr, index=[], columns=[]) print(df) OUTPUT: 0 1 2 3 0 12 13 14 15 1 22 23 24 25 OR df=pd.DataFrame((np.arange(20)).reshape(5,4)) print(df) 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19	This can be used to create a dataframe object when an program outputs the data in a ndarray Creating a numpy array using the arange function and then converting it into a dataframe
Creating a DataFrame object from Series objects(Storing them in dictionary)
srObj1=pd.Series(['ram','sayam','dinesh','ganesh','suresh']) srObj2=pd.Series([12,23,14,17,26]) dc={'name':srObj1, 'age':srObj2} df=pd.DataFrame(dc) print(df) OUTPUT: name age 0 ram 12 1 sayam 23 2 dinesh 14 3 ganesh 17 4 suresh 26	Now Series object is 1 Dimensional so many Series object is stored in a dictionary and that dictionary is used to create a multi-column DataFrame object
Creating a DataFrame from another DataFrame
df2=pd.DataFrame(df) print(df2) OUTPUT: name age 0 ram 12 1 sayam 23 2 dinesh 14 3 ganesh 17 4 suresh 26	This is help full when we want to copy a dataframe and manipulate it
Creating a DataFrame by importing data form CSV
df=pd.read_csv(‘data.csv’) print(df) OUTPUT: sl name age 0 1 ganesh 23 1 2 ramesh 24 2 3 suresh 21 3 4 dinesh 17 4 5 sayam 18	In real life the most of the large data is loaded from a CSV file into a data frame , and analysis is done on it.

DataFrame Object Attributes:

Syntax:

DataFrame_Object.<attribute>

List of the common attributes

Attributes	Description
index	Returns the index labels of the DataFrame in an array object
columns	Returns the columns labels of the DataFrame
axes	Return a list representing the axes of the DataFrame.
dtypes	Returns the data type of data in the DataFrame
size	Returns the number of element in the DataFrame
shape	Returns the number of rows and column of the DataFrame
values	Returns a Numpy represenation of the DataFrame
empty	Returns True if DataFrame is entirely empty (no items)
ndim	Return an int representing the number of axes
T	Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa . i.e Trasnpose the DataFrame

Accessing data in DataFrame Object:

Code Example	Remarks
Display the complete DataFrame object
df=pd.DataFrame((np.arange(20)).reshape(5,4)) print(df) OUTPUT: 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19	DataFrame object is created form the numpy array
Accessing the Column of the DataFrame Object
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4']) print(df['c2']) #passing single column print(df[['c2','c4']]) #passing multiple column OUTPUT: 0 1 1 5 2 9 3 13 4 17 Name: c2, dtype: int64 c2 c4 0 1 3 1 5 7 2 9 11 3 13 15 4 17 19	Printing the data of column c2
Accessing individual Element of DataFrame
import pandas as pd import numpy as np df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4']) print(df['c2'][1]) # dataFrameObject[<columnLabel>][<rowLabel>] print(df.c2[1]) # dataFrameObject.<columnLabel>[<rowLabel>] OUTPUT: 5	Both the query works in the same way if the columns contain string names and not integer values else the second function gives error
Accessing the Subset of a DataFrame(This method is used most often)
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4']) #USING LOC #SYNTAX: dataframeObject.loc[<startRow>:<endRow>:<rowStepValue>, <startColumn:<endColumn>:<ColumnStepValue> df1=df.loc[1:3,'c1':'c3'] print(df1) df2=df.loc[0:4:2,'c1':'c4':2] print(df2) OUTPUT: c1 c2 c3 1 4 5 6 2 8 9 10 3 12 13 14 c1 c3 0 0 2 2 8 10 4 16 18	Except of loc, iloc function also exists

Operation on DataFrame Object

Code Example	Remarks
Assigning/Modification of the data in the DataFrame Object
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4']) print(df) #Original DataFrame Created from a numpy array OUTPUT: c1 c2 c3 c4 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19 df['c2'][1]=60 # Changing a single element of the DataFrame but data has to be of same type print(df) OUTPUT: c1 c2 c3 c4 0 0 1 2 3 1 4 60 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
df['c3']=['a','b','c','d','e'] # Change an entire column of data in the DataFrame print(df) OUTPUT: c1 c2 c3 c4 0 0 1 a 3 1 4 60 b 7 2 8 9 c 11 3 12 13 d 15 4 16 17 e 19 df.loc[1:1,:]=['aa','bb','cc','dd'] # Changing a complete row of data in DataFrame print(df) OUTPUT: c1 c2 c3 c4 0 0 1 a 3 1 aa bb cc dd 2 8 9 c 11 3 12 13 d 15 4 16 17 e 19
Adding Columns in DataFrame
df=pd.DataFrame((np.arange(15)).reshape(5,3)) print(df) OUTPUT: 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 4 12 13 14 df['c5']=np.NaN # As column ‘c5’ is not present so it gets created and added print(df) OUTPUT: 0 1 2 c5 0 0 1 2 NaN 1 3 4 5 NaN 2 6 7 8 NaN 3 9 10 11 NaN 4 12 13 14 NaN	If the column does not exists then the new column gets added NOTE: Value have to be assigned or NaN if the values are not known else it will through error.
Deleting Columns in DataFrame
print(df) OUTPUT: 0 1 2 c5 0 0 1 2 5 1 3 4 5 5 2 6 7 8 5 3 9 10 11 5 4 12 13 14 5 del df['c5'] # del command can be used to remove column from the DataFrame print(df) OUTPUT: 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 4 12 13 14 df=df.drop([1,2], axis=1) # Remove multiple columns form the DataFrame using drop function print(df) OUTPUT: 0 0 0 1 3 2 6 3 9 4 12
Deleting Rows in DataFrame
df=pd.DataFrame((np.arange(15)).reshape(5,3)) print(df) OUTPUT: 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 4 12 13 14 df=df.drop([1,2], axis=0) print(df) OUTPUT: 0 1 2 0 0 1 2 3 9 10 11 4 12 13 14
Iterating over a DataFrame
df=pd.DataFrame((np.arange(15)).reshape(5,3), columns=['c1','c2','c3']) print(df) OUTPUT: c1 c2 c3 0 0 1 2 1 3 4 5 2 6 7 8 ## Iteraing over a DataFrame using Iterrows() - ROW DATA SET for (rowIndex,rowSeries) in df.iterrows(): print("Row index", rowIndex) print("Data Containing \n", rowSeries) OUTPUT: Row index 0 Data Containing c1 0 c2 1 c3 2 Name: 0, dtype: int32 Row index 1 Data Containing c1 3 c2 4 c3 5 Name: 1, dtype: int32 Row index 2 Data Containing c1 6 c2 7 c3 8 Name: 2, dtype: int32 ##Iteraing over a DataFrame using iteritems() - COLUMN DATA SET for (colIndex,colSeries) in df.iteritems(): print("Column Index", rowIndex) print("Column Data \n", rowSeries) Column Index 2 Column Data c1 6 c2 7 c3 8 Name: 2, dtype: int32 Column Index 2 Column Data c1 6 c2 7 c3 8 Name: 2, dtype: int32 Column Index 2 Column Data c1 6 c2 7 c3 8 Name: 2, dtype: int32	Two function are very commonly used iterrow and iteritem to run a for loop over a DataFrame object dataFrameObject.iterrow() access the dataFrame as horizontal subsets of Series Object dataFrameObject.iteritems() access the dataFrame as vertical subsets of Series Object
Arithmetic operation on DataFrame Object
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) df2=pd.DataFrame((np.arange(6,12)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) print('DataFrame 1:',df1, sep='\n') print('DataFrame 2:',df2, sep='\n') OUTPUT: DataFrame 1: c1 c2 r1 0 1 r2 2 3 r3 4 5 DataFrame 2: c1 c2 r1 6 7 r2 8 9 r3 10 11 ## ---------------ADDITION -----------------## print(df1+df2) OUTPUT: c1 c2 r1 6 8 r2 10 12 r3 14 16 print(df1.add(df2)) OUTPUT: c1 c2 r1 6 8 r2 10 12 r3 14 16 ##-----------SUBTRACTION-------------## print(df2-df1) OUTPUT: c1 c2 r1 6 6 r2 6 6 r3 6 6 print(df1.sub(df2)) OUTPUT: c1 c2 r1 -6 -6 r2 -6 -6 r3 -6 -6 ##-----------MULTIPLICATION---------## print(df1*df2) OUTPUT: c1 c2 r1 0 7 r2 16 27 r3 40 55 print(df1.mul(df2)) OUTPUT: c1 c2 r1 0 7 r2 16 27 r3 40 55 ##---------DIVISION------------## print(df1/df2) OUTPUT: c1 c2 r1 0.00 0.142857 r2 0.25 0.333333 r3 0.40 0.454545 print(df1.div(df2)) OUTPUT: c1 c2 r1 0.00 0.142857 r2 0.25 0.333333 r3 0.40 0.454545 ##-----------APPLYING NUMPY/MATHS FUNCTION---------## print(np.square(df1)) OUTPUT: c1 c2 r1 0 1 r2 4 9 r3 16 25 print(np.sqrt(df1)) OUTPUT: c1 c2 r1 0.000000 1.000000 r2 1.414214 1.732051 r3 2.000000 2.236068 print(np.sin(df1)) OUTPUT: c1 c2 r1 0.000000 0.841471 r2 0.909297 0.141120 r3 -0.756802 -0.958924	Arithmetic operation are vectorized in DataFrame
Comparison of DataFrame Object
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) df2=pd.DataFrame((np.arange(6,12)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) print('DataFrame 1:',df1, sep='\n') print('DataFrame 2:',df2, sep='\n') OUTPUT: DataFrame 1: c1 c2 r1 0 1 r2 2 3 r3 4 5 DataFrame 2: c1 c2 r1 6 7 r2 8 9 r3 10 11 df1+df2 == df1.add(df2) c1 c2 r1 True True r2 True True r3 True True	Note: 1. The two DataFrame must match in row and columns for the comparison 2. NaN values cannot be compared using comparison operator 3. equals() result in True while comparing two NaN values

Commonly used Function on DataFrame Object:

Function	Code Example
dataFrameObj.info() Note: This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.	print(df1.info()) OUTPUT: <class 'pandas.core.frame.DataFrame'> Index: 3 entries, r1 to r3 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 c1 3 non-null int32 1 c2 3 non-null int32 dtypes: int32(2) memory usage: 36.0+ bytes None
dataFrameObj.describe() Note: Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding ``NaN`` values.	print(df1.describe()) OUTPUT: c1 c2 count 3.0 3.0 mean 2.0 3.0 std 2.0 2.0 min 0.0 1.0 25% 1.0 2.0 50% 2.0 3.0 75% 3.0 4.0 max 4.0 5.0
dataFrameObject.head() NOTE: This function returns the first `n` rows for the object based on position. dataFrameObject.tail() NOTE: This function returns last `n` rows from the object based on position.	print(df) OUTPUT: c1 c2 c3 0 0 1 2 1 3 4 5 2 6 7 8 print(df.head(2)) OUTPUT: c1 c2 c3 0 0 1 2 1 3 4 5 print(df.tail(2)) OUTPUT: c1 c2 c3 1 3 4 5 2 6 7 8
dataFrameObject.any() NOTE: While performing a comparing operation on a DataFrame object, this function returns True if any operation is True over requested axis.	import pandas as pd import numpy as np df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) print(df1>3) print('Any function returned ') print((df1>3).any()) OUTPUT: c1 c2 r1 False False r2 False False r3 True True Any function returned c1 True c2 True dtype: bool
dataFrameObject.all() NOTE: While performing a comparison operation on a DataFrame object, this function return True if all the operation is True over requested axis.	import pandas as pd import numpy as np df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) print(df1>3) print('Any function returned ') print((df1>3).all()) OUTPUT: c1 c2 r1 False False r2 False False r3 True True Any function returned c1 False c2 False dtype: bool
dataFrameObject1.combine_first(dataFrameObject2) NOTE: The combine_first method combines the two dataframes using patching the data method. Patching the data means, id a dataframe a certain cell has missing data and corresponding cell(the one with same index and column id) in other dataframe has some valid data then, this method will pick the valid data from the second dataframe and patch it in the first dataframe.	import pandas as pd import numpy as np df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) df2=pd.DataFrame((np.arange(7,16)).reshape(3,3), columns=['c1','c2','c3'], index=['r1','r2','r3']) df1['c3']=np.NaN print('DataFrame1',df1,sep='\n') print('DataFrame2',df2,sep='\n') OUTPUT: DataFrame1 c1 c2 c3 r1 0 1 NaN r2 2 3 NaN r3 4 5 NaN DataFrame2 c1 c2 c3 r1 7 8 9 r2 10 11 12 r3 13 14 15 print(df1.combine_first(df2)) OUTPUT: c1 c2 c3 r1 0 1 9.0 r2 2 3 12.0 r3 4 5 15.0
pandas.concat([dataFrame1, dataFrame2]) NOTE: The concat method can concatenate two dataframe along the rows or along the columns. This works when the two dataframe has same structure.	df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3']) df2=pd.DataFrame((np.arange(7,16)).reshape(3,3), columns=['c1','c2','c3'], index=['r1','r2','r3']) df1['c3']=np.NaN print('DataFrame1',df1,sep='\n') print('DataFrame2',df2,sep='\n') OUTPUT: DataFrame1 c1 c2 c3 r1 0 1 NaN r2 2 3 NaN r3 4 5 NaN DataFrame2 c1 c2 c3 r1 7 8 9 r2 10 11 12 r3 13 14 15 print(pd.concat([df1,df2])) #column concatenation OUTPUT: c1 c2 c3 r1 0 1 NaN r2 2 3 NaN r3 4 5 NaN r1 7 8 9.0 r2 10 11 12.0 r3 13 14 15.0 print(pd.concat([df1,df2], axis=1)) #row concatenation OUTPUT: c1 c2 c3 c1 c2 c3 r1 0 1 NaN 7 8 9 r2 2 3 NaN 10 11 12 r3 4 5 NaN 13 14 15
pandas.merge(df1, df2, on=<keyword>) NOTE: This method can be used to combine dataFrame based on rows with same common values are merged In this example the matching is done on the roll column and only those data is fetched, merged that has same value for the roll column	import pandas as pd nameDF=pd.DataFrame({'roll':[1,2,3,4], 'name': ['sayam','dinesh','suresh','ganesh']}) marksDF=pd.DataFrame({'roll':[1,2,3,4,5,6], 'marks':[78,89,56,78,87,92]}) print(nameDF,'\n\n\n', marksDF) OUTPUT: roll name 0 1 sayam 1 2 dinesh 2 3 suresh 3 4 ganesh roll marks 0 1 78 1 2 89 2 3 56 3 4 78 4 5 87 5 6 92 print(pd.merge(nameDF,marksDF, on='roll')) OUTPUT: roll name marks 0 1 sayam 78 1 2 dinesh 89 2 3 suresh 56 3 4 ganesh 78

Matching and Broadcasting

Matching: When operation is performed between two DataFrames, the data is first aligned on the basis of matching index and then the operation is performed on them. This feature of data alignment on the basis of matching index is called as Matching.

Broadcasting: When performing operation between two DataFrame but there is a size miss match(I.e row,column mismatch) then the existing data is replicated/duplicated so as to perform matching and perform operation with out giving NaN values.

Example:

import pandas as pd

import numpy as np

df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])

print('DataFrame is: ',df1,sep='\n')

print('Size of DataFrame: ',df1.shape,sep='\n')

OUTPUT:

DataFrame is:

c1 c2

r1 0 1

r2 2 3

r3 4 5

Size of DataFrame

(3, 2)

row=df1.loc['r1',:]

print( 'DataFrame is: ',row, sep='\n')

print( 'Size of DataFrame: ',row.shape, sep='\n')

OUTPUT:

DataFrame is:

c1 0

c2 1

Name: r1, dtype: int32

Size of DataFrame:

(2,)

print(df1 + row)

OUTPUT:
c1 c2

r1 0 2

r2 2 4

r3 4 6

Note:

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

Handling Missing Data

Pandas offer mechanism to handle missing data in a DataFrame. The missing data needs to be handled as it may lead to unwanted results when some analysis is performed on it

Sunday, May 3, 2020