Data Frame in python pandas
Pandas provide data structure like Series, DataFrame, Panels. Of all these data structure, DataFrame is the most widely used. DataFrame object of pandas allows to store 2 Dimensional data I.e in rows and columns just like in spreadsheet application like MS Excel.
Note:
1. DataFrame are 2- Dimesnional
2. Size and Data Mutable
3. Heterogeneous tabular data structure
4. Data is identified by row and column index
5. Column are referred as axis=1 and row are referred as axis=0
6. Index can be numbers, letters or Strings
7. Same column should have data of same type but different column can be of different data types
9. Can be thought of dictionary of Series Object
10. pandas library has to be imported to create a DataFrame
Creating DataFrame Object:
General syntax for DataFrame Object creation
Pandas.DataFrame(data=None, index=None, columns=None, dtype=None)
Arguments:
data: This argument can be Series, ndarray, Dictionary, list, and even DataFrame
index: List of values for Labeling the Rows of DataFrame and can be ndarray also
columns: List of values for labeling the columns of DataFrame
dtype: allows to set the data types of the dataframe. If none then type is inferred form data items
Code Example
|
Remarks
|
Creating a DataFrame from Dictionary
| |
import pandas as pd
dt={‘sl’:[1,2,3,4,5],
‘name’:[‘ram’,’sayam’,’dinesh’,’ganesh’,’suresh’],
‘age’:[21,17,22,32,19]}
df=pd.DataFrame(data=dt)
print(df)
OUTPUT:
sl name age
0 1 ram 21
1 2 sayam 17
2 3 dinesh 22
3 4 ganesh 32
4 5 suresh 19
dt={'sl':[1,2,3,4,5],
'name':['ram','sayam','dinesh','ganesh','suresh', ],
'age':[21,17,22,32,19]}
df=pd.DataFrame(data=dt, index=['i','ii','iii','iv','v'])
print(df)
OUTPUT:
sl name age
i 1 ram 21
ii 2 sayam 17
iii 3 dinesh 22
iv 4 ganesh 32
v 5 suresh 19
|
Here only the data argument is given so the dictionary keys are used as column names and each key-value pair makes the column name and column data respectively.
Index values are provided so as to name the row of the dataFrame with roman values
|
Creating a DataFrame object from 2-D ndarray
| |
import numpy as np
nparr=np.array([[12,13,14,15],[22,23,24,25]], dtype=np.int32)
df=pd.DataFrame(nparr, index=[], columns=[])
print(df)
OUTPUT:
0 1 2 3
0 12 13 14 15
1 22 23 24 25
OR
df=pd.DataFrame((np.arange(20)).reshape(5,4))
print(df)
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
|
This can be used to create a dataframe object when an program outputs the data in a ndarray
Creating a numpy array using the arange function and then converting it into a dataframe
|
Creating a DataFrame object from Series objects(Storing them in dictionary)
| |
srObj1=pd.Series(['ram','sayam','dinesh','ganesh','suresh'])
srObj2=pd.Series([12,23,14,17,26])
dc={'name':srObj1,
'age':srObj2}
df=pd.DataFrame(dc)
print(df)
OUTPUT:
name age
0 ram 12
1 sayam 23
2 dinesh 14
3 ganesh 17
4 suresh 26
|
Now Series object is 1 Dimensional so many Series object is stored in a dictionary and that dictionary is used to create a multi-column DataFrame object
|
Creating a DataFrame from another DataFrame
| |
df2=pd.DataFrame(df)
print(df2)
OUTPUT:
name age
0 ram 12
1 sayam 23
2 dinesh 14
3 ganesh 17
4 suresh 26
|
This is help full when we want to copy a dataframe and manipulate it
|
Creating a DataFrame by importing data form CSV
| |
df=pd.read_csv(‘data.csv’)
print(df)
OUTPUT:
sl name age
0 1 ganesh 23
1 2 ramesh 24
2 3 suresh 21
3 4 dinesh 17
4 5 sayam 18
|
In real life the most of the large data is loaded from a CSV file into a data frame , and analysis is done on it.
|
DataFrame Object Attributes:
Syntax:
DataFrame_Object.<attribute>
List of the common attributes
Attributes
|
Description
|
index
|
Returns the index labels of the DataFrame in an array object
|
columns
|
Returns the columns labels of the DataFrame
|
axes
|
Return a list representing the axes of the DataFrame.
|
dtypes
|
Returns the data type of data in the DataFrame
|
size
|
Returns the number of element in the DataFrame
|
shape
|
Returns the number of rows and column of the DataFrame
|
values
|
Returns a Numpy represenation of the DataFrame
|
empty
|
Returns True if DataFrame is entirely empty (no items)
|
ndim
|
Return an int representing the number of axes
|
T
|
Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa . i.e Trasnpose the DataFrame
|
Accessing data in DataFrame Object:
Code Example
|
Remarks
|
Display the complete DataFrame object
| |
df=pd.DataFrame((np.arange(20)).reshape(5,4))
print(df)
OUTPUT:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
|
DataFrame object is created form the numpy array
|
Accessing the Column of the DataFrame Object
| |
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])
print(df['c2']) #passing single column
print(df[['c2','c4']]) #passing multiple column
OUTPUT:
0 1
1 5
2 9
3 13
4 17
Name: c2, dtype: int64
c2 c4
0 1 3
1 5 7
2 9 11
3 13 15
4 17 19
|
Printing the data of column c2
|
Accessing individual Element of DataFrame
| |
import pandas as pd
import numpy as np
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])
print(df['c2'][1]) # dataFrameObject[<columnLabel>][<rowLabel>]
print(df.c2[1]) # dataFrameObject.<columnLabel>[<rowLabel>]
OUTPUT:
5
|
Both the query works in the same way if the columns contain string names and not integer values else the second function gives error
|
Accessing the Subset of a DataFrame(This method is used most often)
| |
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])
#USING LOC
#SYNTAX: dataframeObject.loc[<startRow>:<endRow>:<rowStepValue>, <startColumn:<endColumn>:<ColumnStepValue>
df1=df.loc[1:3,'c1':'c3']
print(df1)
df2=df.loc[0:4:2,'c1':'c4':2]
print(df2)
OUTPUT:
c1 c2 c3
1 4 5 6
2 8 9 10
3 12 13 14
c1 c3
0 0 2
2 8 10
4 16 18
|
Except of loc, iloc function also exists
|
Operation on DataFrame Object
Code Example
|
Remarks
|
Assigning/Modification of the data in the DataFrame Object
| |
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])
print(df) #Original DataFrame Created from a numpy array
OUTPUT:
c1 c2 c3 c4
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
df['c2'][1]=60 # Changing a single element of the DataFrame but data has to be of same type
print(df)
OUTPUT:
c1 c2 c3 c4
0 0 1 2 3
1 4 60 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
| |
df['c3']=['a','b','c','d','e'] # Change an entire column of data in the DataFrame
print(df)
OUTPUT:
c1 c2 c3 c4
0 0 1 a 3
1 4 60 b 7
2 8 9 c 11
3 12 13 d 15
4 16 17 e 19
df.loc[1:1,:]=['aa','bb','cc','dd'] # Changing a complete row of data in DataFrame
print(df)
OUTPUT:
c1 c2 c3 c4
0 0 1 a 3
1 aa bb cc dd
2 8 9 c 11
3 12 13 d 15
4 16 17 e 19
| |
Adding Columns in DataFrame
| |
df=pd.DataFrame((np.arange(15)).reshape(5,3))
print(df)
OUTPUT:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
df['c5']=np.NaN # As column ‘c5’ is not present so it gets created and added
print(df)
OUTPUT:
0 1 2 c5
0 0 1 2 NaN
1 3 4 5 NaN
2 6 7 8 NaN
3 9 10 11 NaN
4 12 13 14 NaN
|
If the column does not exists then the new column gets added
NOTE: Value have to be assigned or NaN if the values are not known else it will through error.
|
Deleting Columns in DataFrame
| |
print(df)
OUTPUT:
0 1 2 c5
0 0 1 2 5
1 3 4 5 5
2 6 7 8 5
3 9 10 11 5
4 12 13 14 5
del df['c5'] # del command can be used to remove column from the DataFrame
print(df)
OUTPUT:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
df=df.drop([1,2], axis=1) # Remove multiple columns form the DataFrame using drop function
print(df)
OUTPUT:
0
0 0
1 3
2 6
3 9
4 12
| |
Deleting Rows in DataFrame
| |
df=pd.DataFrame((np.arange(15)).reshape(5,3))
print(df)
OUTPUT:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
df=df.drop([1,2], axis=0)
print(df)
OUTPUT:
0 1 2
0 0 1 2
3 9 10 11
4 12 13 14
| |
Iterating over a DataFrame
| |
df=pd.DataFrame((np.arange(15)).reshape(5,3), columns=['c1','c2','c3'])
print(df)
OUTPUT:
c1 c2 c3
0 0 1 2
1 3 4 5
2 6 7 8
## Iteraing over a DataFrame using Iterrows() - ROW DATA SET
for (rowIndex,rowSeries) in df.iterrows():
print("Row index", rowIndex)
print("Data Containing \n", rowSeries)
OUTPUT:
Row index 0
Data Containing
c1 0
c2 1
c3 2
Name: 0, dtype: int32
Row index 1
Data Containing
c1 3
c2 4
c3 5
Name: 1, dtype: int32
Row index 2
Data Containing
c1 6
c2 7
c3 8
Name: 2, dtype: int32
##Iteraing over a DataFrame using iteritems() - COLUMN DATA SET
for (colIndex,colSeries) in df.iteritems():
print("Column Index", rowIndex)
print("Column Data \n", rowSeries)
Column Index 2
Column Data
c1 6
c2 7
c3 8
Name: 2, dtype: int32
Column Index 2
Column Data
c1 6
c2 7
c3 8
Name: 2, dtype: int32
Column Index 2
Column Data
c1 6
c2 7
c3 8
Name: 2, dtype: int32
|
Two function are very commonly used iterrow and iteritem to run a for loop over a DataFrame object
dataFrameObject.iterrow() access the dataFrame as horizontal subsets of Series Object
dataFrameObject.iteritems() access the dataFrame as vertical subsets of Series Object
|
Arithmetic operation on DataFrame Object
| |
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(6,12)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print('DataFrame 1:',df1, sep='\n')
print('DataFrame 2:',df2, sep='\n')
OUTPUT:
DataFrame 1:
c1 c2
r1 0 1
r2 2 3
r3 4 5
DataFrame 2:
c1 c2
r1 6 7
r2 8 9
r3 10 11
## ---------------ADDITION -----------------##
print(df1+df2)
OUTPUT:
c1 c2
r1 6 8
r2 10 12
r3 14 16
print(df1.add(df2))
OUTPUT:
c1 c2
r1 6 8
r2 10 12
r3 14 16
##-----------SUBTRACTION-------------##
print(df2-df1)
OUTPUT:
c1 c2
r1 6 6
r2 6 6
r3 6 6
print(df1.sub(df2))
OUTPUT:
c1 c2
r1 -6 -6
r2 -6 -6
r3 -6 -6
##-----------MULTIPLICATION---------##
print(df1*df2)
OUTPUT:
c1 c2
r1 0 7
r2 16 27
r3 40 55
print(df1.mul(df2))
OUTPUT:
c1 c2
r1 0 7
r2 16 27
r3 40 55
##---------DIVISION------------##
print(df1/df2)
OUTPUT:
c1 c2
r1 0.00 0.142857
r2 0.25 0.333333
r3 0.40 0.454545
print(df1.div(df2))
OUTPUT:
c1 c2
r1 0.00 0.142857
r2 0.25 0.333333
r3 0.40 0.454545
##-----------APPLYING NUMPY/MATHS FUNCTION---------##
print(np.square(df1))
OUTPUT:
c1 c2
r1 0 1
r2 4 9
r3 16 25
print(np.sqrt(df1))
OUTPUT:
c1 c2
r1 0.000000 1.000000
r2 1.414214 1.732051
r3 2.000000 2.236068
print(np.sin(df1))
OUTPUT:
c1 c2
r1 0.000000 0.841471
r2 0.909297 0.141120
r3 -0.756802 -0.958924
|
Arithmetic operation are vectorized in DataFrame
|
Comparison of DataFrame Object
| |
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(6,12)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print('DataFrame 1:',df1, sep='\n')
print('DataFrame 2:',df2, sep='\n')
OUTPUT:
DataFrame 1:
c1 c2
r1 0 1
r2 2 3
r3 4 5
DataFrame 2:
c1 c2
r1 6 7
r2 8 9
r3 10 11
df1+df2 == df1.add(df2)
c1 c2
r1 True True
r2 True True
r3 True True
|
Note:
1. The two DataFrame must match in row and columns for the comparison
2. NaN values cannot be compared using comparison operator
3. equals() result in True while comparing two NaN values
|
Commonly used Function on DataFrame Object:
Function
|
Code Example
|
dataFrameObj.info()
Note:
This method prints information about a DataFrame including
the index dtype and column dtypes, non-null values and memory usage.
|
print(df1.info())
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, r1 to r3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 3 non-null int32
1 c2 3 non-null int32
dtypes: int32(2)
memory usage: 36.0+ bytes
None
|
dataFrameObj.describe()
Note:
Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a dataset's distribution, excluding ``NaN`` values.
|
print(df1.describe())
OUTPUT:
c1 c2
count 3.0 3.0
mean 2.0 3.0
std 2.0 2.0
min 0.0 1.0
25% 1.0 2.0
50% 2.0 3.0
75% 3.0 4.0
max 4.0 5.0
|
dataFrameObject.head()
NOTE:
This function returns the first `n` rows for the object based
on position.
dataFrameObject.tail()
NOTE:
This function returns last `n` rows from the object based on
position.
|
print(df)
OUTPUT:
c1 c2 c3
0 0 1 2
1 3 4 5
2 6 7 8
print(df.head(2))
OUTPUT:
c1 c2 c3
0 0 1 2
1 3 4 5
print(df.tail(2))
OUTPUT:
c1 c2 c3
1 3 4 5
2 6 7 8
|
dataFrameObject.any()
NOTE:
While performing a comparing operation on a DataFrame object, this function returns True if any operation is True over requested axis.
|
import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print(df1>3)
print('Any function returned ')
print((df1>3).any())
OUTPUT:
c1 c2
r1 False False
r2 False False
r3 True True
Any function returned
c1 True
c2 True
dtype: bool
|
dataFrameObject.all()
NOTE:
While performing a comparison operation on a DataFrame object, this function return True if all the operation is True over requested axis.
|
import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print(df1>3)
print('Any function returned ')
print((df1>3).all())
OUTPUT:
c1 c2
r1 False False
r2 False False
r3 True True
Any function returned
c1 False
c2 False
dtype: bool
|
dataFrameObject1.combine_first(dataFrameObject2)
NOTE:
The combine_first method combines the two dataframes using patching the data method. Patching the data means, id a dataframe a certain cell has missing data and corresponding cell(the one with same index and column id) in other dataframe has some valid data then, this method will pick the valid data from the second dataframe and patch it in the first dataframe.
|
import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(7,16)).reshape(3,3), columns=['c1','c2','c3'], index=['r1','r2','r3'])
df1['c3']=np.NaN
print('DataFrame1',df1,sep='\n')
print('DataFrame2',df2,sep='\n')
OUTPUT:
DataFrame1
c1 c2 c3
r1 0 1 NaN
r2 2 3 NaN
r3 4 5 NaN
DataFrame2
c1 c2 c3
r1 7 8 9
r2 10 11 12
r3 13 14 15
print(df1.combine_first(df2))
OUTPUT:
c1 c2 c3
r1 0 1 9.0
r2 2 3 12.0
r3 4 5 15.0
|
pandas.concat([dataFrame1, dataFrame2])
NOTE:
The concat method can concatenate two dataframe along the rows or along the columns. This works when the two dataframe has same structure.
|
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(7,16)).reshape(3,3), columns=['c1','c2','c3'], index=['r1','r2','r3'])
df1['c3']=np.NaN
print('DataFrame1',df1,sep='\n')
print('DataFrame2',df2,sep='\n')
OUTPUT:
DataFrame1
c1 c2 c3
r1 0 1 NaN
r2 2 3 NaN
r3 4 5 NaN
DataFrame2
c1 c2 c3
r1 7 8 9
r2 10 11 12
r3 13 14 15
print(pd.concat([df1,df2])) #column concatenation
OUTPUT:
c1 c2 c3
r1 0 1 NaN
r2 2 3 NaN
r3 4 5 NaN
r1 7 8 9.0
r2 10 11 12.0
r3 13 14 15.0
print(pd.concat([df1,df2], axis=1)) #row concatenation
OUTPUT:
c1 c2 c3 c1 c2 c3
r1 0 1 NaN 7 8 9
r2 2 3 NaN 10 11 12
r3 4 5 NaN 13 14 15
|
pandas.merge(df1, df2, on=<keyword>)
NOTE:
This method can be used to combine dataFrame based on rows with same common values are merged
In this example the matching is done on the roll column and only those data is fetched, merged that has same value for the roll column |
import pandas as pd
nameDF=pd.DataFrame({'roll':[1,2,3,4], 'name': ['sayam','dinesh','suresh','ganesh']}) marksDF=pd.DataFrame({'roll':[1,2,3,4,5,6], 'marks':[78,89,56,78,87,92]}) print(nameDF,'\n\n\n', marksDF) OUTPUT: roll name 0 1 sayam 1 2 dinesh 2 3 suresh 3 4 ganesh roll marks 0 1 78 1 2 89 2 3 56 3 4 78 4 5 87 5 6 92 print(pd.merge(nameDF,marksDF, on='roll'))
OUTPUT:
roll name marks 0 1 sayam 78 1 2 dinesh 89 2 3 suresh 56 3 4 ganesh 78 |
Matching and Broadcasting
Matching: When operation is performed between two DataFrames, the data is first aligned on the basis of matching index and then the operation is performed on them. This feature of data alignment on the basis of matching index is called as Matching.
Broadcasting: When performing operation between two DataFrame but there is a size miss match(I.e row,column mismatch) then the existing data is replicated/duplicated so as to perform matching and perform operation with out giving NaN values.
Example:
import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print('DataFrame is: ',df1,sep='\n')
print('Size of DataFrame: ',df1.shape,sep='\n')
OUTPUT:
DataFrame is:
c1 c2
r1 0 1
r2 2 3
r3 4 5
Size of DataFrame
(3, 2)
row=df1.loc['r1',:]
print( 'DataFrame is: ',row, sep='\n')
print( 'Size of DataFrame: ',row.shape, sep='\n')
OUTPUT:
DataFrame is:
c1 0
c2 1
Name: r1, dtype: int32
Size of DataFrame:
(2,)
print(df1 + row)
OUTPUT:
c1 c2
c1 c2
r1 0 2
r2 2 4
r3 4 6
Note:
In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.
Handling Missing Data
Pandas offer mechanism to handle missing data in a DataFrame. The missing data needs to be handled as it may lead to unwanted results when some analysis is performed on it