Sunday, May 3, 2020

Data Frame in python pandas

Data Frame in python pandas
Pandas provide data structure like Series, DataFrame, Panels. Of all these data structure, DataFrame is the most widely used. DataFrame object of pandas allows to store 2 Dimensional data I.e in rows and columns just like in spreadsheet application like MS Excel.


Note:
1. DataFrame are 2- Dimesnional
2. Size and Data Mutable
3. Heterogeneous tabular data structure
4. Data is identified by row and column index
5. Column are referred as axis=1 and row are referred as axis=0
6. Index can be numbers, letters or Strings
7. Same column should have data of same type but different column can be of different data types
8. Plots can be made from DataFrames directly without importing matplotlib library
9. Can be thought of dictionary of Series Object
10. pandas library has to be imported to create a DataFrame



Creating DataFrame Object:
General syntax for DataFrame Object creation
Pandas.DataFrame(data=None, index=None, columns=None, dtype=None)
Arguments:
data: This argument can be Series, ndarray, Dictionary, list, and even DataFrame
index: List of values for Labeling the Rows of DataFrame and can be ndarray also
columns: List of values for labeling the columns of DataFrame
dtype: allows to set the data types of the dataframe. If none then type is inferred form data items

Code Example
Remarks
Creating a DataFrame from Dictionary

import pandas as pd
dt={‘sl’:[1,2,3,4,5],
‘name’:[‘ram’,’sayam’,’dinesh’,’ganesh’,’suresh’],
‘age’:[21,17,22,32,19]}
df=pd.DataFrame(data=dt)
print(df)

OUTPUT:
   sl    name  age
0   1     ram   21
1   2   sayam   17
2   3  dinesh   22
3   4  ganesh   32
4   5  suresh   19



dt={'sl':[1,2,3,4,5],
   'name':['ram','sayam','dinesh','ganesh','suresh', ],
   'age':[21,17,22,32,19]}
df=pd.DataFrame(data=dt, index=['i','ii','iii','iv','v'])
print(df)

OUTPUT:
     sl    name  age
i     1     ram   21
ii    2     sayam   17
iii   3     dinesh   22
iv    4    ganesh   32
v     5   suresh   19
Here only the data argument is given so the dictionary keys are used as column names and each key-value pair makes the column name and column data respectively.














Index values are provided so as to name the row of the dataFrame with roman values
Creating a DataFrame object from 2-D ndarray

import numpy as np
nparr=np.array([[12,13,14,15],[22,23,24,25]], dtype=np.int32)
df=pd.DataFrame(nparr, index=[], columns=[])
print(df)

OUTPUT:
    0   1   2   3
0  12  13  14  15
1  22  23  24  25

OR

df=pd.DataFrame((np.arange(20)).reshape(5,4))
print(df)

  0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
This can be used to create a dataframe object when an program outputs the data in a ndarray









Creating a numpy array using the arange function and then converting it into a dataframe
Creating a DataFrame object from Series objects(Storing them in dictionary)

srObj1=pd.Series(['ram','sayam','dinesh','ganesh','suresh'])
srObj2=pd.Series([12,23,14,17,26])
dc={'name':srObj1,
   'age':srObj2}
df=pd.DataFrame(dc)
print(df)

OUTPUT:
     name  age
0     ram   12
1   sayam   23
2  dinesh   14
3  ganesh   17
4  suresh   26
Now Series object is 1 Dimensional so many Series object is stored in a dictionary and that dictionary is used to create a multi-column DataFrame object
Creating a DataFrame from another DataFrame

df2=pd.DataFrame(df)
print(df2)

OUTPUT:
     name  age
0     ram   12
1   sayam   23
2  dinesh   14
3  ganesh   17
4  suresh   26
This is help full when we want to copy a dataframe and manipulate it
Creating a DataFrame by importing data form CSV

df=pd.read_csv(‘data.csv’)
print(df)

OUTPUT:
sl name age
0 1 ganesh 23
1 2 ramesh 24
2 3 suresh 21
3 4 dinesh 17
4 5 sayam 18
In real life the most of the large data is loaded from a CSV file into a data frame , and analysis is done on it.

DataFrame Object Attributes:
Syntax:
DataFrame_Object.<attribute>

List of the common attributes
Attributes
Description
index
Returns the index labels of the DataFrame in an array object
columns
Returns the columns labels of the DataFrame
axes
Return a list representing the axes of the DataFrame.
dtypes
Returns the data type of data in the DataFrame
size
Returns the number of element in the DataFrame
shape
Returns the number of rows and column of the DataFrame
values
Returns a Numpy represenation of the DataFrame
empty
Returns True if DataFrame is entirely empty (no items)
ndim
Return an int representing the number of axes
T
Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa . i.e Trasnpose the DataFrame


Accessing data in DataFrame Object:

Code Example
Remarks
Display the complete DataFrame object

df=pd.DataFrame((np.arange(20)).reshape(5,4))
print(df)

OUTPUT:
  0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
DataFrame object is created form the numpy array
Accessing the Column of the DataFrame Object

df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])
print(df['c2'])           #passing single column
print(df[['c2','c4']])     #passing multiple column

OUTPUT:

0     1
1     5
2     9
3    13
4    17
Name: c2, dtype: int64

  c2 c4
0 1 3
1 5 7
2 9 11
3 13 15
4 17 19

Printing the data of column c2
Accessing individual Element of DataFrame

import pandas as pd
import numpy as np
df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])
print(df['c2'][1])  #  dataFrameObject[<columnLabel>][<rowLabel>]
print(df.c2[1])    #  dataFrameObject.<columnLabel>[<rowLabel>]
OUTPUT:
5
Both the query works in the  same way if the columns contain string names and not integer values else the second function gives error
Accessing the Subset of a DataFrame(This method is used most often)

df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])

#USING LOC
#SYNTAX: dataframeObject.loc[<startRow>:<endRow>:<rowStepValue>, <startColumn:<endColumn>:<ColumnStepValue>

df1=df.loc[1:3,'c1':'c3']
print(df1)

df2=df.loc[0:4:2,'c1':'c4':2]
print(df2)


OUTPUT:

   c1  c2  c3
1   4   5   6
2   8   9  10
3  12  13  14

   c1  c3
0   0   2
2   8  10
4  16  18

Except of loc, iloc function also exists




Operation on DataFrame Object

Code Example
Remarks
Assigning/Modification of the data in the DataFrame Object

df=pd.DataFrame((np.arange(20)).reshape(5,4), columns=['c1','c2','c3','c4'])
print(df) #Original DataFrame Created from a numpy array

OUTPUT:
   c1  c2  c3  c4
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

df['c2'][1]=60   # Changing a single element of the DataFrame but data has to be of same type
print(df)

OUTPUT:
   c1  c2  c3  c4
0   0   1   2   3
1   4  60   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19


df['c3']=['a','b','c','d','e']  # Change an entire column of data in the DataFrame
print(df)

OUTPUT:
   c1  c2  c3  c4
0   0   1  a   3
1   4  60  b   7
2   8   9  c  11
3  12  13  d  15
4  16  17  e  19

df.loc[1:1,:]=['aa','bb','cc','dd']   # Changing a complete row of data in DataFrame
print(df)

OUTPUT:
   c1  c2  c3  c4
0   0   1   a   3
1  aa  bb  cc  dd
2   8   9   c  11
3  12  13   d  15
4  16  17   e  19

Adding Columns in DataFrame

df=pd.DataFrame((np.arange(15)).reshape(5,3))
print(df)

OUTPUT:
    0   1   2
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14

df['c5']=np.NaN      # As column ‘c5’ is not present so it gets created and added
print(df)

OUTPUT:

    0   1   2  c5
0   0   1   2 NaN
1   3   4   5 NaN
2   6   7   8 NaN
3   9  10  11 NaN
4  12  13  14 NaN
If the column does not exists then the new column gets added

NOTE: Value have to be assigned or NaN if the values are not known else it will through error.
Deleting Columns in DataFrame

print(df)

OUTPUT:
    0   1   2  c5
0   0   1   2   5
1   3   4   5   5
2   6   7   8   5
3   9  10  11   5
4  12  13  14   5

del df['c5']  # del command can be used to remove column from the DataFrame
print(df)


OUTPUT:
    0   1   2
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14

df=df.drop([1,2], axis=1)  # Remove multiple columns form the DataFrame using drop function
print(df)

OUTPUT:
    0
0   0
1   3
2   6
3   9
4  12



Deleting Rows in DataFrame

df=pd.DataFrame((np.arange(15)).reshape(5,3))
print(df)

OUTPUT:
    0   1   2
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14

df=df.drop([1,2], axis=0)
print(df)

OUTPUT:
    0   1   2
0   0   1   2
3   9  10  11
4  12  13  14

Iterating over a DataFrame

df=pd.DataFrame((np.arange(15)).reshape(5,3), columns=['c1','c2','c3'])
print(df)

OUTPUT:
   c1  c2  c3
0   0   1   2
1   3   4   5
2   6   7   8

## Iteraing over a DataFrame using Iterrows() - ROW DATA SET

for (rowIndex,rowSeries)  in  df.iterrows():
    print("Row index", rowIndex)
print("Data Containing \n", rowSeries)

OUTPUT:
Row index 0
Data Containing
 c1    0
c2    1
c3    2
Name: 0, dtype: int32
Row index 1
Data Containing
 c1    3
c2    4
c3    5
Name: 1, dtype: int32
Row index 2
Data Containing
 c1    6
c2    7
c3    8
Name: 2, dtype: int32


##Iteraing over a DataFrame using iteritems() - COLUMN DATA SET

for (colIndex,colSeries) in df.iteritems():
    print("Column Index", rowIndex)
print("Column Data \n", rowSeries)

Column Index 2
Column Data
 c1    6
c2    7
c3    8
Name: 2, dtype: int32
Column Index 2
Column Data
 c1    6
c2    7
c3    8
Name: 2, dtype: int32
Column Index 2
Column Data
 c1    6
c2    7
c3    8
Name: 2, dtype: int32
Two function are very commonly used iterrow and iteritem to run a for loop over a DataFrame object

dataFrameObject.iterrow() access the dataFrame  as horizontal subsets of Series Object

dataFrameObject.iteritems() access the dataFrame as vertical subsets of Series Object
Arithmetic operation on DataFrame Object


df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(6,12)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print('DataFrame 1:',df1, sep='\n')
print('DataFrame 2:',df2, sep='\n')

OUTPUT:

DataFrame 1:
    c1  c2
r1   0   1
r2   2   3
r3   4   5

DataFrame 2:
    c1  c2
r1   6   7
r2   8   9
r3  10  11

## ---------------ADDITION -----------------##
print(df1+df2)

OUTPUT:
    c1  c2
r1   6   8
r2  10  12
r3  14  16

print(df1.add(df2))

OUTPUT:
    c1  c2
r1   6   8
r2  10  12
r3  14  16

##-----------SUBTRACTION-------------##

print(df2-df1)

OUTPUT:
    c1  c2
r1   6   6
r2   6   6
r3   6   6

print(df1.sub(df2))

OUTPUT:
    c1  c2
r1  -6  -6
r2  -6  -6
r3  -6  -6

##-----------MULTIPLICATION---------##

print(df1*df2)

OUTPUT:
    c1  c2
r1   0   7
r2  16  27
r3  40  55

print(df1.mul(df2))

OUTPUT:
    c1  c2
r1   0   7
r2  16  27
r3  40  55

##---------DIVISION------------##

print(df1/df2)

OUTPUT:
      c1        c2
r1  0.00  0.142857
r2  0.25  0.333333
r3  0.40  0.454545
print(df1.div(df2))

OUTPUT:
      c1        c2
r1  0.00  0.142857
r2  0.25  0.333333
r3  0.40  0.454545

##-----------APPLYING NUMPY/MATHS FUNCTION---------##
print(np.square(df1))

OUTPUT:
    c1  c2
r1   0   1
r2   4   9
r3  16  25

print(np.sqrt(df1))

OUTPUT:
          c1        c2
r1  0.000000  1.000000
r2  1.414214  1.732051
r3  2.000000  2.236068

print(np.sin(df1))

OUTPUT:
          c1        c2
r1  0.000000  0.841471
r2  0.909297  0.141120
r3  -0.756802  -0.958924




















Arithmetic operation are vectorized in DataFrame
Comparison of DataFrame Object

df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(6,12)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print('DataFrame 1:',df1, sep='\n')
print('DataFrame 2:',df2, sep='\n')

OUTPUT:

DataFrame 1:
    c1  c2
r1   0   1
r2   2   3
r3   4   5
DataFrame 2:
    c1  c2
r1   6   7
r2   8   9
r3  10  11

df1+df2 == df1.add(df2)

c1 c2
r1 True True
r2 True True
r3 True True
Note:
1. The two DataFrame must match in row and columns for the comparison
2. NaN values cannot be compared using comparison operator
3. equals() result in True while comparing two NaN values




Commonly used Function on DataFrame Object:

Function
Code Example
dataFrameObj.info()

Note:
This method prints information about a DataFrame including
the index dtype and column dtypes, non-null values and memory usage.
print(df1.info())

OUTPUT:
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, r1 to r3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   c1      3 non-null      int32
 1   c2      3 non-null      int32
dtypes: int32(2)
memory usage: 36.0+ bytes
None
dataFrameObj.describe()

Note:
Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a dataset's distribution, excluding ``NaN`` values.
print(df1.describe())

OUTPUT:
        c1   c2
count  3.0   3.0
mean   2.0  3.0
std    2.0   2.0
min    0.0  1.0
25%    1.0  2.0
50%    2.0  3.0
75%    3.0  4.0
max    4.0  5.0
dataFrameObject.head()

NOTE:
This function returns the first `n` rows for the object based
on position.

dataFrameObject.tail()

NOTE:
This function returns last `n` rows from the object based on
position.



print(df)

OUTPUT:
   c1  c2  c3
0   0   1   2
1   3   4   5
2   6   7   8

print(df.head(2))

OUTPUT:
   c1  c2  c3
0   0   1   2
1   3   4   5

print(df.tail(2))

OUTPUT:
   c1  c2  c3
1   3   4   5
2   6   7   8
dataFrameObject.any()

NOTE:
While performing a comparing operation on a DataFrame object, this function returns True if any operation is True over requested axis.
import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print(df1>3)
print('Any function returned ')
print((df1>3).any())

OUTPUT:
       c1     c2
r1  False  False
r2  False  False
r3   True   True
Any function returned
c1    True
c2    True
dtype: bool

dataFrameObject.all()

NOTE:
While performing a comparison operation on a DataFrame object, this function return True if all the operation is True over requested axis.
import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print(df1>3)
print('Any function returned ')
print((df1>3).all())

OUTPUT:
       c1     c2
r1  False  False
r2  False  False
r3   True   True
Any function returned
c1    False
c2    False
dtype: bool
dataFrameObject1.combine_first(dataFrameObject2)

NOTE:
The combine_first method combines the two dataframes using patching the data method. Patching the data means, id a dataframe a certain cell has missing data and corresponding cell(the one with same index and column id) in other dataframe has some valid data then, this method will pick the valid data from the second dataframe and patch it in the first dataframe.
import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(7,16)).reshape(3,3), columns=['c1','c2','c3'], index=['r1','r2','r3'])
df1['c3']=np.NaN
print('DataFrame1',df1,sep='\n')
print('DataFrame2',df2,sep='\n')

OUTPUT:
DataFrame1
    c1  c2  c3
r1   0   1 NaN
r2   2   3 NaN
r3   4   5 NaN
DataFrame2
    c1  c2  c3
r1   7   8   9
r2  10  11  12
r3  13  14  15

print(df1.combine_first(df2))

OUTPUT:
    c1  c2    c3
r1   0   1   9.0
r2   2   3  12.0
r3   4   5  15.0
pandas.concat([dataFrame1, dataFrame2])

NOTE:
The concat method can concatenate two dataframe along the rows or along the columns. This works when the two dataframe has same structure.
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
df2=pd.DataFrame((np.arange(7,16)).reshape(3,3), columns=['c1','c2','c3'], index=['r1','r2','r3'])
df1['c3']=np.NaN
print('DataFrame1',df1,sep='\n')
print('DataFrame2',df2,sep='\n')

OUTPUT:
DataFrame1
    c1  c2  c3
r1   0   1 NaN
r2   2   3 NaN
r3   4   5 NaN
DataFrame2
    c1  c2  c3
r1   7   8   9
r2  10  11  12
r3  13  14  15

print(pd.concat([df1,df2]))  #column concatenation

OUTPUT:
    c1  c2    c3
r1   0   1   NaN
r2   2   3   NaN
r3   4   5   NaN
r1   7   8   9.0
r2  10  11  12.0
r3  13  14  15.0

print(pd.concat([df1,df2], axis=1))  #row concatenation

OUTPUT:
    c1  c2  c3  c1  c2  c3
r1   0   1 NaN   7   8   9
r2   2   3 NaN  10  11  12
r3   4   5 NaN  13  14  15
pandas.merge(df1, df2, on=<keyword>)

NOTE:
This method can be used to combine dataFrame based on rows with same common values are merged

In this example the matching is done on the roll column and only those data is fetched, merged that has same value for the roll column
import pandas as pd

nameDF=pd.DataFrame({'roll':[1,2,3,4], 'name':                  ['sayam','dinesh','suresh','ganesh']})

marksDF=pd.DataFrame({'roll':[1,2,3,4,5,6],                       'marks':[78,89,56,78,87,92]})

print(nameDF,'\n\n\n', marksDF)

OUTPUT:
   roll    name
0     1   sayam
1     2  dinesh
2     3  suresh
3     4  ganesh 

    roll  marks
0     1     78
1     2     89
2     3     56
3     4     78
4     5     87
5     6     92

print(pd.merge(nameDF,marksDF, on='roll'))

OUTPUT:
    roll   name    marks
0     1   sayam     78
1     2  dinesh     89
2     3  suresh     56
3     4  ganesh     78


Matching and Broadcasting

Matching: When operation is performed between two DataFrames, the data is first aligned on the basis of matching index and then the operation is performed on them. This feature of data alignment on the basis of matching index is called as Matching.

Broadcasting: When performing operation between two DataFrame but there is a size miss match(I.e row,column mismatch) then the existing data is replicated/duplicated  so as to perform matching and perform operation with out giving NaN values.

Example:

import pandas as pd
import numpy as np
df1=pd.DataFrame((np.arange(0,6)).reshape(3,2), columns=['c1','c2'], index=['r1','r2','r3'])
print('DataFrame is: ',df1,sep='\n')
print('Size of DataFrame: ',df1.shape,sep='\n')

OUTPUT:
DataFrame is:
    c1  c2
r1   0   1
r2   2   3
r3   4   5
Size of DataFrame
(3, 2)

row=df1.loc['r1',:]
print( 'DataFrame is: ',row, sep='\n')
print( 'Size of DataFrame: ',row.shape, sep='\n')

OUTPUT:
DataFrame is:
c1    0
c2    1
Name: r1, dtype: int32
Size of DataFrame:
(2,)

print(df1 + row)

OUTPUT:
    c1  c2
r1   0   2
r2   2   4
r3   4   6

Note:
In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.




Handling Missing Data
Pandas offer mechanism to handle missing data in a DataFrame. The missing data needs to be handled as it may lead to unwanted results when some analysis is performed on it  



Share: