Data Structures:
Series One dimensional labeled array capable of holding any data type, the axis labeled are collectively known as an index.
data can be dic, ndarray, or scalar value.
data=np.random.rand(5)
index=['a','b','c','d','e')
s=pd.Series(data, index=index)
using dictionary:
d={'a':1,'b':2,'c':3,'d':4}
s=pd.Series(d)
** key will convert into index and the value will be value. and If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
s=pd.Series(d,index=['a','b','c','e','f'])
pandas series has a single data type:
s.dtype
Access series value using index:
s['a']
set value using an index:
s['f']=7
Check whether the index exist in a series or not:
'f' in s
'h' not in s
If we access the value using the index and the index is not in series, it will throw an error. but if we access using the get method it won't throw an error out put will be None.
s.get('i')
s.get('i',9) if i is not available in the series it will give 9 instead of None, but if available it will give assigned value.
Vectorized Operation:
ex: series1+series2
when we do an operation between two series it automatically aligns the data based on a label. The result of an operation between unaligned Series
will have the union of the indexes involved.
If a label is not found in one Series
or the other, the result will be marked as missing NaN.
a=pd.Series((1,2,3,4))
b=pd.Series((3,4,6,7,8))
a+b: 4,6,9,11,nan
Data Frame: Two-dimensional data structure with columns of different types.
we can consider it as a spreadsheet or SQL Table. Along with the data, we can pass the index and column
It accepts many different kinds of input:
Dict of 1D ndarrays, lists, dicts, or
Series
2-D numpy.ndarray
Structured or record ndarray
A
Series
Another
DataFrame
a dict of Series plus a specific index will discard all data not matching up to the passed index.
df=pd.DataFrame(d,index=["a","b","d","e"])
From dict of ndarray or list: The ndarray must all be the same length, if an index is passed,
it must be same length as arrays.
import pandas as pd
import numpy as np
d={"one":[1,2,3,4,5],"two":[1,2,3,4,5]}
df=pd.DataFrame(d)
df1=pd.DataFrame(d,index=["a","b","c","d","e"])
df1
From list of dicts:
import pandas as pd
import numpy as np
d=[{"a":1,"b":2},{"a":5,"b":10,"c":10}]
df=pd.DataFrame(d,index=["first","second"])
df
From dict of tuples:
import pandas as pd
import numpy as np
pd.DataFrame({
("a","b"):{("A","B"):1,("A","C"):2},
("a","a"):{("A","C"):3,("A","B"):4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10}
})
From Series:
import pandas as pd
import numpy as np
ser=pd.Series(range(3),index=["a","b","c"])
df=pd.DataFrame(ser)
df
Viewing Data:
df.head()
df.tail()
df.index()
df.columns()
df.dtypes()
df.describes()
DataFrame.sort_index()
sorts by an axis:
df.sort_index(axis=1, ascending=False)
df.sort_values(by="A")
Data Selection:
print(df[:])
print(df[2:3])
print(df.loc[:,"A"])
print(df.loc[["a","c"],["A"]]) # need to use label
print(df.loc["a":"c",["A"]])
print(df.iloc[1:3])
print(df.iloc[1:4,:])
print(df.iloc[1:5][:])
print(df.iloc[1,0])
Boolean Indexing
df[df["A"]>2]
selecting row where df["A"] greater than 2
df[df>2]
Selecting value from data frame where value greater than 2
ISIN Method:
df[df["A"].isin([1,2])]