Intro to Pandas¶
Pandas is a Python library that provides a sophisticated data structure for organization and analysis. It excels at categorizing data and at time series operations.
Check out: https://pandas.pydata.org/pandas-docs/stable/10min.html https://www.youtube.com/watch?v=F6kmIpWWEdU
Advantages of Pandas
- Can read many files: .xls, .csv, any delimited text file, as well as JSON and SQL.
- Very smart with identifying data types, e.g. text, integer, float, or even dates.
- Allows data indexing along row, column or subrow or subcolumn.
- Contains many built-in analysis functions for operating on DataFrames and Series.
- Handles missing or empty data.
Dataframes use the dict data type.¶
dict is a way to structure data that allow one to organize the data with names (text).
# Python example of a dict array. Note the {} curly brackets and the use of colons.
typo = {'spelling':[2,3,4],'counting':['a','b','c']}
# Using the label 'spelling' I can access the data regardless of the
typo['spelling']
# Dict arrays are easy to Index:
typo['spelling'][1:]
# You can even replace values with this easy indexing.
typo['spelling'][1:] = [114,45,66,77]
NOTE:: Indexing is not quite as easy in Pandas, but still pretty easy.
(c)dataquest.io
The Pandas DataFrame allows us to load a complicated file into Python and perform multiple data manipulations efficiently, and without having to recast the data.
pd.Series() is analogous to a 1D array, pd.DataFrame() is analogous to a 2D array.
By using indexes to help categorize data and provide a hierarchy of categories. Index applies to row categories.
The column categories are just columns.
import pandas as pd
freight = pd.read_excel('ScienceInventory_Ioffe.xlsx')
#freight = pd.read_excel('ScienceInventory_Ioffe.xlsx',index_col='Institution')
freight.head() # Plot the first five rows of the dataframe
# Pandas automatically fills in empty data with NaNs.
freight.tail() # Plot the last five rows of the dataframe
# Pandas is fairly clever about determining the right data type for each column.
# If it can't determine, the default is text and the dtype is "object".
freight.info()
freight.values # You can extract a generic array from the Pandas dataframe, using the .values attribute.
# Change the row index to the 'Institution', instead of the numeric row number.
# This provides another level of organization and data access, a little bit like an SQL relational database.
freight = freight.set_index('Institution')
freight.head()
freight['Weight (lbs)']['U. Illinois Chicago'] # Report the weights column for UIC freight.
freight['Weight (lbs)']['U. Illinois Chicago'].sum() # Do arithmetic on on the UIC freight.
# This will fail. Note the reversal of indices.
freight['U. Illinois Chicago']['Weight (lbs)'].sum() # Pandas specifies column indices first, then row indices.
# Read in a text file that is delimited by whitespace. '\s+' allows multiple whitespace characters to exist
# between each entry.
# If file has a header, you need to tell read_csv() how many rows to skip, or it will misinterpret the shape of the
# dataframe.
ts = pd.read_csv('20151028 blank test.asc',sep='\t',skiprows=7,parse_dates=[0,3])
print(ts.info())
ts.head()
# This is timeseries data, so we should utilize a datetime data type as the row index.
# First, we have to create datetime.
fname = 'timeseries2.txt'
# Read in the data and specify columns 3 and 4 - date and time as string data type.
ts = pd.read_csv(fname,sep=',',skiprows=1,header=None,dtype={4:'str',5:'str'})
# Add the strings together, convert them to a datetime and save the result in column 3.
ts[4] = pd.to_datetime(ts[4]+' '+ts[5])
ts.info()
# Set the index to be the datetime column.
ts.set_index(ts[4],inplace=True)
ts.head()
# Pandas provides capabilities to manipulate the time base in your timeseries dataframe. These are possible, once
# you specify a datetime or time column as the row index, as we did above.
# Upsample, but don't replace the values by any interpolation method.
#ts.resample('1min').asfreq()
# Upsample, but replace the values by the constant forward fill method.
#ts.resample('1min').pad()
# Upsample, but replace the values by a linear interpolation.
# ts.resample('1min').interpolate()
typo = {'spelling':[2,3,4],'counting':['a','b','c']}
typo
typo['spelling']
typo['spelling'].append(5)
freight['Number'][10:15] = [114,45,66,77,101]
typo['spelling'][1:] = [114,45,66,77]