Intro to Dataframes¶
Dataframes in R provide a sophisticated data structure for organization and analysis including with time series operations.
Check out: https://www.youtube.com/watch?v=edifwfMEL_I
(c)dataquest.io
The DataFrame allows us to load a complicated file into Python and perform multiple data manipulations efficiently, and without having to recast the data.
In [ ]:
library(readxl) # Read in Excel files
library(tidyverse)
library(lubridate) # Create and manipulate date objects and time objects.
library(xts) # adds ability to merge data frames on date.
library(dplyr)
library(zoo) # time series features for data frames.
library(patchwork) #
In [ ]:
# Read the Excel file into an R data frame
freight <- read_excel("ScienceInventory_Ioffe.xlsx")
In [ ]:
head(freight, n = 10) # Displays the first 10 rows
In [ ]:
tail(freight, n = 10) # Displays the last 10 rows
In [ ]:
# Summary gives some insight into the data type of each column. Also note it is shown in the header when you
# type tail or head
summary(freight)
In [ ]:
as.matrix(freight) # You can extract a generic array
In [ ]:
# Read in a text file that is delimited by whitespace. '\s+' allows multiple whitespace characters to exist
# between each entry.
# If file has a header, you need to tell read_csv() how many rows to skip, or it will misinterpret the shape of the
# dataframe.
ts = read.csv('20151028 blank test.asc',sep='\t',skip=7)
head(ts,n=10)
In [ ]:
# This is timeseries data, so we should utilize a datetime data type as the row index.
# First, we have to create datetime.
fname = 'timeseries2.txt'
# Read in the data and specify all columns as character arrays. Otherwise we lose leading zeros in the time column.
ts = read.csv(fname,sep=',',skip=1,header=FALSE,colClasses="character")
# Add the strings together, convert them to a datetime and save the result in column 3.
head(ts,n=10)
# specifying the format
format <- "%Y%m%d %H%M"
# combining date and time into single object and put it into a new column in the data frame.
ts$datetime = as.POSIXct(paste(ts$V5, ts$V6), format=format)
Trim bad or unwanted data¶
In [ ]:
# Use boolean arrays to shorten a dataframe before a time interval.
# Create a begin and end datetime object.
fin <- as.POSIXct('2020-01-03 23:55:00',format='%Y-%m-%d %H:%M:%S')
beg <- as.POSIXct('2020-01-01 10:55:00',format='%Y-%m-%d %H:%M:%S')
#beg <- as.POSIXct('2024-10-22 00:00:00',format='%Y-%m-%d %H:%M:%S')
# Compare the datetime data in ts and get rid of all the data outside of the limits set by beg and fin.
f_ts <- ts %>% filter(ts$datetime >= beg & ts$datetime <= fin)
Fill in missing data.¶
In [ ]:
# Use na.locf() which stands for "last observation carried forward" to fill in the missing data.
# NOTE xts() can coerce your data into characters if there are non-numeric values in the series.
xts_data <- xts(ts, order.by = ts$datetime)
hr_data <- na.locf(xts_data)
Resample the dataset¶
In [ ]:
# apply.daily, apply.monthly, apply.weekly, apply.quarterly
hr_day <- apply.daily(hr_data,last)
head(hr_day,n=10)
In [ ]: