Project2_main.Rmd

---
title: "ST558- Project 2"
author: "Rohan Prabhune, Naman Goel"
output: 
  github_document:
    toc: true
    toc_depth: 4
always_allow_html: true
---
```{r Temp1,echo=FALSE,eval=TRUE,message=FALSE,warning=FALSE}
knitr::opts_chunk$set(fig.align = 'center') 
```
# Interacting with APIs: Financial Market Data
This vignette shows how to work with API. This will demonstrate how to fetch data from multiple API endpoints and read it in a tibble(data frame). This is followed by some basic exploratory data analysis (EDA) to produce some plots to derive insights from the data fetched.  
The API that we have chosen for this project is [Financial Market Data](https://polygon.io/docs/stocks).The Polygon.io Stocks API provides REST endpoints that let you query the latest market data from all US stock exchanges. 

## Requirements
We used the following packages in the creation of the vignette:

* `httr`: This is used to access the REST API endpoint.  
* `jsonlite`: This is used to parse the fetched data into a data frame.  
* `tidyverse`: This provides two important packages `dplyr` and `ggplot` which are used for data manipulation and plotting respectively.  
* `kableExtra`: This provides better printing properties for contingency tables in markdown.  

To get started, install(if these are not installed already) and load the following packages:
```{r 1_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
library(httr)
library(jsonlite)
library(tidyverse)
library(kableExtra)
```
<!--*************************************************************************-->
## API Interaction Functions  
This section describes the functions created by us to interact with the API endpoints to fetch data as well as some metadata required for making the plots more descriptive.  

### [Aggregates (Bars)](https://polygon.io/docs/stocks/get_v2_aggs_ticker__stocksticker__range__multiplier___timespan___from___to) Endpoint  
Get financial data for a stock within a given time frame

#### `get_stocks_agg`
This function has four modifications from the user. The user can provide the following inputs to the functions:  

* **ticker** and **company_name**: The ticker symbol and registered name of the company. If the user does not have this information, this can be fetched using `get_ticker` function described ahead.  
* **start_date**: The start of the time window (A date with the format YYYY-MM-DD).  
* **end_date**: The end of the time window (A date with the format YYYY-MM-DD).  
* **limit**: Limits the number of entries queried to create the aggregate results.  

This function returns a data frame with the close price, open price, highest price, lowest price etc for the stock over the given date range.  
```{r 1_2_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_stocks_agg <- function(ticker,company_name,start_date="2022-01-01",
                           end_date="2022-08-31",limit=50){
  url=paste0("https://api.polygon.io/v2/aggs/ticker/",ticker,
             "/range/1/day/",start_date,"/",end_date,
             "?adjusted=true&sort=asc&limit=",limit,"&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
  response_obj <- GET(url)
  parsed <- fromJSON(rawToChar(response_obj$content))
  df <- as_tibble(parsed$results)
  df <- df %>% rename(close_price=c,highest_price=h,lowest_price=l,num_transactions=n,
                      open_price=o,timestamp=t,vol=v,weighted_avg_price=vw) 
  df$ticker <- rep(ticker,limit)
  df$company_name <- rep(company_name,limit)
  df$start_date <- start_date
  df$end_date <- end_date
  df <- df %>% select(ticker,company_name,everything())
  return(df)
}
```

### [Tickers](https://polygon.io/docs/stocks/get_v3_reference_tickers) Endpoint  
Query all ticker symbols which are supported by Polygon.io. This API currently includes Stocks/Equities, Cryptocurrencies, and Currencies/Foreign Exchange.  

#### `get_ticker`
This function searches only the stock market. The user can provide the name of the company which they would like the ticker information for For example: If a user wants the ticker, the ticker information for Apple, the user can call the function as `get_ticker(name="Apple")`. This function will return the ticker symbol **AAPL** and the registered company name **Apple Inc.** as a list. The user can pass on the contents of this list to `get_stocks_agg` function mentioned above to get the aggregate bars over a date range. If there are multiple matches for a given name, the function returns the first ticker information from the list of matches.  
```{r 1_2_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_ticker <- function(name){
  url=paste0("https://api.polygon.io/v3/reference/tickers?market=stocks&search=",name,"&active=true&sort=ticker&order=asc&limit=1000&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
  response_obj <- GET(url)
  parsed<- fromJSON(rawToChar(response_obj$content))
  df <- as_tibble(parsed$results)
  return(list(df$ticker[[1]],df$name[[1]]))
}
```

#### `get_ticker_info`  
This function provides details of the ticker for a given market. The user can provide the value for **market** to this function. The possible values of market can be **stocks**, **crypto**, **fx** or **otc**. This function returns the primary exchange and type of ticker information for each security in each market.  
```{r 1_2_3,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_ticker_info <- function(market){
  url=paste0("https://api.polygon.io/v3/reference/tickers?market=",market,"&active=true&sort=ticker&order=asc&limit=1000&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
  response_obj <- GET(url)
  parsed<- fromJSON(rawToChar(response_obj$content))
  df <- as_tibble(parsed$results) %>% select(ticker,name,primary_exchange,type)
  return(df)
}
```

### [Grouped Daily (Bars)](https://polygon.io/docs/stocks/get_v2_aggs_grouped_locale_us_market_stocks__date) Endpoint  
Get the daily open, high, low, and close (OHLC) for the entire universe of stocks

#### `get_grouped_daily`
This function takes in **date** as an input from the user and returns the open, high, low, and close (OHLC) for the entire stocks markets for that particular date.  
```{r 1_2_4,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_grouped_daily <- function(date="2020-10-14"){
  url=paste0("https://api.polygon.io/v2/aggs/grouped/locale/us/market/stocks/",date,"?adjusted=true&include_otc=true&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
  response_obj <- GET(url)
  parsed <- fromJSON(rawToChar(response_obj$content))
  df <- as_tibble(parsed$results)
  df <- df %>% rename(Ticker=T,volume=v,weighted_avg_price=vw,open_price=o,
                      close_price=c,highest_price=h,lowest_price=l,
                      num_transactions=n,timestamp=t)
  df$date <- date
  return(df)
}
```

### [Ticker Types](https://polygon.io/docs/stocks/get_v3_reference_tickers_types) Endpoint  

#### `get_ticker_type_details`
This function is used to get the metadata information of all the ticker types that Polygon.io has data for.
```{r 1_2_5,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_ticker_type_details <- function(){
  response_obj <- GET("https://api.polygon.io/v3/reference/tickers/types?apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
  parsed <- fromJSON(rawToChar(response_obj$content))
  df <- as_tibble(parsed$results) 
  return(df)
}
```

### [Exchanges](https://polygon.io/docs/stocks/get_v3_reference_exchanges) Endpoint  

#### `get_exchanges_details`
This function is used to get metadata information of all the stock exchanges that Polygon.io has data for.
```{r 1_2_6,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_exchanges_details <- function(){
  response_obj <- GET("https://api.polygon.io/v3/reference/exchanges?asset_class=stocks&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
  parsed <- fromJSON(rawToChar(response_obj$content))
  df <- as_tibble(parsed$results)  
  return(df)
}
```
<!--*************************************************************************-->
# Exploratory Data Analysis (EDA)
## Combining data from API Calls
Here we have called the data of 3 stocks namely Apple, Tesla and Nvidia from 1 Jan 2022 to 31 August 2022.  
To find the stock information for Apple, we have passed "Apple" as an input argument to `get_ticker` function. This function returns a list `ticker_symbol1` which consists of ticker symbol "AAPL" and name of the company which is "Apple Inc.". The same was repeated for the other 2 symbols as well.  

This information along with start date and end date is passed to the function `get_stocks_agg`. The limit argument is not passed, so the function takes the default value of 50. This function fetches the stock information for Apple in the given date range and returns a data frame `df1`. Similarly, this is done to get the stocks information for Tesla and Nvidia in `df2` and `df3` respectively.  
```{r 2_1_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
ticker_symbol1 <- get_ticker(name="Apple")
df1 <- get_stocks_agg(ticker_symbol1[[1]],ticker_symbol1[[2]],start_date="2022-01-01",end_date="2022-08-31")
df1

ticker_symbol2 <- get_ticker(name="Tesla")
df2 <- get_stocks_agg(ticker_symbol2[[1]],ticker_symbol2[[2]],start_date="2022-01-01",end_date="2022-08-31")
df2

ticker_symbol3 <- get_ticker(name="Nvidia")
df3 <- get_stocks_agg(ticker_symbol3[[1]],ticker_symbol3[[2]],start_date="2022-01-01",end_date="2022-08-31")
df3
```

Here we have combined df1, df2 and df3 into vertically into a data frame `df_combined`. This gives us all the stock information for the 3 companies in a single data frame. This data frame is further used to plot the **close_price** for the 3 companies in a given date range.   
For this the time stamp on x-axis is in Unix Msec. We tried to convert it into Human readable datetime format using multiple ways but we were unable to do it due deadline for the project. We are sure we would have gotten a breakthrough had we worked more on this.
```{r 2_1_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
df_combined <- bind_rows(df1, df2, df3)
start_date = as.character(unique(df_combined$start_date))
end_date = as.character(unique(df_combined$end_date))

#Plot
ggplot(df_combined,aes(x=timestamp,y=close_price)) + 
  geom_line(aes(color=company_name),size=1) + 
  labs(x="Time",y="Closing price",
       title="Closing stock price over time for Apple, Nvidia and Tesla") +  
  scale_color_discrete(name = "Company Name")+
  theme(plot.title = element_text(hjust = 0.5))
```
From the plot we can see the stock price Tesla has dropped the most but it is still having higher price than Nvidia and Apple between 1 Jan 2022 to 31 August 2022. Elon Musk's deal with Twitter falling out can be one of the factor for this sink. But overall, this can be attributed to the bearish trend that is being observed currently in the markets currently due to high inflation, tapering by the FED, geopolitical tensions and other macroeconomic factors which is a clear correlation can be observed between the price performance of the 2 stocks.  

<!--*************************************************************************-->
## Creation of new variables
Here we have called `get_grouped_daily` function to get the open, high, low, and close (OHLC) for the entire stocks markets on 16 Nov 2020 (a random date which user can select).  

We have added a new variable **percent_change** which is the percent rise/decline in the stock price throughout that day (Considered open_price and close_price for calculating this). We have arranged the data frame in the descending order of percent_change, and hence at the top of the data frame we have tickers which have the maximum gain in stock price and at the bottom we have tickers which have the maximum loss.  
We have also added **percent_change_chr** variable, where we have coerced percent_change as character which used to make the plot ahead more descriptive.  
```{r 2_2_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
df_grouped <- get_grouped_daily("2020-11-16")
df_grouped <- df_grouped %>% 
  mutate(percent_change = round(((close_price-open_price)/open_price)*100,2)) %>% 
  arrange(desc(percent_change))

df_grouped$percent_change_chr <- paste(as.character(df_grouped$percent_change),'%')
df_grouped %>% select(Ticker,open_price,close_price,percent_change_chr,everything())
```

### Plot for new variable

Here we have used `head()` and `tail()` to get the stock information of 10 tickers having the highest percent gain in `df_top10` and 10 tickers having the highest percent loss in `df_bottom10`. This information is plotted using `geom_col()`.  
As mentioned above **percent_change_chr** is used to add text on top of the columns using `geom_text()`.  

On the x-axis we could not map the ticker symbol to the company name (which would have made the plot clearer) because there is a max limit of 1000 on the [Ticker Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers). So we do not get all the data and hence there is a possibility that we have a ticker symbol in `df_grouped` for which we have not been able to fetch the company name.  
```{r 2_2_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
#Top-10 gains
df_top10 <- head(df_grouped, 10)
df_top10$Ticker <- factor(df_top10$Ticker, 
                          levels=df_top10$Ticker[order(-df_top10$percent_change)])

# Plot
date <- unique(df_top10$date)
ggplot(df_top10, aes(x=Ticker, y=percent_change)) + 
geom_col(width=0.3, color='steelblue', fill='steelblue') + 
theme(axis.text.x=element_text(angle=90), text=element_text(size=12), 
      plot.title = element_text(hjust = 0.5)) + 
labs(y="Percent increase", x ="Stock ticker",
     title = paste0("Highest stock price increase on ",date)) + 
geom_text(aes(label = percent_change_chr), vjust = -0.5, size=3)

#Top-10 losses
df_bottom10 <- tail(df_grouped, 10)
df_bottom10$percent_change <- abs(df_bottom10$percent_change)
df_bottom10$Ticker <- factor(df_bottom10$Ticker, 
                             levels=df_bottom10$Ticker[order(-df_bottom10$percent_change)])

# Plot
date = unique(df_bottom10$date)
ggplot(df_bottom10, aes(x=Ticker, y=percent_change)) + 
geom_col(width=0.3, color='red', fill='red') + 
theme(axis.text.x = element_text(angle=90),text = element_text(size=12),
      plot.title = element_text(hjust = 0.5)) + 
labs(y="Percent decrease", x ="Stock ticker", 
     title = paste0("Highest stock price decrease on ",date)) + 
geom_text(aes(label=percent_change_chr), vjust =-0.5, size=3)
```

From the plot above we can see that on 16th Nov 2020, the stock price for ZXZZT(NASDAQ TEST STOCK) noticed maximum gain of 93.91%. After that, the highest gain was noticed by AIRTW (Air T, Inc.) which was 51.34% and WWR(Westwater Resources, Inc.) which was 45.48%.  
On similar lines, the stock price for KTOVW(Kitov Pharma Ltd. Warrants) noticed highest loss of -43.93%. 
Thus this plot allows the user to find out the biggest gainers and losers on any given day.

<!--*************************************************************************-->
## Contingency tables
Here we have used `get_ticker_info` function to get ticker information such as its type and its primary exchange for all the tickers supported by Polygon.io. in the **stock** market. Hence the input argument to the function call is "stocks". 

### One-way  
Here we have created a contingency table using ticker type information for the stock market. When fetching data from the [Ticker Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers) in `df_info`, we get the abbrevations of ticker types. In order to fetch the descriptions of ticker type (For example: **CS** means **Common Stocks**) we have fetched data from the [Ticker Types Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers_types) in `df_tickertype_metadata`. Then we took the `left_join` of the two tibbles in order to get all the information in a single tibble which is then used to create the contingency table.  
```{r 2_3_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
df_info <- get_ticker_info("stocks")
Sys.sleep(5)
df_tickertype_metadata <- get_ticker_type_details() %>% select(code,description)
Sys.sleep(5)
df_tables <- left_join(df_info,df_tickertype_metadata,by=c("type"="code"))

tab1 <- table(df_tables$description,dnn=c("Ticker Types"))
tab1 %>%
  kbl(caption="Table for Ticker Types") %>%
  kable_classic(full_width = F)
```
From the table we can see that the number of CS (Common Stock) is 561 and the number of ARDC (American Depository Receipt Common) is 40. 

### Two-way
Similarly here we have created a two way contingency table for the number of ticker types for each stock exchange. When fetching data from the [Ticker Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers) in `df_info`, we get the abbrevations stock exchanges. Here we tried to use [Exchange Endpoint](https://polygon.io/docs/stocks/get_v3_reference_exchanges) to get the full name of the stock exchanges (For example: **XNYS** is **NYSE American, LLC**) but there we multiple exchange information for each exchange which made the table very complicated to understand. For that reason we let the exchange abbrevations be.
```{r 2_3_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
tab2 <- table(df_tables$description,df_info$primary_exchange,dnn=c("Ticker Types","Exchanges"))
tab2 %>%
  kbl(caption="Table for Ticker Types and Exchanges") %>%
  kable_classic(full_width = F)
```
From the table above we can see that there are 359 **Common Stock** type tickers in the **XNYS** (New York stock exchange). And similarly this table helps us to identify the total number of each ticker type in each exchange.

<!--*************************************************************************-->
## Numerical summaries
Here we have created summary Statistics for **open_price** and **close_price** for each company between `r start_date` and `r end_date`. The summary statistics includes minimum, maximum, median, mean, quartile and standard deviation of prices for the 3 stocks.   
```{r 2_4_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
# Open Price
df_combined_open <- df_combined %>% 
  group_by(company_name) %>% 
  summarise("Min." = min(open_price),
            "1st Quartile" = quantile(open_price,0.25),
            "Median." = median(open_price),
            "Mean."=mean(open_price),
            "3rd Quartile" = quantile(open_price,0.75),
            "Max."= max(open_price),
            "Std. Dev." = sd(open_price))

# Close Price
df_combined_close <- df_combined %>% 
  group_by(company_name) %>% 
  summarise("Min." = min(close_price),
            "1st Quartile" = quantile(close_price,0.25),
            "Median." = median(close_price),
            "Mean."=mean(close_price),
            "3rd Quartile" = quantile(close_price,0.75),
            "Max."= max(close_price),
            "Std. Dev." = sd(close_price))

df_combined_open %>%
  kbl(caption=paste0("Summary Statistics for Open Price per Company between ",start_date," and ",end_date)) %>%
  kable_classic()

df_combined_close %>%
  kbl(caption=paste0("Summary Statistics for Close Price per Company between ",start_date," and ",end_date)) %>%
  kable_classic()
```

<!--*************************************************************************-->

## Box plots
The above numerical summaries can be visualized using a box plot. Here we have created a box plot for **highest stock price** and **lowest stock *price** for each company between `r start_date` and `r end_date`.
```{r 2_5_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
ggplot(df_combined, aes(x=company_name, y=highest_price)) + 
  geom_boxplot(color="blue",fill="grey") + 
  labs(y="Highest price", x ="Company Name", 
       title=paste0("Boxplot for highest stock price between ",start_date," and ",end_date)) 
```
From the above plot we can see that mean highest price for Apple is lowest followed by Nvidia and then Tesla. From the width of the box plot we can also infer that highest price for Tesla has varied the most in the given date range where as Apple's highest price has varied the least.

```{r 2_5_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
ggplot(df_combined, aes(x=company_name, y=lowest_price)) + 
  geom_boxplot(color="red",fill="grey") + 
  labs(y="Lowest price", x ="Company Name", 
       title=paste0("Boxplot for lowest stock price between ",start_date," and ",end_date)) 
```
The above plot for lowest price is analogous the box plot for the highest price. It follows the same trend as seen in the plot for highest price.  

<!--*************************************************************************-->

## Histogram  
Here we have used the `df_combined` to create a histogram for the **weighted average stock price** for each company.
```{r 2_6_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
wrapper <- function(x, ...) 
{
  paste(strwrap(x, ...), collapse = "\n")
}
my_title <- paste0("Histogram of weighted average price per company between ",start_date," and ",end_date)

ggplot(df_combined, aes(x=weighted_avg_price)) + 
  geom_histogram(aes(fill=company_name),binwidth=8) + 
  labs(x ="Weighted average price") + 
  scale_fill_discrete(name = "Company Name") + 
  ggtitle(wrapper(my_title, width=80))
```
Consistent with the trend from the previous plots, we can see that weighted average price for Apple is at the lower end while for Tesla it as at higher end. We can also infer from the plots that weighted average price for Apple and Tesla roughly follow Gaussian distribution. While the deviation for Apple around the mean is the least, the deviation for Tesla around its mean is the most. Probably because of Elon's Tweets which contribute to volatiltiy to the stock price!  

<!--*************************************************************************-->

## Bar plot 
Here we have used the one way contingency table created in above sections to create a bar plot with the type of tickers on the x axis.   
```{r 2_7_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.height=10}
ggplot(df_tables, aes(x=description)) + 
  geom_bar(fill="steelblue") + 
  labs(x ="Type of tickers",
       title="Bar plot for number of stock tickers for each type") + 
  theme(text=element_text(size=14), plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90)) + 
  geom_text(aes(label = ..count..), stat = "count", vjust = -0.5)
```
The above plot provides a good visualization of the one way contingency table. We can infer that number of tickers of type "Common Stock" are the highest with 561 and number of tickers of type "Exchange Traded Note" are the least with 5.  
<!--*************************************************************************-->  

## Scatter plot  
The scatter plot has been used to plot the percentage change in returns for the stocks Apple, Tesla and Nvidia for a date range and can be differentiated based on the different colours. From the plot, we can see how varied returns have been on a daily basis but if we look carefully we can see that there is a certain correlation in the returns and on most days the movement for all 3 technology stocks is in the same direction of either gaining or losing
```{r 2_8_1, echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
df_comb <- df_combined %>% mutate(percent_change = round(((close_price-open_price)/open_price)*100,2))
ggplot(df_comb, aes(x=timestamp, y=percent_change,color=company_name)) + 
geom_point()+
facet_grid(cols = vars(company_name)) + 
scale_color_discrete(name = "Company Name")
```
<!--*************************************************************************-->

# Wrap- Up
This vignette mainly focuses on communicating with the REST API Endpoints, creating functions in R to fetch data from the endpoints and basic data exploration- including numerical summaries, contingency tables and plots.