-
Notifications
You must be signed in to change notification settings - Fork 1
/
Project2_main.Rmd
368 lines (319 loc) · 23.3 KB
/
Project2_main.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
---
title: "ST558- Project 2"
author: "Rohan Prabhune, Naman Goel"
output:
github_document:
toc: true
toc_depth: 4
always_allow_html: true
---
```{r Temp1,echo=FALSE,eval=TRUE,message=FALSE,warning=FALSE}
knitr::opts_chunk$set(fig.align = 'center')
```
# Interacting with APIs: Financial Market Data
This vignette shows how to work with API. This will demonstrate how to fetch data from multiple API endpoints and read it in a tibble(data frame). This is followed by some basic exploratory data analysis (EDA) to produce some plots to derive insights from the data fetched.
The API that we have chosen for this project is [Financial Market Data](https://polygon.io/docs/stocks).The Polygon.io Stocks API provides REST endpoints that let you query the latest market data from all US stock exchanges.
## Requirements
We used the following packages in the creation of the vignette:
* `httr`: This is used to access the REST API endpoint.
* `jsonlite`: This is used to parse the fetched data into a data frame.
* `tidyverse`: This provides two important packages `dplyr` and `ggplot` which are used for data manipulation and plotting respectively.
* `kableExtra`: This provides better printing properties for contingency tables in markdown.
To get started, install(if these are not installed already) and load the following packages:
```{r 1_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
library(httr)
library(jsonlite)
library(tidyverse)
library(kableExtra)
```
<!--*************************************************************************-->
## API Interaction Functions
This section describes the functions created by us to interact with the API endpoints to fetch data as well as some metadata required for making the plots more descriptive.
### [Aggregates (Bars)](https://polygon.io/docs/stocks/get_v2_aggs_ticker__stocksticker__range__multiplier___timespan___from___to) Endpoint
Get financial data for a stock within a given time frame
#### `get_stocks_agg`
This function has four modifications from the user. The user can provide the following inputs to the functions:
* **ticker** and **company_name**: The ticker symbol and registered name of the company. If the user does not have this information, this can be fetched using `get_ticker` function described ahead.
* **start_date**: The start of the time window (A date with the format YYYY-MM-DD).
* **end_date**: The end of the time window (A date with the format YYYY-MM-DD).
* **limit**: Limits the number of entries queried to create the aggregate results.
This function returns a data frame with the close price, open price, highest price, lowest price etc for the stock over the given date range.
```{r 1_2_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_stocks_agg <- function(ticker,company_name,start_date="2022-01-01",
end_date="2022-08-31",limit=50){
url=paste0("https://api.polygon.io/v2/aggs/ticker/",ticker,
"/range/1/day/",start_date,"/",end_date,
"?adjusted=true&sort=asc&limit=",limit,"&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
response_obj <- GET(url)
parsed <- fromJSON(rawToChar(response_obj$content))
df <- as_tibble(parsed$results)
df <- df %>% rename(close_price=c,highest_price=h,lowest_price=l,num_transactions=n,
open_price=o,timestamp=t,vol=v,weighted_avg_price=vw)
df$ticker <- rep(ticker,limit)
df$company_name <- rep(company_name,limit)
df$start_date <- start_date
df$end_date <- end_date
df <- df %>% select(ticker,company_name,everything())
return(df)
}
```
### [Tickers](https://polygon.io/docs/stocks/get_v3_reference_tickers) Endpoint
Query all ticker symbols which are supported by Polygon.io. This API currently includes Stocks/Equities, Cryptocurrencies, and Currencies/Foreign Exchange.
#### `get_ticker`
This function searches only the stock market. The user can provide the name of the company which they would like the ticker information for For example: If a user wants the ticker, the ticker information for Apple, the user can call the function as `get_ticker(name="Apple")`. This function will return the ticker symbol **AAPL** and the registered company name **Apple Inc.** as a list. The user can pass on the contents of this list to `get_stocks_agg` function mentioned above to get the aggregate bars over a date range. If there are multiple matches for a given name, the function returns the first ticker information from the list of matches.
```{r 1_2_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_ticker <- function(name){
url=paste0("https://api.polygon.io/v3/reference/tickers?market=stocks&search=",name,"&active=true&sort=ticker&order=asc&limit=1000&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
response_obj <- GET(url)
parsed<- fromJSON(rawToChar(response_obj$content))
df <- as_tibble(parsed$results)
return(list(df$ticker[[1]],df$name[[1]]))
}
```
#### `get_ticker_info`
This function provides details of the ticker for a given market. The user can provide the value for **market** to this function. The possible values of market can be **stocks**, **crypto**, **fx** or **otc**. This function returns the primary exchange and type of ticker information for each security in each market.
```{r 1_2_3,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_ticker_info <- function(market){
url=paste0("https://api.polygon.io/v3/reference/tickers?market=",market,"&active=true&sort=ticker&order=asc&limit=1000&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
response_obj <- GET(url)
parsed<- fromJSON(rawToChar(response_obj$content))
df <- as_tibble(parsed$results) %>% select(ticker,name,primary_exchange,type)
return(df)
}
```
### [Grouped Daily (Bars)](https://polygon.io/docs/stocks/get_v2_aggs_grouped_locale_us_market_stocks__date) Endpoint
Get the daily open, high, low, and close (OHLC) for the entire universe of stocks
#### `get_grouped_daily`
This function takes in **date** as an input from the user and returns the open, high, low, and close (OHLC) for the entire stocks markets for that particular date.
```{r 1_2_4,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_grouped_daily <- function(date="2020-10-14"){
url=paste0("https://api.polygon.io/v2/aggs/grouped/locale/us/market/stocks/",date,"?adjusted=true&include_otc=true&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
response_obj <- GET(url)
parsed <- fromJSON(rawToChar(response_obj$content))
df <- as_tibble(parsed$results)
df <- df %>% rename(Ticker=T,volume=v,weighted_avg_price=vw,open_price=o,
close_price=c,highest_price=h,lowest_price=l,
num_transactions=n,timestamp=t)
df$date <- date
return(df)
}
```
### [Ticker Types](https://polygon.io/docs/stocks/get_v3_reference_tickers_types) Endpoint
#### `get_ticker_type_details`
This function is used to get the metadata information of all the ticker types that Polygon.io has data for.
```{r 1_2_5,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_ticker_type_details <- function(){
response_obj <- GET("https://api.polygon.io/v3/reference/tickers/types?apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
parsed <- fromJSON(rawToChar(response_obj$content))
df <- as_tibble(parsed$results)
return(df)
}
```
### [Exchanges](https://polygon.io/docs/stocks/get_v3_reference_exchanges) Endpoint
#### `get_exchanges_details`
This function is used to get metadata information of all the stock exchanges that Polygon.io has data for.
```{r 1_2_6,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
get_exchanges_details <- function(){
response_obj <- GET("https://api.polygon.io/v3/reference/exchanges?asset_class=stocks&apiKey=EdkA7_m2JhjS5POrGuXJbVlA4AjSl_4F")
parsed <- fromJSON(rawToChar(response_obj$content))
df <- as_tibble(parsed$results)
return(df)
}
```
<!--*************************************************************************-->
# Exploratory Data Analysis (EDA)
## Combining data from API Calls
Here we have called the data of 3 stocks namely Apple, Tesla and Nvidia from 1 Jan 2022 to 31 August 2022.
To find the stock information for Apple, we have passed "Apple" as an input argument to `get_ticker` function. This function returns a list `ticker_symbol1` which consists of ticker symbol "AAPL" and name of the company which is "Apple Inc.". The same was repeated for the other 2 symbols as well.
This information along with start date and end date is passed to the function `get_stocks_agg`. The limit argument is not passed, so the function takes the default value of 50. This function fetches the stock information for Apple in the given date range and returns a data frame `df1`. Similarly, this is done to get the stocks information for Tesla and Nvidia in `df2` and `df3` respectively.
```{r 2_1_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
ticker_symbol1 <- get_ticker(name="Apple")
df1 <- get_stocks_agg(ticker_symbol1[[1]],ticker_symbol1[[2]],start_date="2022-01-01",end_date="2022-08-31")
df1
ticker_symbol2 <- get_ticker(name="Tesla")
df2 <- get_stocks_agg(ticker_symbol2[[1]],ticker_symbol2[[2]],start_date="2022-01-01",end_date="2022-08-31")
df2
ticker_symbol3 <- get_ticker(name="Nvidia")
df3 <- get_stocks_agg(ticker_symbol3[[1]],ticker_symbol3[[2]],start_date="2022-01-01",end_date="2022-08-31")
df3
```
Here we have combined df1, df2 and df3 into vertically into a data frame `df_combined`. This gives us all the stock information for the 3 companies in a single data frame. This data frame is further used to plot the **close_price** for the 3 companies in a given date range.
For this the time stamp on x-axis is in Unix Msec. We tried to convert it into Human readable datetime format using multiple ways but we were unable to do it due deadline for the project. We are sure we would have gotten a breakthrough had we worked more on this.
```{r 2_1_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
df_combined <- bind_rows(df1, df2, df3)
start_date = as.character(unique(df_combined$start_date))
end_date = as.character(unique(df_combined$end_date))
#Plot
ggplot(df_combined,aes(x=timestamp,y=close_price)) +
geom_line(aes(color=company_name),size=1) +
labs(x="Time",y="Closing price",
title="Closing stock price over time for Apple, Nvidia and Tesla") +
scale_color_discrete(name = "Company Name")+
theme(plot.title = element_text(hjust = 0.5))
```
From the plot we can see the stock price Tesla has dropped the most but it is still having higher price than Nvidia and Apple between 1 Jan 2022 to 31 August 2022. Elon Musk's deal with Twitter falling out can be one of the factor for this sink. But overall, this can be attributed to the bearish trend that is being observed currently in the markets currently due to high inflation, tapering by the FED, geopolitical tensions and other macroeconomic factors which is a clear correlation can be observed between the price performance of the 2 stocks.
<!--*************************************************************************-->
## Creation of new variables
Here we have called `get_grouped_daily` function to get the open, high, low, and close (OHLC) for the entire stocks markets on 16 Nov 2020 (a random date which user can select).
We have added a new variable **percent_change** which is the percent rise/decline in the stock price throughout that day (Considered open_price and close_price for calculating this). We have arranged the data frame in the descending order of percent_change, and hence at the top of the data frame we have tickers which have the maximum gain in stock price and at the bottom we have tickers which have the maximum loss.
We have also added **percent_change_chr** variable, where we have coerced percent_change as character which used to make the plot ahead more descriptive.
```{r 2_2_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
df_grouped <- get_grouped_daily("2020-11-16")
df_grouped <- df_grouped %>%
mutate(percent_change = round(((close_price-open_price)/open_price)*100,2)) %>%
arrange(desc(percent_change))
df_grouped$percent_change_chr <- paste(as.character(df_grouped$percent_change),'%')
df_grouped %>% select(Ticker,open_price,close_price,percent_change_chr,everything())
```
### Plot for new variable
Here we have used `head()` and `tail()` to get the stock information of 10 tickers having the highest percent gain in `df_top10` and 10 tickers having the highest percent loss in `df_bottom10`. This information is plotted using `geom_col()`.
As mentioned above **percent_change_chr** is used to add text on top of the columns using `geom_text()`.
On the x-axis we could not map the ticker symbol to the company name (which would have made the plot clearer) because there is a max limit of 1000 on the [Ticker Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers). So we do not get all the data and hence there is a possibility that we have a ticker symbol in `df_grouped` for which we have not been able to fetch the company name.
```{r 2_2_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
#Top-10 gains
df_top10 <- head(df_grouped, 10)
df_top10$Ticker <- factor(df_top10$Ticker,
levels=df_top10$Ticker[order(-df_top10$percent_change)])
# Plot
date <- unique(df_top10$date)
ggplot(df_top10, aes(x=Ticker, y=percent_change)) +
geom_col(width=0.3, color='steelblue', fill='steelblue') +
theme(axis.text.x=element_text(angle=90), text=element_text(size=12),
plot.title = element_text(hjust = 0.5)) +
labs(y="Percent increase", x ="Stock ticker",
title = paste0("Highest stock price increase on ",date)) +
geom_text(aes(label = percent_change_chr), vjust = -0.5, size=3)
#Top-10 losses
df_bottom10 <- tail(df_grouped, 10)
df_bottom10$percent_change <- abs(df_bottom10$percent_change)
df_bottom10$Ticker <- factor(df_bottom10$Ticker,
levels=df_bottom10$Ticker[order(-df_bottom10$percent_change)])
# Plot
date = unique(df_bottom10$date)
ggplot(df_bottom10, aes(x=Ticker, y=percent_change)) +
geom_col(width=0.3, color='red', fill='red') +
theme(axis.text.x = element_text(angle=90),text = element_text(size=12),
plot.title = element_text(hjust = 0.5)) +
labs(y="Percent decrease", x ="Stock ticker",
title = paste0("Highest stock price decrease on ",date)) +
geom_text(aes(label=percent_change_chr), vjust =-0.5, size=3)
```
From the plot above we can see that on 16th Nov 2020, the stock price for ZXZZT(NASDAQ TEST STOCK) noticed maximum gain of 93.91%. After that, the highest gain was noticed by AIRTW (Air T, Inc.) which was 51.34% and WWR(Westwater Resources, Inc.) which was 45.48%.
On similar lines, the stock price for KTOVW(Kitov Pharma Ltd. Warrants) noticed highest loss of -43.93%.
Thus this plot allows the user to find out the biggest gainers and losers on any given day.
<!--*************************************************************************-->
## Contingency tables
Here we have used `get_ticker_info` function to get ticker information such as its type and its primary exchange for all the tickers supported by Polygon.io. in the **stock** market. Hence the input argument to the function call is "stocks".
### One-way
Here we have created a contingency table using ticker type information for the stock market. When fetching data from the [Ticker Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers) in `df_info`, we get the abbrevations of ticker types. In order to fetch the descriptions of ticker type (For example: **CS** means **Common Stocks**) we have fetched data from the [Ticker Types Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers_types) in `df_tickertype_metadata`. Then we took the `left_join` of the two tibbles in order to get all the information in a single tibble which is then used to create the contingency table.
```{r 2_3_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
df_info <- get_ticker_info("stocks")
Sys.sleep(5)
df_tickertype_metadata <- get_ticker_type_details() %>% select(code,description)
Sys.sleep(5)
df_tables <- left_join(df_info,df_tickertype_metadata,by=c("type"="code"))
tab1 <- table(df_tables$description,dnn=c("Ticker Types"))
tab1 %>%
kbl(caption="Table for Ticker Types") %>%
kable_classic(full_width = F)
```
From the table we can see that the number of CS (Common Stock) is 561 and the number of ARDC (American Depository Receipt Common) is 40.
### Two-way
Similarly here we have created a two way contingency table for the number of ticker types for each stock exchange. When fetching data from the [Ticker Endpoint](https://polygon.io/docs/stocks/get_v3_reference_tickers) in `df_info`, we get the abbrevations stock exchanges. Here we tried to use [Exchange Endpoint](https://polygon.io/docs/stocks/get_v3_reference_exchanges) to get the full name of the stock exchanges (For example: **XNYS** is **NYSE American, LLC**) but there we multiple exchange information for each exchange which made the table very complicated to understand. For that reason we let the exchange abbrevations be.
```{r 2_3_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
tab2 <- table(df_tables$description,df_info$primary_exchange,dnn=c("Ticker Types","Exchanges"))
tab2 %>%
kbl(caption="Table for Ticker Types and Exchanges") %>%
kable_classic(full_width = F)
```
From the table above we can see that there are 359 **Common Stock** type tickers in the **XNYS** (New York stock exchange). And similarly this table helps us to identify the total number of each ticker type in each exchange.
<!--*************************************************************************-->
## Numerical summaries
Here we have created summary Statistics for **open_price** and **close_price** for each company between `r start_date` and `r end_date`. The summary statistics includes minimum, maximum, median, mean, quartile and standard deviation of prices for the 3 stocks.
```{r 2_4_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
# Open Price
df_combined_open <- df_combined %>%
group_by(company_name) %>%
summarise("Min." = min(open_price),
"1st Quartile" = quantile(open_price,0.25),
"Median." = median(open_price),
"Mean."=mean(open_price),
"3rd Quartile" = quantile(open_price,0.75),
"Max."= max(open_price),
"Std. Dev." = sd(open_price))
# Close Price
df_combined_close <- df_combined %>%
group_by(company_name) %>%
summarise("Min." = min(close_price),
"1st Quartile" = quantile(close_price,0.25),
"Median." = median(close_price),
"Mean."=mean(close_price),
"3rd Quartile" = quantile(close_price,0.75),
"Max."= max(close_price),
"Std. Dev." = sd(close_price))
df_combined_open %>%
kbl(caption=paste0("Summary Statistics for Open Price per Company between ",start_date," and ",end_date)) %>%
kable_classic()
df_combined_close %>%
kbl(caption=paste0("Summary Statistics for Close Price per Company between ",start_date," and ",end_date)) %>%
kable_classic()
```
<!--*************************************************************************-->
## Box plots
The above numerical summaries can be visualized using a box plot. Here we have created a box plot for **highest stock price** and **lowest stock *price** for each company between `r start_date` and `r end_date`.
```{r 2_5_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
ggplot(df_combined, aes(x=company_name, y=highest_price)) +
geom_boxplot(color="blue",fill="grey") +
labs(y="Highest price", x ="Company Name",
title=paste0("Boxplot for highest stock price between ",start_date," and ",end_date))
```
From the above plot we can see that mean highest price for Apple is lowest followed by Nvidia and then Tesla. From the width of the box plot we can also infer that highest price for Tesla has varied the most in the given date range where as Apple's highest price has varied the least.
```{r 2_5_2,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE}
ggplot(df_combined, aes(x=company_name, y=lowest_price)) +
geom_boxplot(color="red",fill="grey") +
labs(y="Lowest price", x ="Company Name",
title=paste0("Boxplot for lowest stock price between ",start_date," and ",end_date))
```
The above plot for lowest price is analogous the box plot for the highest price. It follows the same trend as seen in the plot for highest price.
<!--*************************************************************************-->
## Histogram
Here we have used the `df_combined` to create a histogram for the **weighted average stock price** for each company.
```{r 2_6_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
wrapper <- function(x, ...)
{
paste(strwrap(x, ...), collapse = "\n")
}
my_title <- paste0("Histogram of weighted average price per company between ",start_date," and ",end_date)
ggplot(df_combined, aes(x=weighted_avg_price)) +
geom_histogram(aes(fill=company_name),binwidth=8) +
labs(x ="Weighted average price") +
scale_fill_discrete(name = "Company Name") +
ggtitle(wrapper(my_title, width=80))
```
Consistent with the trend from the previous plots, we can see that weighted average price for Apple is at the lower end while for Tesla it as at higher end. We can also infer from the plots that weighted average price for Apple and Tesla roughly follow Gaussian distribution. While the deviation for Apple around the mean is the least, the deviation for Tesla around its mean is the most. Probably because of Elon's Tweets which contribute to volatiltiy to the stock price!
<!--*************************************************************************-->
## Bar plot
Here we have used the one way contingency table created in above sections to create a bar plot with the type of tickers on the x axis.
```{r 2_7_1,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.height=10}
ggplot(df_tables, aes(x=description)) +
geom_bar(fill="steelblue") +
labs(x ="Type of tickers",
title="Bar plot for number of stock tickers for each type") +
theme(text=element_text(size=14), plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90)) +
geom_text(aes(label = ..count..), stat = "count", vjust = -0.5)
```
The above plot provides a good visualization of the one way contingency table. We can infer that number of tickers of type "Common Stock" are the highest with 561 and number of tickers of type "Exchange Traded Note" are the least with 5.
<!--*************************************************************************-->
## Scatter plot
The scatter plot has been used to plot the percentage change in returns for the stocks Apple, Tesla and Nvidia for a date range and can be differentiated based on the different colours. From the plot, we can see how varied returns have been on a daily basis but if we look carefully we can see that there is a certain correlation in the returns and on most days the movement for all 3 technology stocks is in the same direction of either gaining or losing
```{r 2_8_1, echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,fig.width = 10}
df_comb <- df_combined %>% mutate(percent_change = round(((close_price-open_price)/open_price)*100,2))
ggplot(df_comb, aes(x=timestamp, y=percent_change,color=company_name)) +
geom_point()+
facet_grid(cols = vars(company_name)) +
scale_color_discrete(name = "Company Name")
```
<!--*************************************************************************-->
# Wrap- Up
This vignette mainly focuses on communicating with the REST API Endpoints, creating functions in R to fetch data from the endpoints and basic data exploration- including numerical summaries, contingency tables and plots.