Skip to content

Commit 554bfae

Browse files
add news retriever (run-llama#13934)
* add news retriever * Integrate News API to YouRetriever * Update You Retriever notebook * 🎨 Rename `endpoint_type` to `endpoint` * Return news metadata * fixup! 🎨 Rename `endpoint_type` to `endpoint` * ⬆️ Bump package version --------- Co-authored-by: Christopher Tee <[email protected]>
1 parent 00e118c commit 554bfae

File tree

3 files changed

+155
-25
lines changed

3 files changed

+155
-25
lines changed

docs/docs/examples/retrievers/you_retriever.ipynb

+59-16
Original file line numberDiff line numberDiff line change
@@ -36,44 +36,88 @@
3636
"metadata": {},
3737
"outputs": [],
3838
"source": [
39+
"import os\n",
3940
"from llama_index.retrievers.you import YouRetriever"
4041
]
4142
},
43+
{
44+
"cell_type": "markdown",
45+
"id": "bda2c5e0",
46+
"metadata": {},
47+
"source": [
48+
"### Retrieve from You.com's Search API"
49+
]
50+
},
4251
{
4352
"cell_type": "code",
4453
"execution_count": null,
4554
"id": "a38b87b3-c94e-4311-8335-86c6b0f32463",
4655
"metadata": {},
4756
"outputs": [],
4857
"source": [
49-
"you_api_key = \"\" or os.environ[\"YOU_API_KEY\"]\n",
58+
"you_api_key = \"\" or os.environ[\"YDC_API_KEY\"]\n",
5059
"\n",
51-
"retriever = YouRetriever(api_key=you_api_key)"
60+
"retriever = YouRetriever(endpoint=\"search\", api_key=you_api_key) # default"
5261
]
5362
},
5463
{
5564
"cell_type": "code",
5665
"execution_count": null,
5766
"id": "bbfc0fe3-7c64-4d5d-8190-f80e31d35b4c",
5867
"metadata": {},
59-
"outputs": [],
68+
"outputs": [
69+
{
70+
"name": "stdout",
71+
"output_type": "stream",
72+
"text": [
73+
"The beaches and underwater world off the coast of Florida provide endless opportunities of play in the ocean. ... Glacier Bay is a living laboratory with ongoing research and study by scientists on a wide range of ocean-related issues. ... A picture of coastal life, Fire Island offers rich biological diversity and the beautiful landscapes that draw us all to the ocean.\n",
74+
"A military veteran, Jose Sarria also became a prominent advocate for Latinos, immigrants, and the LGBTQ community in San Francisco. ... Explore the history of the LGBTQ community on Governors Island and Henry Gurber's work in protecting gay rights.\n",
75+
"Explore the history of the LGBTQ community on Governors Island and Henry Gurber's work in protecting gay rights. ... Sylvia Rivera was an advocate for transgender rights and LGBTQ+ communities, and was an active participant of the Stonewall uprising.\n"
76+
]
77+
}
78+
],
6079
"source": [
61-
"retrieved_results = retriever.retrieve(\"national parks in the US\")"
80+
"retrieved_results = retriever.retrieve(\"national parks in the US\")\n",
81+
"print(retrieved_results[0].get_content())"
82+
]
83+
},
84+
{
85+
"cell_type": "markdown",
86+
"id": "069c4adb",
87+
"metadata": {},
88+
"source": [
89+
"### Retrieve from You.com's News API"
6290
]
6391
},
6492
{
6593
"cell_type": "code",
6694
"execution_count": null,
67-
"id": "3142a3af-d9a0-4fc1-a6a4-f42eb11a9099",
95+
"id": "47a7c7d3",
6896
"metadata": {},
6997
"outputs": [],
7098
"source": [
71-
"print(retrieved_results[0].get_content())\n",
72-
"\n",
73-
"from llama_index.core.response.notebook_utils import display_source_node\n",
99+
"you_api_key = \"\" or os.environ[\"YDC_API_KEY\"]\n",
74100
"\n",
75-
"# for n in retrieved_results:\n",
76-
"# display_source_node(n)"
101+
"retriever = YouRetriever(endpoint=\"news\", api_key=you_api_key)"
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": null,
107+
"id": "f9eedea5",
108+
"metadata": {},
109+
"outputs": [
110+
{
111+
"name": "stdout",
112+
"output_type": "stream",
113+
"text": [
114+
"But seven months after the October announcement, the Fed's key interest rate — the federal funds rate — is still stuck at 5.25% to 5.5%, where it has been since July 2023. U.S. interest rates are built with the fed funds rate as the foundation.\n"
115+
]
116+
}
117+
],
118+
"source": [
119+
"retrieved_results = retriever.retrieve(\"Fed interest rates\")\n",
120+
"print(retrieved_results[0].get_content())"
77121
]
78122
},
79123
{
@@ -93,9 +137,8 @@
93137
"source": [
94138
"from llama_index.core.query_engine import RetrieverQueryEngine\n",
95139
"\n",
96-
"query_engine = RetrieverQueryEngine.from_args(\n",
97-
" retriever,\n",
98-
")"
140+
"retriever = YouRetriever()\n",
141+
"query_engine = RetrieverQueryEngine.from_args(retriever)"
99142
]
100143
},
101144
{
@@ -108,7 +151,7 @@
108151
"name": "stdout",
109152
"output_type": "stream",
110153
"text": [
111-
"The United States has 63 national parks, which are protected areas operated by the National Park Service. These parks are designated for their natural beauty, unique geological features, diverse ecosystems, and recreational opportunities. They are typically larger and more popular destinations compared to other units of the National Park System. National monuments, on the other hand, are also protected for their historical or archaeological significance. Some national parks are paired with national preserves, which have different levels of protection but are administered together. The national parks in the United States cover a total area of approximately 52.4 million acres.\n"
154+
"There are 63 national parks in the United States, each established to preserve unique landscapes, wildlife, and historical sites for the enjoyment of present and future generations. These parks are managed by the National Park Service, which aims to conserve the scenery and natural and historic objects within the parks. National parks offer a wide range of activities such as hiking, camping, wildlife viewing, and learning about the natural world. Some of the most visited national parks include Great Smoky Mountains, Yellowstone, and Zion, while others like Gates of the Arctic see fewer visitors due to their remote locations. Each national park has its own distinct features and attractions, contributing to the diverse tapestry of protected lands across the country.\n"
112155
]
113156
}
114157
],
@@ -120,9 +163,9 @@
120163
],
121164
"metadata": {
122165
"kernelspec": {
123-
"display_name": "llama_index_v2",
166+
"display_name": "you-llamaindex",
124167
"language": "python",
125-
"name": "llama_index_v2"
168+
"name": "python3"
126169
},
127170
"language_info": {
128171
"codemirror_mode": {

llama-index-integrations/retrievers/llama-index-retrievers-you/llama_index/retrievers/you/base.py

+95-8
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22

33
import logging
44
import os
5-
from typing import List, Optional
5+
import warnings
6+
from typing import Any, Dict, List, Literal, Optional
67

78
import requests
9+
810
from llama_index.core.base.base_retriever import BaseRetriever
911
from llama_index.core.callbacks.base import CallbackManager
1012
from llama_index.core.schema import NodeWithScore, QueryBundle, TextNode
@@ -13,24 +15,109 @@
1315

1416

1517
class YouRetriever(BaseRetriever):
16-
"""You retriever."""
18+
"""
19+
Retriever for You.com's Search and News API.
20+
21+
[API reference](https://documentation.you.com/api-reference/)
22+
23+
Args:
24+
api_key: you.com API key, if `YDC_API_KEY` is not set in the environment
25+
endpoint: you.com endpoints
26+
num_web_results: The max number of web results to return, must be under 20
27+
safesearch: Safesearch settings, one of "off", "moderate", "strict", defaults to moderate
28+
country: Country code, ex: 'US' for United States, see API reference for more info
29+
search_lang: (News API) Language codes, ex: 'en' for English, see API reference for more info
30+
ui_lang: (News API) User interface language for the response, ex: 'en' for English, see API reference for more info
31+
spellcheck: (News API) Whether to spell check query or not, defaults to True
32+
"""
1733

1834
def __init__(
1935
self,
2036
api_key: Optional[str] = None,
2137
callback_manager: Optional[CallbackManager] = None,
38+
endpoint: Literal["search", "news"] = "search",
39+
num_web_results: Optional[int] = None,
40+
safesearch: Optional[Literal["off", "moderate", "strict"]] = None,
41+
country: Optional[str] = None,
42+
search_lang: Optional[str] = None,
43+
ui_lang: Optional[str] = None,
44+
spellcheck: Optional[bool] = None,
2245
) -> None:
2346
"""Init params."""
24-
self._api_key = api_key or os.environ["YOU_API_KEY"]
47+
# Should deprecate `YOU_API_KEY` in favour of `YDC_API_KEY` for standardization purposes
48+
self._api_key = api_key or os.getenv("YOU_API_KEY") or os.environ["YDC_API_KEY"]
2549
super().__init__(callback_manager)
2650

51+
if endpoint not in ("search", "news"):
52+
raise ValueError('`endpoint` must be either "search" or "news"')
53+
54+
# Raise warning if News API-specific fields are set but endpoint is not "news"
55+
if endpoint != "news":
56+
news_api_fields = (search_lang, ui_lang, spellcheck)
57+
for field in news_api_fields:
58+
if field:
59+
warnings.warn(
60+
(
61+
f"News API-specific field '{field}' is set but `{endpoint=}`. "
62+
"This will have no effect."
63+
),
64+
UserWarning,
65+
)
66+
67+
self.endpoint = endpoint
68+
self.num_web_results = num_web_results
69+
self.safesearch = safesearch
70+
self.country = country
71+
self.search_lang = search_lang
72+
self.ui_lang = ui_lang
73+
self.spellcheck = spellcheck
74+
75+
def _generate_params(self, query: str) -> Dict[str, Any]:
76+
params = {"safesearch": self.safesearch, "country": self.country}
77+
78+
if self.endpoint == "search":
79+
params.update(
80+
query=query,
81+
num_web_results=self.num_web_results,
82+
)
83+
elif self.endpoint == "news":
84+
params.update(
85+
q=query,
86+
count=self.num_web_results,
87+
search_lang=self.search_lang,
88+
ui_lang=self.ui_lang,
89+
spellcheck=self.spellcheck,
90+
)
91+
92+
# Remove `None` values
93+
return {k: v for k, v in params.items() if v is not None}
94+
2795
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
2896
"""Retrieve."""
2997
headers = {"X-API-Key": self._api_key}
30-
results = requests.get(
31-
f"https://api.ydc-index.io/search?query={query_bundle.query_str}",
98+
params = self._generate_params(query_bundle.query_str)
99+
response = requests.get(
100+
f"https://api.ydc-index.io/{self.endpoint}",
101+
params=params,
32102
headers=headers,
33-
).json()
103+
)
104+
response.raise_for_status()
105+
results = response.json()
106+
107+
nodes: List[TextNode] = []
108+
if self.endpoint == "search":
109+
for hit in results["hits"]:
110+
nodes.append(
111+
TextNode(
112+
text="\n".join(hit["snippets"]),
113+
)
114+
)
115+
else: # news endpoint
116+
for article in results["news"]["results"]:
117+
node = TextNode(
118+
text=article["description"],
119+
extra_info={"url": article["url"], "age": article["age"]},
120+
)
121+
nodes.append(node)
34122

35-
search_hits = ["\n".join(hit["snippets"]) for hit in results["hits"]]
36-
return [NodeWithScore(node=TextNode(text=s), score=1.0) for s in search_hits]
123+
return [NodeWithScore(node=node, score=1.0) for node in nodes]

llama-index-integrations/retrievers/llama-index-retrievers-you/pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ exclude = ["**/BUILD"]
2727
license = "MIT"
2828
name = "llama-index-retrievers-you"
2929
readme = "README.md"
30-
version = "0.1.2"
30+
version = "0.1.3"
3131

3232
[tool.poetry.dependencies]
3333
python = ">=3.8.1,<4.0"

0 commit comments

Comments
 (0)