Skip to content

Commit a9cd252

Browse files
authored
add external integration docs (#19)
1 parent c9f7aba commit a9cd252

File tree

1 file changed

+163
-0
lines changed

1 file changed

+163
-0
lines changed

README.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,169 @@ For use of built-in HTML parser (via `ScrapeApiResponse.selector` property) addi
1515

1616
For reference of usage or examples, please checkout the folder `/examples` in this repository.
1717

18+
## Integrations
19+
20+
Scrapfly Python SDKs are integrated with [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/). Both framework allows training Large Language Models (LLMs) using augmented context.
21+
22+
This augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:
23+
- Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
24+
- Document Understanding and Extraction
25+
- Autonomous Agents that can perform research and take actions
26+
<br>
27+
28+
In the context of web scraping, web page data can be extracted as Text or Markdown using [Scrapfly's format feature](https://scrapfly.io/docs/scrape-api/specification#api_param_format) to train LLMs with the scraped data.
29+
30+
### LlamaIndex
31+
32+
#### Installation
33+
Install `llama-index`, `llama-index-readers-web`, and `scrapfly-sdk` using pip:
34+
```shell
35+
pip install llama-index llama-index-readers-web scrapfly-sdk
36+
```
37+
38+
#### Usage
39+
Scrapfly is available at LlamaIndex as a [data connector](https://docs.llamaindex.ai/en/stable/module_guides/loading/connector/), known as a `Reader`. This reader is used to gather a web page data into a `Document` representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See the [LlamaIndex use cases](https://docs.llamaindex.ai/en/stable/use_cases/) for more.
40+
```python
41+
import os
42+
43+
from llama_index.readers.web import ScrapflyReader
44+
from llama_index.core import VectorStoreIndex
45+
46+
# Initiate ScrapflyReader with your Scrapfly API key
47+
scrapfly_reader = ScrapflyReader(
48+
api_key="Your Scrapfly API key", # Get your API key from https://www.scrapfly.io/
49+
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
50+
)
51+
52+
# Load documents from URLs as markdown
53+
documents = scrapfly_reader.load_data(
54+
urls=["https://web-scraping.dev/products"]
55+
)
56+
57+
# After creating the documents, train them with an LLM
58+
# LlamaIndex uses OpenAI default, other options can be found at the examples direcotry:
59+
# https://docs.llamaindex.ai/en/stable/examples/llm/openai/
60+
61+
# Add your OpenAI key (a paid subscription must exist) from: https://platform.openai.com/api-keys/
62+
os.environ['OPENAI_API_KEY'] = "Your OpenAI Key"
63+
index = VectorStoreIndex.from_documents(documents)
64+
query_engine = index.as_query_engine()
65+
66+
response = query_engine.query("What is the flavor of the dark energy potion?")
67+
print(response)
68+
"The flavor of the dark energy potion is bold cherry cola."
69+
```
70+
71+
The `load_data` function accepts a ScrapeConfig object to use the desired Scrapfly API parameters:
72+
```python
73+
from llama_index.readers.web import ScrapflyReader
74+
75+
# Initiate ScrapflyReader with your ScrapFly API key
76+
scrapfly_reader = ScrapflyReader(
77+
api_key="Your Scrapfly API key", # Get your API key from https://www.scrapfly.io/
78+
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
79+
)
80+
81+
scrapfly_scrape_config = {
82+
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
83+
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
84+
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
85+
"country": "us", # Select a proxy location
86+
"auto_scroll": True, # Auto scroll the page
87+
"js": "", # Execute custom JavaScript code by the headless browser
88+
}
89+
90+
# Load documents from URLs as markdown
91+
documents = scrapfly_reader.load_data(
92+
urls=["https://web-scraping.dev/products"],
93+
scrape_config=scrapfly_scrape_config, # Pass the scrape config
94+
scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
95+
)
96+
```
97+
98+
### LangChain
99+
100+
#### Installation
101+
Install `langchain`, `langchain-community`, and `scrapfly-sdk` using pip:
102+
```shell
103+
pip install langchain langchain-community scrapfly-sdk
104+
```
105+
106+
#### Usage
107+
Scrapfly is available at LangChain as a [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders), known as a `Loader`. This reader is used to gather a web page data into `Document` representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, see [LangChain tutorials](https://python.langchain.com/v0.2/docs/tutorials/) for further use cases.
108+
```python
109+
import os
110+
111+
from langchain import hub # pip install langchainhub
112+
from langchain_chroma import Chroma # pip install langchain_chroma
113+
from langchain_core.runnables import RunnablePassthrough
114+
from langchain_core.output_parsers import StrOutputParser
115+
from langchain_openai import OpenAIEmbeddings, ChatOpenAI # pip install langchain_openai
116+
from langchain_text_splitters import RecursiveCharacterTextSplitter # pip install langchain_text_splitters
117+
from langchain_community.document_loaders import ScrapflyLoader
118+
119+
120+
scrapfly_loader = ScrapflyLoader(
121+
["https://web-scraping.dev/products"],
122+
api_key="Your Scrapfly API key", # Get your API key from https://www.scrapfly.io/
123+
continue_on_failure=True, # Ignore unprocessable web pages and log their exceptions
124+
)
125+
126+
# Load documents from URLs as markdown
127+
documents = scrapfly_loader.load()
128+
129+
# This example uses OpenAI. For more see: https://python.langchain.com/v0.2/docs/integrations/platforms/
130+
os.environ["OPENAI_API_KEY"] = "Your OpenAI key"
131+
132+
# Create a retriever
133+
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
134+
splits = text_splitter.split_documents(documents)
135+
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
136+
retriever = vectorstore.as_retriever()
137+
138+
def format_docs(docs):
139+
return "\n\n".join(doc.page_content for doc in docs)
140+
141+
model = ChatOpenAI()
142+
prompt = hub.pull("rlm/rag-prompt")
143+
144+
rag_chain = (
145+
{"context": retriever | format_docs, "question": RunnablePassthrough()}
146+
| prompt
147+
| model
148+
| StrOutputParser()
149+
)
150+
151+
response = rag_chain.invoke("What is the flavor of the dark energy potion?")
152+
print(response)
153+
"The flavor of the Dark Energy Potion is bold cherry cola."
154+
```
155+
156+
To use the full Scrapfly features with LangChain, pass a ScrapeConfig object to the `ScrapflyLoader`:
157+
```python
158+
from langchain_community.document_loaders import ScrapflyLoader
159+
160+
scrapfly_scrape_config = {
161+
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
162+
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
163+
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
164+
"country": "us", # Select a proxy location
165+
"auto_scroll": True, # Auto scroll the page
166+
"js": "", # Execute custom JavaScript code by the headless browser
167+
}
168+
169+
scrapfly_loader = ScrapflyLoader(
170+
["https://web-scraping.dev/products"],
171+
api_key="Your Scrapfly API key", # Get your API key from https://www.scrapfly.io/
172+
continue_on_failure=True, # Ignore unprocessable web pages and log their exceptions
173+
scrape_config=scrapfly_scrape_config, # Pass the scrape_config object
174+
scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
175+
)
176+
177+
# Load documents from URLs as markdown
178+
documents = scrapfly_loader.load()
179+
print(documents)
180+
```
18181
## Get Your API Key
19182

20183
You can create a free account on [Scrapfly](https://scrapfly.io/register) to get your API Key.

0 commit comments

Comments
 (0)