You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+163Lines changed: 163 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,169 @@ For use of built-in HTML parser (via `ScrapeApiResponse.selector` property) addi
15
15
16
16
For reference of usage or examples, please checkout the folder `/examples` in this repository.
17
17
18
+
## Integrations
19
+
20
+
Scrapfly Python SDKs are integrated with [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/). Both framework allows training Large Language Models (LLMs) using augmented context.
21
+
22
+
This augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:
23
+
- Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
24
+
- Document Understanding and Extraction
25
+
- Autonomous Agents that can perform research and take actions
26
+
<br>
27
+
28
+
In the context of web scraping, web page data can be extracted as Text or Markdown using [Scrapfly's format feature](https://scrapfly.io/docs/scrape-api/specification#api_param_format) to train LLMs with the scraped data.
29
+
30
+
### LlamaIndex
31
+
32
+
#### Installation
33
+
Install `llama-index`, `llama-index-readers-web`, and `scrapfly-sdk` using pip:
Scrapfly is available at LlamaIndex as a [data connector](https://docs.llamaindex.ai/en/stable/module_guides/loading/connector/), known as a `Reader`. This reader is used to gather a web page data into a `Document` representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See the [LlamaIndex use cases](https://docs.llamaindex.ai/en/stable/use_cases/) for more.
40
+
```python
41
+
import os
42
+
43
+
from llama_index.readers.web import ScrapflyReader
44
+
from llama_index.core import VectorStoreIndex
45
+
46
+
# Initiate ScrapflyReader with your Scrapfly API key
47
+
scrapfly_reader = ScrapflyReader(
48
+
api_key="Your Scrapfly API key", # Get your API key from https://www.scrapfly.io/
49
+
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
50
+
)
51
+
52
+
# Load documents from URLs as markdown
53
+
documents = scrapfly_reader.load_data(
54
+
urls=["https://web-scraping.dev/products"]
55
+
)
56
+
57
+
# After creating the documents, train them with an LLM
58
+
# LlamaIndex uses OpenAI default, other options can be found at the examples direcotry:
Scrapfly is available at LangChain as a [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders), known as a `Loader`. This reader is used to gather a web page data into `Document` representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, see [LangChain tutorials](https://python.langchain.com/v0.2/docs/tutorials/) for further use cases.
108
+
```python
109
+
import os
110
+
111
+
from langchain import hub # pip install langchainhub
112
+
from langchain_chroma import Chroma # pip install langchain_chroma
113
+
from langchain_core.runnables import RunnablePassthrough
114
+
from langchain_core.output_parsers import StrOutputParser
115
+
from langchain_openai import OpenAIEmbeddings, ChatOpenAI # pip install langchain_openai
116
+
from langchain_text_splitters import RecursiveCharacterTextSplitter # pip install langchain_text_splitters
117
+
from langchain_community.document_loaders import ScrapflyLoader
118
+
119
+
120
+
scrapfly_loader = ScrapflyLoader(
121
+
["https://web-scraping.dev/products"],
122
+
api_key="Your Scrapfly API key", # Get your API key from https://www.scrapfly.io/
123
+
continue_on_failure=True, # Ignore unprocessable web pages and log their exceptions
124
+
)
125
+
126
+
# Load documents from URLs as markdown
127
+
documents = scrapfly_loader.load()
128
+
129
+
# This example uses OpenAI. For more see: https://python.langchain.com/v0.2/docs/integrations/platforms/
0 commit comments