Reperio /reˈpe.ri.oː/, [rɛˈpɛrioː], to discover.
Reperio is a simple, lightweight library to parse and scrap html pages.
yarn add reperio
Benchmarching is the time it takes to do the following actions:
Action | Time (ms) |
---|---|
new Parser(20lines) | 0.49 ms |
new Parser(20lines).extractUrls() | 0.49 ms |
new Parser(20000lines) | 5.61 ms |
new Parser(2000lines).extractUrls() | 6.11 ns |
There are two ways to invoke a parser:
- Pass a string payload to the constructor
const parser = new Parser(`
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Reperio Website</title>
<script type="text/javascript">
console.log("Me awesome script")
</script>
</head>
<body>
<h1>Welcome to the website</h1>
<p>Welcome to the reperio test website</p>
</body>
<footer>
<p>Burlet Mederic</p>
</footer>
</html>
`);
console.log(parser.parsedPage.title);
// Reperio Website
- Pass a URL to the
parserFromUrl
function
parserFromUrl
returns a promise of the following format.
parserFromUrl("https://scrapeme.live/shop/").then(({ error, parser }) => {
if (parser) {
console.log(parser.parsedPage.title);
// Products - ScrapeMe
}
});
Once the parser is returned you can access the following components
title: string
The title of the page
head: string
Everything in between <head></head>
body: string
Everything in between <body></body>
footer: string
Everything in between <footer></footer>
meta: MetaHTMLTag[]
Will return an array of MetaHTMLTag
for each <meta>
tag.
export interface MetaHTMLTag extends HTMLTag {
attribute: HTMLTagName.meta;
charset: string | undefined;
content: string | undefined;
name: string | undefined;
}
media.images: ImgHTMLTag[]
Will return an array of ImgHTMLTag
for each <img>
tag.
export interface ImgHTMLTag extends HTMLTag {
attribute: HTMLTagName.img;
src: string | undefined;
alt: string | undefined;
height: string | undefined;
width: string | undefined;
body?: undefined;
}
media.videos: VideoHTMLTag[]
Will return an array of VideoHTMLTag
for each <video>
tag.
export interface VideoHTMLTag extends HTMLTag {
attribute: HTMLTagName.video;
autoplay: string | undefined;
controls: string | undefined;
loop: string | undefined;
poster: string | undefined;
src: string | undefined;
height: string | undefined;
width: string | undefined;
}
links.links: LinkHTMLTag[]
Will return an array of LinkHTMLTag
for each <link>
tag.
export interface LinkHTMLTag extends HTMLTag {
attribute: HTMLTagName.link;
href: string | undefined;
crossorigin: string | undefined;
rel: string | undefined;
type: string | undefined;
body?: undefined;
}
links.anchors: AnchorHTMLTag[]
Will return an array of AnchorHTMLTag
for each <a>
tag.
export interface AnchorHTMLTag extends HTMLTag {
attribute: HTMLTagName.a;
download: string | undefined;
href: string | undefined;
target: string | undefined;
type: string | undefined;
}
styles: StyleHTMLTag[]
Will return an array of StyleHTMLTag
for each <style>
tag.
export interface StyleHTMLTag extends HTMLTag {
attribute: HTMLTagName.style;
type: string | undefined;
body: string | undefined;
}
scripts: ScriptHTMLTag[]
Will return an array of ScriptHTMLTag
for each <script>
tag.
export interface ScriptHTMLTag extends HTMLTag {
attribute: HTMLTagName.script;
async: string | undefined;
crossorigin: string | undefined;
defer: string | undefined;
integrity: string | undefined;
src: string | undefined;
type: string | undefined;
body: string | undefined;
}
tables: TableHTMLTag[]
Will return an array of TableHTMLTag
for each <table>
tag. Tables are parsed into headers and rows, with content extraction for each cell.
export interface TableHTMLTag extends HTMLTag {
attribute: HTMLTagName.table;
body: string | undefined;
headers: TableRow[];
rows: TableRow[];
rawHtml: string | undefined;
}
export interface TableRow {
cells: TableCell[];
}
export interface TableCell {
content: string;
originalHtml: string;
text: string;
colSpan?: number;
rowSpan?: number;
}
This function will extract all the urls found in:
- images
- videos
- links
- anchors
- scripts
By default the function remove duplicates; you can set the removeDuplicates
flag to false.
This function converts HTML tables into an array of JavaScript objects, where table headers are used as property names and cell values as property values.
- If no
tableIndex
is provided and there's only one table, it returns an array of objects for that table - If no
tableIndex
is provided and there are multiple tables, it returns an array of arrays (one array per table) - If a
tableIndex
is provided, it returns an array of objects for the specified table
Example:
// For a table with headers: First Name, Last Name, Age
// And rows with data: John/Doe/25, Jane/Smith/30
const tableData = parser.extractTablesToObject();
console.log(tableData);
/* Output:
[
{"First Name": "John", "Last Name": "Doe", "Age": "25"},
{"First Name": "Jane", "Last Name": "Smith", "Age": "30"}
]
*/
This function will download all images to the specified folder in downloadLocation
By default the function remove duplicates; you can set the removeDuplicates
flag to false.
This function will return all the sentences that have the matching term.
const para = `As she did so, a most extraordinary thing happened. Some random sentence with flung in it. The bed-clothes gathered themselves together, leapt up suddenly into a sort of peak, and then jumped headlong over the bottom rail. It was exactly as if a hand had clutched them in the centre and flung them aside. Immediately after, .........`;
const foundSentences = findSentenceWithWord(para, "flung");
console.log(foundSentences);
/*[
"Some random sentence with flung in it.",
"It was exactly as if a hand had clutched them in the centre and flung them aside."
]*/
All the functions used for parsing a payload are available for individual use.
removeWhitespace(payload: string)
Will remove all new lines and double spaces.
parseTitle
Extracts the content of the <title></title>
tag.
parseHead
Extracts the content of the <head></head>
tag.
parseBody
Extracts the content of the <body></body>
tag.
parseFooter
Extracts the content of the <footer></footer>
tag.
parseMeta
Extracts all the <meta></meta>
tags
parseImages
Extracts all the <img>
tags
parseVideos
Extracts all the <video></video>
tags
parseLinks
Extracts all the <link>
tags
parseAnchors
Extracts all the <a></a>
tags
parseStyles
Extracts all the <style>
tags
parseScripts
Extracts all the <script></script>
tags
parseTables
Extracts all the <table></table>
tags, including their structure (headers and rows)
Please look at any open issues for submitting PRs
Follow established code principles
Update tests in src/__tests__
.
To publish an update to npm, follow these steps:
-
Update the version in
package.json
following semantic versioning principles:- MAJOR version for incompatible API changes
- MINOR version for added functionality in a backward compatible manner
- PATCH version for backward compatible bug fixes
-
Build the package:
npm run build
-
Run tests to ensure everything works correctly:
npm test
-
Create a new git tag for the version:
git tag -a v1.x.x -m "Version 1.x.x"
-
Push the tag to GitHub:
git push origin v1.x.x
-
Publish to npm:
npm publish
If you're publishing for the first time or after logging out:
npm login npm publish