Skip to content

Commit

Permalink
index outgoing links to create backlink index
Browse files Browse the repository at this point in the history
  • Loading branch information
alexkrolick committed Sep 4, 2020
1 parent 5a0840f commit e668f61
Show file tree
Hide file tree
Showing 10 changed files with 165 additions and 43 deletions.
4 changes: 3 additions & 1 deletion .prettierignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
**/search.json
**/search.json
.*ignore
*.mmd
16 changes: 16 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"type": "pwa-node",
"request": "launch",
"name": "Launch Program",
"skipFiles": ["<node_internals>/**"],
"program": "${workspaceFolder}/lib/index.js",
"args": ["-d", "test/notes", "linksTo:*"]
}
]
}
31 changes: 26 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Search took 0.017599736 seconds
- [Features](#features)
- [Metadata search](#metadata-search)
- [Tag search](#tag-search)
- [Link search](#link-search)
- [CLI](#cli)
- [Caching the search index](#caching-the-search-index)
- [Without Cache](#without-cache)
Expand Down Expand Up @@ -88,6 +89,22 @@ Hashtags are #indexed so you can #query them
search-notes tags:something
```

### Link search

Outgoing links are indexed, so you can look up "backlinks" (incoming links) by searching what links to a file. You can look up any link this way, not just local documents. E.g., you can find all documents that link to Wikipedia.

```md
# Find all documents that link to a page

search-notes -d test/notes "linksTo:SomePage"
```

```md
# See all links

search-notes -d test/notes "linksTo:\*"
```

## CLI

```
Expand Down Expand Up @@ -120,6 +137,7 @@ Examples:
search-notes "tags:stuart -france" negate term
search-notes "tags:stuart +france" boolean AND
search-notes "britain^2 france^1" boost term relevance
search-notes "linksTo:filename" incoming links
search-notes -w re-index folder and save cache to disk
search-notes -c index.json query specify index cache file
```
Expand Down Expand Up @@ -181,10 +199,15 @@ Search took 0.021347405 seconds

- [x] index/reindex command
- [x] search command
- [x] index links between notes (`"linksTo:somewhere.md linkedFrom:elsewhere.md"`)
- [ ] backlink visualizer (node graph)
- [ ] static output
- [ ] web page output
- [ ] filter-then-display
- [ ] use mermaid-cli
- [ ] index nested folders
- [ ] more sensible search defaults (see elasticlunr)
- [ ] more output formatting options
- [ ] index links between notes (`"linksTo:somewhere.md linkedFrom:elsewhere.md"`)
- [ ] print snippet of file around hits (like `grep -n`)
- [ ] add more remarkable plugins out of the box (LaTeX formula rendering, etc)
- [ ] extract core modules from CLI, to enable re-use
Expand All @@ -193,11 +216,9 @@ Search took 0.021347405 seconds
- [ ] incremental index update (not supported by lunr)
- [ ] command to set up git to treat index file as binary (see .gitattributes)
- [ ] webcomponent for embedding search in markdown
- [ ] add .searchignore file
- [ ] add json output mode
- [ ] numeric data types for metadata ("rating > 4") _hard_
- [ ] backlink visualizer (node graph)
- [ ] static output
- [ ] web page output
- [ ] filter-then-display
- [ ] package as a binary instead of nodejs library
- [ ] electron app that acts as a container for background processes and wraps CLI

Expand Down
70 changes: 50 additions & 20 deletions lib/create-index.js
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
const fs = require("fs");
const path = require("path");
const Remarkable = require("remarkable").Remarkable;
const meta = require("remarkable-meta");
const RemarkableMeta = require("remarkable-meta");
const lunr = require("lunr");
const debug = require("./debug");
const flatMap = require("lodash/fp/flatMap");
const flatten = require("lodash/fp/flatten");

// TODO: find a good way to compose processing pipeline for perf & maintainability
// File processing pipeline context object:
Expand Down Expand Up @@ -47,41 +49,67 @@ function createSearchIndex({ directory, filename, write = false }) {
function processFile({ directory, filename }) {
// Ignore non-markdown files
if (!filename.endsWith(".md")) return;
const f = path.resolve(directory, filename)
debug(f)
const f = path.resolve(directory, filename);
debug(f);
const raw = fs.readFileSync(f, "utf8");

// Parse content
const md = new Remarkable();
md.use(meta);
md.use(RemarkableMeta);
const html = md.render(raw);

// Extract YML frontmatter as structured data
const frontmatter = md.meta;
const meta = md.meta;

// Add hashtags to metadata using regex search in markdown body
// TODO: this might also catch #id in-page links, is that good?
// Dedupe and merge with tags field from frontmatter, if present
const tags = new Set(
(raw.match(/#\w+\b/gi) || []).map(match => {
(raw.match(/#\w+\b/gi) || []).map((match) => {
// take off the "#" symbol
return match.slice(1);
}),
);
// Dedupe and merge with tags field from frontmatter, if present
(frontmatter.tags || []).forEach(tag => tags.add(tag));
frontmatter.tags = [...tags]; // convert set to array
const fields = Object.keys(frontmatter);
// Merge yml frontmatter "tags" field with #hashtag list
if (typeof meta.tags === "string") {
meta.tags.split(",").forEach((tag) => tags.add(tag));
} else if (Array.isArray(meta.tags)) {
meta.tags.forEach((tag) => tags.add(tag));
}
meta.tags = [...tags]; // convert set to array

const docNodes = md.parse(raw, {});

// Add outgoing links to metadata
const isLinkNode = (n) => n.type === "link_open";
const getLinks = (nodes) =>
nodes.reduce((result, node) => {
if (isLinkNode(node)) {
result.push(node.href);
} else if (node.children) {
const childNodes = getLinks(node.children);
if (childNodes) {
result = result.concat(childNodes);
}
}
return result;
}, []);
meta.linksTo = getLinks(docNodes);

// TODO: Add title/name/document to metadata using first h1 tag, yml key, or filename?

// SIDE-EFFECT: Update pipeline's field index
const fields = Object.keys(meta);
for (field of fields) {
pipeline.metadataFields.add(field);
}

return { filename, html, fields, raw, frontmatter };
return { filename, html, fields, raw, frontmatter: meta };
}

function indexDocuments() {
// Transform data to support query pattern we want
const documents = pipeline.parsedFiles.map(f => ({
const documents = pipeline.parsedFiles.map((f) => ({
...f.frontmatter,
file: f.filename,
body: f.raw,
Expand All @@ -91,30 +119,32 @@ function indexDocuments() {
// Create index
// TODO: maybe use elasticlunr instead of raw lunr
// elasticlunr allows incremental index updates and more elasticsearch-like userspace APIs
const idx = lunr(function() {
const index = lunr(function () {
// Create schema:
this.ref("id");
for (const field of pipeline.metadataFields) {
this.field(field);
}

documents.forEach(function(doc) {
documents.forEach(function (doc) {
this.add(doc);
}, this);
});

// TODO: Optimize index size https://github.com/olivernn/lunr.js/issues/316
const indexContents = JSON.stringify(idx)

const cacheFileContents = JSON.stringify({ index, documents });

// Save index to disk, if write option is enabled
if (pipeline.createCache && pipeline.cacheLocation) {
const f = path.resolve(pipeline.directory, pipeline.cacheLocation)
debug(f)
fs.writeFileSync(f, indexContents);
const f = path.resolve(pipeline.directory, pipeline.cacheLocation);
debug(f);
fs.writeFileSync(f, cacheFileContents);
}

return JSON.parse(indexContents); // serialize and de-serialize objects
// serialize and de-serialize objects to recast lunr objects
const normalObject = JSON.parse(cacheFileContents);
return normalObject;
}

module.exports = {
Expand Down
45 changes: 31 additions & 14 deletions lib/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ const argv = yargs
.command(
"$0 [query]",
"Search for notes using structured data and full text index, with fuzzy matching.",
yargs => {
(yargs) => {
yargs
.positional("query", {
type: "string",
Expand All @@ -27,6 +27,7 @@ const argv = yargs
.example('$0 "tags:stuart -france"', "negate term")
.example('$0 "tags:stuart +france"', "boolean AND")
.example('$0 "britain^2 france^1"', "boost term relevance")
.example('$0 "linksTo:filename"', "incoming links")
.example("$0 -w", "re-index folder and save cache to disk")
.example("$0 -c index.json query", "specify index cache file");
},
Expand Down Expand Up @@ -58,9 +59,12 @@ const argv = yargs
.version()
.help().argv;

function searchNotes({ query, directory, cache, writeCache, explain }) {
function getSearchSchema({
directory,
cache,
writeCache,
}) /*: {index, documents} */ {
let schema;
const time = process.hrtime();
if (cache && !writeCache) {
try {
// The way the default flags are set means we always load the cache file if it exists;
Expand All @@ -77,16 +81,29 @@ function searchNotes({ query, directory, cache, writeCache, explain }) {
} else if (cache && writeCache) {
// Re-index the directory and save the result to the cache location
debug("Creating search index cache...");
const timeStart = process.hrtime();
schema = createSearchIndex({ directory, write: true, filename: cache });
const diff = process.hrtime(time);
console.log(`Updated index file in ${(diff[0] * 1e9 + diff[1])/1e9} seconds`);
const timeDiff = process.hrtime(timeStart);
console.log(
`Updated index file in ${
(timeDiff[0] * 1e9 + timeDiff[1]) / 1e9
} seconds`,
);
}
const diff = process.hrtime(time);
return schema;
}

function searchNotes({ query, directory, cache, writeCache, explain }) {
const timeStart = process.hrtime();
const schema = getSearchSchema({ directory, cache, writeCache });
const timeDiff = process.hrtime(timeStart);
const idx = lunr.Index.load(schema.index);
const results = idx.search(query);
if (explain) {
console.log(`Search took ${(diff[0] * 1e9 + diff[1])/1e9} seconds`);
console.log(
`Search took ${(timeDiff[0] * 1e9 + timeDiff[1]) / 1e9} seconds`,
);
}
const idx = lunr.Index.load(schema);
const results = idx.search(query);
debug(results);
return results;
}
Expand All @@ -105,16 +122,16 @@ Search results shape:
function displayResults(searchResults) {
if (argv.explain) {
console.table(
searchResults.map(r => ({
searchResults.map((r) => ({
File: r.ref,
Score: r.score.toFixed(3),
Hits:
Object.entries(r.matchData.metadata).map(([key, val]) => `"${key}" (${Object.keys(val).join(', ')})`).join(', ')

Hits: Object.entries(r.matchData.metadata)
.map(([key, val]) => `"${key}" (${Object.keys(val).join(", ")})`)
.join(", "),
})),
);
} else {
searchResults.map(r => console.log(r.ref));
searchResults.map((r) => console.log(r.ref));
}
}

Expand Down
1 change: 1 addition & 0 deletions test/notes/.searchignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
graph.md
18 changes: 18 additions & 0 deletions test/notes/graph.mmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
flowchart LR;
james-stuart.md -.- '#pretender';
prince-charlie.md --> james-stuart.md;
prince-charlie.md -.- '#stuart';
prince-charlie.md -.- '#royal';
prince-charlie.md -.- '#scotland';
prince-charlie.md -.- '#france';
prince-charlie.md -.- '#pretender';
queen-anne.md -.- '#stuart';
queen-anne.md -.- '#royal';
queen-anne.md -.- '#britain';
queen-anne.md -.- '#scotland';
subgraph Famous Scots
william-wallace.md([William Wallace]);
queen-anne.md([Queen Anne]);
james-stuart.md([James Stuart]);
prince-charlie.md([Bonnie Prince Charlie]);
end
8 changes: 8 additions & 0 deletions test/notes/james-stuart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
name: James Francis Edward Stuart
rating: 2
---

James Francis Edward Stuart (10 June 1688 – 1 January 1766), nicknamed The Old _[Pretender](./pretender.md)_ by Whigs, was the son of King James II and VII of England, Scotland and Ireland, and his second wife, Mary of Modena. He was Prince of Wales from July 1688 until, just months after his birth, his Catholic father was deposed and exiled in the Glorious Revolution of 1688. James II's Protestant elder daughter (the prince's half-sister), Mary II, and her husband, William III, became co-monarchs and the Bill of Rights 1689 and Act of Settlement 1701 excluded Catholics from the English then, subsequently, the British throne.

#pretender
5 changes: 5 additions & 0 deletions test/notes/pretender.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
tags: concept, definition
---

A pretender is one who maintains or is able to maintain a claim that they are entitled to a position of honour or rank, which may be occupied by an incumbent (usually more recognised), or whose powers may currently be exercised by another person or authority. Most often, it refers to a former monarch, or descendant thereof, whose throne is occupied, claimed by a rival or has been abolished.
10 changes: 7 additions & 3 deletions test/notes/prince-charlie.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
---
name: Bonnie Prince Charlie
rating: 3
tags:
- monarch
- highlands
- pretender
---

# Prince Charles III
# Bonnie Prince Charlie

Charles Edward Louis John Casimir Sylvester Severino Maria Stuart (31 December 1720 – 31 January 1788) was the elder son of James Francis Edward Stuart, grandson of James II and VII, and the Stuart claimant to the throne of Great Britain after 1766 as "Charles III". During his lifetime, he was also known as "the Young Pretender" and "the Young Chevalier"; in popular memory, he is "Bonnie Prince Charlie". He is best remembered for his role in the 1745 rising; his defeat at Culloden in April 1746 effectively ended the Stuart cause, and subsequent attempts failed to materialise, such as a planned French invasion in 1759. His escape from Scotland after the uprising led to his portrayal as a romantic figure of heroic failure.
Charles Edward Louis John Casimir Sylvester Severino Maria Stuart (31 December 1720 – 31 January 1788) was the elder son of [James Francis Edward Stuart](./james-stuart.md), grandson of James II and VII, and the Stuart claimant to the throne of Great Britain after 1766 as "Charles III". During his lifetime, he was also known as "the Young Pretender" and "the Young Chevalier"; in popular memory, he is "Bonnie Prince Charlie". He is best remembered for his role in the 1745 rising; his defeat at Culloden in April 1746 effectively ended the Stuart cause, and subsequent attempts failed to materialise, such as a planned French invasion in 1759. His escape from Scotland after the uprising led to his portrayal as a romantic figure of heroic failure.

#stuart #royal #scotland #france
#stuart #royal #scotland #france

0 comments on commit e668f61

Please sign in to comment.