Description
A database of markdown files so that you can quickly access the metadata and content you want.
- All metadata including frontmatter, links, tags, tasks etc
- Auto-reloading
- Super simple javascript API
Bonus
- Can generate sqlite so you get full sql access (if you want)
Non-features
- Does not index the full-text content
Re Flowershow: Use this to replace contentlayer.dev.
See https://datahub.io/notes/markdowndb
Acceptance aka Roadmap
- POC covering basic extraction etc [epic] MarkdownDB v0.1 #6
- Research Obsidian dataview approach to a markdown db #5
- [epic] MarkdownDB plugin system #2 - specifically parser plugins
Feature list
Marketing
Features
Index a folder of files - create an "DB" index from a folder of markdown files (and other files including images)
- Index a folder and get JS/TS objects
- Index a folder and get json output
- Index multiple folders (with support for configuring e.g. prefixing in some way e.g. i have all my blog files in this separate folder over here) #129
- Command line tool for indexing: Create a markdowndb (index) on the command line
- Index a folder and get SQLite
Extract structured data like:
- Frontmatter metadata: Extract markdown frontmatter and add in a metadata field
- Tags: Extracts tags in markdown pages
- Extract tags in frontmatter
- Extract tags in body like
#abc
Tags extraction from body #49
- Links: links between files like
[hello](abc.md)
or wikilink style[[xyz]]
so we can compute backlinks or deadlinks etc (see [parse] Extract Links #4) - Tasks: extract tasks like this
- [ ] this is a task
(See obsidian data view) #60
Data types, data enhancement and validation
- Computed fields: add new metadata properties based on existing metadata e.g. a slug field computed from title field; or, adding a title based on the first h1 heading in a doc; or, a type field based on the folder of the file (e.g. these are blog posts). cf https://www.contentlayer.dev/docs/reference/source-files/define-document-type#computedfields. Computed metadata fields #54
- Data validation and Document Types: validate metadata against a schema/type so that I know the data in the database is "valid" (Meta)Data Validation and Document Types #55
- deal with casting types e.g. string, number so that we can query in useful ways e.g. find me all blog posts before date X
- BYOT (bring your own types): i want to create my own types ... so that when i get an object out it is cast to the right typescript type
Inbox
Marketing
Sections on front page about major features
- Have a section on front page about links feature
- Have a section for tags
- etc
💤
- Refactor: improve our interfaces, do something similar to CachedMetadata and CachedFile
- "multi-thread" support for fast indexing #128
Misc
- ➕ 2023-03-15 Add
layout
e.g.layout: blog
as a rule in markdown db loading rather than ingetStaticPaths
for rendering blogs (follow up to work in datopian/datahub-next#51) ⛔2023-03-17 on having markdowndb support for rules
Rufus random notes
- how can we get type stuff like contentlayer has e.g. a given type in markdown frontmatter leads to use of X typescript type/interface
- check out astro-build - how do they do type stuff?
Notes
Questions
- What is a ContentBase / ContentDB? ✅2023-03-07 a database (index) of content e.g. of text files on disk, images etc. DB need not store content of files but it "indexes" them i.e. has a list of them, with associated metadata etc.
- Why do we need one? ✅2023-03-07 a) to replace this (basic) functionality in ContentLayer.dev so we can replace ContentLayer.dev b) so we can richer things like get files with all tags etc
- What contentlayer.dev API calls do we need to replace **✅2023-03-07 ~8 of them. quite simple. see below. **
- What is the different between a Content Layer (API) and a ContentBase
- What are the key technical components of a ContentBase ✅2023-03-07 see diagram
- What is MarkdownDB? ✅2023-03-07 It is a ContentBase whose text files are in markdown format
- What information do we index about markdown files in ContentBase? ✅2023-03-07
- frontmatter
- list of all blocks and their types?
- tags?
- What is the unique identifier for files?
- What are the job stories that the MarkdownDB needs to support? 🔥
- What about assets other than markdown files? e.g. images and pngs? ✅2023-03-07 these should also get processed.
- Does something like this already exist and how does it work?
- How does https://github.com/simonw/markdown-to-sqlite do it? 🚧2023-03-07 see https://github.com/simonw/markdown-to-sqlite/blob/main/markdown_to_sqlite/cli.py - main work is sqlite_utils i suspect
- How big will the sqlite db get? (i.e. per 1k documents indexed) NB: we aren't storing the text ... (though perhaps we could ...) 🚧2023-03-07 guess metadata is ~1kb per file. so 1k files = 1Mb and 100k files = 100Mb so seems ok for memory
- What happens if the sqlite file gets really big? ✅2023-03-07 we've probably have to store it somewhere in cloud etc
- What DB should we use e.g. IndexedDB or sqlite? ✅2023-03-07 propose sqlite3 b/c you get sql etc and now pretty much supported in browser if we ever need that
- See Proposal: sqlite3-wasm as presistent storage blacksmithgu/datacore#6 for interesting discussion on this.
- How do we handle the indexing of remote files, such as files in GitHub repos? ✅2023-03-07 ❌ kind of invalid question. we can index the remote files easily and then cache that locally. We aren't indexing on the fly.
-
Do we just store a reference to that file?
-
- What's a minimal viable API? 🚧2023-03-08 see section below
Notes on obsidian dataview API
blacksmithgu/obsidian-dataview#1811
How to handle document types 2023-03-09
I'm not sure how we want to handle types, since having it as a frontmatter field might not be the most ideal way because if we had a blog folder we'd have to add the type metadata to all the files individually.
On contentlayer.dev
it uses a filePathPattern
for that:
const Blog = defineDocumentType(() => ({
name: "Blog",
filePathPattern: `${siteConfig.blogDir}/!(index)*.md*`,
contentType: "mdx",
fields: {
...
I believe that's a good way of handling this. The caveat is that the path of a file is now determining its type and therefore folders with mixed types are impossible, although we could apply the pattern as something like *.blog.md*
.
The use case I'm imaging is something like (there are probably better examples than blog):
blogs
my-first-post.blog.mdx // Blog type
my-second-post.blog.mdx // Blog type
index.mdx // Generic page type
about-our-authors.mdx // Generic page type
write-for-us.contact.mdx // Generic contact type
How could we index frontmatter into our db? 2023-03-09
My idea is to have another table for frontmatter, something like:
file_id | field | value | (maybe) type: array or string |
---|---|---|---|
d9fc09 | title | My new post | string |
file_id
should be a foreign key pointing to file._id
.
To increase performance, since we are going to have many more rows now, we can create a DB index on this table (using the file_id field)
If done this way we are going to be able to query mdx files using frontmatter fields. E.g: (may not be exactly this)
MyMdDb.query({ tags: [economy], frontmatter: { author: 'João' } })