Skip to content

[epic] MarkdownDB Index and Library v1 #3

Closed
@rufuspollock

Description

@rufuspollock

A database of markdown files so that you can quickly access the metadata and content you want.

  • All metadata including frontmatter, links, tags, tasks etc
  • Auto-reloading
  • Super simple javascript API

Bonus

  • Can generate sqlite so you get full sql access (if you want)

Non-features

  • Does not index the full-text content

Re Flowershow: Use this to replace contentlayer.dev.

See https://datahub.io/notes/markdowndb

Acceptance aka Roadmap

Feature list

Marketing

Features

Index a folder of files - create an "DB" index from a folder of markdown files (and other files including images)

Extract structured data like:

Data types, data enhancement and validation


Inbox

Marketing

Sections on front page about major features

  • Have a section on front page about links feature
  • Have a section for tags
  • etc

💤

Misc

  • ➕ 2023-03-15 Add layout e.g. layout: blog as a rule in markdown db loading rather than in getStaticPaths for rendering blogs (follow up to work in datopian/datahub-next#51) ⛔2023-03-17 on having markdowndb support for rules

Rufus random notes

  • how can we get type stuff like contentlayer has e.g. a given type in markdown frontmatter leads to use of X typescript type/interface
  • check out astro-build - how do they do type stuff?

Notes

Questions

  • What is a ContentBase / ContentDB? ✅2023-03-07 a database (index) of content e.g. of text files on disk, images etc. DB need not store content of files but it "indexes" them i.e. has a list of them, with associated metadata etc.
  • Why do we need one? ✅2023-03-07 a) to replace this (basic) functionality in ContentLayer.dev so we can replace ContentLayer.dev b) so we can richer things like get files with all tags etc
    • What contentlayer.dev API calls do we need to replace **✅2023-03-07 ~8 of them. quite simple. see below. **
  • What is the different between a Content Layer (API) and a ContentBase
  • What are the key technical components of a ContentBase ✅2023-03-07 see diagram
  • What is MarkdownDB? ✅2023-03-07 It is a ContentBase whose text files are in markdown format
  • What information do we index about markdown files in ContentBase? ✅2023-03-07
    • frontmatter
    • list of all blocks and their types?
    • tags?
  • What is the unique identifier for files?
  • What are the job stories that the MarkdownDB needs to support? 🔥
  • What about assets other than markdown files? e.g. images and pngs? ✅2023-03-07 these should also get processed.
  • Does something like this already exist and how does it work?
  • How big will the sqlite db get? (i.e. per 1k documents indexed) NB: we aren't storing the text ... (though perhaps we could ...) 🚧2023-03-07 guess metadata is ~1kb per file. so 1k files = 1Mb and 100k files = 100Mb so seems ok for memory
  • What happens if the sqlite file gets really big? ✅2023-03-07 we've probably have to store it somewhere in cloud etc
  • What DB should we use e.g. IndexedDB or sqlite? ✅2023-03-07 propose sqlite3 b/c you get sql etc and now pretty much supported in browser if we ever need that
  • How do we handle the indexing of remote files, such as files in GitHub repos? ✅2023-03-07 ❌ kind of invalid question. we can index the remote files easily and then cache that locally. We aren't indexing on the fly.
    • Do we just store a reference to that file?
  • What's a minimal viable API? 🚧2023-03-08 see section below

Notes on obsidian dataview API

blacksmithgu/obsidian-dataview#1811

How to handle document types 2023-03-09

I'm not sure how we want to handle types, since having it as a frontmatter field might not be the most ideal way because if we had a blog folder we'd have to add the type metadata to all the files individually.

On contentlayer.dev it uses a filePathPattern for that:

const Blog = defineDocumentType(() => ({
  name: "Blog",
  filePathPattern: `${siteConfig.blogDir}/!(index)*.md*`,
  contentType: "mdx",
  fields: {
  ...

I believe that's a good way of handling this. The caveat is that the path of a file is now determining its type and therefore folders with mixed types are impossible, although we could apply the pattern as something like *.blog.md*.

The use case I'm imaging is something like (there are probably better examples than blog):

blogs
  my-first-post.blog.mdx    // Blog type
  my-second-post.blog.mdx     // Blog type 
  index.mdx    // Generic page type 
  about-our-authors.mdx    // Generic page type
  write-for-us.contact.mdx    // Generic contact type                   

How could we index frontmatter into our db? 2023-03-09

My idea is to have another table for frontmatter, something like:

file_id field value (maybe) type: array or string
d9fc09 title My new post string

file_id should be a foreign key pointing to file._id.

To increase performance, since we are going to have many more rows now, we can create a DB index on this table (using the file_id field)

If done this way we are going to be able to query mdx files using frontmatter fields. E.g: (may not be exactly this)

MyMdDb.query({ tags: [economy], frontmatter: { author: 'João' } })

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions