[epic] MarkdownDB Index and Library v1

A database of markdown files so that you can quickly access the metadata and content you want.

- All metadata including frontmatter, links, tags, tasks etc
- Auto-reloading
- Super simple javascript API

Bonus

- Can generate  sqlite so you get full sql access (if you want)

Non-features

- Does not index the full-text content

Re Flowershow: Use this to replace contentlayer.dev.

See https://datahub.io/notes/markdowndb

## Acceptance aka Roadmap

- [x] POC covering basic extraction etc datopian/markdowndb#6 
- [x] #5 
- [ ] #2 - specifically parser plugins

## Feature list

### Marketing

- [x] #1 

## Features

**Index a folder of files** - create an "DB" index from a folder of markdown files (and other files including images)

- [x] Index a folder and get JS/TS objects
- [x] Index a folder and get json output
- [ ] #129
- [x] **Command line tool for indexing**: Create a markdowndb (index) on the command line
- [x] Index a folder and get SQLite

Extract structured data like:

- [x] **Frontmatter metadata**: Extract markdown frontmatter and add in a metadata field
- [x] **Tags**: Extracts tags in markdown pages
  - [x] Extract tags in frontmatter
  - [x] Extract tags in body like `#abc` #49 
- [x] **Links**: links between files like `[hello](abc.md)` or wikilink style `[[xyz]]` so we can compute backlinks or deadlinks etc (see #4)
- [x] #60

Data types, data enhancement and validation

- [x] **Computed fields**: add new metadata properties based on existing metadata e.g. a slug field computed from title field; or, adding a title based on the first h1 heading in a doc; or, a type field based on the folder of the file (e.g. these are blog posts). cf https://www.contentlayer.dev/docs/reference/source-files/define-document-type#computedfields. #54
- [x] **Data validation and Document Types**: validate metadata against a schema/type so that I know the data in the database is "valid" #55 
  - [ ] deal with casting types e.g. string, number so that we can query in useful ways e.g. find me all blog posts before date X
  - [ ]  BYOT (bring your own types): i want to create my own types ... so that when i get an object out it is cast to the right typescript type

---

## Inbox

### Marketing

Sections on front page about major features

- [x] Have a section on front page about links feature
- [x] Have a section for tags
- [ ] etc

### 💤

- [ ] Refactor: improve our interfaces, do something similar to CachedMetadata and CachedFile
- [ ] #128

Misc

- [x] ➕ 2023-03-15 Add `layout` e.g. `layout: blog` as a rule in markdown db loading rather than in `getStaticPaths` for rendering blogs (follow up to work in datopian/datahub-next#51) **⛔2023-03-17 on having markdowndb support for rules**

### Rufus random notes

- how can we get type stuff like contentlayer has e.g. a given type in markdown frontmatter leads to use of X typescript type/interface
- check out astro-build - how do they do type stuff?

## Notes

### Questions

- [x] What is a ContentBase / ContentDB? **✅2023-03-07 a database (index) of content e.g. of text files on disk, images etc. DB need not store content of files but it "indexes" them i.e. has a list of them, with associated metadata etc.**
- [x] Why do we need one? **✅2023-03-07 a) to replace this (basic) functionality in ContentLayer.dev so we can replace ContentLayer.dev b) so we can richer things like get files with all tags etc**
  - [x] What contentlayer.dev API calls do we need to replace **✅2023-03-07 ~8 of them. quite simple. see below. **
- [ ] What is the different between a Content Layer (API) and a ContentBase
- [x] What are the key technical components of a ContentBase **✅2023-03-07 see diagram**
- [x] What is MarkdownDB? **✅2023-03-07 It is a ContentBase whose text files are in markdown format**
- [x] What information do we index about markdown files in ContentBase? **✅2023-03-07**
  - frontmatter
  - list of all blocks and their types?
  - tags?
- [ ] What is the unique identifier for files?
- [x] What are the job stories that the MarkdownDB needs to support? 🔥
- [x] What about assets other than markdown files? e.g. images and pngs? **✅2023-03-07 these should also get processed.**
- [x] Does something like this already exist and how does it work?
  - [x] How does https://github.com/simonw/markdown-to-sqlite do it? **🚧2023-03-07 see https://github.com/simonw/markdown-to-sqlite/blob/main/markdown_to_sqlite/cli.py - main work is sqlite_utils i suspect**
- [x] How big will the sqlite db get? (i.e. per 1k documents indexed) NB: we aren't storing the text ... (though perhaps we could ...) **🚧2023-03-07 guess metadata is ~1kb per file. so 1k files = 1Mb and 100k files = 100Mb so seems ok for memory**
- [x] What happens if the sqlite file gets really big? **✅2023-03-07 we've probably have to store it somewhere in cloud etc**
- [x] What DB should we use e.g. IndexedDB or sqlite? **✅2023-03-07 propose sqlite3 b/c you get sql etc and now pretty much supported in browser if we ever need that**
  - See https://github.com/blacksmithgu/datacore/issues/6 for interesting discussion on this.
- [x] How do we handle the indexing of remote files, such as files in GitHub repos? **✅2023-03-07 ❌ kind of invalid question. we can index the remote files easily and then cache that locally. We aren't indexing on the fly.**
  - [ ] ~~Do we just store a reference to that file?~~
- [x] What's a minimal viable API? **🚧2023-03-08 see section below**

###  Notes on obsidian dataview API

https://github.com/blacksmithgu/obsidian-dataview/discussions/1811

### How to handle document types 2023-03-09

I'm not sure how we want to handle types, since having it as a frontmatter field might not be the most ideal way because if we had a blog folder we'd have to add the type metadata to all the files individually.

On `contentlayer.dev` it uses a `filePathPattern` for that:

```typescript
const Blog = defineDocumentType(() => ({
  name: "Blog",
  filePathPattern: `${siteConfig.blogDir}/!(index)*.md*`,
  contentType: "mdx",
  fields: {
  ...
```

I believe that's a good way of handling this. The caveat is that the path of a file is now determining its type and therefore folders with mixed types are impossible, although we could apply the pattern as something like `*.blog.md*`.

The use case I'm imaging is something like (there are probably better examples than blog):

```
blogs
  my-first-post.blog.mdx    // Blog type
  my-second-post.blog.mdx     // Blog type 
  index.mdx    // Generic page type 
  about-our-authors.mdx    // Generic page type
  write-for-us.contact.mdx    // Generic contact type                   
```

### How could we index frontmatter into our db? 2023-03-09

My idea is to have another table for frontmatter, something like:

| file_id | field | value | (maybe) type: array or string |
| -------- | ----- | ------- | -------------------------------------- |
| d9fc09 | title | My new post | string |

`file_id` should be a foreign key pointing to `file._id`.

To increase performance, since we are going to have many more rows now, we can create a DB index on this table (using the file_id field)

If done this way we are going to be able to query mdx files using frontmatter fields. E.g: (may not be exactly this)

```typescript
MyMdDb.query({ tags: [economy], frontmatter: { author: 'João' } })
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[epic] MarkdownDB Index and Library v1 #3

Acceptance aka Roadmap

Feature list

Marketing

Features

Inbox

Marketing

💤

Rufus random notes

Notes

Questions

Notes on obsidian dataview API

How to handle document types 2023-03-09

How could we index frontmatter into our db? 2023-03-09

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[epic] MarkdownDB Index and Library v1 #3

Description

Acceptance aka Roadmap

Feature list

Marketing

Features

Inbox

Marketing

💤

Rufus random notes

Notes

Questions

Notes on obsidian dataview API

How to handle document types 2023-03-09

How could we index frontmatter into our db? 2023-03-09

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions