Skip to content

documentloaders: implement simple client for apache tika#1002

Open
ricardomaraschini wants to merge 1 commit intotmc:mainfrom
ricardomaraschini:tika-client
Open

documentloaders: implement simple client for apache tika#1002
ricardomaraschini wants to merge 1 commit intotmc:mainfrom
ricardomaraschini:tika-client

Conversation

@ricardomaraschini
Copy link

@ricardomaraschini ricardomaraschini commented Sep 4, 2024

Implements a very simple client to use Apache Tika for document parsing. An example usage has been added:

import (
        "context"
        "net/http"

        "github.com/tmc/langchaingo/documentloaders"
        "github.com/tmc/langchaingo/textsplitter"
)

// To run this example you need to run a Tika server and then set the address
// on the TikaURL constant. The easiest way of running a Tika server is by
// using Docker:
//
// $ docker run -d -p 9998:9998 apache/tika
//
// Tika will be listening on http://localhost:9998, you then just need to ajust
// the TikaURL constant.

const TikaURL = "http://localhost:9998"

func main() {
        resp, err := http.Get("https://www.golang-book.com/public/pdf/gobook.pdf")
        if err != nil {
                panic(err)
        }
        defer resp.Body.Close()

        splitter := textsplitter.NewRecursiveCharacter()
        tika := documentloaders.NewTika(TikaURL, resp.Body)
        docs, err := tika.LoadAndSplit(context.Background(), splitter)
        if err != nil {
                panic(err)
        }

        _ = docs
}

Feel free to just close this PR if this isn't necessary here. I am using it so I decided to contribute it back, no harm done.

PR Checklist

  • Read the Contributing documentation.
  • Read the Code of conduct documentation.
  • Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
  • Describes the source of new concepts.
  • References existing implementations as appropriate.
  • Contains test coverage for new functions.
  • Passes all golangci-lint checks.

@ricardomaraschini
Copy link
Author

Yeah, lint is complaining. I will get back to this only if there is interest on this code.

implements a very simple client to use apache tika for document parsing.
@ricardomaraschini
Copy link
Author

Yeah, lint is complaining. I will get back to this only if there is interest on this code.

Oh well, once in hell I may as well give Satan a hug. That was an easy fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant