Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import/export API #482

Open
jroper opened this issue Nov 16, 2023 · 5 comments
Open

Import/export API #482

jroper opened this issue Nov 16, 2023 · 5 comments

Comments

@jroper
Copy link

jroper commented Nov 16, 2023

It would be great if Akka persistence r2dbc offered an import/export API. If you have data that you want to get in/out of Akka persistence r2dbc, this currently requires understanding exactly how the data is stored, including concepts like slices, serialization manifests, etc. It would be more convenient if Akka persistence r2dbc provided an API that worked with streams of high level representations of events/snapshots/durable state.

While INSERT/SELECT may be a convenient option to support for import/export operations, the most efficient way to import/export data into postgres is using the COPY TO/FROM statements (this is what pg_dump/pg_restore uses). So, the best way to support this, I believe, would be to consume/produce streams of CSV data, rather than connecting to postgres directly. In a cloud scenario, a user could then use Alpakka to stream these out to S3/GCS etc, and then use the RDS/Cloud SQL specific mechanisms of import/exporting from these stores.

My reason for suggesting CSV files over the default text format that Postgres uses is that Google Cloud SQL only supports import from GCS using CSV, it doesn't seem to support the Postgres text (or binary) format. Supporting these other formats though in addition may be an option to consider - though I think the postgres binary format, while efficient, may be complex, though it should be well documented as I believe it's basically the same protocol that postgres speaks over the wire.

It could even have an option to output in pg_dump format - ie, include schema creation statements (noting that index creation statements should always come after the data).

@patriknw
Copy link
Member

Sounds good to me. csv and pg_dump text format should be fairly similar to produce. Would we also support importing from csv format or would import always be with the db tooling? "Google Cloud SQL only supports import from GCS using CSV"

There is already a migration tool https://doc.akka.io/docs/akka-persistence-r2dbc/current/migration.html
It's using the ordinary eventsByPersistenceId query so it can migrate from any Akka Persistence plugin. However, we might go more low level here so that we don't have to deserialize the payloads. Anyway, could be a starting point for inspiration. I think that kind of durable progress tracking is important so that it can resume in case of errors.

@jroper
Copy link
Author

jroper commented Nov 16, 2023

The actual import/export to the database would always be done with DB tooling, and not have anything to do with the tooling I'm talking about providing here. This is about preparing the import/export from a higher level abstraction that we can offer to users, allowing them to migrate to/from anything, including non Akka persistence stores.

Since, on our side, there's no DB involved, we're just writing to/from files, I don't think it's necessary for us to provide any resumption capabilities. The operation should be fairly stable, should be very fast (at least 10s of MB a second) and will always be safe to restart from the beginning. The users source/destination of their data might be a DB, but that would be up to them if they wanted to implement some sort of resumption, we can't make any assumptions about their system since we have no idea what it is.

@patriknw
Copy link
Member

patriknw commented Nov 16, 2023

Is would be something like this.

Export:

  • db tool will export the db table to csv file, that is in the internal db format
  • that csv is run through this tool to convert it to the public high level format (could be protobuf)

Import:

  • the public high level format is run though this tool to create a csv file in the internal db format
  • db tool will import the csv

@jroper
Copy link
Author

jroper commented Nov 16, 2023

Not quite. I don't expect that there will be a "public high level format" defined by this tool. What I do expect is that this tool will be a library that defines a case class representation of an event/snapshot/durable state, and a Flow that converts a stream of that case class to the CSV file and back.

So, it will be up to the user what to do with that case class. They could convert it to their own protobuf/json/whatever schema, and read/write that out. Or, they could interface directly with the database they are importing the data from, or exporting the data to, and write it directly there.

So, this is what it will look like:

Export:

  • db tool will export the db table to csv file, that is the internal db format
  • user will write their own tool that uses this library to consume that CSV file as a stream that they can then export to wherever they want, such as their own protobuf/json format, or live to a running database.

Import:

  • User will write their own tool that reads data from wherever they want, such as their own protobuf/json format, or a live running database, and that tool uses this library to output that data as csv in the internal db format
  • db tool will import the csv

@jroper
Copy link
Author

jroper commented Nov 16, 2023

The case classes that we use to represent the records could well be protobuf classes, this would give us both a public high level format, as well as giving users the power to skip outputting that format and go directly to whatever they want to go, whether that be their own format or a live database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants