Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: pgvecto.rs: IPC connection is closed unexpected (v0.2.1) #500

Open
nothing2obvi opened this issue Jun 11, 2024 · 5 comments
Open

Comments

@nothing2obvi
Copy link

nothing2obvi commented Jun 11, 2024

Hello. I was using the pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0 image as instructed by Immich, but I started getting this error:

2024-06-11 10:38:40.602 UTC [16293] ERROR:  pgvecto.rs: IPC connection is closed unexpected.
	ADVICE: The error is raisen by background worker errors. Please check the full PostgreSQL log to get more information. Please read `https://docs.pgvecto.rs/admin/configuration.html`.
2024-06-11 10:38:40.602 UTC [16293] STATEMENT:  
	          SELECT idx_status
	          FROM pg_vector_index_stat
	          WHERE indexname = $1

I switched to pgvecto-rs:pg14-v0.2.1 as it seems like was suggested in #409 and #376, but I still get the same error.

What can I do? Thanks.

@VoVAllen
Copy link
Member

Can you try manuually deleta all the files under pgdata/pgvecto_rs and reindex all the vector column?

@nothing2obvi
Copy link
Author

Sorry for the late reply. I don't have a pgdata/pgvecto_rs but I have a pgdata/pgvectors is that what you mean?

I also don't understand what you mean by "reindex all the vector column".

Thanks!

@cho-thinkfree-com
Copy link

cho-thinkfree-com commented Aug 29, 2024

I am currently testing a PostgreSQL setup in a container (Kubernetes) with an 8-core, 32GB pod, where I have installed the pgvectors extension ( image tag is pg16-v0.3.0 ). I have 1,000 tables, each using both vector-dense and vector-sparse columns, and each table can hold approximately 30,000 rows. Additionally, each vector column has an index created using HNSW (Hierarchical Navigable Small World). Currently, the size of the /pgdata directory is around 710GB on AWS EBS(gp3, 1.5TB, 16000 iops, 500M Throughput)

While continuously inserting data into the tables, I noticed that although the PostgreSQL pod itself did not terminate, its logs indicated that it had shut down. It restarted shortly afterward, but for a long time, the following log messages were repeatedly output, causing the service that was inserting data to stop functioning:

LOG: Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d802700004000016383bd".

After this, a considerable number of connection errors occurred.

After waiting for a longer period, the log messages stating "LOG: Find directory pg_vectors/indexes/xxxxx" stopped appearing, and I confirmed that the service could resume inserting data again.

Below is a portion of the PostgreSQL logs:

2024-08-29 08:19:55.832 UTC [30530] LOG:  00000: checkpoint complete: wrote 71542 buffers (6.8%); 0 WAL file(s) added, 29 removed, 0 recycled; write=1725.803 s, sync=0.681 s, total=1734.645 s; sync files=176, longest=0.448 s, average=0.004 s; distance=476360 kB, estimate=500901 kB; lsn=CB/E87E6710, redo lsn=CB/992040D8
2024-08-29 08:19:55.832 UTC [30530] LOCATION:  LogCheckpointEnd, xlog.c:6329
2024-08-29 08:19:55.832 UTC [30530] LOG:  00000: checkpoint starting: wal time
2024-08-29 08:19:55.832 UTC [30530] LOCATION:  LogCheckpointStart, xlog.c:6240
2024-08-29 08:35:02.230 UTC [1] LOG:  00000: background worker "vectors" (PID 30589) was terminated by signal 9: Killed
2024-08-29 08:35:02.230 UTC [1] LOCATION:  LogChildExit, postmaster.c:3681
2024-08-29 08:35:02.230 UTC [1] LOG:  00000: terminating any other active server processes
2024-08-29 08:35:02.230 UTC [1] LOCATION:  HandleChildCrash, postmaster.c:3489
2024-08-29 08:35:02.283 UTC [42108] FATAL:  57P03: the database system is in recovery mode
2024-08-29 08:35:02.283 UTC [42108] LOCATION:  ProcessStartupPacket, postmaster.c:2356
2024-08-29 08:35:02.283 UTC [42109] FATAL:  57P03: the database system is in recovery mode
....
2024-08-29 08:35:02.440 UTC [42120] LOCATION:  ProcessStartupPacket, postmaster.c:2356
2024-08-29 08:35:02.440 UTC [42122] FATAL:  57P03: the database system is in recovery mode
2024-08-29 08:35:02.440 UTC [42122] LOCATION:  ProcessStartupPacket, postmaster.c:2356
2024-08-29 08:35:02.560 UTC [1] LOG:  00000: all server processes terminated; reinitializing
2024-08-29 08:35:02.560 UTC [1] LOCATION:  PostmasterStateMachine, postmaster.c:3949
2024-08-29 08:35:03.451 UTC [42123] LOG:  00000: database system was interrupted; last known up at 2024-08-29 08:19:48 UTC
2024-08-29 08:35:03.451 UTC [42123] LOCATION:  StartupXLOG, xlog.c:5111
2024-08-29 08:35:03.452 UTC [42126] FATAL:  57P03: the database system is in recovery mode
2024-08-29 08:35:03.452 UTC [42126] LOCATION:  ProcessStartupPacket, postmaster.c:2356
....
2024-08-29 08:35:05.384 UTC [42151] FATAL:  57P03: the database system is in recovery mode
2024-08-29 08:35:05.384 UTC [42151] LOCATION:  ProcessStartupPacket, postmaster.c:2356
2024-08-29 08:35:05.414 UTC [42123] LOG:  00000: database system was not properly shut down; automatic recovery in progress
2024-08-29 08:35:05.414 UTC [42123] LOCATION:  InitWalRecovery, xlogrecovery.c:924
2024-08-29 08:35:05.426 UTC [42152] FATAL:  57P03: the database system is in recovery mode
2024-08-29 08:35:05.426 UTC [42152] LOCATION:  ProcessStartupPacket, postmaster.c:2356
2024-08-29 08:35:05.428 UTC [42123] LOG:  00000: redo starts at CB/992040D8
2024-08-29 08:35:05.428 UTC [42123] LOCATION:  PerformWalRecovery, xlogrecovery.c:1687
2024-08-29 08:35:05.509 UTC [42153] FATAL:  57P03: the database system is not yet accepting connections
2024-08-29 08:35:05.509 UTC [42153] DETAIL:  Consistent recovery state has not been yet reached.
2024-08-29 08:35:05.509 UTC [42153] LOCATION:  ProcessStartupPacket, postmaster.c:2340
2024-08-29 08:35:05.518 UTC [42154] FATAL:  57P03: the database system is not yet accepting connections
2024-08-29 08:35:05.518 UTC [42154] DETAIL:  Consistent recovery state has not been yet reached.
....
2024-08-29 08:35:14.365 UTC [42174] LOCATION:  ProcessStartupPacket, postmaster.c:2340
2024-08-29 08:35:14.581 UTC [42175] FATAL:  57P03: the database system is not yet accepting connections
2024-08-29 08:35:14.581 UTC [42175] DETAIL:  Consistent recovery state has not been yet reached.
2024-08-29 08:35:14.581 UTC [42175] LOCATION:  ProcessStartupPacket, postmaster.c:2340
2024-08-29 08:35:15.430 UTC [42123] LOG:  00000: redo in progress, elapsed time: 10.00 s, current LSN: CB/F49E8628
2024-08-29 08:35:15.430 UTC [42123] LOCATION:  PerformWalRecovery, xlogrecovery.c:1701
2024-08-29 08:35:15.483 UTC [42176] FATAL:  57P03: the database system is not yet accepting connections
2024-08-29 08:35:15.483 UTC [42176] DETAIL:  Consistent recovery state has not been yet reached.
2024-08-29 08:35:15.483 UTC [42176] LOCATION:  ProcessStartupPacket, postmaster.c:2340
2024-08-29 08:35:18.265 UTC [42177] FATAL:  57P03: the database system is not yet accepting connections
2024-08-29 08:35:18.265 UTC [42177] DETAIL:  Consistent recovery state has not been yet reached.
...
2024-08-29 08:35:19.584 UTC [42181] FATAL:  57P03: the database system is not yet accepting connections
2024-08-29 08:35:19.584 UTC [42181] DETAIL:  Consistent recovery state has not been yet reached.
2024-08-29 08:35:19.584 UTC [42181] LOCATION:  ProcessStartupPacket, postmaster.c:2340
2024-08-29 08:35:20.171 UTC [42123] LOG:  00000: invalid record length at CC/26439E48: expected at least 24, got 0
2024-08-29 08:35:20.171 UTC [42123] LOCATION:  ReadRecord, xlogrecovery.c:3136
2024-08-29 08:35:20.171 UTC [42123] LOG:  00000: redo done at CC/26439DA0 system usage: CPU: user: 4.68 s, system: 2.01 s, elapsed: 14.74 s
2024-08-29 08:35:20.171 UTC [42123] LOCATION:  PerformWalRecovery, xlogrecovery.c:1824
2024-08-29 08:35:20.196 UTC [42124] LOG:  00000: checkpoint starting: end-of-recovery immediate wait
2024-08-29 08:35:20.196 UTC [42124] LOCATION:  LogCheckpointStart, xlog.c:6240
2024-08-29 08:35:20.485 UTC [42182] FATAL:  57P03: the database system is not yet accepting connections
2024-08-29 08:35:20.485 UTC [42182] DETAIL:  Consistent recovery state has not been yet reached.
2024-08-29 08:35:20.485 UTC [42182] LOCATION:  ProcessStartupPacket, postmaster.c:2340
2024-08-29 08:35:23.111 UTC [42124] LOG:  00000: checkpoint complete: wrote 228960 buffers (21.8%); 0 WAL file(s) added, 78 removed, 63 recycled; write=2.636 s, sync=0.009 s, total=2.917 s; sync files=155, longest=0.006 s, average=0.001 s; distance=2312407 kB, estimate=2312407 kB; lsn=CC/26439E48, redo lsn=CC/26439E48
2024-08-29 08:35:23.111 UTC [42124] LOCATION:  LogCheckpointEnd, xlog.c:6329
2024-08-29 08:35:23.195 UTC [1] LOG:  00000: database system is ready to accept connections
2024-08-29 08:35:23.195 UTC [1] LOCATION:  process_pm_child_exit, postmaster.c:3109
2024-08-29 08:35:23.208 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d802700004000016383bd".
2024-08-29 08:35:23.208 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d8027000040000162b615".
2024-08-29 08:35:23.208 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d802700004000006a3206".
2024-08-29 08:35:23.208 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d80270000400000007d93".
....
2024-08-29 08:35:23.254 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d8027000040000000ada1".
2024-08-29 08:35:23.257 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d80270000400001620253/segments/919e6167-a1a0-40b8-a16c-712feb957b2f".
2024-08-29 08:35:24.334 UTC [42191] ERROR:  XX000: pgvecto.rs: IPC connection is closed unexpectedly.
	ADVICE: Visit `https://github.com/tensorchord/pgvecto.rs/issues` for help.
2024-08-29 08:35:24.334 UTC [42191] LOCATION:  vectors::error::check_connection, error.rs:114
2024-08-29 08:35:24.334 UTC [42191] STATEMENT:  UPDATE vc SET searchable = false WHERE content_id = $1  AND to_be_deleted = false
2024-08-29 08:35:24.369 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d8027000040000160637b/segments/834decd0-ebc0-4d57-a59f-fda0011a7757".
2024-08-29 08:35:24.466 [42185] LOG:  Find directory "pg_vectors/indexes/0000000000000000000000000000000066cc2320a86d80270000400001608274/segments/c442fbad-03d9-4c1b-ba31-4f2752e6ec06".
....

postgresql.conf

include_if_exists '/var/lib/postgresql/data/pgdata/postgresql.conf'
shared_preload_libraries = 'pg_stat_statements, vectors'
search_path = '"$user", public, vectors'
max_connections = 200
autovacuum_naptime = '30s
autovacuum_vacuum_threshold = 5000
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_threshold = 2500
autovacuum_analyze_scale_factor = 0.05
log_min_error_statement = error
log_error_verbosity = verbose
log_statement = 'none'
work_mem = '10MB'
shared_buffers = '8GB'
maintenance_work_mem = '2GB'
effective_cache_size = '16GB'
max_parallel_workers_per_gather = 2
max_worker_processes = 5
max_parallel_workers = 5

While I am proficient in programming development, my knowledge of databases (PostgreSQL) is limited. I suspect that the issue could be due to attempting to insert such a large amount of data with too little CPU and memory.

However, is there any way to determine whether "there is a problem with PostgreSQL"? Alternatively, is there a way to know if pgvectors is in the process of indexing, has completed the indexing, and is now ready to accept more data?

Alternatively, is there a general guideline on the amount of CPU and RAM needed to handle the volume of data I mentioned? I would appreciate even a simple response like "It depends on the situation, but it seems too small" or "Generally, it should be sufficient."

Since I lack expertise, I plan to use Citus or StackGres for sharding to handle large volumes of data. However, if a single instance can manage sufficiently large data, I will only set up an HA (High Availability) configuration.

Thank you for reading my novice question.

@VoVAllen
Copy link
Member

VoVAllen commented Sep 2, 2024

@cho-thinkfree-com Sorry for the problem. We haven't tested such large scale tables with indexes so far. The problem you met might due to insufficient memory. 30M dense vector+30M sparse vector needs about 160GB for 1024dim vectors. And index needs some extra memory for index structures. In the latest version, it should be able to accept new data immediately after create index commands finished

@cho-thinkfree-com
Copy link

Thank you so very much for your kind and detailed response.

You mentioned that approximately 160GB is required to handle 1024-dimensional vectors when processing 30M dense vectors and 30M sparse vectors.

Could you please clarify whether this 160GB refers to memory (RAM) or disk size?

Also, does the 160GB estimate come from calculating the size of dense vectors (30M * 1024 dim * 4 bytes) + sparse vectors (...), or is there another method for this calculation?

Is it necessary to load all vectors into memory in order to create the index? (I apologize if this is a basic question, but I have limited knowledge of databases and indexing.)

Perhaps I'm attempting something unrealistic! :-)

Thank you once again for your kind response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants