Skip to content

Commit 1bc472f

Browse files
committed
Ingestion service documentation complete
1 parent bcd7d00 commit 1bc472f

File tree

2 files changed

+137
-9
lines changed

2 files changed

+137
-9
lines changed

docs/brainkbui.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# User Interface Overview
2-
2+
This section offers a detailed overview of the UI, including its layout, design elements, and functionality. It provides insights into how users can navigate the interface, interact with various components, and utilize its features effectively to achieve their goals.
3+
Additionally, it highlights key elements that enhance user experience, such as responsiveness, accessibility, and ease of use.
34
## Overview
45

56
The BrainKB UI (user interface), accessible at [beta.brainkb.org](https://beta.brainkb.org), is a user-centric interface designed to interact with the BrainKB knowledge graph infrastructure. It enables neuroscientists, researchers, and practitioners to explore, search, analyze, and visualize neuroscience knowledge effectively. The platform integrates a range of tools and features that facilitate evidence-based decision-making, making it an essential resource for advancing neuroscience research.

docs/ingestion_service.md

+135-8
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Ingestion service
2-
2+
This section provides information regarding the ingestion service, one of the service component of BrainKB.
3+
## Overview
34
{numref}`brainkb_intestion_architecture_figure` illustrates the architecture of the ingestion service, which follows the producer-consumer pattern and leverages RabbitMQ for scalable data ingestion. The service is composed of two main components: (i) the producer and (ii) the consumer.
45
The producer component exposes API endpoints (see {numref}`brainkb_ingestion_service_api_endpoints`) that allow clients or users to ingest data. Currently, it supports the ingestion of KGs represented in JSON-LD and Turtle formats. Users can ingest raw JSON-LD data as well as upload files, either individually or in batches. At present, the ingestion of other file types, such as PDF, text, and JSON, has been disabled due to the incomplete implementation of the required functionalities.
56
The consumer retrieves ingested data from RabbitMQ, processes it, and forwards it to the query service via API endpoints. The query service then inserts the processed data into the graph database.
@@ -16,10 +17,13 @@ Currently Enabled API Endpoints.
1617
```
1718

1819

20+
## Sequence Diagram
1921

22+
This sequence diagram illustrates the data ingestion pipeline, detailing the process of how a client submits data, which is subsequently validated, processed, and stored in a graph database.
23+
24+
### **Producer Workflow Overview**
25+
The diagram below highlights the producer in this pipeline, detailing each step in the process as described below.
2026

21-
## Sequence Diagram
22-
2327
```{mermaid}
2428
sequenceDiagram
2529
%% Client/User on the left
@@ -29,12 +33,12 @@ sequenceDiagram
2933
3034
box Thistle Producer
3135
participant API as Producer API
32-
participant Validator as Shared.py
36+
participant Validator as Shared
3337
participant Publisher as RabbitMQ Publisher
3438
end
3539
3640
box LightGoldenRodYellow RabbitMQ
37-
participant RabbitMQ as RabbitMQ Queue
41+
participant RabbitMQ as Oxigraph
3842
end
3943
4044
%% Client submits data
@@ -60,6 +64,27 @@ sequenceDiagram
6064
deactivate API
6165
```
6266

67+
#### **Receiving Data from the Client**
68+
- The **client** initiates the ingestion process by submitting a `POST` request to the **Producer API**.
69+
- The request contains structured data, typically in formats like **JSON, JSON-LD, or TTL (Turtle)**.
70+
71+
_Note:_ Support for additional formats, such as PDF and text, will be enabled once the necessary functionalities are fully developed (see {ref}`table_sourcecodes`) and integrated.
72+
73+
#### **Validation & Preprocessing**
74+
- The **Producer API** passes the received data to the **Shared (or shared.py that implements shared functionalities)**, which performs essential validation checks:
75+
- Ensuring the existence of the **named graph** in the database and ensuring the ingested data is in correct format, e.g., valid JSON-LD format.
76+
- It is important to note that to be able to proceed with ingestion, the client must either register a new named graph IRI (using the query service API endpoint) or select an existing one. This approach enables versioning, ensuring efficient data management and traceability.
77+
- If validation **fails**, the system returns a `400 Bad Request` to the client.
78+
79+
#### **Publishing Data to RabbitMQ**
80+
- Once validated, the **Producer RabbitMQ Publisher** formats the data for ingestion.
81+
- The formatted data is published to **RabbitMQ**, which acts as a message broker to decouple producers and consumers.
82+
- A successful message publication triggers a **publish confirmation**, which is sent back to the **Producer API**.
83+
84+
85+
### **Consumer Workflow Overview**
86+
The diagram below highlights the consumer in this pipeline, detailing each step in the process as described below.
87+
6388
```{mermaid}
6489
sequenceDiagram
6590
%% Client/User on the left
@@ -70,15 +95,15 @@ sequenceDiagram
7095
7196
box HoneyDew Consumer
7297
participant Consumer as Listener
73-
participant Processor as Shared.py
98+
participant Processor as Shared
7499
end
75100
76101
box AliceBlue Query Service
77102
participant QueryService as Query Service
78103
end
79104
80105
box Wheat Graph Database
81-
participant GraphDB as Graph Database
106+
participant GraphDB as Oxigraph
82107
end
83108
84109
@@ -113,4 +138,106 @@ sequenceDiagram
113138
deactivate Processor
114139
Consumer-->>RabbitMQ: 6. Acknowledge message
115140
deactivate Consumer
116-
```
141+
```
142+
143+
#### **Message Consumption from RabbitMQ**
144+
- The **RabbitMQ Queue** holds messages published by the **Producer**.
145+
- The **Consumer Listener** picks up an available message from the RabbitMQ queue.
146+
147+
#### **Adding Provenance Metadata**
148+
- The **Consumer** forwards the message to the **Shared (or shared.py that implements shared functionalities)** for further handling, e.g., adding the provenance information.
149+
150+
- Example: Consider the following ingested data in TTL representation.
151+
152+
```shell
153+
@prefix NCBIAssembly: <https://www.ncbi.nlm.nih.gov/assembly/> .
154+
@prefix NCBIGene: <http://identifiers.org/ncbigene/> .
155+
@prefix bican: <https://identifiers.org/brain-bican/vocab/> .
156+
@prefix biolink: <https://w3id.org/biolink/vocab/> .
157+
@prefix dcterms: <http://purl.org/dc/terms/> .
158+
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
159+
@prefix schema1: <http://schema.org/> .
160+
161+
bican:000015fd3d6a449b47e75651210a6cc74fca918255232c8af9e46d077034c84d a bican:GeneAnnotation ;
162+
rdfs:label "LOC106504536" ;
163+
schema1:identifier "106504536" ;
164+
bican:molecular_type "protein_coding" ;
165+
bican:referenced_in bican:d5c45501b3b8e5d8b5b5ba0f4d72750d8548515c1b00c23473a03a213f15360a ;
166+
biolink:category bican:GeneAnnotation ;
167+
biolink:in_taxon bican:7d54dfcbd21418ea26d9bfd51015414b6ad1d3760d09672afc2e1e4e6c7da1dd ;
168+
biolink:in_taxon_label "Sus scrofa" ;
169+
biolink:symbol "LOC106504536" ;
170+
biolink:xref NCBIGene:106504536 .
171+
```
172+
A new property `prov:wasInformedBy`, is added to the initial TTL data, establishing a link to the triple that contains provenance information, as illustrated below. Please note that the provenance is attached for all the triples. For example, if you are uploading a TTL file that contains 30 triples then all the 30 triples will have provenance information attached.
173+
174+
```
175+
<https://identifiers.org/brain-bican/vocab/ingestionActivity/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14> a prov:Activity,
176+
bican:IngestionActivity ;
177+
prov:generatedAtTime "2025-01-31T16:52:22.061674+00:00"^^xsd:dateTime ;
178+
prov:wasAssociatedWith bican:000015fd3d6a449b47e75651210a6cc74fca918255232c8af9e46d077034c84d,
179+
bican:00027255beed5c235eaedf534ac72ffc67ed597821a5b5c0f35709d5eb93bd47,
180+
<https://identifiers.org/brain-bican/vocab/agent/testuser> .
181+
182+
<https://identifiers.org/brain-bican/vocab/provenance/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14> a prov:Entity ;
183+
dcterms:provenance "Data posted by testuser on 2025-01-31T16:52:22.061674Z" ;
184+
prov:generatedAtTime "2025-01-31T16:52:22.061674+00:00"^^xsd:dateTime ;
185+
prov:wasAttributedTo <https://identifiers.org/brain-bican/vocab/agent/testuser> ;
186+
prov:wasGeneratedBy <https://identifiers.org/brain-bican/vocab/ingestionActivity/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14> .
187+
```
188+
189+
Final data after adding provenance information.
190+
```shell
191+
@prefix NCBIGene: <http://identifiers.org/ncbigene/> .
192+
@prefix bican: <https://identifiers.org/brain-bican/vocab/> .
193+
@prefix biolink: <https://w3id.org/biolink/vocab/> .
194+
@prefix dcterms: <http://purl.org/dc/terms/> .
195+
@prefix prov: <http://www.w3.org/ns/prov#> .
196+
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
197+
@prefix schema1: <http://schema.org/> .
198+
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
199+
200+
bican:000015fd3d6a449b47e75651210a6cc74fca918255232c8af9e46d077034c84d a bican:GeneAnnotation ;
201+
rdfs:label "LOC106504536" ;
202+
schema1:identifier "106504536" ;
203+
prov:wasInformedBy <https://identifiers.org/brain-bican/vocab/provenance/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14> ; #this links to the new provenance information
204+
bican:molecular_type "protein_coding" ;
205+
bican:referenced_in bican:d5c45501b3b8e5d8b5b5ba0f4d72750d8548515c1b00c23473a03a213f15360a ;
206+
biolink:category bican:GeneAnnotation ;
207+
biolink:in_taxon bican:7d54dfcbd21418ea26d9bfd51015414b6ad1d3760d09672afc2e1e4e6c7da1dd ;
208+
biolink:in_taxon_label "Sus scrofa" ;
209+
biolink:symbol "LOC106504536" ;
210+
biolink:xref NCBIGene:106504536 .
211+
212+
213+
#added new provenance information regarding the ingestion activity. Might have to update <https://identifiers.org/brain-bican/vocab/ingestionActivity/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14 patten, to be discussed and done later
214+
<https://identifiers.org/brain-bican/vocab/ingestionActivity/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14> a prov:Activity,
215+
bican:IngestionActivity ;
216+
prov:generatedAtTime "2025-01-31T16:52:22.061674+00:00"^^xsd:dateTime ;
217+
prov:wasAssociatedWith bican:000015fd3d6a449b47e75651210a6cc74fca918255232c8af9e46d077034c84d,
218+
bican:00027255beed5c235eaedf534ac72ffc67ed597821a5b5c0f35709d5eb93bd47,
219+
<https://identifiers.org/brain-bican/vocab/agent/testuser> .
220+
221+
<https://identifiers.org/brain-bican/vocab/provenance/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14> a prov:Entity ;
222+
dcterms:provenance "Data posted by testuser on 2025-01-31T16:52:22.061674Z" ;
223+
prov:generatedAtTime "2025-01-31T16:52:22.061674+00:00"^^xsd:dateTime ;
224+
prov:wasAttributedTo <https://identifiers.org/brain-bican/vocab/agent/testuser> ;
225+
prov:wasGeneratedBy <https://identifiers.org/brain-bican/vocab/ingestionActivity/e4db1e0b-98ff-497c-88b1-afb4a6d7ee14> .
226+
```
227+
228+
#### **Sending Processed Data to the Query Service**
229+
- The **Consumer** forwards the processed data to the **Query Service**.
230+
- The **Query Service** acts as an interface to the **Graph Database**, enabling operations such as querying and inserting data.
231+
232+
#### **Storing Data in the Graph Database**
233+
- The **Query Service** sends the structured data to the **Graph Database**, which in our case is the Oxigraph.
234+
- The **Graph Database** confirms successful/unsuccessful storage operation.
235+
236+
#### **Acknowledging Message Processing**
237+
- The **Query Service** sends a confirmation back to the **Processor**.
238+
- The **Processor** notifies the **Consumer** that processing is complete.
239+
- Finally, the **Consumer acknowledges the message to RabbitMQ**, ensuring:
240+
- The message is marked as processed.
241+
242+
243+

0 commit comments

Comments
 (0)