The Snapshot API allow processing of PDF documents through a verification system in respect of the OSI (Open Science Indicators). This project provides a Node.js REST API that implements JWT authentication and integrates with the DataSeer AI "Genshare" API for PDF processing. It features user-specific rate limiting, script-based user management, secure token handling, S3 storage for request data, and Google Sheets integration for summary logging.
- Features
- Prerequisites
- Installation
- Usage
- Health Monitoring
- Error Handling
- GenShare Response
- Authentication
- S3 Storage
- Google Sheets Integration
- Project Structure
- Configuration Files
- Rate Limiting
- Logging System
- Security Considerations
- Contributing
- License
- JWT-based authentication for all routes
- PDF processing via Genshare API integration
- User-specific rate limiting
- Script-based user and permissions management
- Secure token handling
- Health monitoring for all dependent services
- Route-specific access control with allow/block lists
- S3 storage for request and response data
- Google Sheets integration for summary logging
- Node.js (v14+ recommended)
- npm (comes with Node.js)
-
Clone the repository:
git clone https://github.com/DataSeer/snapshot-api.git cd snapshot-api
-
Build image:
docker build -t snapshot-api .
-
Run container:
# using default conf & env files docker run -d -it -p 3000:3000 --network host --name snapshot-api-instance snapshot-api # using custom conf & env files docker run -d -it -p 3000:3000 --network host --name snapshot-api-instance -v $(pwd)/.env:/usr/src/app/.env -v $(pwd)/conf:/usr/src/app/conf -v $(pwd)/log:/usr/src/app/log snapshot-api
-
Interact with the container:
# using default conf & env files docker exec -it snapshot-api-instance /bin/bash
-
Clone the repository:
git clone https://github.com/DataSeer/snapshot-api.git cd snapshot-api
-
Install dependencies:
npm install
-
Set up configuration:
- Create service configuration files in the
conf
directory:// conf/genshare.json { "processPDF": { "url": "http://localhost:5000/process/pdf", "method": "POST", "apiKey": "your_genshare_api_key" }, "health": { "url": "http://localhost:5000/health", "method": "GET" } } // conf/grobid.json { "health": { "url": "http://localhost:8070/health", "method": "GET" } } // conf/datastet.json { "health": { "url": "http://localhost:8080/health", "method": "GET" } }
- The
conf/users.json
file will be created automatically when you add users.
- Create service configuration files in the
-
Set environment variables:
PORT
: The port on which the server will run (default: 3000)JWT_SECRET
: Secret key for JWT token generation and validation
To start the server in production mode:
npm start
Use the following command to manage users:
npm run manage-users <command> [userId] [options]
Commands:
add [userId] [rateLimit]
: Add a new userremove <userId>
: Remove a userrefresh-token <userId>
: Refresh a user's tokenupdate-limit <userId> <rateLimit>
: Update a user's rate limitlist
: List all users
Examples:
# Add a new user with custom rate limit
npm run manage-users add user123 '{"max": 200, "windowMs": 900000}'
# Refresh a user's token
npm run manage-users refresh-token user123
# Update a user's rate limit
npm run manage-users update-limit user123 '{"max": 300}'
# List all users
npm run manage-users list
# Remove a user
npm run manage-users remove user123
Rate limits are specified as a JSON object with max
(maximum number of requests) and windowMs
(time window in milliseconds) properties. If not specified when adding a user, it defaults to 100 requests per 15-minute window.
The API includes a route-specific permissions system managed through a dedicated script:
npm run manage-permissions <command> [options]
Commands:
add <path> <method> [allowed] [blocked]
: Add route permissionsremove <path> <method>
: Remove route permissionsallow <path> <method> <userId>
: Grant user accessblock <path> <method> <userId>
: Revoke user accesslist
: List all route permissions
Examples:
# Add new route permissions
npm run manage-permissions add /processPDF POST '["user1","user2"]' '["user3"]'
# Allow user access
npm run manage-permissions allow /processPDF POST user4
# Block user access
npm run manage-permissions block /processPDF POST user3
# List all permissions
npm run manage-permissions list
# Remove route permissions
npm run manage-permissions remove /processPDF POST
Access Control Rules:
- Empty allowed list + empty blocked list: all authenticated users have access
- Populated allowed list: only listed users have access
- Blocked list: listed users are denied access (overrides allowed list)
- Users not in either list: allowed if allowed list is empty, blocked if it's not
All API endpoints require authentication using a JWT token.
GET /
: Get information about available API routesPOST /processPDF
: Process a PDF file- Form data:
file
: PDF fileoptions
: JSON string of processing options (must be a valid JSON object. If it's not well-formed or is not a valid JSON object, the API will return a 400 Bad Request error)
- Form data:
For all requests, include the JWT token in the Authorization header:
Authorization: Bearer <your_token>
The API provides comprehensive health monitoring for all dependent services (GenShare, GROBID, and DataStet).
-
GET /ping
: Check health status of all services- Returns aggregated health status of all services
- Response includes timestamp and detailed service status
-
GET /genshare/health
: Direct health check proxy to GenShare service- Returns the raw response from GenShare's health endpoint
- Returns 500 with message "GenShare health check failed" if the request fails
-
GET /grobid/health
: Direct health check proxy to GROBID service- Returns the raw response from GROBID's health endpoint
- Returns 500 with message "Grobid health check failed" if the request fails
-
GET /datastet/health
: Direct health check proxy to DataStet service- Returns the raw response from DataStet's health endpoint
- Returns 500 with message "Datastet health check failed" if the request fails
These individual health endpoints act as direct proxies to their respective services, forwarding the raw response data when successful. In case of failure, they return a 500 status code with a service-specific error message.
The /ping
endpoint returns:
{
"status": "healthy" | "unhealthy" | "error",
"timestamp": "2024-12-02T12:00:00.000Z",
"services": {
"genshare": {
"err": null,
"request": "GET http://genshare-service/health",
"response": {
"status": 200,
"data": { /* service-specific health data */ }
}
},
"grobid": {
"err": null,
"request": "GET http://grobid-service/health",
"response": {
"status": 200,
"data": { /* service-specific health data */ }
}
},
"datastet": {
"err": null,
"request": "GET http://datastet-service/health",
"response": {
"status": 200,
"data": { /* service-specific health data */ }
}
}
}
}
Individual service endpoints return raw response from their respective health endpoints.
- 200: Service(s) healthy
- 503: One or more services unhealthy
- 500: Error occurred while checking service health
Each service requires health check configuration in its respective config file:
{
"health": {
"url": "http://service-url/health",
"method": "GET"
}
}
HTTP 400: 'Required "file" missing' (parameter not set)
curl -X POST -H "Authorization: Bearer <your_token>" \
-F 'options={"key":"value","anotherKey":123}' \
http://localhost:3000/processPDF
# HTTP 400 Bad Request
Required "file" missing
HTTP 400: 'Required "file" invalid. Must have mimetype "application/pdf".' (file with incorrect mimetype)
curl -X POST -H "Authorization: Bearer <your_token>" \
-F "file=@path/to/your/file.xml" \
-F 'options={"key":"value","anotherKey":123}' \
http://localhost:3000/processPDF
# HTTP 400 Bad Request
Required "file" invalid. Must have mimetype "application/pdf".
HTTP 400: 'Required "options" missing.' (parameter not set)
curl -X POST -H "Authorization: Bearer <your_token>" \
-F "file=@path/to/your/file.pdf" \
http://localhost:3000/processPDF
# HTTP 400 Bad Request
Required "options" missing.
HTTP 400: 'Required "options" invalid. Must be a valid JSON object.' (data are not JSON)
curl -X POST -H "Authorization: Bearer <your_token>" \
-F "file=@path/to/your/file.pdf" \
-F 'options="key value anotherKey 123"' \
http://localhost:3000/processPDF
# HTTP 400 Bad Request
Required "options" invalid. Must be a valid JSON object.
HTTP 400: 'Required "options" invalid. Must be a JSON object.' (data are JSON but not an object)
curl -X POST -H "Authorization: Bearer <your_token>" \
-F "file=@path/to/your/file.pdf" \
-F 'options=["key","value","anotherKey",123]' \
http://localhost:3000/processPDF
# HTTP 400 Bad Request
Required "options" invalid. Must be a JSON object.
The API uses JSON Web Tokens (JWT) for authentication.
TokenManager
: Handles token storage and validationUserManager
: Manages user data and updates- Tokens are stored in
conf/users.json
separate from user data for security
-
Creation: Tokens are generated using:
npm run manage-users add <userId>
-
Usage: Include token in requests:
curl -H "Authorization: Bearer <your_token>" http://localhost:3000/endpoint
-
Validation: Each request is authenticated by:
- Extracting token from Authorization header
- Verifying JWT signature
- Looking up associated user
- Checking rate limits
-
Renewal: Refresh expired tokens using:
npm run manage-users refresh-token <userId>
# Generate new user with token
npm run manage-users add user123
# List all users and their tokens
npm run manage-users list
# Refresh token for existing user
npm run manage-users refresh-token user123
# Remove user and invalidate token
npm run manage-users remove user123
- JWT tokens are signed with a secret key (
JWT_SECRET
environment variable) - Tokens are stored separately from user data
- Rate limiting is tied to authentication
- Invalid tokens return 401 Unauthorized
- Missing tokens return 403 Forbidden
The API implements comprehensive storage of all request and response data in Amazon S3, providing complete traceability and audit capabilities for all PDF processing operations.
Files are stored in S3 using the following path structure:
{s3Folder}/{userId}/{requestId}/
├── file.pdf # Original uploaded PDF
├── file.metadata.json # File metadata (name, size, MD5, etc.)
├── options.json # Processing options
├── process.json # Process metadata
├── process.log # Detailed process log
└── response.json # GenShare API response
-
File Data
- Original PDF file
- File metadata including:
- Original filename
- File size
- MD5 hash
- MIME type
-
Process Information
- Process metadata:
- Start date/time
- End date/time
- Process duration
- File presence indicator
- Detailed process log with timestamped entries
- Processing options used
- Process metadata:
-
Response Data
- Complete GenShare API response
- Response status and headers
- Response data including decision tree path
Configure S3 storage in conf/aws.s3.json
:
{
"accessKeyId": "your_access_key",
"secretAccessKey": "your_secret_key",
"region": "your_region",
"bucketName": "your_bucket",
"s3Folder": "your_folder"
}
Each PDF processing request creates a ProcessingSession that manages:
- Unique request ID generation
- Log accumulation
- File handling
- S3 upload operations
- Process metadata tracking
The API logs summary information for each PDF processing request to a Google Sheets spreadsheet, providing easy access to processing history and analytics.
The summary sheet includes the following columns:
- Request ID (with S3 link)
- Error status
- Processing date
- User ID
- PDF filename
- Response data columns:
- Article ID
- DAS presence
- Data availability requirements
- DAS sharing statement
- Data generalist info
- Warrant generalist info
- Data specialist info
- Warrant specialist info
- Non-standard info
- Computer-generated content
- Computer-generated SI
- Computer-generated online
- Warrants code online
- Path information (decision tree path)
-
Set up Google Sheets credentials:
- Place service account credentials in
conf/googleSheets.credentials.json
- Configure sheet details in
conf/googleSheets.json
:{ "spreadsheetId": "your_spreadsheet_id", "sheetName": "your_sheet_name" }
- Place service account credentials in
-
Required permissions:
- Google Sheets API enabled
- Service account with appropriate permissions
- Spreadsheet shared with service account email
The API automatically logs:
- Every PDF processing request
- Processing outcomes
- Error statuses
- GenShare response data
- Decision tree paths
src/
: Contains the main application codeserver.js
: Entry pointconfig.js
: Configuration managementmiddleware/
: Custom middleware (e.g., authentication)routes/
: API route definitionscontrollers/
: Request handling logicutils/
: Utility functions and classes
scripts/
: Contains the user management scriptconf/
: Configuration filesgenshare.json
: Genshare API configurationusers.json
: User data storage (managed by scripts)
tmp/
: folder containing temporary files
The application uses several configuration files:
conf/genshare.json
: Contains configuration for the Genshare API integrationconf/grobid.json
: Contains configuration for the GROBID serviceconf/datastet.json
: Contains configuration for the DataStet serviceconf/users.json
: Stores user data, including tokens and rate limitsconf/permissions.json
: Stores route-specific access permissionsconf/aws.s3.json
: Contains S3 storage configurationconf/googleSheets.json
: Contains Google Sheets configurationconf/googleSheets.credentials.json
: Service account credentials for Google Sheets
Make sure to keep these files secure and do not commit them to version control.
This API implements user-specific rate limiting:
- Rate limits are customized for each user and stored in their user data
- Unauthenticated requests are limited to 100 requests per 15-minute window
- Authenticated requests use the limits specified in the user's data:
max
: Maximum number of requests allowed in the time windowwindowMs
: Time window in milliseconds- If not specified, defaults to 100 requests per 15-minute window
- To give a user unlimited requests, set
windowMs
to 0 when adding or updating the user
Rate limiting is implemented in src/utils/rateLimiter.js
and can be further customized as needed.
The API implements comprehensive logging using Winston and Morgan.
- Each log entry contains:
- IP address
- User ID (or 'unauthenticated')
- Timestamp
- HTTP method and URL
- Status code
- Response size
- Referrer
- User agent
- Request success status
The project includes a log analysis script that provides detailed statistics about API usage:
# analyze log/combined.log file
npm run analyze-logs
# analyze a given log file
node scripts/analyze_logs.js [path/to/logfile]
The analyzer provides:
- Per-user statistics:
- Total requests
- Successful requests
- Success rate
- URL-specific breakdown
- Per-IP statistics:
- Similar metrics as per-user
- Useful for tracking unauthenticated requests
Example output:
Request Statistics:
USERS Statistics:
User: user123
Total Requests: 150
Successful Requests: 145
Overall Success Rate: 96.67%
URL Breakdown:
URL: /processPDF
Total Requests: 120
Successful Requests: 118
Success Rate: 98.33%
IPS Statistics:
IP: 192.168.1.1
Total Requests: 75
Successful Requests: 70
Overall Success Rate: 93.33%
...
- All routes require JWT authentication
- Route-specific access control through permissions system
- JWT tokens are stored separately and managed through a dedicated script
- The main application can only read tokens, not modify them
- Rate limiting is implemented to prevent API abuse
- Sensitive configuration files (
users.json
,permissions.json
,genshare.json
,grobid.json
,datastet.json
) are not committed to version control
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE.md file for details.