A Node.js REST API for processing PDF documents through OSI (Open Science Indicators) verification system. It integrates with DataSeer AI "Genshare" API, featuring JWT authentication, user-specific rate limiting, S3 storage for request data, and Google Sheets integration for summary logging.
- Features
- Prerequisites
- Installation
- Configuration
- API Architecture
- Usage
- Scripts
- Development
- Deployment
- Security
- Dependencies
- PDF document processing via Genshare integration
- JWT-based authentication system
- Role-based access control
- User-specific rate limiting
- AWS S3 storage integration
- Google Sheets logging integration
- Health monitoring for all services
- Comprehensive logging system
- Version synchronization
- Complete request traceability
- Node.js (>= 20.18.0)
- Docker for containerization
- AWS Account (ECR & S3)
- Google Cloud Account (Sheets API)
- Access to:
- GROBID service
- DataStet service
- GenShare service
# Build image
docker build -t snapshot-api .
# Run with default configuration
docker run -d -p 3000:3000 --name snapshot-api snapshot-api
# Run with custom configuration
docker run -d -p 3000:3000 \
-v $(pwd)/.env:/usr/src/app/.env \
-v $(pwd)/conf:/usr/src/app/conf \
-v $(pwd)/log:/usr/src/app/log \
snapshot-api
# Clone repository
git clone https://github.com/DataSeer/snapshot-api.git
cd snapshot-api
# Install dependencies
npm install
# Copy configuration files
cp .env.default .env
cp conf/*.default conf/*
JWT_SECRET=your_jwt_secret_key
PORT=3000
- Service Configuration:
// conf/genshare.json
{
"processPDF": {
"url": "http://localhost:5000/process/pdf",
"method": "POST",
"apiKey": "your_genshare_api_key"
},
"health": {
"url": "http://localhost:5000/health",
"method": "GET"
}
}
- AWS S3 (
conf/aws.s3.json
):
{
"accessKeyId": "YOUR_ACCESS_KEY",
"secretAccessKey": "YOUR_SECRET_KEY",
"region": "YOUR_REGION",
"bucketName": "YOUR_BUCKET_NAME",
"s3Folder": "YOUR-FOLDER-NAME"
}
- Additional Configurations:
conf/grobid.json
: GROBID service settingsconf/datastet.json
: DataStet service settingsconf/permissions.json
: Route permissionsconf/users.json
: User managementconf/googleSheets.json
&conf/googleSheets.credentials.json
: Google Sheets integration
GET / - List available API routes
GET /versions - Get version information
POST /processPDF - Process a PDF file
GET /ping - Health check all services
GET /genshare/health - Check GenShare service health
GET /grobid/health - Check GROBID service health
GET /datastet/health - Check DataStet service health
- Token Generation:
npm run manage-users -- add user123
# Output: User user123 added with token: eyJhbGciOiJ...
- Request Authentication:
curl -H "Authorization: Bearer <your-token>" http://localhost:3000/endpoint
{
"max": 100, // Maximum requests
"windowMs": 900000 // Time window (15 minutes)
}
S3 Storage Structure:
{s3Folder}/{userId}/{requestId}/
├── file.pdf # Original PDF
├── file.metadata.json # File metadata
├── options.json # Processing options
├── process.json # Process metadata
├── process.log # Process log
└── response.json # API response
- File Errors:
# Missing file
curl -X POST -H "Authorization: Bearer <token>" \
-F 'options={"key":"value"}' \
http://localhost:3000/processPDF
# Response: HTTP 400 "Required 'file' missing"
# Invalid file type
curl -X POST -H "Authorization: Bearer <token>" \
-F "[email protected]" \
-F 'options={"key":"value"}' \
http://localhost:3000/processPDF
# Response: HTTP 400 "Required 'file' invalid"
- Options Errors:
# Missing options
curl -X POST -H "Authorization: Bearer <token>" \
-F "[email protected]" \
http://localhost:3000/processPDF
# Response: HTTP 400 "Required 'options' missing"
curl -X POST http://localhost:3000/processPDF \
-H "Authorization: Bearer <token>" \
-F "[email protected]" \
-F 'options={"option1": "value1"}'
# Add user with custom rate limit
npm run manage-users -- add user123 '{"max": 200, "windowMs": 900000}'
# List users
npm run manage-users -- list
# Update rate limit
npm run manage-users -- update-limit user123 '{"max": 300}'
# Refresh token
npm run manage-users -- refresh-token user123
# Add route permission
npm run manage-permissions -- add /processPDF POST '["user1","user2"]' '["user3"]'
# Allow user
npm run manage-permissions -- allow /processPDF POST user4
# List permissions
npm run manage-permissions -- list
# Start the server
npm run start
# Add new user with default rate limit
npm run manage-users -- add user123
# Output: User user123 added with token: eyJhbGciOiJ...
# Add user with custom rate limit
npm run manage-users -- add user123 '{"max": 200, "windowMs": 900000}'
# Output: User user123 added with rate limit: {"max":200,"windowMs":900000}
# List all users
npm run manage-users -- list
# Output: Lists all users with their tokens and rate limits
# Refresh user token
npm run manage-users -- refresh-token user123
# Output: Token refreshed for user user123. New token: eyJhbGciOiJ...
# Update user rate limit
npm run manage-users -- update-limit user123 '{"max": 300}'
# Output: Rate limit updated for user user123: {"max":300,"windowMs":900000}
# Remove user
npm run manage-users -- remove user123
# Output: User user123 removed
# Add new route with permissions
npm run manage-permissions -- add /api/route GET '["user1"]' '["user2"]'
# Output: Route /api/route [GET] added with permissions
# Allow user access to route
npm run manage-permissions -- allow /api/route GET user3
# Output: User user3 allowed on route /api/route [GET]
# Block user from route
npm run manage-permissions -- block /api/route GET user2
# Output: User user2 blocked from route /api/route [GET]
# List all route permissions
npm run manage-permissions -- list
# Output: Displays all routes and their permissions
# Analyze default log file
npm run analyze-logs
# Output: Shows usage statistics from log/combined.log
# Analyze specific log file
npm run analyze-logs -- /path/to/custom.log
# Output: Shows usage statistics for specified log file
# Analysis includes:
# - Per-user statistics
# - Per-IP statistics
# - URL-specific breakdown
# - Success rates
# Run ESLint check
npm run lint
# Output: Shows any code style violations
# Fix auto-fixable ESLint issues
npm run lint:fix
# Output: Fixes and shows remaining issues
# Install Git hooks (automatic on npm install)
npm run prepare
# Output: Husky hooks installed
# Sync version with git tags
npm run sync-version
# Output: Updates package.json version to match latest git tag
# Update changelog
npm run version
# Effect: Updates CHANGELOG.md with recent changes
# Note: Part of npm version lifecycle, rarely used directly
# Push version changes
npm run post-version
# Effect: Pushes commits and tags to repository
# Note: Usually part of release process
# Create new release
npm run release
# Effect:
# 1. Bumps version based on conventional commits
# 2. Updates CHANGELOG.md
# 3. Creates version commit
# 4. Creates git tag
# 5. Pushes changes and tags
Notes:
- All user and permission management commands require appropriate permissions
- Log analysis can be memory-intensive for large log files
- Version management commands affect git repository state
- Always run lint before committing changes
- The release process is automated but can be done manually if needed
.
├── src/
│ ├── server.js # Entry point
│ ├── config.js # Configuration
│ ├── middleware/ # Custom middleware
│ ├── routes/ # API routes
│ ├── controllers/ # Request handlers
│ └── utils/ # Utility functions
├── scripts/ # Management scripts
├── conf/ # Configuration files
└── tmp/ # Temporary files
- Clone and setup:
git clone https://github.com/DataSeer/snapshot-api.git
cd snapshot-api
npm install
- Configure development environment:
cp .env.default .env
cp conf/*.default conf/*
Follow Conventional Commits specification:
# Format
<type>: <description>
# Examples
feat: add new PDF processing option
fix: resolve rate limiting issue
docs: update API documentation
refactor: improve error handling
Supported types (from .versionrc.json):
- feat: Features
- fix: Bug Fixes
- docs: Documentation
- style: Styling
- refactor: Code Refactoring
- perf: Performance Improvements
- test: Tests
- build: Build System
- ci: CI Implementation
- chore: Maintenance
# Create new version
npm run release
# Manual version tag
git tag v1.1.0
git push origin v1.1.0
- JWT-based authentication required for all routes
- Route-specific access control
- User-specific rate limiting
- Secure token storage and management
- Request logging and monitoring
- S3 storage for complete request traceability
- Automated security scanning in CI/CD
- Regular token rotation recommended
- Access logs monitoring for suspicious activity
- Principle of least privilege for AWS IAM
{
"aws-sdk": "^2.1692.0",
"axios": "^1.7.7",
"express": "^4.17.1",
"express-rate-limit": "^5.2.6",
"googleapis": "^144.0.0",
"jsonwebtoken": "^9.0.2",
"winston": "^3.15.0"
}
{
"@commitlint/cli": "^19.6.1",
"@commitlint/config-conventional": "^19.6.0",
"eslint": "^8.56.0",
"husky": "^8.0.3",
"standard-version": "^9.5.0"
}