Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sql obfuscation regular expressions #2861

Closed
jsumners-nr opened this issue Jan 3, 2025 · 5 comments · Fixed by #2911
Closed

Improve sql obfuscation regular expressions #2861

jsumners-nr opened this issue Jan 3, 2025 · 5 comments · Fixed by #2911
Assignees
Labels
points: 3 A few days

Comments

@jsumners-nr
Copy link
Contributor

jsumners-nr commented Jan 3, 2025

// eslint-disable-next-line sonarjs/slow-regex
const singleQuote = /'(?:''|[^'])*?(?:\\'.*|'(?!'))/
// eslint-disable-next-line sonarjs/slow-regex
const doubleQuote = /"(?:[^"]|"")*?(?:\\".*|"(?!"))/
const dollarQuote = /(\$(?!\d)[^$]*?\$).*?(?:\1|$)/
const oracleQuote = /q'\[.*?(?:\]'|$)|q'\{.*?(?:\}'|$)|q'<.*?(?:>'|$)|q'\(.*?(?:\)'|$)/
// eslint-disable-next-line sonarjs/slow-regex
const comment = /(?:#|--).*?(?=\r|\n|$)/

// ( ` database` . ` table ` )
// eslint-disable-next-line sonarjs/slow-regex
const CLEANER = /^\(?(?:([`'"]?)(.*?)\1\.)?([`'"]?)(.*?)\3\)?$/

These regular expressions are triggering sonar-js's "slow regex" rule. We need to investigate what the regular expressions do, and see if we can improve them.

@workato-integration
Copy link

@jsumners-nr
Copy link
Contributor Author

These may already be solved if we agree that #2714 is an acceptable change.

@kmudduluru kmudduluru moved this from Triage Needed: Unprioritized Features to Prioritized in Node.js Engineering Board Jan 17, 2025
@kmudduluru kmudduluru moved this from Prioritized to To do: In current sprint in Node.js Engineering Board Jan 21, 2025
@kmudduluru kmudduluru added the points: 3 A few days label Jan 21, 2025
@jsumners-nr
Copy link
Contributor Author

I wanted an easier to understand benchmark in order to compare our current implementation with the one I have proposed. I put it up at https://github.com/jsumners-nr/sql-parser-bench. The benchmark runs 10,000 iterations for each parser to parse all of our cross agent SQL parsing tests. This still obscures the finer details of each algorithm, but makes it easier to see the difference. The results are pretty consistently close to:

original parser duration: 144 ms
replacement parser duration: 1,519 ms
original is faster by: 1,374 ms

I want to call out that the replacement parser supports common table expressions (CTEs) whereas our current one does not.

If we instead run a 10,000 iteration test of a singular SQL statement that both support:

select
  foo,
  bar,
  baz
from some_db.some_table as a_table
where
    foo = 'foo'

We get a result like:

original parser duration: 5 ms
replacement parser duration: 44 ms
original is faster by: 38 ms

I think this is more representative of what would be encountered during normal operation.

I also think we could improve either algorithm with an LRU cache.

@jsumners-nr
Copy link
Contributor Author

To be clear: I don't think there is anything we can do to the regular expressions to satisfy the linter rule. The linter rule is keying off "zero or more" and "one or more" operators. Those are necessary for the type of parsing being done by these rules. So we either solidify the exceptions with an explanation, or come up with a different way to do the parsing (which is what I have proposed and benchmarked).

@jsumners-nr
Copy link
Contributor Author

jsumners-nr commented Jan 24, 2025

Doing some more research, I have discovered:

  1. Our sql parser and sql "obfuscator" are mutually exclusive tools.
  2. My comments should be attached to https://github.com/newrelic/node-newrelic/blob/cb16516e90a3dc2cefb98e6131a7243412aefbfc/lib/db/query-parsers/sql.js
  3. Obfuscation is only performed if a query meets our "slow query" threshold.
  4. .Net does not use these regular expressions to obfuscate SQL statements. Instead, the parse the string character by character to do the replacement. At least in JavaScript, this would not be an ideal approach because values to redact could contain multibyte characters that would need to be detected. https://github.com/newrelic/newrelic-dotnet-agent/blob/4309938c5d0593029398d7658a37b5ebe78c283c/src/Agent/NewRelic/Agent/Core/Database/SqlObfuscator.cs?plain=1

@jsumners-nr jsumners-nr self-assigned this Jan 29, 2025
@jsumners-nr jsumners-nr moved this from To do: In current sprint to In progress: Issues being worked on in Node.js Engineering Board Jan 29, 2025
@github-project-automation github-project-automation bot moved this from In progress: Issues being worked on to Done: Issues recently completed in Node.js Engineering Board Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
points: 3 A few days
Projects
Status: Done: Issues recently completed
Development

Successfully merging a pull request may close this issue.

2 participants