NEW (RegexEngine) @W-15985046@ Implement regex engine with trailing whitespace rule #34

ravipanguluri · 2024-06-27T17:07:44Z

In this PR:

Implement rule that checks for trailing whitespaces at the end of lines within an Apex class
runRules() method will run the trailing whitespace rule when pointed to a file and/or directory
The engine will emit appropriate error/logging messages when applicable

packages/code-analyzer-core/src/workspace.ts

packages/code-analyzer-regex-engine/src/RegexEnginePlugin.ts

stephen-carter-at-sf · 2024-06-27T19:05:44Z

packages/code-analyzer-regex-engine/src/executor.ts

+        for (const file of allFiles) {
+            const fileData = fs.statSync(file)
+            if (fileData.isFile()) {
+                codeLocations = codeLocations.concat(await this.scanFile(file));


Oh interesting use of codeLocations. We originally planned to have codeLocations be plural for path based violations... where 1 violation is associated with a path through the code. But I haven't considered having 1 violation (in this case for the 1 rule) be associated with multiple code locations.

Instead I would have thought we would want 1 violation per code location.

The question here becomes how VSCode in the future will use this. Note that the primaryLocationIndex was intended to be the single place that vscode could mark the violation for the user to see. This is another reason why I think we want 1 violation per code location in this case... so that vscode can show each of these locations separately as individual violations.

Thoughts? @jag-j

Oh and I just noticed... you aren't even creating one violation per file... you literally are creating 1 violation for all files in all locations. Interesting. This is a good use case of what external engine-api users might think.

But yeah, the CLI, and things like our HTML output, etc... might trim these locations thinking that multiple locations are for non-standard rule types. This makes me think we I should add a validation step in Core that ensures that if an engine returns a violation for a Standard rule type that has multiple code locations - then it will error to inform users to complain to the engine author that they did something wrong.

So for now, please change so that every regex match ends up with 1 code location per violation since this regex engine is returning standard rule types.

Does that mean I should increment (or somehow update) the primaryLocationIndex for each violation then? Or just leave them all to be the same? And yes, my interpretation was based on the fact that code locations was a list, so I thought that meant for each rule you collect all the code locations where there was a violation. I am changing it to have a single code location for every violation now.

No the primaryLocationIndex will be 0 (for 0 based indexing) when you have 1 code location for each violation.

For path based rules, the violation will be assocaited with a path and the primaryLocationIndex could be the first, the last, or somewhere in the middle to help clients like vscode know where to report the violation associated with the path. We'll see how that turns out when we eventually migrate over the salesforce graph engine.

stephen-carter-at-sf · 2024-06-27T19:08:43Z

packages/code-analyzer-regex-engine/src/executor.ts

+                message: "",
+                resourceUrls: [""]


resourceUrls are option here. On the rule description we encourage them... but here they are optional (to be appended to the final list of urls from the rule description that the core framework gives back to the client for the violation). So remove resourceUrls all together (delete line 26).

But we definitely want to fill in some sort of message here (even if it ends up being the same message as the rule itself).

stephen-carter-at-sf · 2024-06-27T19:15:56Z

packages/code-analyzer-regex-engine/src/executor.ts

+        const fileType: string = path.extname(fileName)
+        let codeLocations: CodeLocation[] = [];
+        if (fileType === APEX_CLASS_FILE_EXT){
+            codeLocations = await this.getViolationCodeLocations(fileName)
+        }
+
+        return codeLocations;


This could be:
return path.extname(fileName) === APEX_CLASS_FILE_EXT ? await this.getViolationCodeLocations(fileName) : [];

stephen-carter-at-sf · 2024-06-27T19:16:26Z

packages/code-analyzer-regex-engine/src/executor.ts

+
+    private async getViolationCodeLocations(fileName: string): Promise<CodeLocation[]> {
+        const codeLocations: CodeLocation[] = [];
+        const data: string = fs.readFileSync(fileName, {encoding: 'utf8', flag: 'r'})


Is 'r' the default of flag? If so, I don't think you need to specify it.
Also, instead of data, maybe say fileContents.

stephen-carter-at-sf · 2024-06-27T19:17:30Z

packages/code-analyzer-regex-engine/src/executor.ts

+
+        split_data.forEach((line: string, lineNum: number) => {
+            let codeLocation: CodeLocation;
+            regex = /(?<=[^ \t\r\n\f])[ \t]+$/


Since you are hard coding this, can't you move it to line 48 so you can make regex a constant?

Also given your line by line approach... do we care about (?<=[^ \t\r\n\f]) ?

That is... if a line is just a line of spaces... shouldn't it also be flagged as a violation?

Wouldn't /\s+$/ be sufficient for line by line approach?

I was under the impression that whitespace lines could be allowed in between lines of code for stylistic purposes. But I can change it to the simpler regex expression if that wasn't the intent.

Hmm... that's a good question... if a line contains nothing but whitespace characters (excluding the single newline case) - is that considered trailing whitespace or not. Might want to ask the others what they think. It really comes down to the nature of the rule - to help folks save on the amount of characters they have. So I would think we would want to detect this.

stephen-carter-at-sf · 2024-06-27T19:19:29Z

packages/code-analyzer-regex-engine/src/executor.ts

+    private async getViolationCodeLocations(fileName: string): Promise<CodeLocation[]> {
+        const codeLocations: CodeLocation[] = [];
+        const data: string = fs.readFileSync(fileName, {encoding: 'utf8', flag: 'r'})
+        const split_data: string[] = data.split("\n")


Is this what we want to split on? I would think windows would have \r characters left behind and thus triggering your regex to flag every single line.

stephen-carter-at-sf · 2024-06-27T19:29:36Z

packages/code-analyzer-regex-engine/src/executor.ts

+        let regex: RegExp;
+        let match: RegExpExecArray | null;
+
+        split_data.forEach((line: string, lineNum: number) => {


Instead of looping line by line, did you consider applying a regular expression to the entire file?

Basically, thinking ahead with how we would add in more regular expression rules to this engine... we don't want the restriction to be that the regular expressions that we make are subject to just a single line do we?

So maybe we just search for something like /\s+$/m on the entire file... and then only if violations are found do we postprocess to find the line numbers (by using the position of the match and counting the number of newlines between the start and that match, etc). Just a suggestion that will then open up the possibilities for a generalized regular expression rule system.

Also... I wonder about the case where we could have a ton of newline characters at the end of a file.. .is that considered trailing whitespace?

maybe something like /[ \t]+\r?\n|\r?\n{2,}$/g; instead? Can this be used to get all the locations of the trailing whitespaces?

I talked through the approach that I took to solving the problem with @jag-j and one thing i'm worried about is that it could be somewhat complicated for users to make rules if we ingest all the regex expressions the same way. I also proposed to Jag that a delimiter be something that the user specifies in the config file as we expand. Because there are also other cases like finding a certain regex pattern across a block of code that may require a different splitting procedure. However, I also do see the merit of having all the rules be composed the same way, so I am completely open to changing my implementation. Further, I also see that splitting on the "/n" is problematic.

jag-j · 2024-07-02T15:50:46Z

packages/code-analyzer-regex-engine/src/executor.ts

+    private async scanFile(fileName: string): Promise<Violation[]> {
+        const violations: Violation[] = [];
+        if (path.extname((fileName)) === APEX_CLASS_FILE_EXT) {
+            const fileContents: string = fs.readFileSync(fileName, {encoding: 'utf8'})


@ravipanguluri Not sure if you already discussed this with @stephen-carter-at-sf. Can this be async readFile instead?

I see ravi already submitted. But yeah, this probably should have been an async read for sure.

I thought you had said we were going to leave it as sync for now which is what I told to @jag-j during our 1-1, but I realize that was a fair bit prior to this point. I can make a new short work item for it and change it to async if you want. It shouldn't take too long. My bad on this.

jag-j · 2024-07-02T15:55:21Z

packages/code-analyzer-regex-engine/src/executor.ts

+    private updateNewlineIndices(fileContents: string): void {
+        const newlineRegex: RegExp = new RegExp(this.lineSep, "g")
+        const matches = fileContents.matchAll(newlineRegex);
+        this.newlineIndexes = []


Nitpick - This is fine, but I would prefer this method returning newlineIndexes rather than having it as a class member. Making testing a lot easier and you can also test each of these individual methods (updateNewlineIndices, getColumnNumber, getLineNumber) easily.

jag-j

As discussed, push your pending changes and merge the PR.

salesforce-cla bot added the cla:signed label Jun 27, 2024

ravipanguluri marked this pull request as draft June 27, 2024 17:30

stephen-carter-at-sf reviewed Jun 27, 2024

View reviewed changes

WIP

17bb748

ravipanguluri force-pushed the rp/W-15985046 branch from f5c2b90 to 17bb748 Compare June 28, 2024 19:41

ravipanguluri added 3 commits July 1, 2024 14:36

WIP: refactor of code to handle newlines

c3ef066

WIP: regex that catches trailing whitespaces on lines of code

e5a7f18

NEW @W-15985046@ get OS specific line separator to pass windows tests

da76550

ravipanguluri marked this pull request as ready for review July 1, 2024 22:56

NEW @W-15985046@ cleanup unneeded tests/files

9e31057

jag-j reviewed Jul 2, 2024

View reviewed changes

jag-j approved these changes Jul 2, 2024

View reviewed changes

NEW @W-15985046@ changed newlineIndexes to not be class member

862278e

ravipanguluri merged commit 1f269c1 into dev Jul 2, 2024
5 checks passed

ravipanguluri deleted the rp/W-15985046 branch July 2, 2024 17:28

NEW (RegexEngine) @W-15985046@ Implement regex engine with trailing whitespace rule #34

NEW (RegexEngine) @W-15985046@ Implement regex engine with trailing whitespace rule #34

Uh oh!

Conversation

ravipanguluri commented Jun 27, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravipanguluri Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephen-carter-at-sf Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephen-carter-at-sf Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravipanguluri Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jag-j left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ravipanguluri Jun 27, 2024 •

edited

Loading

stephen-carter-at-sf Jun 27, 2024 •

edited

Loading

stephen-carter-at-sf Jun 27, 2024 •

edited

Loading

ravipanguluri Jun 27, 2024 •

edited

Loading