-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TypeScript] Add methods for BufferedTokenStream #4569
Open
Phlosioneer
wants to merge
3
commits into
antlr:dev
Choose a base branch
from
Phlosioneer:add-token-stream-functions
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,26 @@ | ||
import { TokenStream } from './TokenStream'; | ||
import { Lexer } from "./Lexer"; | ||
import { TokenSource } from './TokenSource'; | ||
import { Token } from './Token'; | ||
|
||
export declare class BufferedTokenStream extends TokenStream { | ||
|
||
tokenSource: Lexer; | ||
tokenSource: TokenSource; | ||
tokens: Token[]; | ||
fetchedEof: boolean; | ||
|
||
constructor(source: TokenSource); | ||
setTokenSource(tokenSource: TokenSource): void; | ||
mark(): number; | ||
release(marker: number): void; | ||
reset(): void; | ||
seek(index: number): void; | ||
consume(): void; | ||
sync(i: number): boolean; | ||
fetch(n: number): number; | ||
LB(k: number): Token; | ||
nextTokenOnChannel(i: number, channel: number): number; | ||
previousTokenOnChannel(i: number, channel: number): number; | ||
|
||
protected lazyInit(): void; | ||
protected adjustSeekIndex(i: number): number; | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need these protected members ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjustSeekIndex
is required for subclassingBufferedTokenStream
, and it only exists for subclasses to use. SeeCommonTokenStream::adjustSeekIndex
for an example.lazyInit
is needed if the subclass tries to access any of the fields/properties safely, possibly before one of the BufferedTokenStream methods is called. Technically moreprotected
functions are available, likesetup
, but these two were the minimum required to correctly subclass the stream. Technically all the fields should be markedprotected
but that felt too restrictive, because the missing underscore convention (_index
,_tokens
) indicates the fields are public in JS.I've actually made a mistake, there should be an additional
index: number
line. I'll push that to the PR branch tomorrow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why do you need to subclass
BufferedTokenStream
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I needed a token stream that did extra processing/rewriting, between the lexer and parser steps. It's a normal/supported thing to do, the Parser accepts any TokenStream as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that normally done by a TokenStreamRewriter, designed just for that purpose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TokenStreamRewriter is made with a C-style preprocessor in mind. Its output is text after rewrites have happened. It's not able to e.g. transform tokens, merge them together or split them apart, add context to tokens (technically subclass, but JS makes that very blurry), and generally fails to act as a stream transformer between the lexer and parser. You would need to run a file through a lexer, then the token stream rewriter, then through a different lexer, then through a parser. I'm not dealing with a preprocessor, I need a token stream transformer that outputs a stream of tokens.
I think this is a supported use case; otherwise there would be no need for a distinction between CommonTokenStream and BufferedTokenStream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ground this discussion, we're discussing two
protected
annotations in a typescript definition file as a companion to the already-implemented, already-publicly-accessible javascript functions. The ship has already sailed about whether these functions were a "good" interface or if the buffer is a "good" extension point. These functions are already part of the public API on every other language runtime except typescript.Also, it should be noted that other runtimes put
lazyInit
asprotected
, neverprivate
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know antlr was getting another rewrite. I'll detail my use case, perhaps it will help. However, yesterday I rewrote the whole parser & lexer from scratch and achieved much better results than antlr was able to provide. This is for a personal project. Read only if you're interested, nothing here directly bears on the current protected member discussion.
The DSL I made has identifiers / names of the form
foo.bar.baz
, dot-separated paths. It also allows a shorthand starting with a dot,.bar.baz
, that meant the identifier was relative to some other identifier. There are block statements likewith foo { ... .bar.baz ... }
that should result infoo.bar.baz
at the end of parsing.OK, all fairly normal so far. The issue is that
with
blocks could be anywhere in the file. Inside any block or at the top level or nested within otherwith
statements. I tried to spec this out in the parser's .g4 but my simple ~60 line parser ballooned into a few hundred lines, to account for all the possible places wherewith
could happen; and the parser-listener became similarly complex. And that was still with a subset of the intended DSL.What I really needed was something like
with_generic<inner_parser>: 'with' (IDENT | REL_IDENT) '{' inner_parser* '}';
but that didn't seem possible. So I tried to move the handling ofwith
into its own parser / transformer. The rule is extremely simple if stated as its own grammar; it's nearly find-and-replace. I tried to use the provided stream-rewriter, but the translation back to text destroyed metadata that I was tracking in the tokens, particularly the IDENT / REL_IDENT tokens, about the path they represent. I can see the utility of that rewriter for large complex transformations, but this task is extremely simple.So I wrote a class that took in a token stream, located
with
blocks, and emitted correctedIDENT
tokens. No new tokens were added to the stream, so the line, column, and character indexes of each token were still valid. The text was also still valid; the relative-to information was stored as a separate field. But then I ran into the above issue, where the Parser would not accept just any input stream, only a buffered input stream; and the buffered input stream wouldn't accept an arbitrary token stream, just a Lexer. So I made adjustments until it worked.The DSL is something I made and could adjust, but it's being used as a game/puzzle element directly, so I couldn't just remove or dramatically change how the
with
blocks worked. What eventually drove me to rewrite the lexer & parser was frustration over the lexer not tokenizing some IDENT tokens (and not understanding how to properly handle error messages). Turns out, the grammar was actually ambiguous to lex, but I only understood that after rewriting. The lexer and parser ended up being less code, though; the full power and flexibility of antlr wasn't needed. The DSL is LL(1) with finite depth, so the state machine is extremely simple. And with access to generic functions, thewith
situation could be handled simply in the parser with a single rule-function (parseWith<T, U>(innerParser: (data: U) => T, data: U): void
) invoked within the generic block rule-function (parseBlock<T, U>(statementParser: (data: U) => T, data: U): void
which corresponds to antlr pseudocodeparse_block<inner>: '{' (with_block<inner> | inner)* '}';
). The most complex thing I had to do was implementing a prefix tree for lexing keywords. I probably could have just done a hash map with string keys but it was a fun challenge.I took a look at the antlr5 repo. I like the concept. Though webassembly seems to be splitting in two with the WASI stuff, even two very similar runtimes is far better than 10 unrelated ones. I'll poke around a bit, maybe try to get involved.