Skip to content

Conversation

@rogpeppe
Copy link

@rogpeppe rogpeppe commented Sep 20, 2025

NOTE: this PR is targeted at the stdlib-iterator branch as a potential improved v2 API, not current tip.

nerdsnipe alert: I'm still in relatively early phases of exploring iterator idiom, so couldn't resist having a go at simplifying SplitTokensIntoLines which is now fully streaming (even though none of the call sites look like they can take advantage of that currently).

I removed the EOF variable, replacing it with an IsZero method which can be used to check if a Token is valid or not. (I don't see anywhere that seems to rely on yielding an EOF value: all the tests pass.)

Aside: one convention from the stdlib I realise I've absorbed but never actually expressed: iterators are generally named in Go as "sequences", so perhaps use chroma.Seq rather than chroma.Iterator (also has the advantage that it's shorter), or just inline iter.Seq[chroma.Token] everywhere - it's not that long and it's super-clear - and remove Iterator entirely.

nerdsnipe alert: I'm still in relatively early phases of exploring iterator
idiom, so couldn't resist having a go at simplifying SplitTokensIntoLines
which is now fully streaming (even though none of the call sites
look like they can take advantage of that currently). Incidentally, it's
now O(n) not O(n^2) too (that `for strings.Contains` looked expensive).

I removed the `EOF` variable, replacing it with an `IsZero` method
which can be used to check if a `Token` is valid or not.

I don't see anywhere that seems to rely on yielding an EOF value:
all the tests pass.

Signed-off-by: Roger Peppe <[email protected]>
@rogpeppe
Copy link
Author

One other thought: a possibly simpler way to write delegatingLexer.Tokenise:

func (d *delegatingLexer) Tokenise(options *TokeniseOptions, text string) (Iterator, error) {
	tokens, err := Coalesce(d.language).Tokenise(options, text)
	if err != nil {
		return nil, err
	}
	// Because we can't yield an error in the iterator, try
	// creating a root lexer, on the supposition that any
	// error it returns is likely to be to do with the iterator
	// itself or the options than with the text we'll pass it.
	if _, err := d.root.Tokenise(options, ""); err != nil {
		return nil, err
	}
	return func(yield func(Token) bool) {
		for t := range tokens {
			if t.Type != Other {
				if !yield(t) {
					return
				}
				continue
			}
			rootTokens, err := d.root.Tokenise(options, t.Value)
			if err != nil {
				continue // What else could we do? Panic?
			}
			for t := range rootTokens {
				if !yield(t) {
					return
				}
			}
		}
	}, nil
}

I don't think I've got the semantics quite right (the tests fail) but it seems a bit more conceptually simple than merging two token streams. Also with the original approach, ISTM there was no way for a root lexer to be able to distinguish between two independent sections because all the text is mashed together into others.

Aside: idiomatic Go tends to use American spelling, so I'm used to typing Tokenize not Tokenise, even though I'm a Brit. Maybe consider aligning the spelling with convention to reduce API surprises?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant