Skip to content
Merged
158 changes: 158 additions & 0 deletions lldb/include/lldb/ValueObject/DILLexer.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
//===-- DILLexer.h ----------------------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef LLDB_VALUEOBJECT_DILLEXER_H_
#define LLDB_VALUEOBJECT_DILLEXER_H_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our headers don't have the ending underscore.

Suggested change
#ifndef LLDB_VALUEOBJECT_DILLEXER_H_
#define LLDB_VALUEOBJECT_DILLEXER_H_
#ifndef LLDB_VALUEOBJECT_DILLEXER_H
#define LLDB_VALUEOBJECT_DILLEXER_H


#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/iterator_range.h"
#include "llvm/Support/Error.h"
#include <cstdint>
#include <limits.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fits in nicer with the other headers (though maybe you don't need it if you remove the UINT_MAX thing below)

Suggested change
#include <limits.h>
#include <climits>

#include <memory>
#include <string>
#include <vector>

namespace lldb_private {

namespace dil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace lldb_private {
namespace dil {
namespace lldb_private::dil {


/// Class defining the tokens generated by the DIL lexer and used by the
/// DIL parser.
class Token {
public:
enum Kind {
coloncolon,
eof,
identifier,
invalid,
kw_namespace,
l_paren,
none,
r_paren,
unknown,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only remaining use of the "unknown" token is to construct "empty" tokens in the tests. I think all of those cases could be avoided by just declaring the Token variable later in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for this PR, but the parser has a token member variable, where it stores the current token it's working on. When I create the parser I have to initialize the member variable token to some value; the only one that makes sense is "unknown".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that variable should be optional<Token> then? Though I'm not sure why it's needed as things could just call lexer.GetCurrentToken() instead. In either case, I'd like to remove this from this patch as its not needed here. If it turns out to be the best solution to the parsers needs, we can add the extra type then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just thought of an alternative implementation I think I can use that will make this work. :-) I'll test that and if it works I'll get rid of the 'unknown' token type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unknown token is still here. Are you sure you uploaded the right version of the patch?

};

Token(Kind kind, std::string spelling, uint32_t start)
: m_kind(kind), m_spelling(spelling), m_start_pos(start) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: m_kind(kind), m_spelling(spelling), m_start_pos(start) {}
: m_kind(kind), m_spelling(std::move(spelling)), m_start_pos(start) {}


Token() : m_kind(Kind::none), m_spelling(""), m_start_pos(0) {}

void SetKind(Kind kind) { m_kind = kind; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get rid of these (and of the none token kind)? Ideally, I'd like to treat the token as a value type (so you can assign and copy it, but not mess with it), and one that is always valid (no uninitialized (none) state).

Copy link
Contributor Author

@cmtice cmtice Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get rid of the constructor on line 44, and I can get rid of the none token, but I need to keep SetKind, in order to separate ">>" into two ">"s, in the case where we have determined that ">>" is actually two single angle brackets closing a template, such as std::vector<std::pair<int,int>>. In that case, I'll also need to insert a token into the middle of the tokens vector -- not great, but may not happen that often, and the vector shouldn't be that long.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't remove your ability to modify a token, just how you how you do it (you update the token as a whole, instead of doing it piece wise). So, instead of something like token.SetKind(right_angle); token.SetSpelling(">"); you'd do token = Token(right_angle, ">", token.GetPosition()).

That said, modification of the token stream and multi-pass (tentative) parsing doesn't sound like an ideal combination. How do you ensure that the modification does not leak into subsequent actions if the tentative parse is aborted?

Even if you can, the fact that you have to consider this makes me uneasy. Would it be possible to do this without modifying the token stream? So like, when the parser (I'm assuming this happens in the parser, as you need more context to disambiguate these) encounters a ">>" token and the current context allows you to treat this as a double closing template bracket, you act as if you encountered two ">" tokens, but without actually updating the tokens in the lexer.


Kind GetKind() const { return m_kind; }

std::string GetSpelling() const { return m_spelling; }

uint32_t GetLength() const { return m_spelling.size(); }

bool Is(Kind kind) const { return m_kind == kind; }

bool IsNot(Kind kind) const { return m_kind != kind; }

bool IsOneOf(Kind kind1, Kind kind2) const { return Is(kind1) || Is(kind2); }

template <typename... Ts> bool IsOneOf(Kind kind, Ts... Ks) const {
return Is(kind) || IsOneOf(Ks...);
}

uint32_t GetLocation() const { return m_start_pos; }

static llvm::StringRef GetTokenName(Kind kind);

private:
Kind m_kind;
std::string m_spelling;
uint32_t m_start_pos; // within entire expression string
};

/// Class for doing the simple lexing required by DIL.
class DILLexer {
public:
DILLexer(llvm::StringRef dil_expr) : m_expr(dil_expr) {
m_cur_pos = m_expr.begin();
// Use UINT_MAX to indicate invalid/uninitialized value.
m_tokens_idx = UINT_MAX;
m_invalid_token = Token(Token::invalid, "", 0);
}

llvm::Expected<bool> LexAll();

/// Return the lexed token N+1 positions ahead of the 'current' token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there are too many ways to navigate the token stream here. You can either call GetCurrentToken+IncrementTokenIdx, or GetNextToken(which I guess increments the index automatically), or LookAhead+AcceptLookAhead.

I think it would be better to start with something simple (we can add more or revamp the existing API if it turns out to be clunky). What would you say to something like:

const Token &LookAhead(uint32_t N /* add `=1` if you want*/);
const Token &GetCurrentToken() { return LookAhead(0); } // just a fancy name for a look ahead of zero
void Advance(uint32_t N = 1); // advance the token stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser really needs a way to save & restore/reset the token index, because there are places in the parser where it does tentative parsing & then decides to rollback. It does so by saving the current token index, then doing the tentative parsing (which can advance the index some number of tokens), and then (for rolling back) setting the current token index back to the saved value.

So I don't think the simple API you've outlined above would be sufficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with tentative parse and roll back APIs. I'm commenting on the other APIs which advance through the token stream linearly (but in a very baroque fashion). IOW, my proposal was to replace GetCurrentToken, IncrementTokenIdx, GetNextToken, LookAhead and AcceptLookAhead with the three functions above (exact names TBD), and keep GetCurrentTokenIdx and ResetTokenIdx as they are.

/// being handled by the DIL parser.
const Token &LookAhead(uint32_t N);

const Token &AcceptLookAhead(uint32_t N);

const Token &GetNextToken();

/// Return the index for the 'current' token being handled by the DIL parser.
uint32_t GetCurrentTokenIdx() { return m_tokens_idx; }

/// Return the current token to be handled by the DIL parser.
const Token &GetCurrentToken() { return m_lexed_tokens[m_tokens_idx]; }

uint32_t NumLexedTokens() { return m_lexed_tokens.size(); }

/// Update the index for the 'current' token, to point to the next lexed
/// token.
bool IncrementTokenIdx() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Lex() do this automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Lex() is called from LookAhead, when we definitely do not want to automatically increment the token index.

if (m_tokens_idx >= m_lexed_tokens.size() - 1)
return false;

m_tokens_idx++;
return true;
}

/// Set the index for the 'current' token (to be handled by the parser)
/// to a particular position. Used for either committing 'look ahead' parsing
/// or rolling back tentative parsing.
bool ResetTokenIdx(uint32_t new_value) {
if (new_value > m_lexed_tokens.size() - 1)
return false;

m_tokens_idx = new_value;
return true;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool ResetTokenIdx(uint32_t new_value) {
if (new_value > m_lexed_tokens.size() - 1)
return false;
m_tokens_idx = new_value;
return true;
}
void ResetTokenIdx(uint32_t new_value) {
assert(new_value < m_lexed_tokens.size());
m_tokens_idx = new_value;
}

(AIUI, the only usage of this function will be to restore a previous (and valid) position, so any error here is definitely a bug)


uint32_t GetLocation() { return m_cur_pos - m_expr.begin(); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be private?


private:
llvm::Expected<Token> Lex();

llvm::iterator_range<llvm::StringRef::iterator> IsWord();

/// Update 'result' with the other paremeter values, create a
/// duplicate token, and push the duplicate token onto the vector of
/// lexed tokens.
void UpdateLexedTokens(Token &result, Token::Kind tok_kind,
std::string tok_str, uint32_t tok_pos);

// The input string we are lexing & parsing.
llvm::StringRef m_expr;

// The current position of the lexer within m_expr (the character position,
// within the string, of the next item to be lexed).
llvm::StringRef::iterator m_cur_pos;

// Holds all of the tokens lexed so far.
std::vector<Token> m_lexed_tokens;

// Index into m_lexed_tokens; indicates which token the DIL parser is
// currently trying to parse/handle.
uint32_t m_tokens_idx;

// "invalid" token; to be returned by lexer when 'look ahead' fails.
Token m_invalid_token;
};

} // namespace dil

} // namespace lldb_private

#endif // LLDB_VALUEOBJECT_DILLEXER_H_
1 change: 1 addition & 0 deletions lldb/source/ValueObject/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
add_lldb_library(lldbValueObject
DILLexer.cpp
ValueObject.cpp
ValueObjectCast.cpp
ValueObjectChild.cpp
Expand Down
189 changes: 189 additions & 0 deletions lldb/source/ValueObject/DILLexer.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
//===-- DILLexer.cpp ------------------------------------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// This implements the recursive descent parser for the Data Inspection
// Language (DIL), and its helper functions, which will eventually underlie the
// 'frame variable' command. The language that this parser recognizes is
// described in lldb/docs/dil-expr-lang.ebnf
//
//===----------------------------------------------------------------------===//

#include "lldb/ValueObject/DILLexer.h"
#include "lldb/Utility/Status.h"
#include "llvm/ADT/StringSwitch.h"

namespace lldb_private {

namespace dil {

llvm::StringRef Token::GetTokenName(Kind kind) {
switch (kind) {
case Kind::coloncolon:
return "coloncolon";
case Kind::eof:
return "eof";
case Kind::identifier:
return "identifier";
case Kind::invalid:
return "invalid";
case Kind::kw_namespace:
return "namespace";
case Kind::l_paren:
return "l_paren";
case Kind::none:
return "none";
case Kind::r_paren:
return "r_paren";
case Kind::unknown:
return "unknown";
}
}

static bool IsLetter(char c) {
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z');
}

static bool IsDigit(char c) { return '0' <= c && c <= '9'; }

// A word starts with a letter, underscore, or dollar sign, followed by
// letters ('a'..'z','A'..'Z'), digits ('0'..'9'), and/or underscores.
llvm::iterator_range<llvm::StringRef::iterator> DILLexer::IsWord() {
llvm::StringRef::iterator start = m_cur_pos;
bool dollar_start = false;

// Must not start with a digit.
if (m_cur_pos == m_expr.end() || IsDigit(*m_cur_pos))
return llvm::make_range(m_cur_pos, m_cur_pos);

// First character *may* be a '$', for a register name or convenience
// variable.
if (*m_cur_pos == '$') {
dollar_start = true;
++m_cur_pos;
}

// Contains only letters, digits or underscores
for (; m_cur_pos != m_expr.end(); ++m_cur_pos) {
char c = *m_cur_pos;
if (!IsLetter(c) && !IsDigit(c) && c != '_')
break;
}

// If first char is '$', make sure there's at least one mare char, or it's
// invalid.
if (dollar_start && (m_cur_pos - start <= 1)) {
m_cur_pos = start;
return llvm::make_range(start, start); // Empty range
}

return llvm::make_range(start, m_cur_pos);
}

void DILLexer::UpdateLexedTokens(Token &result, Token::Kind tok_kind,
std::string tok_str, uint32_t tok_pos) {
Token new_token(tok_kind, tok_str, tok_pos);
result = new_token;
m_lexed_tokens.push_back(std::move(new_token));
}

llvm::Expected<bool> DILLexer::LexAll() {
bool done = false;
while (!done) {
auto tok_or_err = Lex();
if (!tok_or_err)
return tok_or_err.takeError();
Token token = *tok_or_err;
if (token.GetKind() == Token::eof) {
done = true;
}
}
return true;
}

llvm::Expected<Token> DILLexer::Lex() {
Token result;

// Skip over whitespace (spaces).
while (m_cur_pos != m_expr.end() && *m_cur_pos == ' ')
m_cur_pos++;

// Check to see if we've reached the end of our input string.
if (m_cur_pos == m_expr.end()) {
UpdateLexedTokens(result, Token::eof, "", (uint32_t)m_expr.size());
return result;
}

uint32_t position = m_cur_pos - m_expr.begin();
llvm::StringRef::iterator start = m_cur_pos;
llvm::iterator_range<llvm::StringRef::iterator> word_range = IsWord();
if (!word_range.empty()) {
uint32_t length = word_range.end() - word_range.begin();
llvm::StringRef word(m_expr.substr(position, length));
// We will be adding more keywords here in the future...
Token::Kind kind = llvm::StringSwitch<Token::Kind>(word)
.Case("namespace", Token::kw_namespace)
.Default(Token::identifier);
UpdateLexedTokens(result, kind, word.str(), position);
return result;
}

m_cur_pos = start;
llvm::StringRef remainder(m_expr.substr(position, m_expr.end() - m_cur_pos));
std::vector<std::pair<Token::Kind, const char *>> operators = {
{Token::l_paren, "("},
{Token::r_paren, ")"},
{Token::coloncolon, "::"},
};
for (auto [kind, str] : operators) {
if (remainder.consume_front(str)) {
m_cur_pos += strlen(str);
UpdateLexedTokens(result, kind, str, position);
return result;
}
}

// Unrecognized character(s) in string; unable to lex it.
Status error("Unable to lex input string");
return error.ToError();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for rewriting this for you, but I figured its easier than explaining everything in abstract:

The main things I wanted to achieve by this are:

  • no half-initialized state (object constructed, but LexAll not called). The object is always constructed fully parsed. It's basically what's described here, but even better because there isn't even a privately-visible half-initialized state. (Since the only state of the is basically "the remainder of the string", I figured it's easier to pass it as arguments and construct the lexer only when it's done. This also lets us get rid of the m_cur_pos`` member which is only used in the initialization stage.
  • I doubled down on the StringRef representation. I see you've partially used it, but that still meant that there were some awkward conversions between position-in-the-string and StringRef representations. Now they're gone. I also realized that iterator_range<StringRef::iterator> is just an (unnecessarily) fancy name for StringRef, so I just use that throughout.
  • no more UpdateLexedTokens. Just using Token as a value type. The overall programming style is also more functional - less side effects, more return values

The thing I did not do (but I still think it would be better is to replace the std::vector<std::pair<>> keyword representation with the "constexpr array of pairs" I had in my original suggestion. I think that's better because the vector thing means you'll be constructing a new vector object every time you call this function. That's going to impact the performance more (although it will still probably be unnoticeable) than any StringSwitch usage, as it causes a memory allocation. If you think the use of a C array is obsolete, you can also use a constexpr std::initializer_list<std::pair<>>, but I find that just adds an unnecessary level of boilerplate.

Suggested change
llvm::Expected<bool> DILLexer::LexAll() {
bool done = false;
while (!done) {
auto tok_or_err = Lex();
if (!tok_or_err)
return tok_or_err.takeError();
Token token = *tok_or_err;
if (token.GetKind() == Token::eof) {
done = true;
}
}
return true;
}
llvm::Expected<Token> DILLexer::Lex() {
Token result;
// Skip over whitespace (spaces).
while (m_cur_pos != m_expr.end() && *m_cur_pos == ' ')
m_cur_pos++;
// Check to see if we've reached the end of our input string.
if (m_cur_pos == m_expr.end()) {
UpdateLexedTokens(result, Token::eof, "", (uint32_t)m_expr.size());
return result;
}
uint32_t position = m_cur_pos - m_expr.begin();
llvm::StringRef::iterator start = m_cur_pos;
llvm::iterator_range<llvm::StringRef::iterator> word_range = IsWord();
if (!word_range.empty()) {
uint32_t length = word_range.end() - word_range.begin();
llvm::StringRef word(m_expr.substr(position, length));
// We will be adding more keywords here in the future...
Token::Kind kind = llvm::StringSwitch<Token::Kind>(word)
.Case("namespace", Token::kw_namespace)
.Default(Token::identifier);
UpdateLexedTokens(result, kind, word.str(), position);
return result;
}
m_cur_pos = start;
llvm::StringRef remainder(m_expr.substr(position, m_expr.end() - m_cur_pos));
std::vector<std::pair<Token::Kind, const char *>> operators = {
{Token::l_paren, "("},
{Token::r_paren, ")"},
{Token::coloncolon, "::"},
};
for (auto [kind, str] : operators) {
if (remainder.consume_front(str)) {
m_cur_pos += strlen(str);
UpdateLexedTokens(result, kind, str, position);
return result;
}
}
// Unrecognized character(s) in string; unable to lex it.
Status error("Unable to lex input string");
return error.ToError();
}
llvm::Expected<DILLexer> DILLexer::Create(llvm::StringRef expr) {
std::vector<Token> tokens;
llvm::StringRef remainder = expr;
do {
if (llvm::Expected<Token> t = Lex(expr, remainder))
tokens.push_back(std:move(*t);
else
return t.takeError();
} while (tokens.back().GetKind() != Token::eof);
return DILLexer(std::move(tokens)); // calling a private constructor
}
static llvm::Expected<Token> Lex(llvm::StringRef expr, llvm::StringRef &remainder) {
// Skip over whitespace.
remainder = remainder.ltrim();
size_t position = remainder.data()-expr.data();
// Check to see if we've reached the end of our input string.
if (remainder.empty())
return Token(Token::eof,
if (m_cur_pos == m_expr.end())
return Token(Token::eof, "", position);
llvm::StringRef word = IsWord(remainder); // automatically updates `remainder`, you may be able to use things like `StringRef::drop_while` in the implementation
if (!word_range.empty()) {
// We will be adding more keywords here in the future...
Token::Kind kind = llvm::StringSwitch<Token::Kind>(word)
.Case("namespace", Token::kw_namespace)
.Default(Token::identifier);
return Token(kind, word.str(), position);
}
std::vector<std::pair<Token::Kind, const char *>> operators = {
{Token::l_paren, "("},
{Token::r_paren, ")"},
{Token::coloncolon, "::"},
};
for (auto [kind, str] : operators) {
if (remainder.consume_front(str))
return Token(kind, str, position);
}
return llvm::createStringError("Unable to lex input string");
}


const Token &DILLexer::LookAhead(uint32_t N) {
if (m_tokens_idx + N + 1 < m_lexed_tokens.size())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't say anything about what you make of my suggestion to use an "infinite" stream of eof tokens at the end. The current implementation uses an infinite stream of invalid tokens, but I doubt that anything cares whether it's working with an invalid or eof token (if you're looking ahead, chances are you know what you're expecting to find, and you just want to know whether your token matches that expectation).

That would allow us to get rid of another pseudo-token and the m_invalid_token member.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "infinite" stream of eof tokens at the end is ok; I will give it a try. :-)

return m_lexed_tokens[m_tokens_idx + N + 1];

return m_invalid_token;
}

const Token &DILLexer::AcceptLookAhead(uint32_t N) {
if (m_tokens_idx + N + 1 > m_lexed_tokens.size())
return m_invalid_token;

m_tokens_idx += N + 1;
return m_lexed_tokens[m_tokens_idx];
}

const Token &DILLexer::GetNextToken() {
if (m_tokens_idx == UINT_MAX)
m_tokens_idx = 0;
else
m_tokens_idx++;

// Return the next token in the vector of lexed tokens.
if (m_tokens_idx < m_lexed_tokens.size())
return m_lexed_tokens[m_tokens_idx];

// We're already at/beyond the end of our lexed tokens. If the last token
// is an eof token, return it.
if (m_lexed_tokens[m_lexed_tokens.size() - 1].GetKind() == Token::eof)
return m_lexed_tokens[m_lexed_tokens.size() - 1];

// Return the invalid token.
return m_invalid_token;
}

} // namespace dil

} // namespace lldb_private
1 change: 1 addition & 0 deletions lldb/unittests/ValueObject/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
add_lldb_unittest(LLDBValueObjectTests
DumpValueObjectOptionsTests.cpp
DILLexerTests.cpp

LINK_LIBS
lldbValueObject
Expand Down
Loading
Loading