-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLDB] Add Lexer (with tests) for DIL (Data Inspection Language). #123521
base: main
Are you sure you want to change the base?
Conversation
This adds the basic lexer, with unittests, for the Data Inspection Language (DIL) -- see https://discourse.llvm.org/t/rfc-data-inspection-language/69893 This version of the lexer only handles local variables and namespaces, and is designed to work with llvm#120971.
@llvm/pr-subscribers-lldb Author: None (cmtice) ChangesThis adds the basic lexer, with unittests, for the Data Inspection Language (DIL) -- see This version of the lexer only handles local variables and namespaces, and is designed to work with Full diff: https://github.com/llvm/llvm-project/pull/123521.diff 4 Files Affected:
diff --git a/lldb/include/lldb/ValueObject/DILLexer.h b/lldb/include/lldb/ValueObject/DILLexer.h
new file mode 100644
index 00000000000000..45c506b2f4106d
--- /dev/null
+++ b/lldb/include/lldb/ValueObject/DILLexer.h
@@ -0,0 +1,156 @@
+//===-- DILLexer.h ----------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_VALUEOBJECT_DILLEXER_H_
+#define LLDB_VALUEOBJECT_DILLEXER_H_
+
+#include "llvm/ADT/StringRef.h"
+#include <cstdint>
+#include <limits.h>
+#include <memory>
+#include <string>
+#include <vector>
+
+namespace lldb_private {
+
+namespace dil {
+
+enum class TokenKind {
+ coloncolon,
+ eof,
+ identifier,
+ invalid,
+ kw_namespace,
+ l_paren,
+ none,
+ r_paren,
+ unknown,
+};
+
+/// Class defining the tokens generated by the DIL lexer and used by the
+/// DIL parser.
+class DILToken {
+public:
+ DILToken(dil::TokenKind kind, std::string spelling, uint32_t start)
+ : m_kind(kind), m_spelling(spelling), m_start_pos(start) {}
+
+ DILToken() : m_kind(dil::TokenKind::none), m_spelling(""), m_start_pos(0) {}
+
+ void setKind(dil::TokenKind kind) { m_kind = kind; }
+ dil::TokenKind getKind() const { return m_kind; }
+
+ std::string getSpelling() const { return m_spelling; }
+
+ uint32_t getLength() const { return m_spelling.size(); }
+
+ bool is(dil::TokenKind kind) const { return m_kind == kind; }
+
+ bool isNot(dil::TokenKind kind) const { return m_kind != kind; }
+
+ bool isOneOf(dil::TokenKind kind1, dil::TokenKind kind2) const {
+ return is(kind1) || is(kind2);
+ }
+
+ template <typename... Ts> bool isOneOf(dil::TokenKind kind, Ts... Ks) const {
+ return is(kind) || isOneOf(Ks...);
+ }
+
+ uint32_t getLocation() const { return m_start_pos; }
+
+ void setValues(dil::TokenKind kind, std::string spelling, uint32_t start) {
+ m_kind = kind;
+ m_spelling = spelling;
+ m_start_pos = start;
+ }
+
+ static const std::string getTokenName(dil::TokenKind kind);
+
+private:
+ dil::TokenKind m_kind;
+ std::string m_spelling;
+ uint32_t m_start_pos; // within entire expression string
+};
+
+/// Class for doing the simple lexing required by DIL.
+class DILLexer {
+public:
+ DILLexer(llvm::StringRef dil_expr) : m_expr(dil_expr.str()) {
+ m_cur_pos = m_expr.begin();
+ // Use UINT_MAX to indicate invalid/uninitialized value.
+ m_tokens_idx = UINT_MAX;
+ }
+
+ bool Lex(DILToken &result, bool look_ahead = false);
+
+ bool Is_Word(std::string::iterator start, uint32_t &length);
+
+ uint32_t GetLocation() { return m_cur_pos - m_expr.begin(); }
+
+ /// Update 'result' with the other paremeter values, create a
+ /// duplicate token, and push the duplicate token onto the vector of
+ /// lexed tokens.
+ void UpdateLexedTokens(DILToken &result, dil::TokenKind tok_kind,
+ std::string tok_str, uint32_t tok_pos);
+
+ /// Return the lexed token N+1 positions ahead of the 'current' token
+ /// being handled by the DIL parser.
+ const DILToken &LookAhead(uint32_t N);
+
+ const DILToken &AcceptLookAhead(uint32_t N);
+
+ /// Return the index for the 'current' token being handled by the DIL parser.
+ uint32_t GetCurrentTokenIdx() { return m_tokens_idx; }
+
+ /// Return the current token to be handled by the DIL parser.
+ DILToken &GetCurrentToken() { return m_lexed_tokens[m_tokens_idx]; }
+
+ /// Update the index for the 'current' token, to point to the next lexed
+ /// token.
+ bool IncrementTokenIdx() {
+ if (m_tokens_idx >= m_lexed_tokens.size() - 1)
+ return false;
+
+ m_tokens_idx++;
+ return true;
+ }
+
+ /// Set the index for the 'current' token (to be handled by the parser)
+ /// to a particular position. Used for either committing 'look ahead' parsing
+ /// or rolling back tentative parsing.
+ bool ResetTokenIdx(uint32_t new_value) {
+ if (new_value > m_lexed_tokens.size() - 1)
+ return false;
+
+ m_tokens_idx = new_value;
+ return true;
+ }
+
+private:
+ // The input string we are lexing & parsing.
+ std::string m_expr;
+
+ // The current position of the lexer within m_expr (the character position,
+ // within the string, of the next item to be lexed).
+ std::string::iterator m_cur_pos;
+
+ // Holds all of the tokens lexed so far.
+ std::vector<DILToken> m_lexed_tokens;
+
+ // Index into m_lexed_tokens; indicates which token the DIL parser is
+ // currently trying to parse/handle.
+ uint32_t m_tokens_idx;
+
+ // "invalid" token; to be returned by lexer when 'look ahead' fails.
+ DILToken m_invalid_token;
+};
+
+} // namespace dil
+
+} // namespace lldb_private
+
+#endif // LLDB_VALUEOBJECT_DILLEXER_H_
diff --git a/lldb/source/ValueObject/DILLexer.cpp b/lldb/source/ValueObject/DILLexer.cpp
new file mode 100644
index 00000000000000..4c2b0b1813bb96
--- /dev/null
+++ b/lldb/source/ValueObject/DILLexer.cpp
@@ -0,0 +1,205 @@
+//===-- DILLexer.cpp ------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// This implements the recursive descent parser for the Data Inspection
+// Language (DIL), and its helper functions, which will eventually underlie the
+// 'frame variable' command. The language that this parser recognizes is
+// described in lldb/docs/dil-expr-lang.ebnf
+//
+//===----------------------------------------------------------------------===//
+
+#include "lldb/ValueObject/DILLexer.h"
+#include "llvm/ADT/StringMap.h"
+
+namespace lldb_private {
+
+namespace dil {
+
+// For fast keyword lookup. More keywords will be added later.
+const llvm::StringMap<dil::TokenKind> Keywords = {
+ {"namespace", dil::TokenKind::kw_namespace},
+};
+
+const std::string DILToken::getTokenName(dil::TokenKind kind) {
+ switch (kind) {
+ case dil::TokenKind::coloncolon:
+ return "coloncolon";
+ case dil::TokenKind::eof:
+ return "eof";
+ case dil::TokenKind::identifier:
+ return "identifier";
+ case dil::TokenKind::kw_namespace:
+ return "namespace";
+ case dil::TokenKind::l_paren:
+ return "l_paren";
+ case dil::TokenKind::r_paren:
+ return "r_paren";
+ case dil::TokenKind::unknown:
+ return "unknown";
+ default:
+ return "token_name";
+ }
+}
+
+static bool Is_Letter(char c) {
+ if (('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z'))
+ return true;
+ return false;
+}
+
+static bool Is_Digit(char c) { return ('0' <= c && c <= '9'); }
+
+// A word starts with a letter, underscore, or dollar sign, followed by
+// letters ('a'..'z','A'..'Z'), digits ('0'..'9'), and/or underscores.
+bool DILLexer::Is_Word(std::string::iterator start, uint32_t &length) {
+ bool done = false;
+ bool dollar_start = false;
+
+ // Must not start with a digit.
+ if (m_cur_pos == m_expr.end() || Is_Digit(*m_cur_pos))
+ return false;
+
+ // First character *may* be a '$', for a register name or convenience
+ // variable.
+ if (*m_cur_pos == '$') {
+ dollar_start = true;
+ ++m_cur_pos;
+ length++;
+ }
+
+ // Contains only letters, digits or underscores
+ for (; m_cur_pos != m_expr.end() && !done; ++m_cur_pos) {
+ char c = *m_cur_pos;
+ if (!Is_Letter(c) && !Is_Digit(c) && c != '_') {
+ done = true;
+ break;
+ } else
+ length++;
+ }
+
+ if (dollar_start && length > 1) // Must have something besides just '$'
+ return true;
+
+ if (!dollar_start && length > 0)
+ return true;
+
+ // Not a valid word, so re-set the lexing position.
+ m_cur_pos = start;
+ return false;
+}
+
+void DILLexer::UpdateLexedTokens(DILToken &result, dil::TokenKind tok_kind,
+ std::string tok_str, uint32_t tok_pos) {
+ DILToken new_token;
+ result.setValues(tok_kind, tok_str, tok_pos);
+ new_token = result;
+ m_lexed_tokens.push_back(std::move(new_token));
+}
+
+bool DILLexer::Lex(DILToken &result, bool look_ahead) {
+ bool retval = true;
+
+ if (!look_ahead) {
+ // We're being asked for the 'next' token, and not a part of a LookAhead.
+ // Check to see if we've already lexed it and pushed it onto our tokens
+ // vector; if so, return the next token from the vector, rather than doing
+ // more lexing.
+ if ((m_tokens_idx != UINT_MAX) &&
+ (m_tokens_idx < m_lexed_tokens.size() - 1)) {
+ result = m_lexed_tokens[m_tokens_idx + 1];
+ return retval;
+ }
+ }
+
+ // Skip over whitespace (spaces).
+ while (m_cur_pos != m_expr.end() && *m_cur_pos == ' ')
+ m_cur_pos++;
+
+ // Check to see if we've reached the end of our input string.
+ if (m_cur_pos == m_expr.end()) {
+ UpdateLexedTokens(result, dil::TokenKind::eof, "", m_expr.length());
+ return retval;
+ }
+
+ uint32_t position = m_cur_pos - m_expr.begin();
+ ;
+ std::string::iterator start = m_cur_pos;
+ uint32_t length = 0;
+ if (Is_Word(start, length)) {
+ dil::TokenKind kind;
+ std::string word = m_expr.substr(position, length);
+ auto iter = Keywords.find(word);
+ if (iter != Keywords.end())
+ kind = iter->second;
+ else
+ kind = dil::TokenKind::identifier;
+
+ UpdateLexedTokens(result, kind, word, position);
+ return true;
+ }
+
+ switch (*m_cur_pos) {
+ case '(':
+ m_cur_pos++;
+ UpdateLexedTokens(result, dil::TokenKind::l_paren, "(", position);
+ return true;
+ case ')':
+ m_cur_pos++;
+ UpdateLexedTokens(result, dil::TokenKind::r_paren, ")", position);
+ return true;
+ case ':':
+ if (position + 1 < m_expr.size() && m_expr[position + 1] == ':') {
+ m_cur_pos += 2;
+ UpdateLexedTokens(result, dil::TokenKind::coloncolon, "::", position);
+ return true;
+ }
+ break;
+ default:
+ break;
+ }
+ // Empty Token
+ result.setValues(dil::TokenKind::none, "", m_expr.length());
+ return false;
+}
+
+const DILToken &DILLexer::LookAhead(uint32_t N) {
+ uint32_t extra_lexed_tokens = m_lexed_tokens.size() - m_tokens_idx - 1;
+
+ if (N + 1 < extra_lexed_tokens)
+ return m_lexed_tokens[m_tokens_idx + N + 1];
+
+ uint32_t remaining_tokens =
+ (m_tokens_idx + N + 1) - m_lexed_tokens.size() + 1;
+
+ bool done = false;
+ bool look_ahead = true;
+ while (!done && remaining_tokens > 0) {
+ DILToken tok;
+ Lex(tok, look_ahead);
+ if (tok.getKind() == dil::TokenKind::eof)
+ done = true;
+ remaining_tokens--;
+ };
+
+ if (remaining_tokens > 0) {
+ m_invalid_token.setValues(dil::TokenKind::invalid, "", 0);
+ return m_invalid_token;
+ }
+
+ return m_lexed_tokens[m_tokens_idx + N + 1];
+}
+
+const DILToken &DILLexer::AcceptLookAhead(uint32_t N) {
+ if (m_tokens_idx + N + 1 > m_lexed_tokens.size())
+ return m_invalid_token;
+
+ m_tokens_idx += N + 1;
+ return m_lexed_tokens[m_tokens_idx];
+}
+
+} // namespace dil
+
+} // namespace lldb_private
diff --git a/lldb/unittests/ValueObject/CMakeLists.txt b/lldb/unittests/ValueObject/CMakeLists.txt
index 8fcc8d62a79979..952f5411a98057 100644
--- a/lldb/unittests/ValueObject/CMakeLists.txt
+++ b/lldb/unittests/ValueObject/CMakeLists.txt
@@ -1,5 +1,6 @@
add_lldb_unittest(LLDBValueObjectTests
DumpValueObjectOptionsTests.cpp
+ DILLexerTests.cpp
LINK_LIBS
lldbValueObject
diff --git a/lldb/unittests/ValueObject/DILLexerTests.cpp b/lldb/unittests/ValueObject/DILLexerTests.cpp
new file mode 100644
index 00000000000000..ec6ff86b64d36b
--- /dev/null
+++ b/lldb/unittests/ValueObject/DILLexerTests.cpp
@@ -0,0 +1,193 @@
+//===-- DILLexerTests.cpp --------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "lldb/ValueObject/DILLexer.h"
+#include "llvm/ADT/StringRef.h"
+#include "gtest/gtest.h"
+#include <string>
+
+using llvm::StringRef;
+
+TEST(DILLexerTests, SimpleTest) {
+ StringRef dil_input_expr("simple_var");
+ uint32_t tok_len = 10;
+ lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+ lldb_private::dil::DILToken dil_token;
+ dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::unknown);
+ dil_lexer.Lex(dil_token);
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ EXPECT_EQ(dil_token.getSpelling(), "simple_var");
+ EXPECT_EQ(dil_token.getLength(), tok_len);
+ dil_lexer.Lex(dil_token);
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::eof);
+}
+
+TEST(DILLexerTests, TokenKindTest) {
+ StringRef dil_input_expr("namespace");
+ lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+ lldb_private::dil::DILToken dil_token;
+ dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+ dil_lexer.Lex(dil_token);
+ EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), UINT_MAX);
+ dil_lexer.ResetTokenIdx(0);
+
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::kw_namespace);
+ EXPECT_TRUE(dil_token.isNot(lldb_private::dil::TokenKind::identifier));
+ EXPECT_FALSE(dil_token.is(lldb_private::dil::TokenKind::l_paren));
+ EXPECT_TRUE(dil_token.isOneOf(lldb_private::dil::TokenKind::eof,
+ lldb_private::dil::TokenKind::kw_namespace));
+ EXPECT_FALSE(dil_token.isOneOf(lldb_private::dil::TokenKind::l_paren,
+ lldb_private::dil::TokenKind::r_paren,
+ lldb_private::dil::TokenKind::coloncolon,
+ lldb_private::dil::TokenKind::eof));
+
+ dil_token.setKind(lldb_private::dil::TokenKind::identifier);
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+}
+
+TEST(DILLexerTests, LookAheadTest) {
+ StringRef dil_input_expr("(anonymous namespace)::some_var");
+ lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+ lldb_private::dil::DILToken dil_token;
+ dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+ uint32_t expect_loc = 23;
+
+ dil_lexer.Lex(dil_token);
+ EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), UINT_MAX);
+ dil_lexer.ResetTokenIdx(0);
+
+ // Current token is '('; check the next 4 tokens, to make
+ // sure they are the identifier 'anonymous', the namespace keyword,
+ // ')' and '::', in that order.
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::l_paren);
+ EXPECT_EQ(dil_lexer.LookAhead(0).getKind(),
+ lldb_private::dil::TokenKind::identifier);
+ EXPECT_EQ(dil_lexer.LookAhead(0).getSpelling(), "anonymous");
+ EXPECT_EQ(dil_lexer.LookAhead(1).getKind(),
+ lldb_private::dil::TokenKind::kw_namespace);
+ EXPECT_EQ(dil_lexer.LookAhead(2).getKind(),
+ lldb_private::dil::TokenKind::r_paren);
+ EXPECT_EQ(dil_lexer.LookAhead(3).getKind(),
+ lldb_private::dil::TokenKind::coloncolon);
+ // Verify we've advanced our position counter (lexing location) in the
+ // input 23 characters (the length of '(anonymous namespace)::'.
+ EXPECT_EQ(dil_lexer.GetLocation(), expect_loc);
+
+ // Our current index should still be 0, as we only looked ahead; we are still
+ // officially on the '('.
+ EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), 0);
+
+ // Accept the 'lookahead', so our current token is '::', which has the index
+ // 4 in our vector of tokens (which starts at zero).
+ dil_token = dil_lexer.AcceptLookAhead(3);
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::coloncolon);
+ EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), 4);
+
+ // Lex the final variable name in the input string
+ dil_lexer.Lex(dil_token);
+ dil_lexer.IncrementTokenIdx();
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ EXPECT_EQ(dil_token.getSpelling(), "some_var");
+ EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), 5);
+
+ dil_lexer.Lex(dil_token);
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::eof);
+}
+
+TEST(DILLexerTests, MultiTokenLexTest) {
+ StringRef dil_input_expr("This string has several identifiers");
+ lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+ lldb_private::dil::DILToken dil_token;
+ dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+ dil_lexer.Lex(dil_token);
+ EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), UINT_MAX);
+ dil_lexer.ResetTokenIdx(0);
+
+ EXPECT_EQ(dil_token.getSpelling(), "This");
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ dil_lexer.Lex(dil_token);
+ dil_lexer.IncrementTokenIdx();
+
+ EXPECT_EQ(dil_token.getSpelling(), "string");
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ dil_lexer.Lex(dil_token);
+ dil_lexer.IncrementTokenIdx();
+
+ EXPECT_EQ(dil_token.getSpelling(), "has");
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ dil_lexer.Lex(dil_token);
+ dil_lexer.IncrementTokenIdx();
+
+ EXPECT_EQ(dil_token.getSpelling(), "several");
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ dil_lexer.Lex(dil_token);
+ dil_lexer.IncrementTokenIdx();
+
+ EXPECT_EQ(dil_token.getSpelling(), "identifiers");
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ dil_lexer.Lex(dil_token);
+ dil_lexer.IncrementTokenIdx();
+
+ EXPECT_EQ(dil_token.getSpelling(), "");
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::eof);
+}
+
+TEST(DILLexerTests, IdentifiersTest) {
+ std::vector<std::string> valid_identifiers = {
+ "$My_name1",
+ "$pc",
+ "abcd",
+ "ab cd",
+ "_",
+ "_a",
+ "_a_",
+ "a_b",
+ "this",
+ "self",
+ "a",
+ "MyName"
+ };
+ std::vector<std::string> invalid_identifiers = {
+ "234",
+ "2a",
+ "2",
+ "$",
+ "1MyName",
+ "",
+ "namespace"
+ };
+
+ // Verify that all of the valid identifiers come out as identifier tokens.
+ for (auto str : valid_identifiers) {
+ StringRef dil_input_expr(str);
+ lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+ lldb_private::dil::DILToken dil_token;
+ dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+ dil_lexer.Lex(dil_token);
+ EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+ }
+
+ // Verify that none of the invalid identifiers come out as identifier tokens.
+ for (auto str : invalid_identifiers) {
+ StringRef dil_input_expr(str);
+ lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+ lldb_private::dil::DILToken dil_token;
+ dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+ dil_lexer.Lex(dil_token);
+ EXPECT_TRUE(dil_token.isNot(lldb_private::dil::TokenKind::identifier));
+ EXPECT_TRUE(dil_token.isOneOf(lldb_private::dil::TokenKind::unknown,
+ lldb_private::dil::TokenKind::none,
+ lldb_private::dil::TokenKind::eof,
+ lldb_private::dil::TokenKind::kw_namespace));
+ }
+}
|
Update CMakeLists.txt to build DILLexer.cpp.
You can test this locally with the following command:git-clang-format --diff 23a239267e8a1d20ed10d3545feaf2a2bb70b085 5e2ee55f800726910ad6e56a192554375f61bfb8 --extensions h,cpp -- lldb/include/lldb/ValueObject/DILLexer.h lldb/source/ValueObject/DILLexer.cpp lldb/unittests/ValueObject/DILLexerTests.cpp View the diff from clang-format here.diff --git a/lldb/unittests/ValueObject/DILLexerTests.cpp b/lldb/unittests/ValueObject/DILLexerTests.cpp
index 137013e40d..c830dbaa79 100644
--- a/lldb/unittests/ValueObject/DILLexerTests.cpp
+++ b/lldb/unittests/ValueObject/DILLexerTests.cpp
@@ -156,28 +156,10 @@ TEST(DILLexerTests, MultiTokenLexTest) {
TEST(DILLexerTests, IdentifiersTest) {
std::vector<std::string> valid_identifiers = {
- "$My_name1",
- "$pc",
- "abcd",
- "ab cd",
- "_",
- "_a",
- "_a_",
- "a_b",
- "this",
- "self",
- "a",
- "MyName"
- };
+ "$My_name1", "$pc", "abcd", "ab cd", "_", "_a",
+ "_a_", "a_b", "this", "self", "a", "MyName"};
std::vector<std::string> invalid_identifiers = {
- "234",
- "2a",
- "2",
- "$",
- "1MyName",
- "",
- "namespace"
- };
+ "234", "2a", "2", "$", "1MyName", "", "namespace"};
// Verify that all of the valid identifiers come out as identifier tokens.
for (auto &str : valid_identifiers) {
|
|
||
/// Class defining the tokens generated by the DIL lexer and used by the | ||
/// DIL parser. | ||
class DILToken { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have a dill::
namespace, maybe we can drop DIL
prefix?
/// Class for doing the simple lexing required by DIL. | ||
class DILLexer { | ||
public: | ||
DILLexer(llvm::StringRef dil_expr) : m_expr(dil_expr.str()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we accept StringRef
as input, we shouldn't copy the data and work with the provided string view (and assume we don't outlive it).
If you need the lexer to own the text (and you probably don't), then accept std::string
and move from it.
|
||
/// Update the index for the 'current' token, to point to the next lexed | ||
/// token. | ||
bool IncrementTokenIdx() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't Lex()
do this automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. Lex() is called from LookAhead, when we definitely do not want to automatically increment the token index.
/// being handled by the DIL parser. | ||
const DILToken &LookAhead(uint32_t N); | ||
|
||
const DILToken &AcceptLookAhead(uint32_t N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this API might be simpler. The lexer doesn't actually need to re-lex, the results will always be the same. We only need to rollback occasionally, but we'll always be process the same sequence of tokens the second time.
So the lexer can always add the tokens to the m_lexed_tokens
vector and we only need GetCurrentTokenIdx()
and ResetTokenIdx()
to do the rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely be nice, and if its true, I'd consider taking this even further: given that we're going to be processing about a line of text at most, we might be able to just eagerly lex the whole string and avoid the whole on-demand parsing business. If so, maybe we don't even need the lexer class and the lexing could consist of a single function like std::vector<Token> Lex(string)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re werat's comment: This is already what the lexer does. It never re-lexes anything. It stores the lexed tokens in a vector as it lexes them, and keeps track, via index into the vector, which token is the 'current' one (as far as the parser is concerned). Accepting the lookahead basically involves just updating the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to hear, but it still makes the implementation complicated. Separating lexing from the traversal through the lexed tokens would make things much easier to follow. I understand traditional lexers can't do that, but hey are parsing megabytes of text. We're going to be parsing approximately one line of code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday I also thought of another advantage of early/eager parsing -- we can report errors (things like invalid tokens and such) early. Then the other functions ("get next token" and stuff) shouldn't have any failure modes except running off the end of token stream (which I think we could handle by pretending the stream ends with an infinite amount of EOF tokens)
lldb/source/ValueObject/DILLexer.cpp
Outdated
} | ||
|
||
uint32_t position = m_cur_pos - m_expr.begin(); | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
; |
lldb/source/ValueObject/DILLexer.cpp
Outdated
bool look_ahead = true; | ||
while (!done && remaining_tokens > 0) { | ||
DILToken tok; | ||
Lex(tok, look_ahead); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the return value here?
TEST(DILLexerTests, SimpleTest) { | ||
StringRef dil_input_expr("simple_var"); | ||
uint32_t tok_len = 10; | ||
lldb_private::dil::DILLexer dil_lexer(dil_input_expr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO can drop dil_
here. All these tests are in the context of DIL, so it's clear what it's all about.
lldb_private::dil::DILLexer dil_lexer(dil_input_expr); | |
lldb_private::dil::DILLexer lexer(input_expr); |
|
||
namespace dil { | ||
|
||
enum class TokenKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional idea: Make this a non-class enum inside the (DIL)Token class so that it can be referred to as Token::coloncolon
void setValues(dil::TokenKind kind, std::string spelling, uint32_t start) { | ||
m_kind = kind; | ||
m_spelling = spelling; | ||
m_start_pos = start; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use the assignment operator instead (token = Token(kind, spelling, start)
) ?
m_tokens_idx = UINT_MAX; | ||
} | ||
|
||
bool Lex(DILToken &result, bool look_ahead = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool Lex(DILToken &result, bool look_ahead = false); | |
std::optional<DILToken> Lex(bool look_ahead = false); |
/// being handled by the DIL parser. | ||
const DILToken &LookAhead(uint32_t N); | ||
|
||
const DILToken &AcceptLookAhead(uint32_t N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely be nice, and if its true, I'd consider taking this even further: given that we're going to be processing about a line of text at most, we might be able to just eagerly lex the whole string and avoid the whole on-demand parsing business. If so, maybe we don't even need the lexer class and the lexing could consist of a single function like std::vector<Token> Lex(string)
?
/// Update 'result' with the other paremeter values, create a | ||
/// duplicate token, and push the duplicate token onto the vector of | ||
/// lexed tokens. | ||
void UpdateLexedTokens(DILToken &result, dil::TokenKind tok_kind, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all of these APIs really be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few of them can be private (I'll take care of that). Many of them could be protected (mostly just called from the parser, which I can designate as a friend), except that if I make them protected then my unittest, which needs to access them, can't access them. I've been trying and trying to find a way to allow my unit tests access these methods when they're protected, but I can't seem to make it work. Do you know the magic secret for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are workarounds, but no magic secrets. If something is called by the parser (which is the only user of the class anyway), then it should be public (btw, friends can see even private members, not just the public ones). And unit tests should generally test using the public APIs (as that's what the users will use). There are exceptions to that, but they usually involve situations where the public API requires arguments that would be difficult/impossible to mock in a test, which shouldn't be the case here.
Separating the public API from the private implementation details would really help this class, as I really don't know which one of these functions is meant to be called from the outside.
lldb/source/ValueObject/DILLexer.cpp
Outdated
// Empty Token | ||
result.setValues(dil::TokenKind::none, "", m_expr.length()); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of an empty token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly the same as unknown. I will change the TokenKind above to unknown.
}; | ||
|
||
// Verify that all of the valid identifiers come out as identifier tokens. | ||
for (auto str : valid_identifiers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringRef dil_input_expr(str); | ||
lldb_private::dil::DILLexer dil_lexer(dil_input_expr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringRef dil_input_expr(str); | |
lldb_private::dil::DILLexer dil_lexer(dil_input_expr); | |
SCOPED_TRACE(str); | |
lldb_private::dil::DILLexer dil_lexer(str); |
the explicit string ref construction is unnecessary. SCOPED_TRACE ensures that in case of a test failure, the error message will contain the string which caused it to fail.
} | ||
|
||
TEST(DILLexerTests, MultiTokenLexTest) { | ||
StringRef dil_input_expr("This string has several identifiers"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would read better if there was helper function that would lex the string and then return the lexing result in a nicely-comparable for. E.g., with a function like
std::vector<std::pair<TokenKind, std::string>> Lex(StringRef input)
, this whole test would boil down to
EXPECT_THAT(Lex("This string has several identifiers"),
testing::ElementsAre(testing::Pair(TokenKind::identifier, "This"),
testing::Pair(TokenKind::identifier, "string"),
testing::Pair(TokenKind::identifier, "has"),
testing::Pair(TokenKind::identifier, "several"),
testing::Pair(TokenKind::identifier, "identifiers"),
testing::Pair(TokenKind::eof, "")));
Making it easy to write test like this is important, as I expect most of the lexer tests to be of this form, and I want to encourage people to write more tests like this for various corner cases, of which there are many in a lexer implementation (for example, test that a colon at the of a string does not access beyond the end of the string when it checks whether it's a part of ::
)
lldb/source/ValueObject/DILLexer.cpp
Outdated
return true; | ||
} | ||
|
||
switch (*m_cur_pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An iterator is sort of a natural representation of a position in a string, but it also makes it hard to use many of the nice (safe) string APIs. For example, if we had a StringRef m_rest
to represent the unparsed portion of the string, then this block could be written as:
constexpr std::pair<TokenKind, const char *> operators = {
{TokenKind::l_paren, "("},
{TokenKind::r_paren, ")"},
{TokenKind::coloncolon, "::"},
{TokenKind::colon, ":"},
};
size_t position = m_rest.data() - m_input.data();
for (auto [kind, str]: operators) {
if (m_rest.consume_front(str)) {
UpdateLexedTokens(result, kind, str, position);
return true;
}
}
(Notice how I do not have to worry about running beyond the end of the string, how the "current position" is updated automatically, and how the tokens which are prefixes of one another are handled by putting the longer string first.)
There are also other places where this could be useful. E.g. skipping over whitespace could be implemented as m_rest = m_rest.ltrim()
and similar.
@@ -0,0 +1,156 @@ | |||
//===-- DILLexer.h ----------------------------------------------*- C++ -*-===// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this file uses a combination of cammelCase
(llvm style) and UpperCamel
(lldb style) for method names. I'd be fine with either, but please be consistent.
- Remove "DIL" prefix from DILTokenKind and DILToken. - Change the token kind from an enum class to an enum inside the Token class. - Use CamelCase for all the method names. - Replace Token::SetValues method with assignments. - Use a StringRef, not std::string, to hold the input string in the lexer. - Update the lexer to lex all the tokens at one time. Added two new methods for this: LexAll and GetNextToken. - Made some of the Lexer methods private. - Replaces StringMap with StringSwitch for fast keyword lookups. - Updated GetTokenName to directly return StringRefs; removed default case from switch statement. - Cleaned up code format in IsLetter & IsDigit. - Updated IsWord too return an iterator range containing the word (if any). - Updated Lex function (now called by LexAll) to return an llvm::Expected token; removed look_ahead checks; changed the operator lexing to use a vector of operators (as suggested). - Cleaned up LookAhead method, now that we know all tokens have already been lexed. - Added helper function to unittests, to help check a sequence of tokens. - Generally cleaned up the tests to deal with all the code changes.
I think I have addressed all the comments & requests. Please review this again. Thanks! |
This adds the basic lexer, with unittests, for the Data Inspection Language (DIL) -- see
https://discourse.llvm.org/t/rfc-data-inspection-language/69893
This version of the lexer only handles local variables and namespaces, and is designed to work with
#120971.