Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLDB] Add Lexer (with tests) for DIL (Data Inspection Language). #123521

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cmtice
Copy link
Contributor

@cmtice cmtice commented Jan 19, 2025

This adds the basic lexer, with unittests, for the Data Inspection Language (DIL) -- see
https://discourse.llvm.org/t/rfc-data-inspection-language/69893

This version of the lexer only handles local variables and namespaces, and is designed to work with
#120971.

This adds the basic lexer, with unittests, for the Data Inspection
Language (DIL) -- see
https://discourse.llvm.org/t/rfc-data-inspection-language/69893

This version of the lexer only handles local variables and namespaces,
and is designed to work  with
llvm#120971.
@cmtice cmtice requested a review from JDevlieghere as a code owner January 19, 2025 17:21
@llvmbot llvmbot added the lldb label Jan 19, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 19, 2025

@llvm/pr-subscribers-lldb

Author: None (cmtice)

Changes

This adds the basic lexer, with unittests, for the Data Inspection Language (DIL) -- see
https://discourse.llvm.org/t/rfc-data-inspection-language/69893

This version of the lexer only handles local variables and namespaces, and is designed to work with
#120971.


Full diff: https://github.com/llvm/llvm-project/pull/123521.diff

4 Files Affected:

  • (added) lldb/include/lldb/ValueObject/DILLexer.h (+156)
  • (added) lldb/source/ValueObject/DILLexer.cpp (+205)
  • (modified) lldb/unittests/ValueObject/CMakeLists.txt (+1)
  • (added) lldb/unittests/ValueObject/DILLexerTests.cpp (+193)
diff --git a/lldb/include/lldb/ValueObject/DILLexer.h b/lldb/include/lldb/ValueObject/DILLexer.h
new file mode 100644
index 00000000000000..45c506b2f4106d
--- /dev/null
+++ b/lldb/include/lldb/ValueObject/DILLexer.h
@@ -0,0 +1,156 @@
+//===-- DILLexer.h ----------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_VALUEOBJECT_DILLEXER_H_
+#define LLDB_VALUEOBJECT_DILLEXER_H_
+
+#include "llvm/ADT/StringRef.h"
+#include <cstdint>
+#include <limits.h>
+#include <memory>
+#include <string>
+#include <vector>
+
+namespace lldb_private {
+
+namespace dil {
+
+enum class TokenKind {
+  coloncolon,
+  eof,
+  identifier,
+  invalid,
+  kw_namespace,
+  l_paren,
+  none,
+  r_paren,
+  unknown,
+};
+
+/// Class defining the tokens generated by the DIL lexer and used by the
+/// DIL parser.
+class DILToken {
+public:
+  DILToken(dil::TokenKind kind, std::string spelling, uint32_t start)
+      : m_kind(kind), m_spelling(spelling), m_start_pos(start) {}
+
+  DILToken() : m_kind(dil::TokenKind::none), m_spelling(""), m_start_pos(0) {}
+
+  void setKind(dil::TokenKind kind) { m_kind = kind; }
+  dil::TokenKind getKind() const { return m_kind; }
+
+  std::string getSpelling() const { return m_spelling; }
+
+  uint32_t getLength() const { return m_spelling.size(); }
+
+  bool is(dil::TokenKind kind) const { return m_kind == kind; }
+
+  bool isNot(dil::TokenKind kind) const { return m_kind != kind; }
+
+  bool isOneOf(dil::TokenKind kind1, dil::TokenKind kind2) const {
+    return is(kind1) || is(kind2);
+  }
+
+  template <typename... Ts> bool isOneOf(dil::TokenKind kind, Ts... Ks) const {
+    return is(kind) || isOneOf(Ks...);
+  }
+
+  uint32_t getLocation() const { return m_start_pos; }
+
+  void setValues(dil::TokenKind kind, std::string spelling, uint32_t start) {
+    m_kind = kind;
+    m_spelling = spelling;
+    m_start_pos = start;
+  }
+
+  static const std::string getTokenName(dil::TokenKind kind);
+
+private:
+  dil::TokenKind m_kind;
+  std::string m_spelling;
+  uint32_t m_start_pos; // within entire expression string
+};
+
+/// Class for doing the simple lexing required by DIL.
+class DILLexer {
+public:
+  DILLexer(llvm::StringRef dil_expr) : m_expr(dil_expr.str()) {
+    m_cur_pos = m_expr.begin();
+    // Use UINT_MAX to indicate invalid/uninitialized value.
+    m_tokens_idx = UINT_MAX;
+  }
+
+  bool Lex(DILToken &result, bool look_ahead = false);
+
+  bool Is_Word(std::string::iterator start, uint32_t &length);
+
+  uint32_t GetLocation() { return m_cur_pos - m_expr.begin(); }
+
+  /// Update 'result' with the other paremeter values, create a
+  /// duplicate token, and push the duplicate token onto the vector of
+  /// lexed tokens.
+  void UpdateLexedTokens(DILToken &result, dil::TokenKind tok_kind,
+                         std::string tok_str, uint32_t tok_pos);
+
+  /// Return the lexed token N+1 positions ahead of the 'current' token
+  /// being handled by the DIL parser.
+  const DILToken &LookAhead(uint32_t N);
+
+  const DILToken &AcceptLookAhead(uint32_t N);
+
+  /// Return the index for the 'current' token being handled by the DIL parser.
+  uint32_t GetCurrentTokenIdx() { return m_tokens_idx; }
+
+  /// Return the current token to be handled by the DIL parser.
+  DILToken &GetCurrentToken() { return m_lexed_tokens[m_tokens_idx]; }
+
+  /// Update the index for the 'current' token, to point to the next lexed
+  /// token.
+  bool IncrementTokenIdx() {
+    if (m_tokens_idx >= m_lexed_tokens.size() - 1)
+      return false;
+
+    m_tokens_idx++;
+    return true;
+  }
+
+  /// Set the index for the 'current' token (to be handled by the parser)
+  /// to a particular position. Used for either committing 'look ahead' parsing
+  /// or rolling back tentative parsing.
+  bool ResetTokenIdx(uint32_t new_value) {
+    if (new_value > m_lexed_tokens.size() - 1)
+      return false;
+
+    m_tokens_idx = new_value;
+    return true;
+  }
+
+private:
+  // The input string we are lexing & parsing.
+  std::string m_expr;
+
+  // The current position of the lexer within m_expr (the character position,
+  // within the string, of the next item to be lexed).
+  std::string::iterator m_cur_pos;
+
+  // Holds all of the tokens lexed so far.
+  std::vector<DILToken> m_lexed_tokens;
+
+  // Index into m_lexed_tokens; indicates which token the DIL parser is
+  // currently trying to parse/handle.
+  uint32_t m_tokens_idx;
+
+  // "invalid" token; to be returned by lexer when 'look ahead' fails.
+  DILToken m_invalid_token;
+};
+
+} // namespace dil
+
+} // namespace lldb_private
+
+#endif // LLDB_VALUEOBJECT_DILLEXER_H_
diff --git a/lldb/source/ValueObject/DILLexer.cpp b/lldb/source/ValueObject/DILLexer.cpp
new file mode 100644
index 00000000000000..4c2b0b1813bb96
--- /dev/null
+++ b/lldb/source/ValueObject/DILLexer.cpp
@@ -0,0 +1,205 @@
+//===-- DILLexer.cpp ------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// This implements the recursive descent parser for the Data Inspection
+// Language (DIL), and its helper functions, which will eventually underlie the
+// 'frame variable' command. The language that this parser recognizes is
+// described in lldb/docs/dil-expr-lang.ebnf
+//
+//===----------------------------------------------------------------------===//
+
+#include "lldb/ValueObject/DILLexer.h"
+#include "llvm/ADT/StringMap.h"
+
+namespace lldb_private {
+
+namespace dil {
+
+// For fast keyword lookup. More keywords will be added later.
+const llvm::StringMap<dil::TokenKind> Keywords = {
+    {"namespace", dil::TokenKind::kw_namespace},
+};
+
+const std::string DILToken::getTokenName(dil::TokenKind kind) {
+  switch (kind) {
+  case dil::TokenKind::coloncolon:
+    return "coloncolon";
+  case dil::TokenKind::eof:
+    return "eof";
+  case dil::TokenKind::identifier:
+    return "identifier";
+  case dil::TokenKind::kw_namespace:
+    return "namespace";
+  case dil::TokenKind::l_paren:
+    return "l_paren";
+  case dil::TokenKind::r_paren:
+    return "r_paren";
+  case dil::TokenKind::unknown:
+    return "unknown";
+  default:
+    return "token_name";
+  }
+}
+
+static bool Is_Letter(char c) {
+  if (('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z'))
+    return true;
+  return false;
+}
+
+static bool Is_Digit(char c) { return ('0' <= c && c <= '9'); }
+
+// A word starts with a letter, underscore, or dollar sign, followed by
+// letters ('a'..'z','A'..'Z'), digits ('0'..'9'), and/or  underscores.
+bool DILLexer::Is_Word(std::string::iterator start, uint32_t &length) {
+  bool done = false;
+  bool dollar_start = false;
+
+  // Must not start with a digit.
+  if (m_cur_pos == m_expr.end() || Is_Digit(*m_cur_pos))
+    return false;
+
+  // First character *may* be a '$', for a register name or convenience
+  // variable.
+  if (*m_cur_pos == '$') {
+    dollar_start = true;
+    ++m_cur_pos;
+    length++;
+  }
+
+  // Contains only letters, digits or underscores
+  for (; m_cur_pos != m_expr.end() && !done; ++m_cur_pos) {
+    char c = *m_cur_pos;
+    if (!Is_Letter(c) && !Is_Digit(c) && c != '_') {
+      done = true;
+      break;
+    } else
+      length++;
+  }
+
+  if (dollar_start && length > 1) // Must have something besides just '$'
+    return true;
+
+  if (!dollar_start && length > 0)
+    return true;
+
+  // Not a valid word, so re-set the lexing position.
+  m_cur_pos = start;
+  return false;
+}
+
+void DILLexer::UpdateLexedTokens(DILToken &result, dil::TokenKind tok_kind,
+                                 std::string tok_str, uint32_t tok_pos) {
+  DILToken new_token;
+  result.setValues(tok_kind, tok_str, tok_pos);
+  new_token = result;
+  m_lexed_tokens.push_back(std::move(new_token));
+}
+
+bool DILLexer::Lex(DILToken &result, bool look_ahead) {
+  bool retval = true;
+
+  if (!look_ahead) {
+    // We're being asked for the 'next' token, and not a part of a LookAhead.
+    // Check to see if we've already lexed it and pushed it onto our tokens
+    // vector; if so, return the next token from the vector, rather than doing
+    // more lexing.
+    if ((m_tokens_idx != UINT_MAX) &&
+        (m_tokens_idx < m_lexed_tokens.size() - 1)) {
+      result = m_lexed_tokens[m_tokens_idx + 1];
+      return retval;
+    }
+  }
+
+  // Skip over whitespace (spaces).
+  while (m_cur_pos != m_expr.end() && *m_cur_pos == ' ')
+    m_cur_pos++;
+
+  // Check to see if we've reached the end of our input string.
+  if (m_cur_pos == m_expr.end()) {
+    UpdateLexedTokens(result, dil::TokenKind::eof, "", m_expr.length());
+    return retval;
+  }
+
+  uint32_t position = m_cur_pos - m_expr.begin();
+  ;
+  std::string::iterator start = m_cur_pos;
+  uint32_t length = 0;
+  if (Is_Word(start, length)) {
+    dil::TokenKind kind;
+    std::string word = m_expr.substr(position, length);
+    auto iter = Keywords.find(word);
+    if (iter != Keywords.end())
+      kind = iter->second;
+    else
+      kind = dil::TokenKind::identifier;
+
+    UpdateLexedTokens(result, kind, word, position);
+    return true;
+  }
+
+  switch (*m_cur_pos) {
+  case '(':
+    m_cur_pos++;
+    UpdateLexedTokens(result, dil::TokenKind::l_paren, "(", position);
+    return true;
+  case ')':
+    m_cur_pos++;
+    UpdateLexedTokens(result, dil::TokenKind::r_paren, ")", position);
+    return true;
+  case ':':
+    if (position + 1 < m_expr.size() && m_expr[position + 1] == ':') {
+      m_cur_pos += 2;
+      UpdateLexedTokens(result, dil::TokenKind::coloncolon, "::", position);
+      return true;
+    }
+    break;
+  default:
+    break;
+  }
+  // Empty Token
+  result.setValues(dil::TokenKind::none, "", m_expr.length());
+  return false;
+}
+
+const DILToken &DILLexer::LookAhead(uint32_t N) {
+  uint32_t extra_lexed_tokens = m_lexed_tokens.size() - m_tokens_idx - 1;
+
+  if (N + 1 < extra_lexed_tokens)
+    return m_lexed_tokens[m_tokens_idx + N + 1];
+
+  uint32_t remaining_tokens =
+      (m_tokens_idx + N + 1) - m_lexed_tokens.size() + 1;
+
+  bool done = false;
+  bool look_ahead = true;
+  while (!done && remaining_tokens > 0) {
+    DILToken tok;
+    Lex(tok, look_ahead);
+    if (tok.getKind() == dil::TokenKind::eof)
+      done = true;
+    remaining_tokens--;
+  };
+
+  if (remaining_tokens > 0) {
+    m_invalid_token.setValues(dil::TokenKind::invalid, "", 0);
+    return m_invalid_token;
+  }
+
+  return m_lexed_tokens[m_tokens_idx + N + 1];
+}
+
+const DILToken &DILLexer::AcceptLookAhead(uint32_t N) {
+  if (m_tokens_idx + N + 1 > m_lexed_tokens.size())
+    return m_invalid_token;
+
+  m_tokens_idx += N + 1;
+  return m_lexed_tokens[m_tokens_idx];
+}
+
+} // namespace dil
+
+} // namespace lldb_private
diff --git a/lldb/unittests/ValueObject/CMakeLists.txt b/lldb/unittests/ValueObject/CMakeLists.txt
index 8fcc8d62a79979..952f5411a98057 100644
--- a/lldb/unittests/ValueObject/CMakeLists.txt
+++ b/lldb/unittests/ValueObject/CMakeLists.txt
@@ -1,5 +1,6 @@
 add_lldb_unittest(LLDBValueObjectTests
   DumpValueObjectOptionsTests.cpp
+  DILLexerTests.cpp
 
   LINK_LIBS
     lldbValueObject
diff --git a/lldb/unittests/ValueObject/DILLexerTests.cpp b/lldb/unittests/ValueObject/DILLexerTests.cpp
new file mode 100644
index 00000000000000..ec6ff86b64d36b
--- /dev/null
+++ b/lldb/unittests/ValueObject/DILLexerTests.cpp
@@ -0,0 +1,193 @@
+//===-- DILLexerTests.cpp --------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "lldb/ValueObject/DILLexer.h"
+#include "llvm/ADT/StringRef.h"
+#include "gtest/gtest.h"
+#include <string>
+
+using llvm::StringRef;
+
+TEST(DILLexerTests, SimpleTest) {
+  StringRef dil_input_expr("simple_var");
+  uint32_t tok_len = 10;
+  lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+  lldb_private::dil::DILToken dil_token;
+  dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::unknown);
+  dil_lexer.Lex(dil_token);
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  EXPECT_EQ(dil_token.getSpelling(), "simple_var");
+  EXPECT_EQ(dil_token.getLength(), tok_len);
+  dil_lexer.Lex(dil_token);
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::eof);
+}
+
+TEST(DILLexerTests, TokenKindTest) {
+  StringRef dil_input_expr("namespace");
+  lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+  lldb_private::dil::DILToken dil_token;
+  dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+  dil_lexer.Lex(dil_token);
+  EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), UINT_MAX);
+  dil_lexer.ResetTokenIdx(0);
+
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::kw_namespace);
+  EXPECT_TRUE(dil_token.isNot(lldb_private::dil::TokenKind::identifier));
+  EXPECT_FALSE(dil_token.is(lldb_private::dil::TokenKind::l_paren));
+  EXPECT_TRUE(dil_token.isOneOf(lldb_private::dil::TokenKind::eof,
+                                lldb_private::dil::TokenKind::kw_namespace));
+  EXPECT_FALSE(dil_token.isOneOf(lldb_private::dil::TokenKind::l_paren,
+                                 lldb_private::dil::TokenKind::r_paren,
+                                 lldb_private::dil::TokenKind::coloncolon,
+                                 lldb_private::dil::TokenKind::eof));
+
+  dil_token.setKind(lldb_private::dil::TokenKind::identifier);
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+}
+
+TEST(DILLexerTests, LookAheadTest) {
+  StringRef dil_input_expr("(anonymous namespace)::some_var");
+  lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+  lldb_private::dil::DILToken dil_token;
+  dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+  uint32_t expect_loc = 23;
+
+  dil_lexer.Lex(dil_token);
+  EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), UINT_MAX);
+  dil_lexer.ResetTokenIdx(0);
+
+  // Current token is '('; check the next 4 tokens, to make
+  // sure they are the identifier 'anonymous', the namespace keyword,
+  // ')' and '::', in that order.
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::l_paren);
+  EXPECT_EQ(dil_lexer.LookAhead(0).getKind(),
+            lldb_private::dil::TokenKind::identifier);
+  EXPECT_EQ(dil_lexer.LookAhead(0).getSpelling(), "anonymous");
+  EXPECT_EQ(dil_lexer.LookAhead(1).getKind(),
+            lldb_private::dil::TokenKind::kw_namespace);
+  EXPECT_EQ(dil_lexer.LookAhead(2).getKind(),
+            lldb_private::dil::TokenKind::r_paren);
+  EXPECT_EQ(dil_lexer.LookAhead(3).getKind(),
+            lldb_private::dil::TokenKind::coloncolon);
+  // Verify we've advanced our position counter (lexing location) in the
+  // input 23 characters (the length of '(anonymous namespace)::'.
+  EXPECT_EQ(dil_lexer.GetLocation(), expect_loc);
+
+  // Our current index should still be 0, as we only looked ahead; we are still
+  // officially on the '('.
+  EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), 0);
+
+  // Accept the 'lookahead', so our current token is '::', which has the index
+  // 4 in our vector of tokens (which starts at zero).
+  dil_token = dil_lexer.AcceptLookAhead(3);
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::coloncolon);
+  EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), 4);
+
+  // Lex the final variable name in the input string
+  dil_lexer.Lex(dil_token);
+  dil_lexer.IncrementTokenIdx();
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  EXPECT_EQ(dil_token.getSpelling(), "some_var");
+  EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), 5);
+
+  dil_lexer.Lex(dil_token);
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::eof);
+}
+
+TEST(DILLexerTests, MultiTokenLexTest) {
+  StringRef dil_input_expr("This string has several identifiers");
+  lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+  lldb_private::dil::DILToken dil_token;
+  dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+  dil_lexer.Lex(dil_token);
+  EXPECT_EQ(dil_lexer.GetCurrentTokenIdx(), UINT_MAX);
+  dil_lexer.ResetTokenIdx(0);
+
+  EXPECT_EQ(dil_token.getSpelling(), "This");
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  dil_lexer.Lex(dil_token);
+  dil_lexer.IncrementTokenIdx();
+
+  EXPECT_EQ(dil_token.getSpelling(), "string");
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  dil_lexer.Lex(dil_token);
+  dil_lexer.IncrementTokenIdx();
+
+  EXPECT_EQ(dil_token.getSpelling(), "has");
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  dil_lexer.Lex(dil_token);
+  dil_lexer.IncrementTokenIdx();
+
+  EXPECT_EQ(dil_token.getSpelling(), "several");
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  dil_lexer.Lex(dil_token);
+  dil_lexer.IncrementTokenIdx();
+
+  EXPECT_EQ(dil_token.getSpelling(), "identifiers");
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  dil_lexer.Lex(dil_token);
+  dil_lexer.IncrementTokenIdx();
+
+  EXPECT_EQ(dil_token.getSpelling(), "");
+  EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::eof);
+}
+
+TEST(DILLexerTests, IdentifiersTest) {
+  std::vector<std::string> valid_identifiers = {
+    "$My_name1",
+    "$pc",
+    "abcd",
+    "ab cd",
+    "_",
+    "_a",
+    "_a_",
+    "a_b",
+    "this",
+    "self",
+    "a",
+    "MyName"
+  };
+  std::vector<std::string> invalid_identifiers = {
+    "234",
+    "2a",
+    "2",
+    "$",
+    "1MyName",
+    "",
+    "namespace"
+  };
+
+  // Verify that all of the valid identifiers come out as identifier tokens.
+  for (auto str : valid_identifiers) {
+    StringRef dil_input_expr(str);
+    lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+    lldb_private::dil::DILToken dil_token;
+    dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+    dil_lexer.Lex(dil_token);
+    EXPECT_EQ(dil_token.getKind(), lldb_private::dil::TokenKind::identifier);
+  }
+
+  // Verify that none of the invalid identifiers come out as identifier tokens.
+  for (auto str : invalid_identifiers) {
+    StringRef dil_input_expr(str);
+    lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
+    lldb_private::dil::DILToken dil_token;
+    dil_token.setKind(lldb_private::dil::TokenKind::unknown);
+
+    dil_lexer.Lex(dil_token);
+    EXPECT_TRUE(dil_token.isNot(lldb_private::dil::TokenKind::identifier));
+    EXPECT_TRUE(dil_token.isOneOf(lldb_private::dil::TokenKind::unknown,
+                                  lldb_private::dil::TokenKind::none,
+                                  lldb_private::dil::TokenKind::eof,
+                                  lldb_private::dil::TokenKind::kw_namespace));
+  }
+}

Update CMakeLists.txt to build DILLexer.cpp.
Copy link

github-actions bot commented Jan 19, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 23a239267e8a1d20ed10d3545feaf2a2bb70b085 5e2ee55f800726910ad6e56a192554375f61bfb8 --extensions h,cpp -- lldb/include/lldb/ValueObject/DILLexer.h lldb/source/ValueObject/DILLexer.cpp lldb/unittests/ValueObject/DILLexerTests.cpp
View the diff from clang-format here.
diff --git a/lldb/unittests/ValueObject/DILLexerTests.cpp b/lldb/unittests/ValueObject/DILLexerTests.cpp
index 137013e40d..c830dbaa79 100644
--- a/lldb/unittests/ValueObject/DILLexerTests.cpp
+++ b/lldb/unittests/ValueObject/DILLexerTests.cpp
@@ -156,28 +156,10 @@ TEST(DILLexerTests, MultiTokenLexTest) {
 
 TEST(DILLexerTests, IdentifiersTest) {
   std::vector<std::string> valid_identifiers = {
-    "$My_name1",
-    "$pc",
-    "abcd",
-    "ab cd",
-    "_",
-    "_a",
-    "_a_",
-    "a_b",
-    "this",
-    "self",
-    "a",
-    "MyName"
-  };
+      "$My_name1", "$pc", "abcd", "ab cd", "_", "_a",
+      "_a_",       "a_b", "this", "self",  "a", "MyName"};
   std::vector<std::string> invalid_identifiers = {
-    "234",
-    "2a",
-    "2",
-    "$",
-    "1MyName",
-    "",
-    "namespace"
-  };
+      "234", "2a", "2", "$", "1MyName", "", "namespace"};
 
   // Verify that all of the valid identifiers come out as identifier tokens.
   for (auto &str : valid_identifiers) {


/// Class defining the tokens generated by the DIL lexer and used by the
/// DIL parser.
class DILToken {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have a dill:: namespace, maybe we can drop DIL prefix?

/// Class for doing the simple lexing required by DIL.
class DILLexer {
public:
DILLexer(llvm::StringRef dil_expr) : m_expr(dil_expr.str()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we accept StringRef as input, we shouldn't copy the data and work with the provided string view (and assume we don't outlive it).
If you need the lexer to own the text (and you probably don't), then accept std::string and move from it.


/// Update the index for the 'current' token, to point to the next lexed
/// token.
bool IncrementTokenIdx() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Lex() do this automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Lex() is called from LookAhead, when we definitely do not want to automatically increment the token index.

/// being handled by the DIL parser.
const DILToken &LookAhead(uint32_t N);

const DILToken &AcceptLookAhead(uint32_t N);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this API might be simpler. The lexer doesn't actually need to re-lex, the results will always be the same. We only need to rollback occasionally, but we'll always be process the same sequence of tokens the second time.

So the lexer can always add the tokens to the m_lexed_tokens vector and we only need GetCurrentTokenIdx() and ResetTokenIdx() to do the rollback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would definitely be nice, and if its true, I'd consider taking this even further: given that we're going to be processing about a line of text at most, we might be able to just eagerly lex the whole string and avoid the whole on-demand parsing business. If so, maybe we don't even need the lexer class and the lexing could consist of a single function like std::vector<Token> Lex(string) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re werat's comment: This is already what the lexer does. It never re-lexes anything. It stores the lexed tokens in a vector as it lexes them, and keeps track, via index into the vector, which token is the 'current' one (as far as the parser is concerned). Accepting the lookahead basically involves just updating the index.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to hear, but it still makes the implementation complicated. Separating lexing from the traversal through the lexed tokens would make things much easier to follow. I understand traditional lexers can't do that, but hey are parsing megabytes of text. We're going to be parsing approximately one line of code here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday I also thought of another advantage of early/eager parsing -- we can report errors (things like invalid tokens and such) early. Then the other functions ("get next token" and stuff) shouldn't have any failure modes except running off the end of token stream (which I think we could handle by pretending the stream ends with an infinite amount of EOF tokens)

}

uint32_t position = m_cur_pos - m_expr.begin();
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
;

bool look_ahead = true;
while (!done && remaining_tokens > 0) {
DILToken tok;
Lex(tok, look_ahead);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the return value here?

TEST(DILLexerTests, SimpleTest) {
StringRef dil_input_expr("simple_var");
uint32_t tok_len = 10;
lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO can drop dil_ here. All these tests are in the context of DIL, so it's clear what it's all about.

Suggested change
lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
lldb_private::dil::DILLexer lexer(input_expr);


namespace dil {

enum class TokenKind {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional idea: Make this a non-class enum inside the (DIL)Token class so that it can be referred to as Token::coloncolon

Comment on lines 65 to 69
void setValues(dil::TokenKind kind, std::string spelling, uint32_t start) {
m_kind = kind;
m_spelling = spelling;
m_start_pos = start;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use the assignment operator instead (token = Token(kind, spelling, start)) ?

m_tokens_idx = UINT_MAX;
}

bool Lex(DILToken &result, bool look_ahead = false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool Lex(DILToken &result, bool look_ahead = false);
std::optional<DILToken> Lex(bool look_ahead = false);

/// being handled by the DIL parser.
const DILToken &LookAhead(uint32_t N);

const DILToken &AcceptLookAhead(uint32_t N);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would definitely be nice, and if its true, I'd consider taking this even further: given that we're going to be processing about a line of text at most, we might be able to just eagerly lex the whole string and avoid the whole on-demand parsing business. If so, maybe we don't even need the lexer class and the lexing could consist of a single function like std::vector<Token> Lex(string) ?

/// Update 'result' with the other paremeter values, create a
/// duplicate token, and push the duplicate token onto the vector of
/// lexed tokens.
void UpdateLexedTokens(DILToken &result, dil::TokenKind tok_kind,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all of these APIs really be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few of them can be private (I'll take care of that). Many of them could be protected (mostly just called from the parser, which I can designate as a friend), except that if I make them protected then my unittest, which needs to access them, can't access them. I've been trying and trying to find a way to allow my unit tests access these methods when they're protected, but I can't seem to make it work. Do you know the magic secret for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are workarounds, but no magic secrets. If something is called by the parser (which is the only user of the class anyway), then it should be public (btw, friends can see even private members, not just the public ones). And unit tests should generally test using the public APIs (as that's what the users will use). There are exceptions to that, but they usually involve situations where the public API requires arguments that would be difficult/impossible to mock in a test, which shouldn't be the case here.

Separating the public API from the private implementation details would really help this class, as I really don't know which one of these functions is meant to be called from the outside.

Comment on lines 163 to 165
// Empty Token
result.setValues(dil::TokenKind::none, "", m_expr.length());
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of an empty token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly the same as unknown. I will change the TokenKind above to unknown.

};

// Verify that all of the valid identifiers come out as identifier tokens.
for (auto str : valid_identifiers) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 170 to 171
StringRef dil_input_expr(str);
lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
StringRef dil_input_expr(str);
lldb_private::dil::DILLexer dil_lexer(dil_input_expr);
SCOPED_TRACE(str);
lldb_private::dil::DILLexer dil_lexer(str);

the explicit string ref construction is unnecessary. SCOPED_TRACE ensures that in case of a test failure, the error message will contain the string which caused it to fail.

}

TEST(DILLexerTests, MultiTokenLexTest) {
StringRef dil_input_expr("This string has several identifiers");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would read better if there was helper function that would lex the string and then return the lexing result in a nicely-comparable for. E.g., with a function like
std::vector<std::pair<TokenKind, std::string>> Lex(StringRef input), this whole test would boil down to

EXPECT_THAT(Lex("This string has several identifiers"),
 testing::ElementsAre(testing::Pair(TokenKind::identifier, "This"), 
testing::Pair(TokenKind::identifier, "string"), 
testing::Pair(TokenKind::identifier, "has"),
 testing::Pair(TokenKind::identifier, "several"),
 testing::Pair(TokenKind::identifier, "identifiers"), 
testing::Pair(TokenKind::eof, "")));

Making it easy to write test like this is important, as I expect most of the lexer tests to be of this form, and I want to encourage people to write more tests like this for various corner cases, of which there are many in a lexer implementation (for example, test that a colon at the of a string does not access beyond the end of the string when it checks whether it's a part of ::)

return true;
}

switch (*m_cur_pos) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An iterator is sort of a natural representation of a position in a string, but it also makes it hard to use many of the nice (safe) string APIs. For example, if we had a StringRef m_rest to represent the unparsed portion of the string, then this block could be written as:

constexpr std::pair<TokenKind, const char *> operators = {
  {TokenKind::l_paren, "("},
  {TokenKind::r_paren, ")"},
  {TokenKind::coloncolon, "::"},
  {TokenKind::colon, ":"},
};
size_t position = m_rest.data() - m_input.data();
for (auto [kind, str]: operators) {
  if (m_rest.consume_front(str)) {
    UpdateLexedTokens(result, kind, str, position);
    return true;
  }
}

(Notice how I do not have to worry about running beyond the end of the string, how the "current position" is updated automatically, and how the tokens which are prefixes of one another are handled by putting the longer string first.)

There are also other places where this could be useful. E.g. skipping over whitespace could be implemented as m_rest = m_rest.ltrim() and similar.

@@ -0,0 +1,156 @@
//===-- DILLexer.h ----------------------------------------------*- C++ -*-===//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this file uses a combination of cammelCase (llvm style) and UpperCamel (lldb style) for method names. I'd be fine with either, but please be consistent.

- Remove "DIL" prefix from DILTokenKind	and DILToken.
- Change the token kind	from an	enum class to an enum inside the Token
  class.
- Use CamelCase	for all	the method names.
- Replace Token::SetValues method with assignments.
- Use a	StringRef, not std::string,  to hold the input string in the lexer.
- Update the lexer to lex all the tokens at one	time. Added two	new methods
  for this: LexAll and GetNextToken.
- Made some of the Lexer methods private.
- Replaces StringMap with StringSwitch for fast keyword lookups.
- Updated GetTokenName to directly return StringRefs; removed default case from
  switch statement.
- Cleaned up code format in IsLetter	& IsDigit.
- Updated IsWord too return an iterator	range containing the word (if any).
- Updated Lex function (now called by LexAll) to return	an llvm::Expected
  token; removed look_ahead checks; changed the	operator lexing	to use
  a vector of operators	(as suggested).
- Cleaned up LookAhead method, now that	we know	all tokens have	already	been
  lexed.
- Added	helper function	to unittests, to help check a sequence of tokens.
- Generally cleaned up the tests to deal with all the code changes.
@cmtice
Copy link
Contributor Author

cmtice commented Jan 26, 2025

I think I have addressed all the comments & requests. Please review this again. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants