Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured Typesetting (STS) generation #9

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

tirix
Copy link

@tirix tirix commented Dec 24, 2021

I've tried to make as few changes as possible. The changes are in the following source files:

  • metamath.c for the changes to the SHOW STATEMENT command and the new VERIFY STS command, as well as the output post-processing,
  • mmcmdl.c for the new HELP command options, and the new commands,
  • mmcmds.c for the changes to the SHOW STATEMENT command at top level (distinct and dummy variables, syntax hints),
  • mmdata.c for some utility functions,
  • mmhlpa.c and mmhlpb.c for the new built-in HELP options,
  • mmhtbl.c, a new file with a hash table implementation,
  • mmwsts.c, a new file with the main STS implementation,
  • mmwtex.c, for the main hook into the STS formula output and for in-line comment formulas.

This also adds a .gitignore file to ignore object files.

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add the makefile? The readme instructions note several ways to compile metamath, including gcc m*.c -o metamath, and using automake.

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metamath.c line 3032 duplicated code from line 2393

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style, readability command line option STS: Norm used always verbose forms, consider expanding it to something like STRUCTURED_TYPESETTING or so

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

built-in help does not cover new option

@tirix
Copy link
Author

tirix commented Dec 25, 2021

Thank you Wolf for your review!

built-in help does not cover new option

It actually does, see changes in mmhlpa.c and mmhlpb.c.

style, readability command line option STS: Norm used always verbose forms, consider expanding it to something like STRUCTURED_TYPESETTING or so

That's very verbose, though. Maybe just "STRUCTURED" ?

Why did you add the makefile? The readme instructions note several ways to compile metamath, including gcc m*.c -o metamath, and using automake.

Sure, I can remove the makefile.

@benjub
Copy link
Collaborator

benjub commented Dec 25, 2021

How do you render this "structured typesetting"? Is this MathML, MathJax or something like this ?

For texts, and in particular help texts, Norm used the double-space convention between sentences, and I admit I like that.

@tirix
Copy link
Author

tirix commented Dec 25, 2021

How do you render this "structured typesetting"? Is this MathML, MathJax or something like this ?

Yes, the set-mathml.mmts file contains instructions about how to display set.mm formulas as MathML. Then, the MathML result is sent to MathJax for rendering.
The actual MathJax command used is included in that file in the $c ... $. instruction towards the end of the file. That instruction is read and executed by the metamath-exe program, with the input file generated send it its standard input.

For texts, and in particular help texts, Norm used the double-space convention between sentences, and I admit I like that.

Your wish has come true with the last commit!

mmhlpb.c Outdated
H("Syntax: VERIFY STS <format>");
H("");
H("This command error-checks that the STS rules definition covers all syntax");
H("defined in the Metamath source file loaded. It runs through all non");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space here ! And I would write "non-definitional" with a hyphen, or even "nondefinitional", since "non" is not a word.

@benjub
Copy link
Collaborator

benjub commented Dec 25, 2021

Thanks. So maybe you can use the option "/ MATHML" ? I'm fine with "/ STRUCTURED" too. I cannot approve this PR since I am not fluent in C :-(

@wlammen
Copy link
Contributor

wlammen commented Dec 25, 2021

The option STS or /STRUCTURED is the second alternative to /HTML and /ALT_HTML. Are you able to roughly tell in what way these options differ by just looking at the name? If not, people will have to guess at what the result of invocations are. Do you expect this option to be typed by users of metamath directly? Or will it be likely be part of a script rarely changing? If you type the option frequently, a short name is handy, otherwise even a very long option name will hardly annoy anyone, but is easily understandable to casual readers of the script.
If I understand your code right, embedded snippets are translated via a sort of grammar file into true HTML. This process is then best expressed in your option tag.

I am still busy with Xmas, have just skimmed your PR. I currently cannot look into this further.

@tirix
Copy link
Author

tirix commented Dec 25, 2021

Thanks. So maybe you can use the option "/ MATHML" ? I'm fine with "/ STRUCTURED" too.

I wanted to keep it generic because one could generate anything with it, not just MathML.
The set-mathml.mmts just happens to contain rules to generate MathML, the Metamath.exe functionality is completely unaware of what it is (actually, I had written another file to generate LaTeX with the same method).

@tirix
Copy link
Author

tirix commented Dec 25, 2021

The option STS or /STRUCTURED is the second alternative to /HTML and /ALT_HTML.

Yes, and it is itself followed by the name of the production to use. A command would typically be:
SHOW STATEMENT syl / STS mathml
which instructs the program to read and execute the mathml instruction file (set-mathml.mmts).

If you had, say, a graphviz-dot instruction file, the command:
SHOW STATEMENT syl / STS graphviz-dot
would search a set-graphviz-dot.mmts file, and presumably output some graphs instead of formulas (that might be an interesting experiment!).

If I understand your code right, embedded snippets are translated via a sort of grammar file into true HTML. This process is then best expressed in your option tag.

Yes, that's roughly the process. So if I follow you, we could use e.g. /STS_HTML as an option tag?

@wlammen
Copy link
Contributor

wlammen commented Dec 25, 2021

What about EXPAND_HTML? Or HTML_EXPAND?
Note that structured is an adjective (usually used to describe a state/property) while expand is a verb (describing a process/activity). If you prefer an adjective here, consider (EMBEDDED_)ENCODING.
I wonder whether the proposed technique is limited to typesetting. If not, the suggestions here are generic enough to even cover exotic applications.

@benjub
Copy link
Collaborator

benjub commented Dec 25, 2021

When running metamath, it is enough to type any non-ambiguous prefix, so here, MM> show statement syl /S mathml would suffice. In other words, there is no big inconvenience in having a verbose option. As for option names, I always thought the other two (HTML, ALT_HTML) were not optimal. I think more explicit choices would be /UNICODE and /GIF.

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gitignore: OK, see here (https://stackoverflow.com/questions/6626136/best-practice-for-adding-gitignore-to-repo)
Maybe one should put this info into the README?

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metamath.c line 60 suggest a version number and add a changelog entry.

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming consistency mmwtex.h line 33: stsFlag (and possibly following) should be prefixed with g_ (g_stsFlag, matching g_altHtmlFlag for example), see changelog 0.187 metamath.c, line 88

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimization metamath.c line 2381: The function switchPos("/ STS") is called three times more or less in succession. Consider evaluating the result once and use a local variable to recall the result within the if conditions.

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsetSTSRules type: why parset...? Looks like a typo

Copy link
Contributor

@wlammen wlammen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meamath.c line 2977: Suspicious value of i. The other options / TIME or / NO_VERSIONING add 2 to the number of options, seemingly counting the slash and the tag as different entries. Why is this not done for / STS? If the count is correct, IMO a comment should clarify the underlying logic.

to be continued...

@tirix
Copy link
Author

tirix commented Dec 26, 2021

@wlammen I'm going to copy your remarks directly into a code review, it creates a sub-thread per remark and I think it's much easier to follow like that.

@tirix
Copy link
Author

tirix commented Dec 26, 2021

metamath.c line 60 suggest a version number and add a changelog entry

Ok, I've updated the history and proposed a version 0.199.
This kind of history entry can only really be finalized at the merge time (imagine another PR comes before) and typically will create merge conflicts, though.

.gitignore Outdated
@@ -0,0 +1,2 @@
*.o
metamath
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlammen writes:

OK, see here (https://stackoverflow.com/questions/6626136/best-practice-for-adding-gitignore-to-repo)
Maybe one should put this info into the README?

I'm not sure which info you would like to add to the README.
There is a standard .gitignore file for C projects here, we could give it a try.
Maybe better in a separate PR?

Copy link
Contributor

@wlammen wlammen Dec 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I added this comment is, because the metamath sources are distributed via a tar archive as well. Obviously you wont need the gitignore in this case. I think, some kind of distribution how-to is finally called for. That are my thoughts. I was curious why you added the file. Separate PR is fine with me.

mmwtex.h Outdated
Comment on lines 32 to 35
/* 19-Jul-2017 tar Added for STS/MathML output */
extern flag stsFlag; /* STS output (for "structural typesetting") */
extern vstring stsOutput; /* output mode chosen for STS (follows STS flag) */
extern vstring postProcess; /* command to pipe the output into (used for MathJax prerendering) */
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlammen wrote:

stsFlag (and possibly following) should be prefixed with g_ (g_stsFlag, matching g_altHtmlFlag for example), see changelog 0.187 metamath.c, line 88

Yes, this global variable naming convention has been introduced after I programmed the STS module. As an improvement, I could retrofit this to follow it too.

Copy link
Contributor

@wlammen wlammen Dec 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the current rules.

Comment on lines +2381 to +2382
if (switchPos("/ ALT_HTML") != 0 || switchPos("/ STS") != 0 ) {
print2("?Please specify only one of / HTML , / ALT_HTML and / STS.\n");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlammen wrote:

optimization: The function switchPos("/ STS") is called three times more or less in succession. Consider evaluating the result once and use a local variable to recall the result within the if conditions.

I don't think this is called 3 times. There are 3 else..if branches, and it is called one time in each, to ensure only one HTML output formatting option is chosen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It still looks awkward since the call and the parameter is coded several times.

mmwsts.h Outdated
#include "mmdata.h"

/* Parse an STS file */
int parsetSTSRules(vstring format);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlammen wrote:

parsetSTSRules type: why parset...? Looks like a typo

Yes, it's a typo. I'll fix that!

metamath.c Outdated
Comment on lines 2976 to 2977
/* 7-Jul-2017 added MathML/STS */
if (switchPos("/ STS")) i = i + 1;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlammen wrote:

Suspicious value of i. The other options / TIME or / NO_VERSIONING add 2 to the number of options, seemingly counting the slash and the tag as different entries. Why is this not done for / STS? If the count is correct, IMO a comment should clarify the underlying logic.

Interesting. So, a typical SHOW STATEMENT command looks like this:
SHOW STATEMENT syl / HTML
That would be g_rawArgs = 5, which already includes 2 arguments for the / HTML.
As you correctly guessed, for / NO_VERSIONING and / TIME, there are two more arguments, therefore the +2.
In the case of STS, the / STS would already be accounted for in the 5 (taking the place of the / HTML), so that's not why the +1 is for. Rather, in the case of STS, there is one more argument, namely the name of the output processing, e.g. mathml. That is what the +1 is for.

I'll add a comment to make that clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am subject to a learning curve as well while reviewing, wrt both to your source code and the review style.

if (lastArgMatches("STS")) {
i++;
if (strlen(stsOutput)) {
if (!getFullArg(i,cat("* Using which output mode <",stsOutput,">? ",NULL)))
Copy link
Contributor

@wlammen wlammen Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation

print2("?No source file has been read in. Use READ first.\n");
goto pclbad;
}
if (strlen(stsOutput)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code from line 586 (with i == 2)

mmcmds.c Outdated
@@ -371,7 +373,9 @@ void typeStatement(long showStmt,
}
htmlDistinctVarsCommaFlag = 1;
let(&str2, "");
str2 = tokenToTex(g_MathToken[nmbrTmpPtr2[k]].tokenName, showStmt);
/* 27 Jul 2017 tar For MathML/STS */
if(stsFlag) str2 = stsToken(nmbrTmpPtr2[k], showStmt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code from line 357

mmcmds.c Outdated
/* tokenToTex allocates str2; we must deallocate it */
let(&str1, cat(str1, " &nbsp; ", str2, NULL));
let(&str2, "");
str2 = tokenToTex(g_MathToken[nmbrTmpPtr2[k]].tokenName, showStmt);
/* 27 Jul 2017 tar For MathML/STS */
if(stsFlag) str2 = stsToken(nmbrTmpPtr2[k], showStmt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code from line 377

mmdata.c Outdated
long i;
long hash = 0;
i = -1;
while (i < 13 && s[i] != -1) {
Copy link
Contributor

@wlammen wlammen Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suspicious i in first loop: -1, which is at least out of bounds for salt, likely for s, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll answer this one now as it is interesting, useful for the rest of the review... and definitely deserves more information in the comments!

Actually, indices -1, -2 and -3 are valid and used in nmbrString:

  • Index -1 is the length of the number string. See nmbrLen (mmdata.c line 1023)
  • Index -2 is the allocated length, i.e. how many numbers are available totally (could be more than the actual current length of the string). See nmbrAllocLen (mmdata.c line 1032)
  • Index -3 is the location in memUsedPool (for memory management)

In this specific case, I wanted to include the length of the string in the hash, which I think makes sense but shall have been explained.

In any case, you clearly spotted a mistake here because in the salt, index -1 is clearly invalid!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any use of negative indices should be wrapped behind a macro to provide more safety and clarity. (That is, there should be a macro like #define nmbrLen(p) p[-1] where p has type nmbrString which is a typedef for int* or what have you.

mmdata.c Outdated
static long salt[] = { 4938, 48977, 6897, 7293, 2663, 7925, 2999, 12238,
40033, 14038, 10699, 29746, 56108, 34526, 63576, 52053, 61949, 41177, 43740, 22822
};
long i;
Copy link
Contributor

@wlammen wlammen Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declaration is Initialization rule (section 4.5 in https://docs.oracle.com/cd/E17984_01/doc.898/e14699/variables_data_structs.htm) Initialize with value from line 1113

mmdata.c Outdated
/* This simply computes a XOR of the first numbers */
int nmbrHash(nmbrString *s)
{
static long salt[] = { 4938, 48977, 6897, 7293, 2663, 7925, 2999, 12238,
Copy link
Contributor

@wlammen wlammen Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 13 (?) values actually needed. Remove/comment out unneeded ones.

Where is the hash algorithm explained? Add a comment/link.

to be continued...

mmdata.c Outdated
long i;
if (sstart - 1 + len > nmbrLen(s)
|| tstart - 1 + len > nmbrLen(t)) return 0;
for (i = 0; s[sstart-1+i] == t[tstart-1+i] && i<len; i++);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suspicious index in first loop: -1, if sstart == 0. (1) Out of bounds access to s and t, or (2) Parameters sstart and tstart must be > 0.

Either provide parameter checks (cf. line 1211), or (minimum) state limitations in comment line 1122

mmdata.c Outdated
long nmbrInstrN(long start_position, long occ, nmbrString *string1,
nmbrString *string2, long start2, long length2)
{
if (start_position < 1) start_position = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means: If garbage is provided in parameter start_position, then I sanitize it to something more useful, hoping for the best. This kind of fault tolerance supports a sloppy programming on the caller's side. Better throw an exception, or call bug()

mmdata.c Outdated
@@ -1170,6 +1203,34 @@ nmbrString *nmbrSpace(long n)
return (sout);
}

/* Search for the nth occurrence of string2 in string1 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain parameters (semantics and limitations) in a comment. For example, what does occ mean? n would match the functions name.

mmdata.c Outdated
start_position--;
for(; occ > 0;occ--) {
long ls1, i, j;
ls1 = nmbrLen(string1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull constant evaluations out of the loop.

mmdata.c Outdated
if (start_position < 1) start_position = 1;
start_position--;
for(; occ > 0;occ--) {
long ls1, i, j;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declare i and j in for commands

mmdata.c Outdated
}
}
if (found) {
start_position = i+1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect indentation

mmdata.c Outdated
for (i = start_position - 1; i <= ls1 - length2; i++) {
flag found = 1;
for (j = 0; j < length2; j++) {
if (string1[i+j] != string2[start2-1+j]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is usually part of the for-command.

mmdata.c Outdated
/* Add a single number to start of a nmbrString - faster than nmbrCat */
nmbrString *nmbrUnshiftElement(nmbrString *g, long element)
{
long length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declaration should be initialization

mmdata.c Outdated
@@ -1548,6 +1609,21 @@ nmbrString *nmbrAddElement(nmbrString *g, long element)
}


/* Add a single number to start of a nmbrString - faster than nmbrCat */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value is pushed on an internal variable stack with an implicit memory management. This important detail is not mentioned in the comment.

mmdata.h Outdated
@@ -281,6 +281,10 @@ long nmbrLen(nmbrString *s);
long nmbrAllocLen(nmbrString *s);
void nmbrZapLen(nmbrString *s, long length);

/* Search for the nth occurrence of string2 in string1 */
Copy link
Contributor

@wlammen wlammen Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain parameters and specify their limitations.

mmhtbl.c Outdated
#define NO_LINKEDITEM -1
linked *linkedItems;
int free_linkedItem;
flag htinit_done = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is private to htinit() so declare it there as a static variable

mmhtbl.c Outdated
/* Static buffer for the linked lists */
#define NB_LINKEDITEMS 50000L
#define NO_LINKEDITEM -1
linked *linkedItems;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safer to initialize this to NULL and 0

mmhtbl.c Outdated

/* Create and fill the structure */
hashtable ht;
ht.name = name;
Copy link
Contributor

@wlammen wlammen Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that you copy a pointer, not its value, from a parameter to the result. The caller handles the parameter's contents with a series of let() instructions, finally freeing it. All these operations are done without ht.name in mind. This opens up all kinds of memory allocation/access failures. Even if you know, this won't happen in the foreseeable future, you need a guarantee, or a contract, here to decouple caller and callee.

mmhtbl.c Outdated
/* Found it, free the object and remove it from the chain */
hashtable->freeFunc(&linkedItems[*pli].key, &linkedItems[*pli].object);
int old_free = free_linkedItem;
free_linkedItem = *pli;
Copy link
Contributor

@wlammen wlammen Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY duplicated code from line 76

to be continued...

@tirix
Copy link
Author

tirix commented Dec 30, 2021

@wlammen thank you very much for your careful review!
I think you are not quite done yet: please give me a heads up when you are, and I will try to address all remarks at once.

@wlammen
Copy link
Contributor

wlammen commented Dec 30, 2021

New Year's Eve is closing in. I need some time for it.

mmwsts.c Outdated
/* The structure containing information about the STS variable tokens */
struct stsVar_struct {
long stsType; /* type of the token in the STS (must be a constant tokenId) */
long stsSchemeId; /* number of the schemed in which this variable is defined + 1. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"schemed" typo

mmwsts.c Outdated
};

/* Current output format for STS */
vstring stsFormat = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_ prefix missing from global variables (here and elsewhere in this file)

mmwsts.c Outdated
/* Math symbol comparison for bsearch */
/* Here, key is pointer to a character string. */
/* Here we search only the global tokens, those
* which endStatement is the last statement */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English: where endStatement...

mmwsts.c Outdated
/* Here, key is pointer to a character string. */
/* Here we search only the global tokens, those
* which endStatement is the last statement */
int mathSrchGlbCmp(const void *key, const void *data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to avoid specific C types in parameter declarations? And so dodge C type checking?
key: char const *
data: mathToken_struct const *

mmwsts.c Outdated
if(g_MathToken[ *((long *)data) ].endStatement == g_statements) return 0;

/* Find the direction in which the target token is */
for(long *ptr = (long*)data; !strcmp(key, g_MathToken[ *ptr ].tokenName); ptr++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization: The first loop seems to check element data again, and the outcome is known to be 0. So I suggest to initialize with long* ptr = (long*) data + 1;

mmwsts.c Outdated
/* Cache to speed up conversions */
hashtable stsCache;

/* Math symbol comparison for bsearch */
Copy link
Contributor

@wlammen wlammen Jan 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should explain the result, that is limited to -1, 0 and +1, and what is delivered when.
It seems possible that the token list contains the same key multiple times in succession. Since this is part of a binary search, data may point somewhere in the middle of such a series. Is it guaranteed to hold just one global element (there is a suspicious active flag in the structure, that may allow a disabled and enabled element in the same series)? Is it guaranteed there is always a global element present? If either assumption is missed, the binary search may fail.

Clarify the so-called pre-/postconditions in a comment.

mmwsts.c Outdated
long i;
char *fbPtr;
long textLen, tokenLen_;
long *g_mathKeyPtr; /* bsearch returned value */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect use of prefix g_

mmwsts.c Outdated
/* Make sure that g_mathTokens has been initialized */
if (!g_mathTokens) bug(1717);

textLen = (long)strlen(text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definition is initialization long textLen = (long) strlen(text); same for wrklen etc.

mmwsts.c Outdated
#include "mmvstr.h"
#include "mmdata.h"
#include "mminou.h"
#include "mmpars.h" /* For rawSourceError and mathSrchCmp and lookupLabel */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and whiteSpaceLen

mmwsts.c Outdated
wrkNmbrPtr[mathStringLen] = *g_mathKeyPtr;
mathStringLen++;
fbPtr = fbPtr + tokenLen_ + 1; /* Move on to next token */
if(fbPtr >= text + textLen) break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the while condition

mmwsts.c Outdated
return NULL_NMBRSTRING;
}
wrkNmbrPtr[mathStringLen] = *g_mathKeyPtr;
mathStringLen++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tip: wrkNmbrPtr[mathStringLen++] = ... saves the following line and can reduce code size
same for fbPtr += tokenLen_ + 1;

mmwsts.c Outdated
return mathString;
}

/* Store a couple key/object into the cache */
Copy link
Contributor

@wlammen wlammen Jan 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document Pre/Postconditions

mmwsts.c Outdated
}

/* Dump a couple key/object from the cache */
void stsDumpCache(nmbrString *key, vstring object) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suspicious cast to eqFunc in line 235: The signature of this function does not match that of eqFunc: int (eqFunc)(void *, void *)

mmwsts.c Outdated
if(stsUseCache) {
stsCache = htcreate(format, STS_CACHE_BUCKETS, "", (hashFunc*)&nmbrHash, (eqFunc*)&nmbrEq,
(letFunc*)&stsStoreCache, (freeFunc*)&stsFreeCache,
(eqFunc*)&stsDumpCache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suspicious cast: signature of stsDumpCache does not match eqFunc.

mmwsts.c Outdated
}

/* Parse a file containing the structured typesetting rules. */
int parseSTSRules(vstring format) {
Copy link
Contributor

@wlammen wlammen Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: This function is way too long. It covers more than 300 lines. This exceeds the recommended max length of 20 lines by far, and its code is easily broken down into steps that can be moved into helper functions. https://stackoverflow.com/questions/475675/when-is-a-function-too-long

document pre/postconditions: example: stsFormat is both an input and an output variable, but that is not easily seen.

mmwsts.c Outdated
g_outputToString = 0;

/* If the same format was already parsed, nothing to do. */
if(strcmp(stsFormat, format) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the early out code to the beginning of the function where parameter checks usually take place

mmhtbl.c Outdated
/* Dumpts the whole table */
void htdump(hashtable *hashtable) {
print2("Hashtable %s:\n", hashtable->name);
//for(int bucket=0;bucket<hashtable->bucket_count;bucket++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of comment is not ANSI C compatible (see section 3.1.9). Use /* ... */ instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were agreed on C99 now? I don't mind seeing // creep in, even if we don't do any bulk conversions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we? Can you point me to where the decision took place?

Copy link
Member

@digama0 digama0 Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of #8 (comment) . In short: we already compile only on C99, even before taking into account the recent refactors. (I'm not opposed to pushing the minimum beyond C99 (i.e. C11), but I don't think we should consider ANSI C (C89) any more.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C99 this issue is not relevant and can be ignored.

mmdata.c Outdated
{
if (start_position < 1) start_position = 1;
start_position--;
for(; occ > 0;occ--) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization: Use a dedicated loop variable in for loops. A compiler will allocate a CPU register for that.
for (long o = occ + 1; --o > 0;) {... (replace occ with o in loop)...}
is how I would write it.

Info: There is a tiny semantic change in my example: If the memory model of the computer supports signed magnitude instead of two's complement AND occ is MAX_LONG then (occ + 1)-1 == occ may not hold. We can safely ignore this nowadays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long is a signed type, and signed overflow is UB, so I think that this change is legal by the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@digama0 Exactly. Because occ+1 is UB when occ==MAX_LONG, --o may contain anything, and the loop starts with a random value. This cannot happen with the original code. We can safely ignore this semantic difference here.

mmdata.c Outdated
@@ -1098,6 +1131,34 @@ nmbrString *nmbrSpace(long n)
return sout;
}

/* Search for the nth occurrence of string2 in string1 */
long nmbrInstrN(long start_position, long occ, nmbrString *string1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is occ defined as long? Do you really expect more than two billion occurrences (or even 32000 should int be only 16 bit wide) of a substring in a string?

mmdata.c Outdated
}
}
if (found) {
start_position = i+1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check indentation

mmhtbl.c Outdated
#define NO_LINKEDITEM -1
linked *linkedItems;
int free_linkedItem;
flag htinit_done = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a static variable within htinit(). It is completely private implementation detail in this function.

mmwsts.c Outdated
return mmlLine;
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to
#if 0
test code
#endif
to comply with ANSI C

@wlammen
Copy link
Contributor

wlammen commented Jan 5, 2022

I think there are already lots of ideas and issues for refactoring the source. In particular extracting code from long functions into auxiliary sub-function and documenting pre/postconditions can help with further review, since the source becomes a lot more readable then.

In addition merge conflicts have to be resolved.

@tirix
Copy link
Author

tirix commented Jan 8, 2022

Thank you very much @wlammen for your efforts reviewing my code!

Indeed now, before this can be merged, I'll have to solve the conflicts with all the refactoring that's going on.

@digama0
Copy link
Member

digama0 commented Jan 8, 2022

I took care of merging this with master. @tirix , you should double check the last commit, which fixes a few warnings I was getting with the original version.

@tirix
Copy link
Author

tirix commented Jan 8, 2022

Thank you Mario!
This looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants