Skip to content

Conversation

midichef
Copy link
Contributor

@midichef midichef commented Oct 9, 2025

For some HTML tables, there is no first row having column names, and the first row holds data values. That ends up putting data values in the column header. Here's an example, where the columns get the names 1 and 1.1:

<table><tbody>
    <tr> <td>1</td><td>1.1</td> </tr>
    <tr> <td>2</td><td>2.2</td> </tr>
    <tr> <td>3</td><td>3.3</td><td>extra</td> </tr>
    <tr> <td>4</td><td>4.4</td> </tr>
    <tr> <td>5</td><td>5.5</td> </tr>
</tbody></table>

It would be useful in these situations to use --header=0. That way I can interactively change the header value to 0 and reload the sheet with ^R to move the data down by one row. This PR adds that ability to the HTML loader.

One note, it differs from the tsv loader behavior when options.header > 1. The tsv loader joins the row values with the empty string '', but the HTML loader joins them with '_'. I didn't want to spend the time to fix that rare use case.

else:
it = []
for _ in range(self.options.header):
r = list(list(x) for x in self.rows.pop(0))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense.

Copy link
Owner

@saulpw saulpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants