preprocessing omission in sample code 6.2 for "The War of the Worlds" #102

XueWenSYan · 2022-02-03T21:49:21Z

Hi, perhaps the [gutenbergr] sources have changed since the Chapter 6.2 codes were posted. Chapters in the book 'The War of the Worlds' isn't separated by a title starting with something like 'Chapter', but rather have roman numerals (I. II. etc) indicating chapters. Hence the code in the book doesn't produce errors, but indeed only identifies the chapters for the other three books in the example.

I've personally tried something like below to identify the chapters. I think one general issue here is how we should inspect the data first for preprocessing before we proceed with the analysis. I think the book is great with showing the applications of the packages available but the examples do assume some sort of prior knowledge with the structure of the text data (e.g., knowing that there're some lines of texts called 'chapter/Chapter/ Chapter/ CHAPTER' etc that may help us separate the chapters. And indeed, small details like whether there's a space before the word Chapter or not also matters.) In practice, it is usually such nitty-gritty contextual knowledge that may lead to successful versus erroneous text data processing. The book does an excellent job in dealing more with preprocessing in the case studies towards the end. It may be even more helpful to have some contents on the importance of getting to know your data (either through a few lines of warnings or comments, or a devoted section) in the beginning chapters of the book too.

books %>% filter(title == 'The War of the Worlds') %>% mutate(chapter = cumsum(str_detect(text,regex('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}).$'))))

The text was updated successfully, but these errors were encountered:

aliaamiri · 2022-02-04T05:24:15Z

Hi, perhaps the [gutenbergr] sources have changed since the Chapter 6.2 codes were posted. Chapters in the book 'The War of the Worlds' isn't separated by a title starting with something like 'Chapter', but rather have roman numerals (I. II. etc) indicating chapters. Hence the code in the book doesn't produce errors, but indeed only identifies the chapters for the other three books in the example.

I've personally tried something like below to identify the chapters. I think one general issue here is how we should inspect the data first for preprocessing before we proceed with the analysis. I think the book is great with showing the applications of the packages available but the examples do assume some sort of prior knowledge with the structure of the text data (e.g., knowing that there're some lines of texts called 'chapter/Chapter/ Chapter/ CHAPTER' etc that may help us separate the chapters. And indeed, small details like whether there's a space before the word Chapter or not also matters.) In practice, it is usually such nitty-gritty contextual knowledge that may lead to successful versus erroneous text data processing. The book does an excellent job in dealing more with preprocessing in the case studies towards the end. It may be even more helpful to have some contents on the importance of getting to know your data (either through a few lines of warnings or comments, or a devoted section) in the beginning chapters of the book too.

books %>% filter(title == 'The War of the Worlds') %>% mutate(chapter = cumsum(str_detect(text,regex('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}).$'))))

I totally agree. But a simpler chunk of code worked for me to solve this problem:

books %>%
  filter(title == "The War of the Worlds") %>%
  mutate(chapter = cumsum(str_detect(text, "^[IVX]+\\.$")))

juliasilge · 2022-02-07T16:45:38Z

This is related to #85

For the book itself, we use a version of these texts that we downloaded at a certain point in time and saved. We did that because there are often changes like this in resources from the internet.

If you would like to step through the code exactly as in the book, I suggest cloning the repo locally and using the data files we saved: https://github.com/dgrtwo/tidy-text-mining/tree/master/data
If you would like to use updated texts from Project Gutenberg, then yep, you'll need to adjust the regex.

aliaamiri · 2022-02-07T16:56:12Z

Thank you for your enlightening comment 🙌.

juliasilge · 2022-06-09T16:28:51Z

Let us know if you have further questions!

juliasilge closed this as completed Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocessing omission in sample code 6.2 for "The War of the Worlds" #102

preprocessing omission in sample code 6.2 for "The War of the Worlds" #102

XueWenSYan commented Feb 3, 2022 •

edited

Loading

aliaamiri commented Feb 4, 2022

juliasilge commented Feb 7, 2022

aliaamiri commented Feb 7, 2022

juliasilge commented Jun 9, 2022

preprocessing omission in sample code 6.2 for "The War of the Worlds" #102

preprocessing omission in sample code 6.2 for "The War of the Worlds" #102

Comments

XueWenSYan commented Feb 3, 2022 • edited Loading

aliaamiri commented Feb 4, 2022

juliasilge commented Feb 7, 2022

aliaamiri commented Feb 7, 2022

juliasilge commented Jun 9, 2022

XueWenSYan commented Feb 3, 2022 •

edited

Loading