-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preprocessing omission in sample code 6.2 for "The War of the Worlds" #102
Comments
I totally agree. But a simpler chunk of code worked for me to solve this problem:
|
This is related to #85 For the book itself, we use a version of these texts that we downloaded at a certain point in time and saved. We did that because there are often changes like this in resources from the internet.
|
Thank you for your enlightening comment 🙌. |
Let us know if you have further questions! |
Hi, perhaps the [gutenbergr] sources have changed since the Chapter 6.2 codes were posted. Chapters in the book 'The War of the Worlds' isn't separated by a title starting with something like 'Chapter', but rather have roman numerals (I. II. etc) indicating chapters. Hence the code in the book doesn't produce errors, but indeed only identifies the chapters for the other three books in the example.
I've personally tried something like below to identify the chapters. I think one general issue here is how we should inspect the data first for preprocessing before we proceed with the analysis. I think the book is great with showing the applications of the packages available but the examples do assume some sort of prior knowledge with the structure of the text data (e.g., knowing that there're some lines of texts called 'chapter/Chapter/ Chapter/ CHAPTER' etc that may help us separate the chapters. And indeed, small details like whether there's a space before the word Chapter or not also matters.) In practice, it is usually such nitty-gritty contextual knowledge that may lead to successful versus erroneous text data processing. The book does an excellent job in dealing more with preprocessing in the case studies towards the end. It may be even more helpful to have some contents on the importance of getting to know your data (either through a few lines of warnings or comments, or a devoted section) in the beginning chapters of the book too.
books %>% filter(title == 'The War of the Worlds') %>% mutate(chapter = cumsum(str_detect(text,regex('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}).$'))))
The text was updated successfully, but these errors were encountered: