WhatsParser is a tool for parsing .txt
chat files rendered by the WhatsApp messaging App. Is intended to make the shift from WhatsApp data to pandas dataframe as rapid as possible. Reading and parsing the .txt
file is done like this:
from whatsparser import WhatsParser
messages = WhatsParser('./chat.txt')
Once the file has been parsed, all messages are stored as dictionaries with three keys: datetime, author and content. Using indexing you can access individual data point:
len(messages) # Get how many messages there are
>> 3590
messages[35] # Get a message
>> {'datetime': datetime.datetime(2017, 9, 15, 19, 10, 2),
'author': 'Agustin Rodriguez',
'content': 'Hi! this is a Whatsapp message'}
The datetime key stores a datetime object, all the others have string as values.
Convert all messages into a pandas DataFrame so you can use your favorite tools for data analysis:
df = messages.to_dataframe() # Returns a pandas dataframe
WhatsParser also offer the possibility of iterate through the object using various functions. When iterating over messages
a copy is made of all messages stored and iteration and changes occurs over this copy. It is possible to change the data store inside messages
by assigning the results of the iteration to messages.data
.
def find_long_messages(message):
if len(message['content']) > 100:
return True
return False
messages.data = list(filter(find_long_messages, messages))
# Now, messages contains only those messages with a length greater than 100 characters.
from emoji import get_emoji_regexp
def remove_emojis(message):
message['content'] = get_emoji_regexp().sub(r'', message['content'])
return message
messages.data = [remove_emojis(message) for message in messages]
# All messages got their emojis remove from the text
def remove_emojis(message):
message['content'] = get_emoji_regexp().sub(r'', message['content'])
return message
messages.data = list(map(remove_emojis, messages))
Iterate over messages.data
to make changes on the fly, if no just use messages
.
# For changing data
for message in messages.data:
message['content'] = 'NEW CONTENT'
# Without changing the data
for message in messages:
print(message['author'])
This project is licensed under the MIT License - see the LICENSE file for details.