Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove single quotes around words while preserving apostrophes #159

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

ejdweck
Copy link

@ejdweck ejdweck commented Nov 4, 2018

I was using the sentiment library and noticed when I ran analysis on headlines that utilized single quotes, the words were not being properly tokenized.

For example, for the news headline from cnn.com that reads:

Abrams: Trump is 'wrong,' I am qualified to be Georgia's governor

wrong should be tokenized from 'wrong' to wrong.

In its current state, the library successfully tokenizes words from double quotes but not from single quotes (my guess is to preserve apostrophes - if you add an ' to the .replace regex, all single quotes would be removed).

Here is some code to reproduce error:

var Sentiment = require('sentiment');
var sentiment = new Sentiment();

let noQuotes = "Abrams: Trump is wrong, I am qualified to be Georgia's governor";
let singleQuotes = "Abrams: Trump is \'wrong\', I am qualified to be Georgia's governor";
let doubleQuotes = "Abrams: Trump is \"wrong,\" I am qualified to be Georgia's governor"

let noQuotesResult = sentiment.analyze(noQuotes);
var doubleQuotesResult = sentiment.analyze(doubleQuotes);
var singleQuotesResult = sentiment.analyze(singleQuotes);

console.log(noQuotesResult);
console.log(doubleQuotesResult);
console.log(singleQuotesResult);
{ score: -2,
  comparative: -0.18181818181818182,
  tokens:
   [ 'abrams',
     'trump',
     'is',
     'wrong',
     'i',
     'am',
     'qualified',
     'to',
     'be',
     'georgia\'s',
     'governor' ],
  words: [ 'wrong' ],
  positive: [],
  negative: [ 'wrong' ] }
{ score: -2,
  comparative: -0.18181818181818182,
  tokens:
   [ 'abrams',
     'trump',
     'is',
     'wrong',
     'i',
     'am',
     'qualified',
     'to',
     'be',
     'georgia\'s',
     'governor' ],
  words: [ 'wrong' ],
  positive: [],
  negative: [ 'wrong' ] }
{ score: 0,
  comparative: 0,
  tokens:
   [ 'abrams',
     'trump',
     'is',
     '\'wrong\'',
     'i',
     'am',
     'qualified',
     'to',
     'be',
     'georgia\'s',
     'governor' ],
  words: [],
  positive: [],
  negative: [] }

…strophes + 3 unit tests to verify code works as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants