Python: Negative / Positive Analysis: Twitter Negative / Positive Analysis Using RNN-Part 1

Twitter Negative / Positive Analysis Using RNN-Part 1

What is RNN?

It is difficult to judge negatives and positives by reading the context of sentences with a polar dictionary. Proper analysis may not be possible.

Here, from the flow of sentences using a recurrent neural network (RNN) Learn how to perform negative and positive analysis.

Because RNNs can memorize and learn previously calculated information Predict the word that comes after the sentence It is used for machine translation because of the probability of word appearance.

Of course, you can do it in Japanese, but due to the relationship of the data, I will analyze it in English this time.

Twitter Negative / Positive Analysis

Twitter has a character limit of 140 characters The user sends in a short sentence.

Because you can get a lot of data It is used for analysis of various natural language processing including negative and positive analysis.

This time, we will learn using Twitter data about US Airline. For the data, use Airline Twitter sentiment distributed by Figure Eight in the United States.

Click here for License

import pandas as pd

Tweet = pd.read_csv('./6020_negative_positive_data/data/Airline-Sentiment-2-w-AA.csv', encoding='cp932')
tweetData = Tweet.loc[:,['text', 'airline_sentiment']]
print(tweetData)

image.png

Creating a database

Delete frequent words

Since RNN analyzes word relevance, you need to remove frequently occurring words.

Frequently used words such as I and what are called stop words. Google's search engine excludes stop words from the search target to increase the relevance of other words.

Also, on Twitter, the words @ and flight, which mean replies, appear frequently. Create data with those words removed.

import nltk
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
#If you get an error with stopwords, please execute ↓
#
#nltk.download('stopwords')

#Load Tweet data
Tweet = pd.read_csv('./6020_negative_positive_data/data/Airline-Sentiment-2-w-AA.csv', encoding='cp932')
tweetData = Tweet.loc[:,['text','airline_sentiment']]

#Performs morphological analysis of English Tweet
def tweet_to_words(raw_tweet):
    
    # a~Create a list of words starting with z separated by spaces
    letters_only = re.sub("[^a-zA-Z@]", " ",raw_tweet) 
    words = letters_only.lower().split()
    
    # '@'When'flight'が含まれる文字Whenストップワードを削除します
    stops = set(stopwords.words("english"))  
    meaningful_words = [w for w in words if not w in stops and not re.match("^[@]", w) and not re.match("flight",w)] 
    return( " ".join( meaningful_words )) 

cleanTweet = tweetData['text'].apply(lambda x: tweet_to_words(x))
print(cleanTweet)

image.png

Create a database of words

To find out which words affect negatives and positives Create a database of all words once.

Use this database to tag words frequently and negative / positive.

import nltk
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords

Tweet = pd.read_csv('./6020_negative_positive_data/data/Airline-Sentiment-2-w-AA.csv', encoding='cp932')
tweetData = Tweet.loc[:,['text','airline_sentiment']]

def tweet_to_words(raw_tweet):
    
    # a~Create a list of words starting with z separated by spaces
    letters_only = re.sub("[^a-zA-Z@]", " ",raw_tweet) 
    words = letters_only.lower().split()
    
    # '@'When'flight'が含まれる文字Whenストップワードを削除します
    stops = set(stopwords.words("english"))  
    meaningful_words = [w for w in words if not w in stops and not re.match("^[@]", w) and not re.match("flight",w)] 
    return( " ".join( meaningful_words )) 

cleanTweet = tweetData['text'].apply(lambda x: tweet_to_words(x)) 

#Create a database
all_text = ' '.join(cleanTweet)
words = all_text.split()
print(words)

image.png

Digitize words

Numerical tagging is performed on each word based on the number of occurrences of the word. In addition, we will create a new list by digitizing the cleanTweet character string used for learning this time.

import nltk
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from collections import Counter

Tweet = pd.read_csv('./6020_negative_positive_data/data/Airline-Sentiment-2-w-AA.csv', encoding='cp932')
tweetData = Tweet.loc[:,['text','airline_sentiment']]

def tweet_to_words(raw_tweet):
    
    # a~Create a list of words starting with z separated by spaces
    letters_only = re.sub("[^a-zA-Z@]", " ",raw_tweet) 
    words = letters_only.lower().split()
    
    # '@'When'flight'が含まれる文字Whenストップワードを削除します
    stops = set(stopwords.words("english"))  
    meaningful_words = [w for w in words if not w in stops and not re.match("^[@]", w) and not re.match("flight",w)] 
    return( " ".join( meaningful_words )) 

cleanTweet = tweetData['text'].apply(lambda x: tweet_to_words(x))

#Create a database
all_text = ' '.join(cleanTweet)
words = all_text.split()

#Count the number of times a word appears
counts = Counter(words)
#Sort in descending order
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
#print(vocab_to_int)
#Stores the digitized string in a new list
tweet_ints = []
for each in cleanTweet:
    tweet_ints.append([vocab_to_int[word] for word in each.split()])
    
print(tweet_ints)

image.png

Negative / Positive Quantification

Quantify the negative / positive evaluation given for each sentence. This time, we will convert to negative = 0, positive = 1, and neutral = 2.

This negative / positive number is used when learning sentences generated from words.

import numpy as np
import pandas as pd

#Load Tweet data
Tweet = pd.read_csv('./6020_negative_positive_data/data/Airline-Sentiment-2-w-AA.csv', encoding='cp932')
tweetData = Tweet.loc[:,['text','airline_sentiment']]

#Tweet negative/Converts a positive string to a number
labels = np.array([0 if each == 'negative' else 1 if each == 'positive' else 2 for each in tweetData['airline_sentiment'][:]]) 

print(labels)

Align the number of columns

In the created tweet_ints, the number of different words is stored for each tweet.

You need to align the columns of the list as you learn. Also, the lines whose number of words has become 0 due to the cleanTweet process are deleted from each list.

from collections import Counter

#This is the code content of the previous section-------------------------
import numpy as np
import pandas as pd

#Load Tweet data
Tweet = pd.read_csv('./6020_negative_positive_data/data/Airline-Sentiment-2-w-AA.csv', encoding='cp932')
tweetData = Tweet.loc[:,['text','airline_sentiment']]

#Tweet negative/Converts a positive string to a number
labels = np.array([0 if each == 'negative' else 1 if each == 'positive' else 2 for each in tweetData['airline_sentiment'][:]])

# ----------------------------------------

#Stores the digitized string in a new list
tweet_ints = []
for each in cleanTweet:
    tweet_ints.append([vocab_to_int[word] for word in each.split()])

#Find out the number of words in Tweet
tweet_len = Counter([len(x) for x in tweet_ints])
print(tweet_len)
seq_len = max(tweet_len)
print("Zero-length reviews: {}".format(tweet_len[0]))
print("Maximum review length: {}".format(max(tweet_len)))

#Remove the lines where the number of words is 0 with cleanTweet from each list
tweet_idx  = [idx for idx,tweet in enumerate(tweet_ints) if len(tweet) > 0]
labels = labels[tweet_idx]
tweetData = tweetData.iloc[tweet_idx]
tweet_ints = [tweet for tweet in tweet_ints if len(tweet) > 0]

#Create a framework with digitized words for each i-row written from the right to align the number of columns
features = np.zeros((len(tweet_ints), seq_len), dtype=int)
for i, row in enumerate(tweet_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]

print(features)

image.png

Summary

Creating a dataset from Tweet data

1,Loading Tweet data

2,Morphological analysis of Tweet data

3,Creating a database of words

4,Features(Database)Creation
import nltk
import numpy as np
import pandas as pd
import re
from collections import Counter
from nltk.corpus import stopwords

#Load Tweet data
Tweet = pd.read_csv('./6020_negative_positive_data/data/Airline-Sentiment-2-w-AA.csv', encoding='cp932')
tweetData = Tweet.loc[:,['text','airline_sentiment']]

#Performs morphological analysis of English Tweet
def tweet_to_words(raw_tweet):
    
    # a~Create a list of words starting with z separated by spaces
    letters_only = re.sub("[^a-zA-Z@]", " ",raw_tweet) 
    words = letters_only.lower().split()
    
    # '@'When'flight'が含まれる文字Whenストップワードを削除します
    stops = set(stopwords.words("english"))  
    meaningful_words = [w for w in words if not w in stops and not re.match("^[@]", w) and not re.match("flight",w)] 
    return( " ".join(meaningful_words)) 

cleanTweet = tweetData['text'].apply(lambda x: tweet_to_words(x))

#Create a database
all_text = ' '.join(cleanTweet)
words = all_text.split()

#Count the number of times a word appears
counts = Counter(words)

#Sort in descending order
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

#Stores the digitized string in a new list
tweet_ints = []
for each in cleanTweet:
    tweet_ints.append([vocab_to_int[word] for word in each.split()])

#Find out the number of words in Tweet
tweet_len = Counter([len(x) for x in tweet_ints])
seq_len = max(tweet_len)
print("Zero-length reviews: {}".format(tweet_len[0]))
print("Maximum review length: {}".format(max(tweet_len)))

#Remove the lines where the number of words is 0 with cleanTweet from each list
tweet_idx  = [idx for idx,tweet in enumerate(tweet_ints) if len(tweet) > 0]
tweet_ints = [tweet for tweet in tweet_ints if len(tweet) > 0]

#Create a framework with digitized words for each i-row written from the right to align the number of columns
features = np.zeros((len(tweet_ints), seq_len), dtype=int)
for i, row in enumerate(tweet_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]
print(features)

Recommended Posts

Python: Negative / Positive Analysis: Twitter Negative / Positive Analysis Using RNN-Part 1
Negative / Positive Analysis 2 Twitter Negative / Positive Analysis (1)
Negative / Positive Analysis 3 Twitter Negative / Positive Analysis (2)
Python: Negative / Positive Analysis: Text Analysis Application
Data analysis using Python 0
Search Twitter using Python
Post to Twitter using Python
Data analysis using python pandas
Negative / Positive Analysis 1 Application of Text Analysis
Recommendation tutorial using association analysis (python implementation)
Creation of negative / positive classifier using BERT
Time variation analysis of black holes using python
twitter on python3
Data analysis python
Perform entity analysis using spaCy / GiNZA in Python
Scraping & Negative Positive Analysis of Bunshun Online Articles
[Environment construction] Dependency analysis using CaboCha in Python 2.7
Scraping using Python
Explanation of the concept of regression analysis using python Part 2
[Python] [Word] [python-docx] Simple analysis of diff data using python
Collecting information from Twitter with Python (morphological analysis with MeCab)
Explanation of the concept of regression analysis using Python Part 1
Principal component analysis using python from nim with nimpy
Explanation of the concept of regression analysis using Python Extra 1
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Python: Time Series Analysis
Data analysis using xarray
Operate Redmine using Python Redmine
Fibonacci sequence using Python
Data analysis overview python
Voice analysis with python
Data cleaning using Python
Using Python #external packages
Age calculation using python
Python data analysis template
Association analysis in Python
Orthologous analysis using OrthoFinder
Name identification using python
Notes using Python subprocesses
Voice analysis with python
Try using Tweepy [Python2.7]
Data analysis with Python
Regression analysis in Python
[Introduction] Artificial satellite data analysis using Python (Google Colab environment)
Tweet Now Playing to Twitter using the Spotify API. [Python]
Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)