Prefect Dataflow Orchestration for Twitter Sentiment Analysis
This project leverages Prefect for streamlined dataflow orchestration and employs NLTK's SentimentIntensityAnalyzer for analyzing Twitter data, providing insights into public sentiments and trends.
Introduction
In this project, we harness the capabilities of Prefect, a sophisticated dataflow orchestration tool, to manage and streamline a sentiment analysis workflow using a Twitter dataset from Kaggle. Prefect is pivotal in efficiently handling the data pipeline, from ingestion to processing stages. The core analysis is performed using NLTK's SentimentIntensityAnalyzer, which is adept at parsing and interpreting the emotional tone of Twitter posts. This combination allows for a robust, scalable, and efficient analysis, providing deep insights into public sentiment trends. Prefect not only ensures a seamless workflow but also enhances the reliability and error management of the entire process, demonstrating the synergy between advanced data orchestration and natural language processing in extracting meaningful insights from social media data.
Read keywords from user
The workflow begins with the initial task of collecting user-inputted keywords. Instead of a frontend form, we utilize a publicly accessible Google spreadsheet, where users input their search terms, one per line. Subsequently, the process assigns each keyword to 100 tweets randomly. For this specific operation, the task concurrently deploys 10 process workers, enabling parallel execution to enhance efficiency and speed.
@task
def read_search_terms():
search_terms_df = pd.read_csv(util.search_terms_csv)
search_terms = list(search_terms_df['search_term'])
return search_terms
@task(tags=["10_at_a_time"])
def fetch_tweets(search_term: str):
return util.fetch_tweets(search_term=search_term)
Analyze sentiment from text
At this stage of the procedure, sentiment analysis is conducted on each individual tweet, determining the sentiment polarity score. To facilitate this, we have 1000 Prefect workers operating simultaneously, ensuring a swift and efficient analysis process.
self.sia = SentimentIntensityAnalyzer()
@task(tags=["1000_at_a_time"])
def analyze_sentiment_from_text(search_term: str, tweet: str):
return util.analyze_sentiment_from_text(search_term=search_term, text=tweet)
#---
def perform_sentiment_analysis(list_of_tweet_dicts: list):
senti_analysis = list()
for tweets_data in list_of_tweet_dicts:
search_term = tweets_data["search_term"]
tweets = tweets_data["tweets"]
for tweet in tweets:
senti_analysis.append(
analyze_sentiment_from_text.submit(search_term, tweet)
)
return [senti.result() for senti in senti_analysis]
#---
def analyze_sentiment_from_text(self, search_term, text):
sentiment = self.sia.polarity_scores(text=text)
compound_score = sentiment["compound"]
resulting_sentiment = ""
if compound_score >= 0.05:
resulting_sentiment = "Positive"
elif sentiment["compound"] <= -0.05:
resulting_sentiment = "Negative"
else:
resulting_sentiment = "Neutral"
#---
Sentiment Analysis report
The concluding output of our process is a CSV file containing the sentiment analysis results, where each tweet's sentiment is associated with the user-specified keyword. This data can be easily integrated into various data visualization platforms or libraries, providing a comprehensive summary of the sentiment analysis findings.
Conclusion
In conclusion, this project successfully demonstrates the power of combining advanced data orchestration using Prefect with natural language processing for sentiment analysis. By employing Prefect for efficient dataflow management and NLTK's SentimentIntensityAnalyzer for analyzing Twitter data, we get a robust system capable of handling large volumes of social media data. The project's output, a sentiment analysis report in CSV format, offers valuable insights by linking each tweet to its corresponding user-inputted keyword. This data format, ready for integration with various visualization tools, provides a versatile resource for understanding public sentiment trends.
Github Repository
The github repo with all the code deliverables can be found here: (https://github.com/shivrajd6/BDAT1011.Group2.FinalPresentation)