Utilizing Apache Spark for data analysis on Twitter data( schema - https://github.com/episod/twitter-api-fields-as-crowdsourced/wiki )
-
How does @PrezOno’s tweet length compare to the average of all others? What is his average length? All others?
File - https://github.uc.edu/ravindsn/Spark---P4/blob/master/avgTweetLength.py
shell script - avgTweetLength.sh -
Detect the proportion of bad words in a tweet. Plot bad word proportion by hour for all 24 hours.
File - https://github.uc.edu/ravindsn/Spark---P4/blob/master/BadWordStats.py
shell script - BadWordStats.sh
Note: Please run chmod u+x <sh file name>.sh
before executing the shell scripts.
Command for executing the programs using PySpark.
spark-submit avgTweetLength.py
Command for executing the programs on the Hadoop cluster:
spark-submit --master yarn-client avgTweetLength.py