Development of Real Time Analytics of Movies Review Data using PySpark
Prakash K. Aithal1, Dinesh Acharya U.2, Geetha M.3
1Prakash K. Aithal, Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal (Karnataka), India.
2Dinesh Acharya U., Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal (Karnataka), India.
3Geetha M., Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal (Karnataka), India.
Manuscript received on 10 February 2019 | Revised Manuscript received on 23 February 2019 | Manuscript Published on 04 March 2019 | PP: 542-545 | Volume-7 Issue-5S2 January 2019 | Retrieval Number: ES2097017519/19©BEIESP
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The data play the vital role in every organization. The data can be divided into structured, semi-structured and unstructured. One can not process the unstructured data in real-time using RDBMS or Hadoop. Spark is an extension of Hadoop architecture which clubs the goodness of both Hadoop and Storm. Spark supports languages such as Scala, Java, Python, and R. The proposed method uses PySpark to analyze the movies review dataset of 50000 reviews by 36409 peoplefor 1539 movies in real-time. Since movie reviews are written by many users in real-time, it is necessary for real-time data analysis. This method finds all the users who are very activein writing the reviews of the movies. This analytics may be used for giving incentives to the active reviewers. Further, the information about more popular movies based on reviews can be gained through analytics. To achieve these tasks basic map, reduce and filter functionalities have been applied. It is found from the analytics that the Movie code B002VL2PTU has been reviewed by the maximum number of people and also it is determined that maximum of 112 reviews were written by the single user with code A3LZGLA88K0LA0. The frequency count of words in the movie review is accomplished, and sentiment of the user can be analyzed using unigrams.
Keywords: Real-time Analytics; Big Data; PySpark.
Scope of the Article: Real-Time Information Systems