Here is Your Sample Download Sample 📩
The aim of this assignment is to make use of the Big data analysis and creation of visualization by performing real-time data analysis. For this assignment, a dataset that is accessible publicly is chosen and descriptive and predictive analysis has been performed for extraction of useful insights from the data. Normalization and cleaning of data has been done before performing the data analysis. In this assignment visualization tools have been used for representation of the analysis result in an understandable manner.
As handling of Big Data comes with challenges so PySpark platform is used for development because it has Hadoop supported big data processing capability and the style of coding is like Python. In this report the complete data analysis process right from the collection of data, preprocessing, normalization, and analysis methods are documented. This report contains enough information about system setup and required coding that is needed for the data analysis so that this work can be reproduced.
In this assignment, data is collected from the Kaggle website that has details about Bike renting in the Seoul city. From the analysis useful insights on the demand of bike renting service is extracted. A data driven predictive model is implemented using regression (Mishra, 2021) to predict the demand of bikes in the future based on the state of influencing parameters of demand. In the next segment information about the dataset is provided.
The dataset that is chosen for this assignment is taken from Kaggle website. Kaggle is a reliable and authentic datastore. The link of the dataset on which the analysis will be done is . This dataset has 14 categories (columns) of 8k records (rows). The dataset contains Seoul Bike Sharing Demand Prediction containing records from December 2017 to December 2018. There are some important attributes in the dataset which may affect the demand of bikes in Seoul city. These attributes are rainfall temperature, visibility, windspeed, snowfall and season. The dataset contains all forms of datatypes, that are Boolean, Strings, Double/Float, Character, Date, Integer, and Category.
The service of renting bikes is quite popular. This research will predict the demand of bikes in the city of Seoul using machine learning. Similar projects have been done in this domain. Wang et.al in their paper has designed a system for predicting the number of bikes available in docking stations of Suzhou, China. They have used Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) and Random Forest. The performance of the three models were good with some differences. The training time of Random Forest was shorter whereas LSTM performed better for long term prediction. From this paper it is understood that for long term prediction model design LSTM is a good choice. (Wang, 2019)
Wang in his paper uses regression model to predict the number of bike users. At first, he uses data analysis to find out how the different factors like weather, season, humidity, wind speed affects the number of bike users. Then he used regression models ELM (Extreme Learning Model), DELM where D stands for decremental and Neural Network model for the prediction system. Among these the ELM model performs the best having least error. The factors that are used for building the prediction model uses AIC (Akaike information criterion). From this paper it is understood that ELM regression algorithm works better with bike rental prediction. (Wang & Kim, 2018)
Ranganathan has used PYSpark Framework in their paper for anomaly detection in real time. Using this framework can make predictions with minimum delay. According to this paper PYSpark’s capability to handle data, tolerate fault and flexibility is much better when compared to others. The accuracy achieved using PYSpark was much higher when compared to STORM and Scala-Spark. PYSpark has less analysing and processing latencies. From this paper it is learnt that the PYSpark framework is a good choice for machine learning prediction system design. (G., 2020)
Ashofteh has used PYSpark framework for evaluation of personal credit evaluation. In this paper logistic regression, random forest, decision tree, support vector machine and neural network is compared to find the best algorithm using PYSpark Framework. From the results it is understood that logistic regression has low false positive rate than others and also had higher accuracy. The AUC value of logistic regression was 0.909 and F1 score was 0.815. The worst among the models compared was support vector machine. It is understood from this paper that logistic regression is a good algorithm for detection models. (Ashofteh, 2022)
In Federer et.al’s paper it is stated that Tableau is a business intelligence tool that is easy to use and can be used to create customized and meaningful visualisations which are also interactive. With this tool biomedical and scientific visualisations can also be made. With the help of Tableau dashboards can be created which can be updated live. The paper provides evidence where Tableau has been used for contact tracking and epidemic surveillance.
Performing data analysis in tableau does not require use of complicated code. So, it a better alternative than tools like R and Python. It is also better than easier tools like Excel because Tableau is more interactive and flexible. It can be concluded from this paper that Tableau is a suitable tool for performing data analysis. (Federer et al., 2018
This part of the report addresses three main aspects of the data analysis process, i) data loading process in the PySaprk and Tableau environment ii) pre-processing requirement of the data, and iii) system setup process. This three information are very important for any data analysis work, it helps to reproduce or extend the work in the future.
Data Loading Process
The data loading process is completely different for the selected two data analysis tool. The data loading has been done based on the command in the PySpark environment and in Tableau data loading has been done using GUI based browses option.
PySpark Data Loading
To load the data in the PySpark environment, the read CSV function is used form the PySpark library. It this process it is necessary to specify the path (relative or absolute) of the data source. Before this step it is necessary to create the Spark session in the Jupyter environment. In the following screenshot data loading process is shown
Data Preprocessing and Exploration
Before starting the data analysis process, a set of preprocessing and data analysis has been done. For two different tool, two different set of data preprocessing has been done. Details of the preprocessing is given in this segment. Some data exploration also has been done for the using the PySpark tool and it is also documented here.
Tableau Data pre-processing
The dataset has field for the Data and Hour of the booking. Using pre-processing process Time field, Date time field, Week_day name has been derived. Tableau provides built-in features for the generating derived attributes form the existing attributes. Some of the derive attribute creation formula are shown below
This section of the report addresses the details of the experimental process. Details about the Regression model development, descriptive analysis process and predictive analysis is coved here.
Regression Model Development
The phases associated with the Regression model development is given here. Complete coding required for implementation of each phase is given in the Appendix of the document.PySpark session creation so that data can be loaded in the system and regression model can be created.Data preprocessing has been done to create new derived attribute from the existing data.Transform categorical variable to numerical levels using string indexing.Data exploration has been done to identify the missing value and correlation among data attributes.Select most suitable attributes for the regression process.Convert selected attributes into a vector format.Creature multiple regression models musing multiple regression algorithm and data format (Logistical Regress model with all attributes, Logistical Regress model with top-5 corelated attributes, Gradient-boosted model with all attributes, Gradient-boosted model with top-5 corelated attributes )Compute RMSE (Root Mean Square Error) and R square value for each and every regression model.
For this assignment descriptive analysis has been performed by using Tableau tool. With this analysis meaningful insights from the data is extracted. This analysis helps in understanding the previous and current trends of Bike renting demands in the city of Seoul. To understand the variation in parameters time series analysis has been done. The visualisations created using this process is provided in the Result Analysis section. Here the types of visualisations that is created is described.
- Bar Graph: For comparison of multiple data streams
- Box Plot : For showing statistical variations present in the attributes
- Packed Bubble chart : For showing quantitative variations in multiple data
- Time series/Line Graph: For showing the temporal changes of attributes.
In this assignment predictive analysis has been done to forecast the future demand of bike rental booking. Using the predictive analysis the demand for bikes is predicted. During descriptive analysis the time series analysis that was performed was extended for the process of predictive analysis. The visiualistions created in this analysis is provided in the result analysis section.
From this assignment the conclusion that can be drawn is that by using Big data analysis and data visualization methods real world problems can have data driven solutions. This assignment is one such example where the demand of rented bikes in the city of Seoul is analysed. For this work, a dataset which is publicly accessible is chosen, then descriptive and predictive analysis is done followed by regression model development. For development of the model PySpark is used as it can manage data with Big data characteristics. This assignment uses tableau software for data analysis and meaningful insights have been extracted from the data. The complete code and the information of system setup is provided so that this work can be reproduced and future extension of the work is possible.
Ashofteh, A. (2022). Big Data for Credit Risk Analysis: Efficient Machine Learning Models Using PySpark. https://doi.org/
Federer, L., & Joubert, D. (2018). Providing library support for interactive scientific and biomedical visualizations with tableau. Journal of EScience Librarianship, 7(1).
G., D. R. (2020). Real time anomaly detection techniques using pyspark frame work. March 2020, 2(1), 20–30.
Mishra, A. (2021). Development of n-variable Regression Model. Journal Of Advanced Research In Applied Mathematics And Statistics, 06(1&2), 1-3.
Wang, B., & Kim, I. (2018). Short-term prediction for bike-sharing service using machine learning. Transportation Research Procedia, 34, 171–178.
Wang, Z. (2019). Regression model for bike-sharing service by using machine learning. Asian Journal of Social Science Studies, 4(4), 16.