Expert Homework Help Online & Write My Essay Service

Hire best homework helpers for online homework help 24/7. Are you looking for online homework help? Try our excellent homework help who can help you get A+ grade in your assignment.

Order my paper
Calculate your essay price
(550 words)

Approximate price: $22

19 k happy customers
9.5 out of 10 satisfaction rate
527 writers active

Discovering, Analyzing, Visualizing, and Presenting Data

This week reading assignment is the course textbook chapter 8 and 9 (EMC Education Service (Eds). (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing, and Presenting Data, Indianapolis, IN: John Wiley & Sons, Inc).Text Mining may be defined as the process of examining data to gather valuable information. Text mining, also known as text data mining involves algorithms of data mining, machine learning, statistics, and natural language processing, attempts to extract high quality, useful information from unstructured formats. The recent years have seen a tremendous increase in adoption of text mining for business applications. The reason being by increasing awareness about text mining and the reduced-price points at which text mining tools are available today. Text analytics can help businesses listen to the right stories by extracting insights from a free text written by or about customers, combining it with existing feedback data, and identifying patterns and trends. Manual analysis alone is unable to capture this level of insight due to the sheer volume and complexity of the available data.

Last Name
First Name
Author
Burroughs, Edgar Rice (418)
Carroll, Lewis (1044)
Chesterton, G. K. (Gilbert Keith) (437)
Conrad, Joseph (788)
Dickens, Charles (2326)
Dostoyevsky, Fyodor (899)
Doyle, Arthur Conan (1457)
Dumas, Alexandre (572)
Bierce, Ambrose (382)
Hawthorne, Nathaniel (589)
James, Henry (529)
Defoe, Daniel (11152)
Wharton, Edith (12127)
Kipling, Rudyard (432)
Brontë, Charlotte (392)
Melville, Herman (698)
Poe, Edgar Allan (618)
Pope, Alexander (529)
Russell, Bertrand (431)
Shakespeare, William (987)
Wilde, Oscar (1409)
Stevenson, Robert Louis (1023)
Swift, Jonathan (804)
Tolstoy, Leo, graf (666)
Twain, Mark (1702)
Verne, Jules (477)
Wells, H. G. (Herbert George) (795)
https://www.gutenberg.org/browse/scores/top
Note: Students Select your chosen Author from the list. Good luck!
School of Computer &
Information Sciences
ITS 836 Data Science and Big Data Analytics
Text Analysis Part 1
ITS 836
1
Submission Guidelines
• All students are assigned an author from:
– https://www.gutenberg.org/browse/scores/top
– ITS836-46_Week12 Authors for Text Analysis.xlsx
• Submit the document with your name, id
– Clearly mark the question # for all answers text and figures
– Submit the code in a separate text file marking Question #
ITS 836
2
Week 12 Homework Text Analysis Part 1
https://www.tidytextmining.com/tidytext.html
“Tidy Text Format” – Question 1
a) You have been assigned “your author”* in:

ITS836-46_Week12 Authors for Text Analysis.xlsx
b) Identify books for the author: www.gutenberg.org
http://www.gutenberg.org/browse/authors/a
c) Compare word frequencies as in Figure 1.3
of Jane Austen, the Brontë sisters, and “your author”
*You can chose another author: https://www.gutenberg.org/browse/scores/top
Make sure it is not on the list for anyone else
ITS 836
3
Week 12 Homework Text Analysis Part 1
https://www.tidytextmining.com/sentiment.html
“Sentiment analysis with tidy data” Question 2
a)
Analyze the sentiment through multiple works (minimum 2) belonging to “your
author’” as Fig 2.2
Comparing three sentiment lexicons through the sentiment lexicons as Fig 2.3
b)



c)
d)
AFINN from Finn Årup Nielsen,
bing from Bing Liu and collaborators, and
nrc from Saif Mohammad and Peter Turney.
Plot words that contribute to positive and negative sentiment for your authors
works as in Fig 2.4
Create a world cloud of the most common words for your author’s works as in
Fig 2.5
ITS 836
4
Week 12 Homework Text Analysis Part 1
https://www.tidytextmining.com/tfidf.html
“Analyzing word and document frequency: tf-idf”
Question 3
a)
b)
c)
Analyze TF distribution in your author’s works as in Fig 3.1
Plot Zipf’s law for your author’s works as in Fig 3.2
Plot highest tf-idf words in each of you author’s works as in Fig 3.4
ITS 836
5
Week 12 Homework Text Analysis Part 1
https://www.tidytextmining.com/ngrams.html
“n-grams and correlations” Question 4
a)
b)
c)
Plot the bigrams with the highest tf-idf from each of your author’s works
as in Fig 4.1
Plot the words preceded by ‘not’ that had the greatest contribution to
sentiment scores, in either a positive or negative direction of your
author’s works as in Fig 4.2
Plot common bigrams in your author’s works as in Fig 4.4
ITS 836
6
Week 12 Homework Text Analysis Part 1
https://www.tidytextmining.com/dtm.html
“to and from non-tidy formats” Question 5
• As explained in Section5.2 cast the tidy text data for
one of your author’s works into a matrix
ITS 836
7
Week 12 Homework Text Analysis Part 1
https://www.tidytextmining.com/topicmodeling.html
“Topic modeling” Question 6
a) For your author’s works create a topic model with
the terms that are most common within each topic
using the LDA method as in Fig 6.4
ITS 836
8
Questions?
ITS 836
9
Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
1
Module 4: Analytics Theory/Methods
1
Advanced Analytics – Theory and Methods
Time Series Analysis
During this lesson the following topics are covered:
• Time Series Analysis and its applications in forecasting
• ARMA and ARIMA Models
• Implementing the Box-Jenkins Methodology using R
• Reasons to Choose (+) and Cautions (-) with Time Series Analysis
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
2
The topics covered in this lesson are listed. ARIMA and Box-Jenkins methodology are explained
in the following slides.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
2
Time Series Analysis
• Time Series: Ordered sequence of equally spaced values over time
• Time Series Analysis: Accounts for the internal structure of
observations taken over time
 Trend
 Seasonality
 Cycles
 Random
• Goals
 To identify the internal structure of the time series
 To forecast future events
 Example: Based on sales history, what will next December sales be?
• Method: Box-Jenkins (ARMA)
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
3
Businesses perform sales forecasting to look ahead in order to plan their investments, launch
new products, decide when to close or withdraw products, etc. The sales forecasting process is
a critical one for most businesses. Part of the sales forecasting process is to examine the past.
How well did we do in the last few months or what were our sales in the same time period for
the last few years? Time Series Analysis provides a scientific methodology for sales forecasting.
Time Series Analysis is the analysis of sequential data across equally spaced units of time. Time
Series is a basic research methodology in which data for one or more variables are collected for
many observations at different time periods. The main objectives in Time Series Analysis are:
• To understand the underlying structure of the time series by breaking it down to its
components.
• Fit a mathematical model and then proceed to forecast the future
The time periods are usually regularly spaced and the observations may be either univariate or
multivariate. Univariate time series are those where only one variable is measured over time,
whereas multivariate time series are those, where multiple variables are measured
simultaneously. The internal structure of the data may specify a trend, seasonality or cycles:
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
3
Box-Jenkins Method: What is it?
• Models historical behavior to forecast the future
• Applies ARMA (Autoregressive Moving Averages)
 Input: Time Series
 Accounting for Trends and Seasonality components
 Output: Expected future value of the time series
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
5
Box-Jenkins methodology developed by Professors G.E.P. Box and G.M. Jenkins, enables the
forecasting with time series data with both high accuracy and low computational requirements.
The technique may be applied to quickly determine forecasts that are as uncomplicated in form
as the simple smoothing methods, or that involve a number of economic variables. In either
case, use of this technique enables efficient utilization of other predictive information
contained in the data. It offers assurance of obtaining the highest forecasting accuracy possible
in terms of the variables on which the forecast is based.
The input for the model is the trend and seasonality adjusted time series and the output is the
expected future value of the time series.
Box Jenkins Methodology applies autoregressive moving average ARMA models to find the best
fit of a time series to past values of this time series, in order to make forecasts.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
5
Use Cases
Forecast:
• Next month’s sales
• Tomorrow’s stock price
• Hourly power demand
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
6
The key application of Time Series Analysis is in forecasting. Economic and business planning,
inventory and production control of industrial processes are some of the key applications in
which time series analysis is deployed.
Time Series data provide useful information about the physical, biological, social or economic
systems generating the time series, such as:
Economics/ Finance: share prices, profits, imports, exports, stock exchange indices
Sociology: school enrollments, unemployment, crime rate
Environment: Amount of pollutants, such as suspended particulate matter (SPM), in the
environment
Meteorology: Rainfall, temperature, wind speed
Epidemiology: Number of SARS cases over time
Medicine: Blood pressure measurements over time for evaluating drugs to control
hypertension
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
6
Modeling a Time Series
• Let’s model the time series as
Yt =Tt +St +Rt,
t=1,…,n.
• Tt: Trend term
 Air travel steadily increased over the last few years
• St: The seasonal term
 Air travel fluctuates in a regular pattern over the course of a year
• Rt: Random component
 To be modeled with ARMA
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
7
We present a simple model for the time series with the trend, seasonality and a random
fluctuation. There is sometimes a low frequency cyclic term as well, but we are ignoring that
for simplicity.
Examples of trend and seasonality are also detailed in the slide
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
7
Stationary Sequences
• Box-Jenkins methodology assumes the random component is a
stationary sequence
 Constant mean
 Constant variance
 Autocorrelation does not change over time
 Constant correlation of a variable with itself at different times
• In practice, to obtain a stationary sequence, the data must be:
 De-trended
 Seasonally adjusted
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
8
A stationary sequence is a random sequence in which the joint probability distribution does not
vary over time. In other words the mean, variance and auto correlations do not change in the
sequence over time.
In order to render a sequence stationary we need to remove the effects of trend and
seasonality. The ARIMA model (implemented with Box Jenkins) uses the method of
differencing to render the data stationary.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
8
De-trending
• In this example, we see a
linear trend, so we fit a
linear model
 Tt = m·t + b
• The de-trended series is then
 Y1t = Yt – Tt
• In some cases, may have to
fit a non-linear model
 Quadratic
 Exponential
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
9
Trend in a time series is a slow, gradual change in some property of the series over the whole
interval under investigation.
De-trending is a pre-processing step to prepare time series for analysis by methods that
assume stationarity.
A simple linear trend can be removed by subtracting a least-squares-fit straight line. In the
example shown we fit a linear model and obtain the difference. The graph shown next is a detrended time series.
More complicated trends might require different procedures such a fitting a non-linear model
such as a quadratic or a exponential model.
Use a Linear Trend Model if the first differences are more or less constant [ (y2-y1) = (y3-y2) =
……. = (yn-yn-1) ]
Use a Quadratic Trend Model if the second differences are more or less constant. [ (y3-y2) –
(y2-y1) = ………= (yn-yn-1)-(yn-1-yn-2) ]
Use an Exponential Trend Model if the percentage differences are more or constant. [ ((y2-y1)
/y1 ) * 100% = …….((yn-yn-1)/yn-1 ) * 100%
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
9
Seasonal Adjustment
• Plotting the de-trended
series identifies seasons
 For CO2 concentration, we
can model the period as
being a year, with variation
at the month level
• Simple ad-hoc adjustment:
take several years of data,
calculate the average value
for each month, and
subtract that from Y1t
Y2t = Y1t – St
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 10
Unlike the trend and cyclical components, seasonal components, theoretically, happen with
similar magnitude during the same time period each year.
The holiday sales spike is an example of seasonality. By removing the seasonal component, it is
easier to focus on other components. The seasonal component of a series typically makes the
interpretation of a series more difficult.
A simple adjustment for seasonality is done with taking several years of data, calculating
average value for each month and subtracting them from the actual value.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
10
ARMA(p, q) Model
• The simplest Box-Jenkins Model
 Yt is de-trended and seasonally adjusted
• Combination of two process models
 Autoregressive: Yt is a linear combination of its last p values
 Moving average: Yt is a constant value plus the effects of a
dampened white noise process over the last q time values (lags)
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 11
Autoregressive (AR) models can be coupled with moving average (MA) models to form a
general and useful class of time series models called Autoregressive Moving Average (ARMA)
models. This is the simplest Box-Jenkins model.
AR model predicts Yt as a linear combination of its last p values. An autoregressive model is
simply a linear regression of the current value of the series on one or more prior values of the
same series. Several options are available for analyzing autoregressive models, including
standard linear least squares techniques. They also have a straightforward interpretation.
The time series Yt is called an autoregressive process of order p and is denoted as AR(p)
process.
A moving average (MA) model adds to Yt the effects of a dampened white noise process over
the last q steps. The simple moving average is one of the most basic of the forecasting
methods. Moving backwards in time, minus 1, minus, 2, minus 3 and so forth until we have n
data points, divide the sum of those points by the number of data points, n, and that gives you
the forecast for the next period. So it’s called a single moving average or simple moving
average. The forecast is simply a constant value that projects the next time period. “n” is also
the order of the moving averages.
moving average: like a random walk, or brownian motion
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
11
ARIMA(p, d, q) Model
• ARIMA adds a differencing term, d, to the ARMA model
 Autoregressive Integrated Moving Average
 Includes the de-trending as part of the model
 linear trend can be removed by d=1
 quadratic trend by d=2
 and so on for higher order trends
• The general non-seasonal model is known as ARIMA (p, d, q):
 p is the number of autoregressive terms
 d is the number of differences
 q is the number of moving average terms
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 12
ARMA models can be used when the series is weakly stationary; in other words, the series has
a constant variance around a constant mean.. This class of models can be extended to nonstationary series by allowing the differencing of the data series. These are called Autoregressive
Integrated Moving Average(ARIMA) models. There are a large variety of ARIMA models.
ARIMA – difference the Yt d times to “induce stationarity”. d is usually 1 or 2. “I” stands for
integrated – the outputs of the model are summed up (or “integrated”) to recover Yt
The general ARIMA (p, d, q) model gives a tremendous variety of patterns in the ACF and PACF,
so it is not practical to state rules for identifying general ARIMA models. In practice, it is seldom
necessary to deal with values of p, d, or q other than 0, 1, or 2. It is remarkable that such a
small range of values for p, d, or q can cover such a large range of practical forecasting
situations.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
12
ACF & PACF
• Auto Correlation Function (ACF)
 Correlation of the values of the time series with itself
 Autocorrelation “carries over”
 Helps to determine the order, q, of a MA model
 Where does ACF go to zero?
• Partial Auto Correlation Function (PACF)
 An autocorrelation calculated after removing the linear
dependence of the previous terms
 Helps to determine the order, p, of an AR model
 Where does PACF go to zero?
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 13
A common assumption in many time series techniques is that the time series is stationary. A
stationary process has the property that the mean, variance and autocorrelation structure do
not change over time.
An ACF plot provides an indication of the stationarity of the data. If the time series is not
stationary, we can often transform it to stationarity with the simple technique of differencing.
It should be noted that the autocorrelation carries over; if Yt is correlated with Yt-1, it is also
correlated with Yt-2 (though to a lesser degree).
PACF – The partial autocorrelation at lag k is the autocorrelation between Yt and Yt-k that is not
accounted for by lags 1 through k-1.
One looks for the point on the plot where the partial autocorrelations for all higher lags are
essentially zero.
We will look into ACF and PACF graphs in the next Lab.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
13
Model Selection
• Based on the data, the Data Scientist selects p, d and q
 An “art form” that requires domain knowledge, modeling
experience, and a few iterations
 Use a simple model when possible
 AR model (q = 0)
 MA model (p = 0)
• Multiple models need to be built
and compared
 Using ACF and PACF
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 14
Identification of the most appropriate model is the most important part of the process, where
it becomes as much ‘art’ as ‘science’.
The first step is to determine if the time series is stationary. This can be done with a
correlogram, plots of the ACF and PACF. If the time series is not stationary, it needs to be firstdifferenced. (it may need to be differenced again to induce stationarity)
The next stage is to determine the p and q in the ARIMA (p, d, q) model (the d refers to how
many times the data needs to be differenced to produce a stationary series).
In the diagnostic stage we assess the model’s adequacy by checking whether the model
assumptions are satisfied. If the model is inadequate, this stage will provide some information
for us to re-identify the model. We also perform: checking normality, constant variance, and
independence assumption among residuals.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
14
Time Series Analysis – Reasons to Choose (+) &
Cautions (-)
Reasons to Choose (+)
Minimal data collection
Only have to collect the series
itself
Do not need to input drivers
Designed to handle the inherent
autocorrelatio…

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
error: