The economics and statistics major saw the prediction packages They are very greedy . In addition to the previous forecast package , Now this prophet It's also powerful . Ben packages It was reported by the heart of the machine , Take time to try it out on the weekend for a few hours . Some basic introduction to the heart of the machine 《 industry
|
Facebook Open source large-scale prediction tool Prophet: Support Python and R》
I don't like theoretical analysis , It's a case study , Generally, there is no code word , Try to be simple and rough !!
Official website :https://facebookincubator.github.io/prophet/
github website :https://github.com/facebookincubator/prophet
The paper :《Forecasting at Scale Sean J.Taylor and Benjamin Letham》
Case data download :http://download.csdn.net/detail/sinat_26917383/9764537
Finally, I will add some facebook The theory of .
Then try to play down feel more like the function point :
prophet It should be what I've been looking for , Now see the best forecasting tool for marketing activity analysis , It's website analysis 、 The gospel of advertising campaign analysis , If you see the methods in this article , What experience do you find in using , Please share as much as you can ~
# install.packages('prophet')
library(prophet)
library(dplyr)
.
.
history <- data.frame(ds = seq(as.Date('2015-01-01'), as.Date('2016-01-01'), by = 'd'),
y = sin(1:366/200) + rnorm(366)/10)
m <- prophet(history,growth = "linear")
among , When generating data, pay attention to , It is best to ds( Time item )、y( must do numeric) These two names your variables , This case is a single sequence + Time item . The data is long like this :
prophet It's the model generation phase ,m There are a lot of parameters in , It needs to be studied by later generations .
# Time function
future <- make_future_dataframe(m, periods = 365)
tail(future)
# forecast
forecast <- predict(m, future)
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
# Straight line prediction
plot(m, forecast)
# Trend decomposition
prophet_plot_components(m, forecast)
make_future_dataframe: Interesting time generating functions , Previous ds The data is 2015-1-1 To 2016-1-1, Now it's generated a 2015-1-1 To 2016-12-30 Sequence , It's an extra year , In order to predict . And the flexible regulation is to predict the day , Or Zhou ,freq Parameters .
predict, Forecast so ds Is time ,yhat It's the forecast ,lower and upper It's the confidence interval .
Feel the plot:
prophet_plot_components Function is a trend decomposition function , Divide the trend into trend items 、 week 、 year , This is the default configuration .
.
logitics What is it ? Do not understand, please Baidu .
# Data generation stage
history <- data.frame(ds = seq(as.Date('2015-01-01'), as.Date('2016-01-01'), by = 'd'),
y = sin(1:366/200) + rnorm(366)/10,
cap=sin(1:366/200) + rnorm(366)/10+rep(0.3,366))
# The biggest growth trend ,cap Set up cap, It's the pinnacle of this scale ,y At that time, the summit
# Model generation
m <- prophet(history,growth = "logistic")
future <- make_future_dataframe(m, periods = 1826)
future$cap <- sin(1:2192/200) + rnorm(2192)/10+rep(0.3,2192)
# Prediction stage
fcst <- predict(m, future)
plot(m, fcst)
prophet Here, if we want to fit logitics trend , You just need one cap Variable , This variable is y The upper limit of the variable ( For example, the largest market size ), because y If you obey logitics If the trend doesn't give scope , It's easy to predict and reach the top , therefore cap To make predictions less “ fragile ”…
Let's look at a failure fit logitics Case study :
.
.
We can examine the festival 、 Post holiday effect . Let's see paper How to explain the festival effect ( Address of thesis ):
in other words , Festival effect energy function h(t) It's made up of two parts ,Z(t) Is a set of indicative functions (indicator function), And the parameters K obey (0,v) Normal distribution . so to speak , Think of festivals as a normal distribution , Think of the event period as a crest ,lower_window as well as upper_window Windows as diffusion .
# The data generated : General data
history <- data.frame(ds = seq(as.Date('2015-01-01'), as.Date('2016-01-01'), by = 'd'),
y = sin(1:366/200) + rnorm(366)/10,
cap=sin(1:366/200) + rnorm(366)/10+rep(0.3,366))
# The data generated : Holiday data
library(dplyr)
playoffs <- data_frame(
holiday = 'playoff',
ds = as.Date(c('2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07')),
lower_window = 0,
upper_window = 1
)
superbowls <- data_frame(
holiday = 'superbowl',
ds = as.Date(c('2010-02-07', '2014-02-02', '2016-02-07')),
lower_window = 0,
upper_window = 1
)
holidays <- bind_rows(playoffs, superbowls)
# forecast
m <- prophet(history, holidays = holidays)
forecast <- predict(m, future)
# Impact effect
forecast %>%
select(ds, playoff, superbowl) %>%
filter(abs(playoff + superbowl) > 0) %>%
tail(10)
# Trend component
prophet_plot_components(m, forecast);
There are two datasets to be generated in the data generation phase , A batch of data is regular data ( Like traffic ), There is also the time data of holidays
among lower_window,upper_window It can be understood as an extension of holiday time , National Day and new year's day must be different , It's very user-friendly , for example Christmas Eve + Christmas two days , Then set up (lower_window = -1, upper_window = 1). This lower_window The scale is days , So if your data is a week / quarter , Need to set up -7/+7, More reasonable . To take one python How to set it in ( Timing is by week):
c3_4 = pd.DataFrame({
'holiday': 'c1',
'ds': pd.to_datetime(['2017/2/26',
'2017/3/5'),
'lower_window': -7,
'upper_window': 7,
})
lower_window,upper_window It's the essence of the festival effect , In general , stay -7 / +7 The time of the activity period is different from that of the active period , Just can be a lot of expression of the festival's normal effect .
The data is long like this :
holiday ds lower_window upper_window
<chr> <date> <dbl> <dbl>
1 playoff 2008-01-13 0 1
2 playoff 2009-01-03 0 1
3 playoff 2010-01-16 0 1
4 playoff 2010-01-24 0 1
5 playoff 2010-02-07 0 1
Prediction stage , Remember to turn on prophet(history, holidays = holidays) Medium holidays. Now let's look at the holiday effect :
ds playoff superbowl
1 2015-01-11 0.012300004 0
2 2015-01-12 -0.008805914 0
3 2016-01-17 0.012300004 0
4 2016-01-18 -0.008805914 0
5 2016-01-24 0.012300004 0
6 2016-01-25 -0.008805914 0
7 2016-02-07 0.012300004 0
8 2016-02-08 -0.008805914 0
From the data , You can see that there is an overlapping date , The super bowl + The playoffs are on the same day , Then there will be The cumulative effect of festivals .
You can see that the impact of the playoff day is more obvious , The Super Bowl had little effect on that day , Yes, of course , I made up the data , If there is an effect, see xxx.
Trend decomposition here , Except for the trend item 、 week 、 year , One more holiday affects , Did you see? ?
.
In some cases, there will be fitting during holidays , Then you can use holidays.prior.scale Parameters to adjust , Make it smooth transition .( I don't know , At first, I thought it was the post holiday effect …)
# Post holiday effect holidays.prior.scale
m <- prophet(history, holidays = holidays, holidays.prior.scale = 1)
forecast <- predict(m, future)
forecast %>%
select(ds, playoff, superbowl) %>%
filter(abs(playoff + superbowl) > 0) %>%
tail(10)
Mainly through holidays.prior.scale To achieve , The default is 10. Because the author scrambled the data , Here's the effect , So paste the official website data . In the case of the official website , By adjusting the , Make the effect of the Super Bowl weaken that night , Considering the influence of the situation before the festival on the day .
At the same time, except before the festival , And the pre season effect , Through parameters seasonality_prior_scale adjustment
DS PLAYOFF SUPERBOWL
2190 2014-02-02 1.362312 0.693425
2191 2014-02-03 2.033471 0.542254
2532 2015-01-11 1.362312 0.000000
2533 2015-01-12 2.033471 0.000000
2901 2016-01-17 1.362312 0.000000
2902 2016-01-18 2.033471 0.000000
2908 2016-01-24 1.362312 0.000000
.
.
After this section, we mainly play with the data in the case , Case data if R There's No... in the bag , It can be downloaded from Here to download .
.
There are likely to be mutation points in the time series , For example, the impact of some holidays .Prophet It will automatically detect these mutations , And make appropriate adjustments , But machine judgment will appear : There is no adjustment for the mutation point 、 The mutation point is over adjusted in both cases , If there really is a mutation point , It can also be adjusted by the parameters in the function .
Prophet I'm going to test for mutations , Here's the picture Prophet It's self detected , The virtual vertical represents the mutation point . Detected. 25 individual , that Prophet It's like L1 Regular is the same ,“ pretend ”/ You can't see these mutations .
Its own way of testing mutations , Like observation ARIMA The autocorrelation of / Partial correlation coefficient truncation 、 trailing :
.
adopt changepoint_prior_scale Human intervention .
df = pd.read_csv('../examples/example_wp_peyton_manning.csv')
m <- prophet(df, changepoint.prior.scale = 0.5)
forecast <- predict(m, future)
plot(m, forecast)
Let's feel it changepoint.prior.scale=0.05 and 0.5 The difference between :
You can put changepoint.prior.scale As an elastic scale , The bigger the value is. , The more affected by the outliers , So the bigger the volatility , Such as 0.5 In this way .
.
When you know the data , There is a certain point of mutation , And know the time . It can be used changepoints function . No po Graph .
df = pd.read_csv('../examples/example_wp_peyton_manning.csv')
m <- prophet(df, changepoints = c(as.Date('2014-01-01')))
forecast <- predict(m, future)
plot(m, forecast)
.
This is the title , It's scary enough , ha-ha ~ Before the third section 3 How to eliminate mutation points and predict them .
however ! Reality is , The mutation point is real , And some of them are meaningful , For example, double 11、 double 12 Such a festival . You can't get rid of these mutations , But if you don't remove it, it will affect the real prediction , Now Prophet Here's a new trick : In the sequence generation model , How much is affected by the outliers ( Similar to the previous changepoint_prior_scale, But here we give an elastic value from the model generation stage ).
Here we can adjust from three angles from the generation model :
(1) Adjust the trend ;
(2) Seasonal adjustment
df = pd.read_csv('../examples/example_wp_peyton_manning.csv')
m <- prophet(df, interval.width = 0.95)
forecast <- predict(m, future)
stay prophet Model generation phase , Join in interval.width, That is to say, when a model is generated , The whole sequence trend , also 5% Affected by outliers .
For manufacturers , There must be seasonal fluctuations , So we want to keep the seasonal mutations , And to predict . And seasonal adaptation is a more troublesome thing ,prophet We need to do full Bayesian sampling first ,mcmc.samples Parameters , The default is 0.
m <- prophet(df, mcmc.samples = 500)
forecast <- predict(m, future)
prophet_plot_components(m, forecast);
open mcmc.samples Button , Will be able to MAP The estimate changes to MCMC sampling , It took a long time to train , Maybe it was before 10 times . final result , Official website DAO chart :
.
There is a difference between outliers and mutation points , Outliers have a great influence on the prediction .
df <- read.csv('../examples/example_wp_R_outliers1.csv')
df$y <- log(df$y)
m <- prophet(df)
future <- make_future_dataframe(m, periods = 1096)
forecast <- predict(m, future)
plot(m, forecast);
It has a big impact on the results , Moreover, the confidence interval of prediction has been expanded many times .prophet The advantages of ,prophet It is acceptable to have a vacancy value NA Of , So these outliers are deleted or NA fall , It's all right .
# The outlier becomes NA+ To make predictions
outliers <- (as.Date(df$ds) > as.Date('2010-01-01')
& as.Date(df$ds) < as.Date('2011-01-01'))
df$y[outliers] = NA
m <- prophet(df)
forecast <- predict(m, future)
plot(m, forecast);
Of course! , You can also delete the whole piece of impact data , In particular, the effects of natural and man-made disasters are permanent , Then you can delete the whole paragraph . Here's how it looks ,2015 year 6 A batch of data around the month , It's all outliers .
.
.
It is mentioned later in chapter three ,prophet Can handle missing values . So here we can implement such an operation , If your data is incomplete , And it's intermittent , For example, you have a month 20 Days of data , So you can also base it on prophet forecast , At the same time give you daily data results . The following functions are realized :
prophet= Missing value prediction + interpolation
df <- read.csv('../examples/example_retail_sales.csv')
m <- prophet(df)
future <- make_future_dataframe(m, periods = 3652)
fcst <- predict(m, future)
plot(m, fcst);
The source data looks like this :
ds y
1 1992-01-01 146376
2 1992-02-01 147079
3 1992-03-01 159336
4 1992-04-01 163669
5 1992-05-01 170068
That is, you only have monthly data for a year , Here's the forecast for the next day , Can also predict , But the error of daily prediction is a little big . So you can set make_future_dataframe Medium freq, What's predicted later is monthly :
future <- make_future_dataframe(m, periods = 120, freq = 'm')
fcst <- predict(m, future)
plot(m, fcst)
.
The author is in linux In practice , Installation encountered a lot of problems .
pip install fbprophet
The official website said :Make sure compilers (gcc, g++) and Python development tools (python-dev) are installed. If you are using a VM, be aware that you will need at least 2GB of memory to run PyStan.
You also need to preload pystan This package .
At the same time, when calling ,from fbprophet import Prophet
Report errors , because github The latest version is not a statement in the official document ... What a pit
Should be :from forecaster import Prophet
.
Simulate one of the simplest examples of festival effects :
from forecaster import Prophet
m = Prophet(holidays=holidays, holidays_prior_scale=20)
m.fit(df)
future = m.make_future_dataframe(periods = 1 ,freq = 'w' )
forecast = m.predict(future)
forecast
forecast It contains all the information , It's a dataframe surface . contain : Predicted y, Trend item 、 Season item 、 Activities, etc
among freq You can adjust yourself . among plot_components It's trend decomposition .
m.plot_components(forecast)
Prophet Making predictions , Its back-end system is a probabilistic programming language Stan, This represents Prophet Can make use of many advantages of Bayesian algorithm , for instance :
Make the model simple 、 The interpretable periodic structure ;
The prediction results include the confidence intervals derived from the only complete posterior distribution , namely Prophet It provides a data-driven risk assessment .
In the following study , The researchers let Prophet Predict two sets of data , Using probabilistic programming language on the back end , Readers can see the use of Stan Some of the work details of .
Prophet A general time series model is used , This model can be applied to Facebook The data on the , And it has a segmented trend (piecewise trends)、 Multi cycle and flexible holidays (floating holiday) Three characteristics .
Prophet The problem of time series prediction is transformed into a curve fitting exercise (exercise). In this curve , The dependent variable is growth 、 Period and holiday The overall performance of .
- growth (growth)
This part uses a logical growth model that changes over time , It belongs to nonlinear growth , therefore , A simple piecewise constant function is used to simulate linear growth .
Use ratio adjustment vector to simulate segmentation points , Each segmentation point corresponds to a specific point in time . Using Laplacian distribution (Laplace distribution) Analog ratio adjustment variable , Positional arguments (location parameter) Set to 0.
- Prophet Model period (periodic seasonality)
Using the standard Fourier series . year 、 The periodicity of the week (seasonality) The approximate values are respectively 20 and 6, Cyclical components (seasonal component) Normally, it's smooth .
- During the holiday (Holiday)
Use an index function to simulate .
The user can adjust the diffusion parameters (spread parameter), To simulate how many historical seasonal changes there will be in the future (historical seasonal variation).