R + Python Facebook large scale time series prediction "true" artifact -- Prophet (code map everywhere)

Understanding oneself 2020-11-13 10:08:55
python facebook large scale time

The economics and statistics major saw the prediction packages They are very greedy . In addition to the previous forecast package , Now this prophet It's also powerful . Ben packages It was reported by the heart of the machine , Take time to try it out on the weekend for a few hours . Some basic introduction to the heart of the machine 《 industry
Facebook Open source large-scale prediction tool Prophet: Support Python and R》

I don't like theoretical analysis , It's a case study , Generally, there is no code word , Try to be simple and rough !!

Official website :https://facebookincubator.github.io/prophet/
github website :https://github.com/facebookincubator/prophet
The paper :《Forecasting at Scale Sean J.Taylor and Benjamin Letham》
Case data download :http://download.csdn.net/detail/sinat_26917383/9764537

Finally, I will add some facebook The theory of .

Then try to play down feel more like the function point :

  • 1、 On a large scale 、 Fine grained data . It's not a lot of data , But the time granularity can be very small , The measurement of playing at school is mostly “ year / month ” Particle size , And this bag can adapt to “ Japan / when ” Grade , See the following case for details . however , Prediction speed ~
    Can be defined as : slower !!!
  • 2、 Trend prediction + Trend decomposition , The most eye-catching module ~~
    There are two trends in fitting : Linear trend 、logistic trend ; There are many kinds of trend decomposition :Trend trend 、 week 、 year 、 season 、 The holiday season , You can also see in the section 、 Post holiday effect .
  • 3、 Mutation point recognition + adjustment . A variety of anti mutation methods and regulation methods .
  • 4、 outliers / Outlier detection . Outlier detection in time dimension . The mutation point is similar to the abnormal point 、 It's different .
  • 5、 Processing missing value data . This means that you may have some missing time slice data , The previous method is to interpolate first , Then make predictions ( Some models don't allow breakpoints ), Here we can take into account the missing values , At the same time, it also achieves the purpose of prediction . Can handle missing value data , That's great .

prophet It should be what I've been looking for , Now see the best forecasting tool for marketing activity analysis , It's website analysis 、 The gospel of advertising campaign analysis , If you see the methods in this article , What experience do you find in using , Please share as much as you can ~

 # install.packages('prophet')


One 、 Trend prediction + Trend decomposition

1、 Case a : Linear trend + Trend decomposition

  • The data generated + Modeling phase
history <- data.frame(ds = seq(as.Date('2015-01-01'), as.Date('2016-01-01'), by = 'd'),
y = sin(1:366/200) + rnorm(366)/10)
m <- prophet(history,growth = "linear")

among , When generating data, pay attention to , It is best to ds( Time item )、y( must do numeric) These two names your variables , This case is a single sequence + Time item . The data is long like this :
 Picture description here
prophet It's the model generation phase ,m There are a lot of parameters in , It needs to be studied by later generations .

  • Prediction stage
# Time function 
future <- make_future_dataframe(m, periods = 365)
# forecast 
forecast <- predict(m, future)
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
# Straight line prediction 
plot(m, forecast)
# Trend decomposition 
prophet_plot_components(m, forecast)

make_future_dataframe: Interesting time generating functions , Previous ds The data is 2015-1-1 To 2016-1-1, Now it's generated a 2015-1-1 To 2016-12-30 Sequence , It's an extra year , In order to predict . And the flexible regulation is to predict the day , Or Zhou ,freq Parameters .
predict, Forecast so ds Is time ,yhat It's the forecast ,lower and upper It's the confidence interval .
Feel the plot:
 Picture description here
prophet_plot_components Function is a trend decomposition function , Divide the trend into trend items 、 week 、 year , This is the default configuration .
 Picture description here


2、 Case 2 :logitics trend + Trend decomposition

logitics What is it ? Do not understand, please Baidu .

# Data generation stage 
history <- data.frame(ds = seq(as.Date('2015-01-01'), as.Date('2016-01-01'), by = 'd'),
y = sin(1:366/200) + rnorm(366)/10,
cap=sin(1:366/200) + rnorm(366)/10+rep(0.3,366))
# The biggest growth trend ,cap Set up cap, It's the pinnacle of this scale ,y At that time, the summit 
# Model generation 
m <- prophet(history,growth = "logistic")
future <- make_future_dataframe(m, periods = 1826)
future$cap <- sin(1:2192/200) + rnorm(2192)/10+rep(0.3,2192)
# Prediction stage 
fcst <- predict(m, future)
plot(m, fcst)

prophet Here, if we want to fit logitics trend , You just need one cap Variable , This variable is y The upper limit of the variable ( For example, the largest market size ), because y If you obey logitics If the trend doesn't give scope , It's easy to predict and reach the top , therefore cap To make predictions less “ fragile ”…
Let's look at a failure fit logitics Case study :
 Picture description here


Two 、 Holiday effect

We can examine the festival 、 Post holiday effect . Let's see paper How to explain the festival effect ( Address of thesis ):
 Picture description here

in other words , Festival effect energy function h(t) It's made up of two parts ,Z(t) Is a set of indicative functions (indicator function), And the parameters K obey (0,v) Normal distribution . so to speak , Think of festivals as a normal distribution , Think of the event period as a crest ,lower_window as well as upper_window Windows as diffusion .

1、 The festival effect

# The data generated : General data 
history <- data.frame(ds = seq(as.Date('2015-01-01'), as.Date('2016-01-01'), by = 'd'),
y = sin(1:366/200) + rnorm(366)/10,
cap=sin(1:366/200) + rnorm(366)/10+rep(0.3,366))
# The data generated : Holiday data 
playoffs <- data_frame(
holiday = 'playoff',
ds = as.Date(c('2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07')),
lower_window = 0,
upper_window = 1
superbowls <- data_frame(
holiday = 'superbowl',
ds = as.Date(c('2010-02-07', '2014-02-02', '2016-02-07')),
lower_window = 0,
upper_window = 1
holidays <- bind_rows(playoffs, superbowls)
# forecast 
m <- prophet(history, holidays = holidays)
forecast <- predict(m, future)
# Impact effect 
forecast %>%
select(ds, playoff, superbowl) %>%
filter(abs(playoff + superbowl) > 0) %>%
# Trend component 
prophet_plot_components(m, forecast);

There are two datasets to be generated in the data generation phase , A batch of data is regular data ( Like traffic ), There is also the time data of holidays
among lower_window,upper_window It can be understood as an extension of holiday time , National Day and new year's day must be different , It's very user-friendly , for example Christmas Eve + Christmas two days , Then set up (lower_window = -1, upper_window = 1). This lower_window The scale is days , So if your data is a week / quarter , Need to set up -7/+7, More reasonable . To take one python How to set it in ( Timing is by week):

c3_4 = pd.DataFrame({
'holiday': 'c1',
'ds': pd.to_datetime(['2017/2/26',
'lower_window': -7,
'upper_window': 7,

lower_window,upper_window It's the essence of the festival effect , In general , stay -7 / +7 The time of the activity period is different from that of the active period , Just can be a lot of expression of the festival's normal effect .
The data is long like this :

 holiday ds lower_window upper_window
<chr> <date> <dbl> <dbl>
1 playoff 2008-01-13 0 1
2 playoff 2009-01-03 0 1
3 playoff 2010-01-16 0 1
4 playoff 2010-01-24 0 1
5 playoff 2010-02-07 0 1

Prediction stage , Remember to turn on prophet(history, holidays = holidays) Medium holidays. Now let's look at the holiday effect :

 ds playoff superbowl
1 2015-01-11 0.012300004 0
2 2015-01-12 -0.008805914 0
3 2016-01-17 0.012300004 0
4 2016-01-18 -0.008805914 0
5 2016-01-24 0.012300004 0
6 2016-01-25 -0.008805914 0
7 2016-02-07 0.012300004 0
8 2016-02-08 -0.008805914 0

From the data , You can see that there is an overlapping date , The super bowl + The playoffs are on the same day , Then there will be The cumulative effect of festivals .
You can see that the impact of the playoff day is more obvious , The Super Bowl had little effect on that day , Yes, of course , I made up the data , If there is an effect, see xxx.
Trend decomposition here , Except for the trend item 、 week 、 year , One more holiday affects , Did you see? ?

 Picture description here


2、 Reconcile the festival effect (Prior scale for holidays and seasonality)

In some cases, there will be fitting during holidays , Then you can use holidays.prior.scale Parameters to adjust , Make it smooth transition .( I don't know , At first, I thought it was the post holiday effect …)

# Post holiday effect holidays.prior.scale
m <- prophet(history, holidays = holidays, holidays.prior.scale = 1)
forecast <- predict(m, future)
forecast %>%
select(ds, playoff, superbowl) %>%
filter(abs(playoff + superbowl) > 0) %>%

Mainly through holidays.prior.scale To achieve , The default is 10. Because the author scrambled the data , Here's the effect , So paste the official website data . In the case of the official website , By adjusting the , Make the effect of the Super Bowl weaken that night , Considering the influence of the situation before the festival on the day .
At the same time, except before the festival , And the pre season effect , Through parameters seasonality_prior_scale adjustment

2190 2014-02-02 1.362312 0.693425
2191 2014-02-03 2.033471 0.542254
2532 2015-01-11 1.362312 0.000000
2533 2015-01-12 2.033471 0.000000
2901 2016-01-17 1.362312 0.000000
2902 2016-01-18 2.033471 0.000000
2908 2016-01-24 1.362312 0.000000


3、 ... and 、 Mutation point regulation 、 Break point 、 Outliers

After this section, we mainly play with the data in the case , Case data if R There's No... in the bag , It can be downloaded from Here to download .


1、Prophet—— Automatic mutation point recognition

There are likely to be mutation points in the time series , For example, the impact of some holidays .Prophet It will automatically detect these mutations , And make appropriate adjustments , But machine judgment will appear : There is no adjustment for the mutation point 、 The mutation point is over adjusted in both cases , If there really is a mutation point , It can also be adjusted by the parameters in the function .

Prophet I'm going to test for mutations , Here's the picture Prophet It's self detected , The virtual vertical represents the mutation point . Detected. 25 individual , that Prophet It's like L1 Regular is the same ,“ pretend ”/ You can't see these mutations .
 Picture description here
Its own way of testing mutations , Like observation ARIMA The autocorrelation of / Partial correlation coefficient truncation 、 trailing :
 Picture description here

2、 Human intervention mutation point —— Elastic range

adopt changepoint_prior_scale Human intervention .

df = pd.read_csv('../examples/example_wp_peyton_manning.csv')
m <- prophet(df, changepoint.prior.scale = 0.5)
forecast <- predict(m, future)
plot(m, forecast)

Let's feel it changepoint.prior.scale=0.05 and 0.5 The difference between :
 Picture description here
 Picture description here
You can put changepoint.prior.scale As an elastic scale , The bigger the value is. , The more affected by the outliers , So the bigger the volatility , Such as 0.5 In this way .

3、 Human intervention mutation point —— A mutation point

When you know the data , There is a certain point of mutation , And know the time . It can be used changepoints function . No po Graph .

df = pd.read_csv('../examples/example_wp_peyton_manning.csv')
m <- prophet(df, changepoints = c(as.Date('2014-01-01')))
forecast <- predict(m, future)
plot(m, forecast)


4、 Mutation prediction

This is the title , It's scary enough , ha-ha ~ Before the third section 3 How to eliminate mutation points and predict them .
however ! Reality is , The mutation point is real , And some of them are meaningful , For example, double 11、 double 12 Such a festival . You can't get rid of these mutations , But if you don't remove it, it will affect the real prediction , Now Prophet Here's a new trick : In the sequence generation model , How much is affected by the outliers ( Similar to the previous changepoint_prior_scale, But here we give an elastic value from the model generation stage ).
Here we can adjust from three angles from the generation model :
(1) Adjust the trend ;
(2) Seasonal adjustment

  • (1) Trend mutation adapts to
df = pd.read_csv('../examples/example_wp_peyton_manning.csv')
m <- prophet(df, interval.width = 0.95)
forecast <- predict(m, future)

stay prophet Model generation phase , Join in interval.width, That is to say, when a model is generated , The whole sequence trend , also 5% Affected by outliers .

  • (2) Seasonal mutation adaptation

For manufacturers , There must be seasonal fluctuations , So we want to keep the seasonal mutations , And to predict . And seasonal adaptation is a more troublesome thing ,prophet We need to do full Bayesian sampling first ,mcmc.samples Parameters , The default is 0.

m <- prophet(df, mcmc.samples = 500)
forecast <- predict(m, future)
prophet_plot_components(m, forecast);

open mcmc.samples Button , Will be able to MAP The estimate changes to MCMC sampling , It took a long time to train , Maybe it was before 10 times . final result , Official website DAO chart :
 Picture description here


5、 outliers / Outlier

There is a difference between outliers and mutation points , Outliers have a great influence on the prediction .

df <- read.csv('../examples/example_wp_R_outliers1.csv')
df$y <- log(df$y)
m <- prophet(df)
future <- make_future_dataframe(m, periods = 1096)
forecast <- predict(m, future)
plot(m, forecast);

 Picture description here

It has a big impact on the results , Moreover, the confidence interval of prediction has been expanded many times .prophet The advantages of ,prophet It is acceptable to have a vacancy value NA Of , So these outliers are deleted or NA fall , It's all right .

# The outlier becomes NA+ To make predictions 
outliers <- (as.Date(df$ds) > as.Date('2010-01-01')
& as.Date(df$ds) < as.Date('2011-01-01'))
df$y[outliers] = NA
m <- prophet(df)
forecast <- predict(m, future)
plot(m, forecast);

Of course! , You can also delete the whole piece of impact data , In particular, the effects of natural and man-made disasters are permanent , Then you can delete the whole paragraph . Here's how it looks ,2015 year 6 A batch of data around the month , It's all outliers .
 Picture description here


Four 、 Missing value 、 How to deal with the vacancy time + forecast

It is mentioned later in chapter three ,prophet Can handle missing values . So here we can implement such an operation , If your data is incomplete , And it's intermittent , For example, you have a month 20 Days of data , So you can also base it on prophet forecast , At the same time give you daily data results . The following functions are realized :

prophet= Missing value prediction + interpolation 
df <- read.csv('../examples/example_retail_sales.csv')
m <- prophet(df)
future <- make_future_dataframe(m, periods = 3652)
fcst <- predict(m, future)
plot(m, fcst);

 Picture description here

The source data looks like this :

 ds y
1 1992-01-01 146376
2 1992-02-01 147079
3 1992-03-01 159336
4 1992-04-01 163669
5 1992-05-01 170068

That is, you only have monthly data for a year , Here's the forecast for the next day , Can also predict , But the error of daily prediction is a little big . So you can set make_future_dataframe Medium freq, What's predicted later is monthly :

future <- make_future_dataframe(m, periods = 120, freq = 'm')
fcst <- predict(m, future)
plot(m, fcst)

 Picture description here

5、 ... and 、 use python Realization prophet Timing prediction

1、 install

The author is in linux In practice , Installation encountered a lot of problems .

pip install fbprophet

The official website said :Make sure compilers (gcc, g++) and Python development tools (python-dev) are installed. If you are using a VM, be aware that you will need at least 2GB of memory to run PyStan.
You also need to preload pystan This package .
At the same time, when calling ,from fbprophet import Prophet Report errors , because github The latest version is not a statement in the official document ... What a pit
Should be :from forecaster import Prophet


2、 Practical cases

Simulate one of the simplest examples of festival effects :

from forecaster import Prophet
m = Prophet(holidays=holidays, holidays_prior_scale=20)
future = m.make_future_dataframe(periods = 1 ,freq = 'w' )
forecast = m.predict(future)

forecast It contains all the information , It's a dataframe surface . contain : Predicted y, Trend item 、 Season item 、 Activities, etc
among freq You can adjust yourself . among plot_components It's trend decomposition .


 Picture description here

Extend one :Facebook Data prediction tool for Prophet —— Bayesian reasoning

Facebook Data prediction tool for Prophet What are the advantages ? Use Bayesian reasoning to find out

Prophet Making predictions , Its back-end system is a probabilistic programming language Stan, This represents Prophet Can make use of many advantages of Bayesian algorithm , for instance :

Make the model simple 、 The interpretable periodic structure ;
The prediction results include the confidence intervals derived from the only complete posterior distribution , namely Prophet It provides a data-driven risk assessment .
In the following study , The researchers let Prophet Predict two sets of data , Using probabilistic programming language on the back end , Readers can see the use of Stan Some of the work details of .

Prophet A general time series model is used , This model can be applied to Facebook The data on the , And it has a segmented trend (piecewise trends)、 Multi cycle and flexible holidays (floating holiday) Three characteristics .

Prophet The problem of time series prediction is transformed into a curve fitting exercise (exercise). In this curve , The dependent variable is growth 、 Period and holiday The overall performance of .

- growth (growth)

This part uses a logical growth model that changes over time , It belongs to nonlinear growth , therefore , A simple piecewise constant function is used to simulate linear growth .
Use ratio adjustment vector to simulate segmentation points , Each segmentation point corresponds to a specific point in time . Using Laplacian distribution (Laplace distribution) Analog ratio adjustment variable , Positional arguments (location parameter) Set to 0.
- Prophet Model period (periodic seasonality)

Using the standard Fourier series . year 、 The periodicity of the week (seasonality) The approximate values are respectively 20 and 6, Cyclical components (seasonal component) Normally, it's smooth .
- During the holiday (Holiday
Use an index function to simulate .
The user can adjust the diffusion parameters (spread parameter), To simulate how many historical seasonal changes there will be in the future (historical seasonal variation).

official account “ Quality cloud notes ” Update your blog regularly :

 Picture description here

本文为[Understanding oneself]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database