I'm writing to give feedback on the draft as it appeared 24 June 2021. The results are informative and use a promising methodology that seems useful in future evaluations of Wikimedia data. With that said, the reliability of the conclusions are tempered by methodological concerns and a lack of detail which prevents reproduction.
The paper uses the Prophet model, trained on 69 data points, to predict 5 data points in the future. Comparisons of the Prophet model against other predictive models suggest this is too small a data set for accurate conclusions. Papastefanopoulos, Linardatos, and Kotsiantis (2020) compared the Prophet model to five other models to evaluate how well they predicted the number of active COVID-19 infections. They trained these models on 72 data points (more than this paper) and evaluated them on 25 (fewer than this paper). The authors found that the Prophet model was among the worst at predicting the data. In similar findings Papacharalampous and Tyralis (2018) compared three different Prophet models to three different modeling methods to evaluate how well they predicted stream flow rate over time. In the naive model, the prediction y-hat at time k+j is equal to the last observed value, i.e. the value of y at k. The authors find that the model fitted to a limited time span (30 days) performs worse than just using the naive model. The authors also find that in the other two Prophet models (trained on daily data for 32 years, or roughly 11,688 data points), the Prophet models perform no better than linear, random forest, or naive models. The Prophet models, they find, only begin to perform at par with other models when predicting at a lag greater than 4. Taken together, these results suggest that this paper's data set of 69 observations and 5 predicted data points are insufficient to produce reliable results.
This could be remedied by a number of methodological changes. Other models such as a random walk, linear regression, linear mixed models, or alternate generalized additive models can be used and compared to the Prophet model in terms of predictive accuracy. Additionally, the Prophet model accuracy can be improved by using a finer time resolution on the order of weeks or days rather than months. This would increase the number of observations and points predicted, which previous work suggests improves the model's accuracy.
I have further concerns about the Prophet model given its documentation and the data here. The Prophet documentation on outliers details the affect that outliers have on the model, most notably that outliers affect the seasonality estimate: "extreme outliers...mess up the seasonality estimate, so their effect reverberates into the future forever". For this reason, and especially given the limited training data, we must be wary of outliers. As the author states: "The five-year trend is not purely dominated by yearly or monthly seasonality patterns. It indicated some other factors are impacting the edits" which raises concerns about the affect the seasonality parameter has on predictions. If there is no clear pattern and the seasonality parameter is overfit, then the confidence intervals of predictions will not only be wide, but widen as the number of predictions increases. As the intervals widen, statistical power goes down making the article's conclusion that there is "no conclusive evidence" predetermined from the start. To help exemplify just how wide the error margins on this model are, the confidence interval width on a prediction is wider than nearly every single month change in the data. I don't have tabular data so I cannot definitively rank it, but it would certainly be in the top 10 largest changes ever seen.
The documentation suggests this can be remedied by removing the outliers, but this brings other concerns. The affects might be mitigated with a finer time resolution as discussed above.
The author points out that "edits on all wikipedias increased by 13.5%" from October 2020 to January 2021 (the drafting is unclear, so I'm not certain of these dates, but I believe the following point stands whether we shift the window slightly) when compared to the period from 2019 to 2020. The reason for this increase is quite obvious: the global COVID-19 pandemic which led to worldwide lock downs beginning in March 2020 led to greater free time for many. That fact is important for interpreting the data and modelling. Firstly, the data do not differentiate between established editors and IPs (or new editors). This is important as the increase in the data could have occurred without any increase in IP or new editing. Put another way, the impact of the COVID-19 pandemic confounds data on editor behavior in a way that the author does not account for. Secondly, this fact suggests that a single treatment is inappropriate, at least not without comparison to other models. the author notes that most edits to ptwiki are from Portugal and Brazil meaning that we have quite obvious treatments: when did Portugal and Brazil institute COVID-19 lock downs? Given the author's own data and our a priori knowledge, we would expect those dates to have a similar if not greater impact on editing rates yet they are not accounted for.
This could be remedied by additional models and comparisons. For example, a double treatment model corresponding to the relevant COVID lock downs as well as the end of IP editing and an alternate single treatment model for the COVID lockdowns without regard to the IP editing revocation. Comparing the present model to those would allow an assessment of how useful this model is compared to other events we know impact the data.
The model assumes the treatment occurred instantaneously which I find dubious for sociological reasons, but I admit that more information on the community history is needed to assess the impact. The ptwiki vote to end IP editing occurred over a month and began in September. While some IPs are simply readers making casual edits, some number are experienced Wikipedians who either choose to not use accounts or who chose to edit while logged out for some reason. These editors may well have been aware of the vote and perhaps aware of the likely outcome. This awareness of the vote and pending denial of IP editing may have resulted in a chilling effect, causing those editors to abandon editing (or forgo simple fixes that would require logging in) earlier than the actual implementation. We would see this as a sharp decline in September with relatively little change in October. This is consistent with the data: in figures 1 and 2 we see a steep increase in editing starting in April 2020 leading to a peak in August (it should be noted that this correlates with the COVID pandemic) followed by a steep drop off in September and a less extreme decline going into October and November, after which it stabilizes. While this is an untested hypothesis, it is supported by the data and points out the potential harm caused by an analysis which does not show an understanding of the context that led to the data.
This could likely be remedied by testing alternate treatment dates or letting the model impute a treatment date and evaluating how well it agrees with a priori expectations.
Finally, I want to demonstrate that the problems I raise above have a material impact on the conclusions and are not merely nitpicks. The model predicted the number of edits for 5 months, meaning that it predicted 4 rates of change (i.e. October to November, November to December, etc). Of those 4, the model predicted the wrong direction in three of them. Historical data shows that January editing is always higher than December and usually by 10k to 25k edits. This showed up in the seasonality parameter (figure 3) suggesting that the model correctly identified this pattern in the training data. Indeed, it learned it so well that it predicted the December to January rate of change would be the highest on record. However January 2021 is the only year on record which did not see a significantly positive rate of change from December. Similarly, December usually sees a drop in editing from November, and while the model predicts that this pattern would continue in 2020, the actual results show an increase in editing from November to December---only the second on record (the only other being in 2018 with a smaller magnitude). The third: editing tends to increase from October to November (2017 is an exception) and the model reasonably predicts a (slight) increase in 2020 based on that data but is again incorrect in direction. While not dispositive, the model predicted four rates of change and got the direction wrong in all but one suggesting that it is a bad predictor; if we chose the direction at random (i.e. a random walk) we would, on average, perform better than this model.
While the paper uses the actual predictions as a point of comparison, that is incorrect for a number of reasons. For example, the model estimates that the number of edits in February 2021 will be somewhere between it second all-time highest number and its lowest all-time number. The model estimates that January 2021 will be between its second all-time highest value and its lowest value. The model estimates that December 2020 will be between its third all-time highest value and 50k edits below its all-time low. The model predicts November 2020 will be between its second all-time highest value and its lowest all-time value. The same for October. It does not take a generalized additive model to predict that a data point will likely fall between its all-time high and its all-time low. Even if the confidence intervals were narrower, comparing the value of predictions would still be incorrect.
As the paper rightly points out, the data are auto-correlated at a time-lag of one but seems unaware of why that might be or its consequences. Auto-correlation of time series data is so common as to be almost guaranteed. The null hypothesis of most time series data is a random walk: Yk-1 = Yk + NORMAL(0, σ). We assume that the next data point is related to the previous one, and that it is the rate of change which is normally distributed, not the data or errors. For this reason, time series data are usually transformed to destroy this auto-correlation: Ŷk = Yk - Yk-1 (≈ the first derivative). This paper does not do that, and falls victim to a common trap detailed more extensively by Fawcett (2015) and Jones (2015). Two time series will almost always show a correlation because they share a common cause that needs controlled: time. As I showed above, when we control for the correlation caused by time by comparing the directions of change rather than value of prediction, the model performed worse than a random walk. Just to drive this point home, I have provided an R script at User:Wugapodes/Correlations between time series that will allow anyone to visualize the increased error rate of uncontrolled time series correlations using data I estimated from figure 1.
I understand this work is a draft, and I look forward to further improvements. I do not provide the above review not to dismiss the author or their findings, and I hope it is clear that I would not take the time to be this thorough if I did not value the author's contribution. The role of IP editing is an important issue for wikis and one of personal interest to me, so I am passionate that our research on the subject is rigorous and sound. I appreciate the author's contributions and learned a great deal reading it. Despite my criticism of the Prophet model's implementation, I believe it holds promise for evaluating editing behavior and look forward to using it myself. To that end, I would appreciate the author making their code available or at least providing a more comprehensive methods section. I look forward to further developments and am willing to expand on anything that isn't clear, just let me know.