User:Dantman/CodeReviewSystem

Ideas for a new git based review system written from scratch.

~~Roan wants any backend written in C++ and the front end to be PHP.~~
- Front end PHP code may reuse some of the classes we use in MediaWiki (WebRequest, PathRouter, etc...) this could become a framework. eg: We might make them {SomeFrameworkPrefix}WebRequest, etc... and then move the generic stuff there while making WebRequest, etc... in MediaWiki extend them.
- It turns out to be more advantageous to write the system in python. Daniel Friesen (Dantman) (talk) 18:17, 27 March 2013 (UTC)
We should use personal public remotes instead of push hacks and auth.
The review system controls it's own repo; No one has the capability to push to this.
The review system can be told to fetch changes from any number of public repos (ie: users' public repos)
Using web hooks (github supports them) and push hooks on a wm hosted repo system the review system will be automatically notified to fetch from a repo (Of course you can also tell it to manually fetch)
- GitHub's default web service hook subscribes to every hook instead of just the push one so it's a little excessive. The services GitHub lists in its service hooks are defined in an open repo. We could technically add our own for Gareth.
- However besides web hooks GitHub supports PubSubHubbub — well actually something based on PubSubHubub, it violates some of the standard's rules but is still useful. Instead of using service hooks this is more useful. This requires no manual configuration or effort on the user's behalf. When a remote is added which we detect as a GitHub repo we can automatically subscribe to push events from GitHub on our own.
After it fetches commits the review system will show you commits that are not yet part of the repo or a review branch. You will then be able to convert a selection of them into a collection of commits for review and they will be put into a review branch. While manual is an option, we may code some conventions into the system which will trigger it to automatically create review branches and collections. Or automatically make suggestions or multiple choice options that the user will confirm to quickly create what is desired.
Sets of commits for review will be review branches instead of changesets.
- Instead of using --amend on anything you'll make new commits and have the review system pull them into the review branch.
- When the review branch is finally ready for the mainline it'll be merged in as-is. No squashing or rebasing creating false histories.
  - ((You may want to squash some of them. For instance when someone detects a typo in the comment added. There's no reason to keep that in the history. ~Platonides))
    - ((You can amend and rebase yourself. Creating forks and having those merged. Just as we do now with pushing a new changeset. This is just noting that the system itself won't force any rebases in ways that rewrite the history and create potential conflicts and make working commits break after being committed. Daniel Friesen (Dantman) (talk) 00:29, 9 April 2012 (UTC)))
- In the event of a merge conflict you'll be told to make a commit merging master into the branch and handle the conflicts yourself. You'll then push that commit to your public repo and have the review system pull it into the review branch. After that it'll be merged into master.
To avoid branch name conflicts and also to avoid having people doing plain checkouts ending up pulling every review branch we'll use special refs for the review branches:
- refs/review-heads/<sha1> - These will be the heads of review branches. The <sha1> is the internal identifier that the review system will use for each review branch. This branch name is always unique. If you re-do the same review branch it'll come out with a new <sha1>. When new commits are pulled into a review branch the review-head will update.
- refs/review/<name> - These will be symbolic-refs (see git help symbolic-ref). Each one of these will point to a refs/review-heads/<sha1>. The <name> is the human readable name given to a review branch. These are basically aliases for review-heads. If a review branch is abandoned (or if it's merged we may also let you re-use names) this name may disappear from the server and it may be re-used for a later review branch. (These may end up closer to topics; eg: We may end up making this name point to the latest in a series of review-heads such that say of 5 collections of commits to be reviewed the refs/review/* points to only the latest one, while all the earlier ones will have no refs/review/*)
If someone fixes something in a review branch in a way you don't like the review branch may be forked:
- When you fork a review branch a new review-head is created.
- We may create fork review aliases in the form refs/review/<name>/<n> such that forking refs/review/fubar leads to creating refs/review/fubar/1.
- A forked review branch will be linked to the previous review branch in a way that lets people easily find it. Notifications may also be made notifying that the review branch was forked. And the new review branch will copy or share some other things from the previous review branch. Such as people watching to the review branch.
- If you checkout a review branch; reset so that you are on an earlier commit in the same review branch and create a new set of commits (or --amend an existing commit); Then push to your public repo; When the review system fetches these it may notice you based this off of an existing review branch but newer commits have been discarded and will automatically create a fork of that review branch for you.
Review branches may be split:
- If the series of commits made into a review branch are too many and you wish to approve only a collection of them a review branch can be split.
- In the review UI you can select a point at which to split the review branch and then submit.
- The review system will create one ore more new review-heads and point the previous portions of the review branch to those. Tweaks will be made in the database to create new collections for review.
- The end result shown in the system will be that one collection of commits for review will become two or more collections of commits.
  - These automatically will depend on each other as if you had pushed a set of commits creating a collection for review. And then added more commits pushing them and
Review branch collections may be combined:
- If you accidentally split up or pushed a fix that ended up creating a new collection a collection of commits in a review branch can be merged with the collection of commits it is based on.
- The review system will drop ~~one of the review-heads~~ a review-head and update things so everything is merged together.
- We may base what one to drop based on how much data is attached. eg: If you push a collection then right away combine it then the newer review-head may be dropped since you really wanted to add things to the other; While if you push a collection, and then push another collection which gets comments from users. And then do a combine it'll drop the older review-head since the newer one is what everyone is using; (Or maybe we'll let you pick in the ui; or perhaps we'll create a brand new review-head for the new collection )
- The dropped review head won't actually be deleted from the database.
  - It'll be kept around to redirect people from the previous commit head to the new one.
  - We could update the refs/review-heads/* so that both review heads point to the same sha. Or use a symbolic-ref to point the dropped one to the new one.
Comments will probably have some sort of connection tying them to a specific commit rather than a whole collection of commits for review. This may or may not be reflected in the ui. This will let us attempt to intelligently handle comments when we split or fork a commit. In some cases we may make some sort of phantom references to the comments in another collection of commits.
A collection of commits depending on other unreviewed collections of commits will still have a review button. Using it and confirming the warning it gives you will automatically review all the collections it depends on along with the collection.
A user will have profile of their own:
- They will be able to control a list of emails to attribute to them like in gerrit.
- The profile will allow them to control a list of their public repos which the review system should pull from.
- They will have a watchlist they can control.
We should try and ensure that any diff comparison we have is at least as good as gerrit's is.
Commenting, giving a +1 -1, and verifying collections
Since things would be reviewed in terms of collections of commits the page should have an intuitive list of those commits. GitHub does this quite well https://github.com/dantman/mediawiki-core/compare/master...skinrewrite-private
We should look into seeing if there are any libraries that let us output listed graphs into web pages the way desktop git viewers do. Or perhaps write something. We'll probably need to use svg+js if we want it to be sane.
While we may include gitweb since it's standard we should look into other git viewers to use as the primary viewer. Especially if that viewer includes line graphs like desktop git guis do.
Notification system should be pluggable (email, irc, etc... will be implemented as separate classes plugging themselves in) and be smart in it's implementation:
- When a collections is merged we don't need a separate notification for any comments or review notes made at the same time.
- When mass-reviewing (eg: Using the option that reviews all dependent collections) we only need one notification
Authz/authn system should be pluggable. At the very least, we need a plugin for LDAP authentication/authorization.
- We probably want to layer this so we can support built-in password auth, taking accounts and passwords from an auth system like LDAP, and using a single SSO using an external page. As well as perhaps arbitrary 3rd party auths like Open ID, various OAuth sites, GitHub, etc...
- Registration url should be arbitrarily override able.

Gareth should use a STOMP server
- Using a PUBLISH/SUBSCRIBE model Gareth publishes a message on every new review, update, fetch, etc...
- A set of code dedicated to WebSocket handling subscribes to a portion of these and relays them to clients that ask for specific ones (provided they have permission)
- This would power the fetch progress bar, a live "Most recent review branches" list on the frontage, etc...
- Internally Gareth would actually subscribe to the STOMP server inside of a thread/process dedicated to email notifications. Instead of doing it right away.
- For with a digest preference it would actually queue things up and wait for a timer to finish. Allowing messages to be grouped together.
- Similarly IRC notifications would be handled by a thread/process dedicated to IRC that has subscribed to certain types of STOMP messages.
- https://code.google.com/p/stomppy/
- http://stomp.github.com/implementations.html
- https://activemq.apache.org/apollo/
- https://code.google.com/p/coilmq/

Gareth should have an API:
- eg: /api/1/projects
- Project list, User list, Remotes list, etc...
Gareth should also have search & filter functionality:
- Filters like project:mediawiki/extensions/* should work.
- User filters should accept username, realname, and email. (Link these all to identity?)
- Consider Lucene indexing commit messages, whole file contents (latest), and commit changes.
- Search should have an option to switch between review, commit, project, etc... searching.
- Search box should use contentEditable to create an intuitive interactive search input.
  - Autocomplete things like project:, owner:, as well as actual input.
  - Individual items like project:mediawiki/core should turn into whole bubbles.
- Search functionality should be available via the API.
Urls like /p/mediawiki/extensions/* should work to view the project list page with a filtered list of projects.
- We may want to have actual labelled groups for project repos to allow the projects page to have multiple sections of repos.

Support the Activity Streams and OStatus protocols alongside Atom (PubSub too?), etc... so not only can feed readers get live updates about new review sets, comments, etc... but one can also subscribe using things like Status.Net.
- For parts of the API with some type of event stream, support an Atom serialization with PubSubHubbub as a method of watching it.
- Build our PubSubHubbub hub directly into Gareth:
  - Hub url will be a path in the Gareth install.
  - Only topic urls pointing to known streams within Gareth will be permitted, others will be rejected.
  - hub.mode=publish will be rejected.
  - Internally STOMP will be used. Event gets sent over stomp. Workers pick up on this and start delivering updates.
  - Then again it might be easier to bundle an existing PubSubHubub implementation. Add some whitelisting of topic urls to it. Then deliver hub.mode=publish to it like normal.

Creation of new project repos is done using the same kind of review sets as for commits.
- Anyone can create one of these from a "Create project" button on the projects page.
- They are attached to an anonymous repo some actions can be taken on.
  - A remote can still be attached to this like normal. So the create project request can either create an empty repo or simultaneously also be a review of some (or all) of the commits already in the repo.
- Besides "Cancel" and "Submit for review" (or just "Submit"?) when creating a review set you have the rights to self review (especially for project creation) there's also a "Submit & Accept" button which will bypass the normal "Create review set then accept with the 'Reject' and 'Accept' buttons" pattern and result in immediate review and creation of the project.
- Likewise any review set can be marked as belonging to any branch, including those that don't exist. So having the commit reviewed will result in the branch being created.
An ACL system handles permissions.
- View, submit (for review), accept, self-merge, and create project.
- Individual ACL rules can apply to everyone, specific users, and groups.
- View and create project rights can only be filtered by project name.
- Accept and self-merge rights are filtered by project and branch.
- Accept rights also allow you to create any branch with a pattern you have the rights to accept something inside.
- Filters for applying ACL rules can be made as strict comparisons, simple patterns like foo/*, anchored regular expressions.
Gareth will probably need storage for keys and logins of it's own to access 3rd party repos. Primarily to support private repos. Since a user's own public remote will have to require credentials to view. Support for hosting personal repos and forks may be necessary to make support for private repos have sane usability.
We should have a proper group system.
- Internal groups can be edited and have specific users (or even other groups) belong to them.
- Should also support smart groups as types that can be implemented by plugins. Such as an LdapGroup smart group implemented by the LDAP auth plugin. Which would take some options to define a pattern of LDAP users that belong to the group.

Starred projects (or some other term)
Various degrees of watch settings [Not watching (Notifications when reviewing), Digest (maybe not), Watching (notifications for all commits or all matching a pattern), Ignoring (no notifications, even if you're reviewing)]. For repos. And certain patterns in those repos (like modifications to certain files).
Reviewers reviewing other people's change sets could use something like an 'Archive' button that would leave them as a reviewer and retain the watchlist. But drop the change set off their dashboard until someone actually responds on it.
Dashboard (with filtered/per-project dashboard versions) with your change sets, ones you're reviewing, and helpful extra boxes like "Recently merged" and "Recently abandoned or rejected". As well as starred commits and projects.
Named tagging for commits.
Instead of -1 and +1 we should give each of the community two of each, like a (-2, -1, +1, +2) — though we might have reason to not include +2. The difference between -1 and 2 (we won't use those terms) essentially being "There's something wrong with the current state of the review branch" which only applies to the current state and requires re-review later vs. "I'm opposed to the fundamental idea of this change." which usually won't change no matter how many new commits are made.
- Gerrit just kills off old review statuses. We have one more permanent version of -1. But we should also not completely kill non-permanent reviews. Instead we should display them out in a greyed out format. To indicate that someone did in fact -1 (or +1) the code before for a bug in it. And it may still apply or may be fixed. And the user needs to re-review the commit.
Embedded Lua for custom control over some parts of the system? Gerrit also uses Prolog for some things. Perhaps we should also investigate whether some of the areas we'd give control over could be beneficial to have Prolog support as well.

Gareth -> GitHub mirroring can be implemented in a plugin instead of a bot.
- Handle setup and using OAuth. Have someone create an organization and register a user. Then add that user to the organization and use OAuth in a special interface in Gareth to setup Gareth with the rights. After that it'll do most of the configuration itself.
  - Add pubkeys for Gareth to push with automatically.
  - New repositories can be added by create api.
- When repos are created homepage can be the url of the project in Gareth. has_issues and has_wiki should be set to false. description can be copied from the one used in the Gareth repo.
- commit comments, pull requests, and pull request comments can be listened for preferably using GitHub's PubSubHubbub or by configuring a service hook.
- When a pull request appears create a Gareth review branch for it then close the GitHub pull request with a comment.
  - The comment can be something like "This pull request has been moved to the Gareth review branch {URL} please log in there with your GitHub account to continue the review."
  - If a relevant user and remote don't already exist in Gareth a user will be created into Gareth for that GitHub user but not authed yet. A remote pointing to the head.repo.clone_url will be created for the user.
  - head.sha will be used to create the review branch. Not sure if base.sha will be needed.
- The GitHub Mirroring plugin can hook into the plugin we'll use for integrating Jenkins (or maybe the STOMP notifications generated by it instead of a direct hook). From there it can use GitHub's Repo Statuses API to mirror statuses.
Jenkins support will be part of a plugin.
- No hack like verified and comments.
- Every commit will have a possible state added to it for tests. Jenkins will update these. Indicators for them will be added to the UI.
  - The commit list will have indicators showing the status of each commit that was tested.
  - An extra indicator in a more primary location will show the status of the latest commit in the review set.
  - Instead of spamming the comments there will be an actual box for the detailed status information like links, etc... maybe we'll make old ones a popover from the indicator.
  - Considering the various Continuous Integration systems most of the common stuff including UI will probably be built-in. Either that or we'll have an abstract CI plugin that all CI plugins will depend on and extend from. Assuming we can make our UIs flexible enough for plugins to add their own areas that work in all themes without each theme having to implementing support for the CI plugin.
- The method of accepting a review set won't change, there will be no +2 of review sets instead of pressing the accept button.
- The merging commits into branches part of accepting review branches will have a hook which CI plugs into. This hook will allow the CI plugin to postpone the final update of the ref, wait for the CI of the latest commit in the review branch to finish, and then either abort or resume the ref update.
Consider some optional support for git flow in a plugin.
- Allow git flow settings to be configured in repositories.
- Automatically create review branches for any branch following the git flow feature branch pattern, perhaps also the release and hotfix pattern.
- It might be a little difficult to automatically handle the release branch pattern of automatically merging into master and tagging considering the potential merge conflicts for the merge back into develop which shouldn't be done till the actual "Finish" ie: "Accept" of the branch.

There's a good chance that the filesystem will become a bottleneck to all the operations of fetching and listing commits me may have to consider indexing or caching of the commits in a repository.
- For indexing, rather than implementing around one storage system we should create an abstract class for the basic storage operations we'll use and some extra methods for patterns like some types of listing/iterating of commits.
- We'll write implementations for various kv stores so the best can be decided and decided by those making production setups:
  - ZOPE and ZOPE+ZEO
  - CodernityDB
  - Berkeley DB
  - Redis
  - Riak
- Besides indexing ZOPE has RAMCache feature. To a limited degree we can use this to vastly improve performance of fetches for commonly accessed stuff.

Notes

Git's fetching can actually be controlled with refspecs, if you do this (assuming https://git.wikimedia.org/r/mediawiki/core.git is MediaWiki Core's git repo; and we use refs/review/* symrefs pointing to refs/review-heads/* in the review system)

$ git clone https://git.wikimedia.org/r/mediawiki/core.git
$ cd core
$ git remote add review https://git.wikimedia.org/r/mediawiki/core.git
$ git config remote.review.fetch +refs/review/*:refs/remotes/review/*

You will have a git repo with the remote 'origin' pointing to mediawiki/core where every remote branch is a branch on the remote. And another remote review pointing to the same mediawiki/core but this time where every remote branch is actually an unmerged review branch you can easily start tracking in a branch of your own to work on that, without needing to use a long pull line each time.

Implementation tips

git for-each-ref --sort='-committerdate' refs/remotes/$REMOTE/ seems to be one useful way to get a list of heads for a specific remote and sort them by which one was updated most recently.
git rev-list $REMOTE/$REMOTE_HEAD --not --branches --branches seems to be one useful way to get a list of commits that are in a remote head but not any of the branches or tags. --glob can also be used to add things like our refs/review... refs. --remote may also be able to help with cleaning up lists for remote branches that are based on other remote branches.

Idea for separating frontend from the python code

One extra idea would be to implement Gareth's frontend in Node.js and make the python side code an API. The Node.js API would interact with the Python core using the HTTP API and listening to STOMP events.

Pros:

We can use better template systems to build the front end UI and we can even share this between the client and server. Implementing some portions of the site in ways that will work without client-side JS will even be easier.
Node.js already has a stable SockJS implementation so we won't have to help develop SockJS-gevent just to use SockJS.
As the frontend will be directly using the API Gareth will definitely have a production level API to use from clients, etc...
Assuming we replace Django's web frontend with something lighter weight we could find something that supports gevent better.

Cons:

This would involve replacing some already implemented components; ie: A partial rewrite.
- The frontend UI would have to be partially rewritten for Node.
- Ideally the bulky Django web code would be replaced with something lighter. This itself wouldn't be a big issue since we'd already inherently be dropping the existing frontend oriented routes and replacing them with ones for an API.
- When Django web stuff is dropped it's hard to not drop other Django stuff with it. Some work will be needed to replace Django's ORM with SQLAlchemy or some other API.

Links

Very rough project start https://github.com/dantman/gareth

These things may have some use in the project:

GitLabs/GitHub style look could be a good starting point for themeing: http://gitlabhq.com/demo.html
CGit has an ASCII tree graph: http://hjemli.net/git/cgit/log/ We really want a graphical one though.
Visualization libraries may come in handy:
~~There is a LDAP user provider for Symfony which may be useful later on https://github.com/opensky/LdapBundle~~