Wikimedia Cloud Services team/Onboarding Hieu/Sessions
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. |
2019-12-12
[edit]modules/puppetmaster/files/production.hiera.yaml
$data = lookup(whatevr)
$date = "other" <--- fails!!
modules/puppetmaster/files/labs.hiera.yaml
https://gerrit.wikimedia.org/r/admin/projects/cloud/instance-puppet
modules/profile/manifests/wmcs/monitoring.pp
2019-12-05
[edit]gerrit:554853 (use var form hiera)
hieradata/labs.yaml
git grep profile::mediawiki::scap_client
git grep profile::mediawiki::common
https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/ Catalog ran successfully : https://puppet-compiler.wmflabs.org/compiler1003/19814/mwmaint1002.eqiad.wmnet/
mwmaint1002.eqiad.wmnet
hieradata/hosts/cloudservices1003.yaml:
- prometheus-pdns-exporter is scrapped by labmons
- prometheus-node-exporter by prod servers
prometheus_nodes:
- labmon1001.eqiad.wmnet - cloudmetrics1002.eqiad.wmnet - prometheus1003.eqiad.wmnet - prometheus1004.eqiad.wmnet
2019-11-28
[edit]https://office.wikimedia.org/wiki/Pwstore
cumin1001.eqiad.wmnet
aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "hieu reimaging server" --hours 1 labmon1002*
aborrero@cumin1001:~ $ sudo install_console labmon1002.mgmt.eqiad.wmnet
- or ssh root@labmon1002.mgmt.eqiad.wmnet
cd pw
../pwstore/pwd ed management
</>hpiLO-> vsp
Virtual Serial Port Active: COM2
Starting virtual serial port. Press 'ESC (' to return to the CLI Session.
Debian GNU/Linux 8 labmon1002 ttyS1
labmon1002 login:
- merge patches (dns and puppet)
https://gerrit.wikimedia.org/r/c/operations/puppet/+/553441 https://gerrit.wikimedia.org/r/c/operations/dns/+/553467
ssh ns1.wikimedia.org aborrero@authdns2001:~ $ sudo authdns-update
- run puppet on install servers
aborrero@cumin1001:~ $ sudo cumin A:installserver run-puppet-agent
- run the script:
phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out IPMI Password: 15:59:12 | labmon1002.eqiad.wmnet | Validated host 15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga 15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet 15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor 15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe 15:59:18 | labmon1002.eqiad.wmnet | Power cycling 15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out IPMI Password: 15:59:12 | labmon1002.eqiad.wmnet | Validated host 15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga 15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet 15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor 15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe 15:59:18 | labmon1002.eqiad.wmnet | Power cycling 15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out IPMI Password: 15:59:12 | labmon1002.eqiad.wmnet | Validated host 15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga 15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet 15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor 15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe 15:59:18 | labmon1002.eqiad.wmnet | Power cycling 15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle 16:03:27 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 5.0 minutes 16:03:29 | cloudmetrics1002.eqiad.wmnet | Uptime checked 16:03:29 | cloudmetrics1002.eqiad.wmnet | Host up (Debian installer) 16:08:08 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 5.0 minutes 16:13:23 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 10.0 minutes 16:13:25 | cloudmetrics1002.eqiad.wmnet | Uptime checked 16:13:25 | cloudmetrics1002.eqiad.wmnet | Host up 16:13:33 | cloudmetrics1002.eqiad.wmnet | Puppet CSR generated, fingerprint is: 06:02:32:2F:0E:80:B8:CA:8E:74:34:9B:63:EA:94:41:EF:B3:0E:B3:DF:D1:4B:84:F4:B3:73:66:B9:78:16:D5 16:13:33 | cloudmetrics1002.eqiad.wmnet | Polling until a Puppet sign request appears 16:13:37 | cloudmetrics1002.eqiad.wmnet | Signed Puppet cert 16:13:39 | cloudmetrics1002.eqiad.wmnet | Validated host 16:13:39 | cloudmetrics1002.eqiad.wmnet | Scheduled delayed downtime on Icinga 16:13:41 | cloudmetrics1002.eqiad.wmnet | Started first puppet run (sit back, relax, and enjoy the wait) START - Cookbook sre.hosts.downtime Forcing a Puppet run on the Icinga server Running Puppet with args --quiet --attempts 30 on 1 hosts: icinga1001.wikimedia.org Downtiming 1 hosts and all their services for 2:00:00: cloudmetrics1002.eqiad.wmnet Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: cloudmetrics1002.eqiad.wmnet END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) 16:20:24 | cloudmetrics1002.eqiad.wmnet | First Puppet run completed 16:20:25 | cloudmetrics1002.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got: Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: 0004000000 Boot Flags : - Boot Flag Invalid - Options apply to only next boot - BIOS PC Compatible (legacy) boot - Boot Device Selector : Force PXE - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST 16:20:50 | cumin1001.eqiad.wmnet | Puppet run completed 16:20:51 | cloudmetrics1002.eqiad.wmnet | Rebooted host 16:24:00 | cloudmetrics1002.eqiad.wmnet | Uptime checked 16:24:00 | cloudmetrics1002.eqiad.wmnet | Host up 16:24:00 | cloudmetrics1002.eqiad.wmnet | Polling the completion of a Puppet run 16:26:04 | cloudmetrics1002.eqiad.wmnet | Puppet run checked 16:26:04 | cloudmetrics1002.eqiad.wmnet | Reimage completed 16:26:04 | cloudmetrics1002.eqiad.wmnet | REIMAGE END | retcode=0
- AFTER operations: cleanup DNS entries
- AFTER operations: cleanup stale file in puppet:
hieradata/hosts/labmon1002.yaml
2019-11-27
[edit]- files within puppet repo with the keyword "labmon"
- ------------------------------------------------------------------------------------------
hieradata/hosts/cloudservices1003.yaml hieradata/hosts/cloudservices2002-dev.yaml hieradata/hosts/cloudservices1004.yaml
hieradata/hosts/cloudcontrol1003.yaml hieradata/hosts/cloudcontrol2001-dev.yaml hieradata/hosts/cloudcontrol2003-dev.yaml hieradata/hosts/cloudcontrol1004.yaml
hieradata/labs/cloudinfra/host/cloud-puppetmaster-01.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-02.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-03.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-04.yaml
modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 hieradata/role/common/wmcs/monitoring.yaml hieradata/common/profile/openstack/eqiad1.yaml
modules/install_server/files/autoinstall/preseed.cfg modules/install_server/files/autoinstall/netboot.cfg
modules/profile/templates/cumin/aliases.yaml.erb manifests/site.pp
- ------------------------------------------------------------------------------------------
2019-11-21
[edit]labmon migration
[edit]https://phabricator.wikimedia.org/T224585 https://gerrit.wikimedia.org/r/c/operations/puppet/+/552107
labmon1001 (primary) labmon1002 (backup)
- how to switch active to standby
things to be backed up and restored
- /var/lib/grafana (dashboard data is not in puppet)
- labmon name change -> cloudmetrics1001 & cloudmetrics1002
- disable puppet agent on labmon
- do 1002 (standby) first
- shutdown 1002
- change puppet to change hostname
- turn back on 1002 to ensure hostname change is correct
- (remember to update netbox)
- reimage to buster
- make 1002 the primary
https://gerrit.wikimedia.org/r/admin/projects/operations/dns (another puppet repository for DNS)
2019-11-14
[edit]https://wikitech.wikimedia.org/wiki/Incident_documentation
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron
network bonding/network teaming? multiple network switches
2019-11-05
[edit]- Services we offer / stuff we have / dependency assesment
- IaaS (CloudVPS)
- PaaS (Toolforge)
- DaaS (Wiki-replicas, toolsdb)
- Others (LDAP, etc)
- For each service we offer, what is the current status from the availability and continuity point of view. Identify SPOF.
- IaaS
- hardware level (NICs, switches, RAID storage, racks, disk backups? etc)
- software level (openstack services in HA, which are not, provisioning/bootstrap, puppet etc)
- PaaS
- hardware level (this uses our own IaaS as hardware)
- software level (grid, k8s, docker registry, services, NFS, and other Toolforge key components, puppet, etc)
- DaaS
- hardware level (this uses both our own IaaS as hardware and physical hardware)
- software level (simple cold-standby setups, dbproxies, puppet, etc)
- Others
- IaaS
- For each service we offer, things to improve in both short term and long term. Do we need them? Are they cost-effective?
- IaaS
- hardware level:
- storage (ceph)
- NIC redundancy
- Racking scheme (not everything in row B eqiad)
- etc
- software level:
- glance in HA
- neutron DVR (distributed virtual routing)
- automatic bootstrapping / provisioning
- etc
- hardware level:
- PaaS
- hadrware level:
- automatic provisiong / bootstrapping
- offline backups?
- software level:
- anything?
- hadrware level:
- DaaS
- hardware level:
- etc
- software level:
- etc
- hardware level:
- Others
- IaaS
2019-10-27
[edit]- tools-webservice
- labmon migration
- documentation (https://wikitech.wikimedia.org/wiki/Systems_and_Service_Continuity)
- https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals reallocating page!
- not ready for review yet
- https://phabricator.wikimedia.org/T218461
- cloud-cumin-01.cloudinfra.eqiad.wmflabs
- https://tools.wmflabs.org/openstack-browser/project/cloudinfra
$ sudo cumin "project:tools" "apt-cache policy toollabs-webservice"
sudo cumin "O{project:tools name:tools-sgebastion-08}" "apt-cache policy toollabs-webservice"
aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true"
aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice -s"
Real installation:
aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice"
2019-10-10
[edit]- status of things:
- working on reliability documentation
- labmon project externally blocked https://phabricator.wikimedia.org/T224585
wmcs_puppet_tree_clean() {
cd /var/lib/git/operations/puppet sudo git clean -fd sudo git checkout -f cd - sudo git-sync-upstream
}
https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#wmf-export-puppet-patch.sh
2019-09-26
[edit]- kubernetes ingress etc
- Q2 goal labmon https://phabricator.wikimedia.org/T224585
- some explanations of the servers
- some puppet tree pointers
2019-08-08
[edit]- multiple LDAP accounts: https://phabricator.wikimedia.org/T230126
- not in the LDAP group?
- cloud-wide root https://gerrit.wikimedia.org/r/admin/projects/labs/private
- generate a patch to add a new SSH key (cloud VPS root)
https://wikitech.wikimedia.org/wiki/LDAP
https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398
- puppet workflow:
+2 verified
+2 code-review
then merge button will appeart -> git-gerrit (not yet in infra)
https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398
puppetmaster1001.eqiad.wmnet
sudo puppet-merge (fetch change from gerrit to puppet master)
hpham@puppetmaster1001:~$ sudo puppet-merge Checking for pending merges in /labs/private Fetching new commits from https://gerrit.wikimedia.org/r/labs/private No changes to merge. Fetching new commits from https://gerrit.wikimedia.org/r/operations/puppet No changes to merge.
https://github.com/wikimedia/puppet/ (mirror)
lo https://github.com/wikimedia/puppet/tree/production/modules/role/manifests
https://wikitech.wikimedia.org/wiki/Puppet_coding
manifest (codes) - hiera pulls configuration data
https://github.com/wikimedia/puppet/blob/production/manifests/site.pp
- Possible initial tasks:
- Set up tools-buster repository in aptly to allow toolforge servers to be installed on buster https://phabricator.wikimedia.org/T229237
- WMCS: migrate python2 scripts to python3 https://phabricator.wikimedia.org/T229920
- Migrate labmon* to Stretch (or Buster, better yet!) https://phabricator.wikimedia.org/T224585
- Commit first patch to puppet
sudo easy_install pip sudo pip install -U setuptools
pip install --user git-review
export PATH=$PATH:$HOME/Library/Python/2.7/bin
- clone with commit-msg hook
- https://gerrit.wikimedia.org/r/admin/projects/operations/puppet
git clone "ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet" && scp -p -P 29418 phamhi@gerrit.wikimedia.org:hooks/commit-msg "puppet/.git/hooks/"
git config --global --add gitreview.username "phamhi" git config --global --add gitreview.email "hpham@wikimedia.org"
git review -s
- Creating a git remote called 'gerrit' that maps to:
- ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet.git
- make the change
git commit -a # add comment
git review