Content translation/Deployments/How-to/TPA
This is how-do document to update Template Parameter Alignment database in the cxserver.
Connect to stat100x
[edit]ssh -N stat100X -L 8880:127.0.0.1:8880
Open, http://localhost:8880/
This will open JupyterHub, which requires LDAP password to login.
Starting notebook
[edit]Make sure to check Kerberos authentication timeout first. Default is set to 48 hours now.
klist
Extend it by running kinit:
kinit
Running scripts
[edit]- Open terminal and clone:
https://gitlab.wikimedia.org/dsaez/templatesAlignment
- Update
config.json
for pairs requires to generate template parameter alignments.
- Run all notebooks in order.
00ExtractNamedTempates.ipynb
overwrites existing output files if it runs again, so it is better to save produced JSON files (eg: templates-articles_xx.json and templates-summary_xx.json) in other directory to avoid losing data. For large languages like en, it can be reused if we are running process within few days, this will save time.
- While running
02alignmentsSpark.ipynb
, make sure that Wikidata partition is up-to-date.
Updating database
[edit]Run: scripts/prepare-template-mapping.sh
from cxserver pointing all generated files from the process.
This will update new templatemapping.db in the same folder. Use sqldiff
command (available with sqlite3-tools package in Linux) to see difference between old and new database.
Copy it to config/templatemapping.db
and submit patch for review. This database can be open with sqlite command to check number of template parameters updated.
eg: sqlite> select count(*) from templates where source_lang='en' and target_lang='vec';
Notes
[edit]1. 02alignmentsSpark.ipynb
will need fastText_multilingual module to be manually install in the conda envionment, which is available at: https://github.com/babylonhealth/fastText_multilingual
a. Find conda environment directory using conda list
b. Copy module to environment manually. eg /home/kartik/.conda/envs/2023-06-08T01.31.46_kartik/lib/python3.10/site-packages/fastText_multilingual
2. 03ProduceAlignments.py
requires https://github.com/facebookresearch/fastText/tree/master/python instead of version provided by pip.
3. 03ProduceAlignments.py
might throw error: IndexError: list index out of range
when language has no {{Cite web}} available or linked to Wikidata. Try fixing Wikidata entry. If not, we need to skip that language.
Useful resources
[edit]- All about Conda envionment: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda
- Issues related to Kerberos access: https://wikitech.wikimedia.org/wiki/SWAP#Access_and_infrastructure
- Jupyter at Wikitech contains useful information: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter