recup sources
This commit is contained in:
parent
86622a19ea
commit
65fe2a35f9
155 changed files with 50969 additions and 0 deletions
3
wiki_compare/.gitignore
vendored
Normal file
3
wiki_compare/.gitignore
vendored
Normal file
|
@ -0,0 +1,3 @@
|
|||
*.json
|
||||
.env
|
||||
*.png
|
103
wiki_compare/CHANGES.md
Normal file
103
wiki_compare/CHANGES.md
Normal file
|
@ -0,0 +1,103 @@
|
|||
# Changements implémentés
|
||||
|
||||
Ce document résume les changements et nouvelles fonctionnalités implémentés dans le cadre de la mise à jour du système de gestion des pages wiki OSM.
|
||||
|
||||
## 1. Suivi des changements récents du wiki OSM
|
||||
|
||||
### Fonctionnalités ajoutées
|
||||
- Création d'un script `fetch_recent_changes.py` qui récupère les changements récents dans l'espace de noms français du wiki OSM
|
||||
- Ajout d'une nouvelle route `/wiki/recent-changes` dans le contrôleur WikiController
|
||||
- Création d'un template `wiki_recent_changes.html.twig` pour afficher les changements récents
|
||||
- Mise à jour de la navigation pour inclure un lien vers la page des changements récents
|
||||
|
||||
### Utilisation
|
||||
- Les changements récents sont automatiquement récupérés toutes les heures
|
||||
- La page affiche la liste des pages modifiées récemment avec des liens vers ces pages
|
||||
|
||||
## 2. Validation de la hiérarchie des titres
|
||||
|
||||
### Fonctionnalités ajoutées
|
||||
- Implémentation d'une logique de détection des hiérarchies de titres incorrectes (par exemple, h4 directement sous h2 sans h3 intermédiaire)
|
||||
- Ajout d'indicateurs visuels (badges) pour signaler les hiérarchies incorrectes dans les listes de sections
|
||||
- Mise à jour du template `wiki_compare.html.twig` pour afficher ces indicateurs
|
||||
|
||||
### Utilisation
|
||||
- Les hiérarchies incorrectes sont automatiquement détectées lors de la comparaison des pages wiki
|
||||
- Un badge rouge avec un point d'exclamation est affiché à côté des titres ayant une hiérarchie incorrecte
|
||||
|
||||
## 3. Vérification des groupes locaux
|
||||
|
||||
### Fonctionnalités ajoutées
|
||||
- Mise à jour du script `fetch_osm_fr_groups.py` pour récupérer les données des groupes locaux depuis Framacalc
|
||||
- Ajout d'une fonctionnalité de vérification de l'existence d'une page wiki pour chaque groupe
|
||||
- Mise à jour du template `wiki_osm_fr_groups.html.twig` pour afficher les résultats de vérification
|
||||
- Ajout de filtres pour faciliter la navigation parmi les groupes
|
||||
|
||||
### Utilisation
|
||||
- Les groupes sont affichés avec des badges indiquant leur source (wiki ou Framacalc)
|
||||
- Les groupes sans page wiki sont mis en évidence avec un badge rouge
|
||||
- Les filtres permettent de voir uniquement les groupes d'une certaine catégorie (tous, wiki, Framacalc, avec page wiki, sans page wiki)
|
||||
|
||||
## Limitations connues
|
||||
|
||||
1. **Accès aux données externes** : Les scripts peuvent rencontrer des difficultés pour accéder aux données externes (wiki OSM, Framacalc) en fonction de l'environnement d'exécution.
|
||||
|
||||
2. **Détection des hiérarchies** : La détection des hiérarchies incorrectes se base uniquement sur les niveaux des titres et ne prend pas en compte le contenu ou la sémantique.
|
||||
|
||||
3. **Correspondance des groupes** : La correspondance entre les groupes Framacalc et les pages wiki se fait par une comparaison approximative des noms, ce qui peut parfois donner des résultats imprécis.
|
||||
|
||||
## Maintenance future
|
||||
|
||||
### Scripts Python
|
||||
- Les scripts Python sont situés dans le répertoire `wiki_compare/`
|
||||
- Ils peuvent être exécutés manuellement ou via des tâches cron
|
||||
- L'option `--dry-run` permet de tester les scripts sans modifier les fichiers
|
||||
- L'option `--force` permet de forcer la mise à jour même si le cache est récent
|
||||
|
||||
### Templates Twig
|
||||
- Les templates sont situés dans le répertoire `templates/admin/`
|
||||
- `wiki_recent_changes.html.twig` : Affichage des changements récents
|
||||
- `wiki_compare.html.twig` : Comparaison des pages wiki avec validation de hiérarchie
|
||||
- `wiki_osm_fr_groups.html.twig` : Affichage des groupes locaux avec vérification des pages wiki
|
||||
|
||||
### Contrôleur
|
||||
- Le contrôleur `WikiController.php` contient toutes les routes et la logique de traitement
|
||||
- La méthode `detectHeadingHierarchyErrors()` peut être ajustée pour modifier les règles de validation des hiérarchies
|
||||
- Les méthodes de rafraîchissement des données (`refreshRecentChangesData()`, etc.) peuvent être modifiées pour ajuster la fréquence de mise à jour
|
||||
# Changements récents - 2025-08-22
|
||||
|
||||
## Améliorations de la page "Pages manquantes en français"
|
||||
|
||||
- Ajout d'un bouton pour copier les titres des pages anglaises au format MediaWiki
|
||||
- Implémentation du scraping côté client en JavaScript pour extraire les titres
|
||||
- Ajout d'un score de décrépitude variable pour chaque page
|
||||
- Affichage du score de décrépitude sous forme de barre de progression colorée
|
||||
|
||||
## Correction de la page "Changements récents Wiki OpenStreetMap"
|
||||
|
||||
- Mise à jour de la logique d'analyse HTML pour s'adapter aux différentes structures de page wiki
|
||||
- Amélioration de la robustesse du script en utilisant plusieurs sélecteurs pour chaque élément
|
||||
- Ajout de méthodes alternatives pour extraire les informations de changement
|
||||
|
||||
## Détails techniques
|
||||
|
||||
### Score de décrépitude
|
||||
|
||||
Le score de décrépitude est maintenant calculé individuellement pour chaque page en utilisant un hachage du titre de la page. Cela garantit que:
|
||||
- Chaque page a un score différent
|
||||
- Les pages en anglais ont généralement un score plus élevé (priorité plus haute)
|
||||
- Les scores sont cohérents entre les exécutions du script
|
||||
|
||||
### Copie des titres au format MediaWiki
|
||||
|
||||
Le bouton "Copier les titres au format MediaWiki" permet de:
|
||||
- Extraire tous les titres des pages anglaises de la section
|
||||
- Les formater au format MediaWiki (`* [[Titre]]`)
|
||||
- Les copier dans le presse-papier pour une utilisation facile
|
||||
|
||||
### Amélioration de la détection des changements récents
|
||||
|
||||
Le script de détection des changements récents a été amélioré pour:
|
||||
- Essayer plusieurs sélecteurs HTML pour s'adapter aux changements de structure du wiki
|
||||
- Extraire les informations de changement de manière plus robuste
|
||||
- Gérer différentes versions de la page de changements récents
|
577
wiki_compare/README.md
Normal file
577
wiki_compare/README.md
Normal file
|
@ -0,0 +1,577 @@
|
|||
# OSM Wiki Compare
|
||||
|
||||
Ce projet contient des scripts pour analyser les pages wiki d'OpenStreetMap, identifier celles qui ont besoin de mises à
|
||||
jour ou de traductions, et publier des suggestions sur Mastodon pour encourager la communauté à contribuer.
|
||||
|
||||
## Vue d'ensemble
|
||||
|
||||
Le projet comprend onze scripts principaux :
|
||||
|
||||
1. **wiki_compare.py** : Récupère les 50 clés OSM les plus utilisées, compare leurs pages wiki en anglais et en
|
||||
français, et identifie celles qui ont besoin de mises à jour.
|
||||
2. **post_outdated_page.py** : Sélectionne aléatoirement une page wiki française qui n'est pas à jour et publie un
|
||||
message sur Mastodon pour suggérer sa mise à jour.
|
||||
3. **suggest_translation.py** : Identifie les pages wiki anglaises qui n'ont pas de traduction française et publie une
|
||||
suggestion de traduction sur Mastodon.
|
||||
4. **propose_translation.py** : Sélectionne une page wiki (par défaut la première) et utilise Ollama avec le modèle
|
||||
"mistral:7b" pour proposer une traduction, qui est sauvegardée dans le fichier outdated_pages.json.
|
||||
5. **suggest_grammar_improvements.py** : Sélectionne une page wiki française (par défaut la première) et utilise grammalecte
|
||||
pour vérifier la grammaire et proposer des améliorations, qui sont sauvegardées dans le fichier outdated_pages.json.
|
||||
6. **detect_suspicious_deletions.py** : Analyse les changements récents du wiki OSM pour détecter les suppressions
|
||||
suspectes (plus de 20 caractères) et les enregistre dans un fichier JSON pour affichage sur le site web.
|
||||
7. **fetch_proposals.py** : Récupère les propositions de tags OSM en cours de vote et les propositions récemment modifiées,
|
||||
et les enregistre dans un fichier JSON pour affichage sur le site web. Les données sont mises en cache pendant une heure
|
||||
pour éviter des requêtes trop fréquentes au serveur wiki.
|
||||
8. **find_untranslated_french_pages.py** : Identifie les pages wiki françaises qui n'ont pas de traduction en anglais
|
||||
et les enregistre dans un fichier JSON pour affichage sur le site web. Les données sont mises en cache pendant une heure.
|
||||
9. **find_pages_unavailable_in_french.py** : Scrape la catégorie des pages non disponibles en français, gère la pagination
|
||||
pour récupérer toutes les pages, les groupe par préfixe de langue et priorise les pages commençant par "En:". Les données
|
||||
sont mises en cache pendant une heure.
|
||||
10. **fetch_osm_fr_groups.py** : Récupère les informations sur les groupes de travail et les groupes locaux d'OSM-FR
|
||||
depuis la section #Pages_des_groupes_locaux et les enregistre dans un fichier JSON pour affichage sur le site web.
|
||||
Les données sont mises en cache pendant une heure.
|
||||
11. **fetch_recent_changes.py** : Récupère les changements récents du wiki OSM pour l'espace de noms français, détecte les pages
|
||||
nouvellement créées qui étaient auparavant dans la liste des pages non disponibles en français, et les enregistre dans un
|
||||
fichier JSON pour affichage sur le site web. Les données sont mises en cache pendant une heure.
|
||||
|
||||
## Installation
|
||||
|
||||
### Prérequis
|
||||
|
||||
- Python 3.6 ou supérieur
|
||||
- Pip (gestionnaire de paquets Python)
|
||||
|
||||
### Dépendances
|
||||
|
||||
Installez les dépendances requises :
|
||||
|
||||
```bash
|
||||
pip install requests beautifulsoup4
|
||||
```
|
||||
|
||||
Pour utiliser le script propose_translation.py, vous devez également installer Ollama :
|
||||
|
||||
1. Installez Ollama en suivant les instructions sur [ollama.ai](https://ollama.ai/)
|
||||
2. Téléchargez le modèle "mistral:7b" :
|
||||
|
||||
```bash
|
||||
ollama pull mistral:7b
|
||||
```
|
||||
|
||||
Pour utiliser le script suggest_grammar_improvements.py, vous devez installer grammalecte :
|
||||
|
||||
```bash
|
||||
pip install grammalecte
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Mastodon API
|
||||
|
||||
Pour publier sur Mastodon, vous devez :
|
||||
|
||||
1. Créer un compte sur une instance Mastodon
|
||||
2. Créer une application dans les paramètres de votre compte pour obtenir un jeton d'accès
|
||||
3. Configurer les scripts avec votre instance et votre jeton d'accès
|
||||
|
||||
Modifiez les constantes suivantes dans les scripts `post_outdated_page.py` et `suggest_translation.py` :
|
||||
|
||||
```python
|
||||
MASTODON_API_URL = "https://mastodon.instance/api/v1/statuses" # Remplacez par votre instance
|
||||
```
|
||||
|
||||
### Variables d'environnement
|
||||
|
||||
Définissez la variable d'environnement suivante pour l'authentification Mastodon :
|
||||
|
||||
```bash
|
||||
export MASTODON_ACCESS_TOKEN="votre_jeton_d_acces"
|
||||
```
|
||||
|
||||
## Utilisation
|
||||
|
||||
### Analyser les pages wiki
|
||||
|
||||
Pour analyser les pages wiki et générer les fichiers de données :
|
||||
|
||||
```bash
|
||||
./wiki_compare.py
|
||||
```
|
||||
|
||||
Cela produira :
|
||||
|
||||
- `top_keys.json` : Les 10 clés OSM les plus utilisées
|
||||
- `wiki_pages.csv` : Informations sur chaque page wiki
|
||||
- `outdated_pages.json` : Pages qui ont besoin de mises à jour
|
||||
- Une sortie console listant les 10 pages wiki qui ont besoin de mises à jour
|
||||
|
||||
### Publier une suggestion de mise à jour
|
||||
|
||||
Pour sélectionner aléatoirement une page française qui n'est pas à jour et publier une suggestion sur Mastodon :
|
||||
|
||||
```bash
|
||||
./post_outdated_page.py
|
||||
```
|
||||
|
||||
Pour simuler la publication sans réellement poster sur Mastodon (mode test) :
|
||||
|
||||
```bash
|
||||
./post_outdated_page.py --dry-run
|
||||
```
|
||||
|
||||
### Suggérer une traduction
|
||||
|
||||
Pour identifier une page anglaise sans traduction française et publier une suggestion sur Mastodon :
|
||||
|
||||
```bash
|
||||
./suggest_translation.py
|
||||
```
|
||||
|
||||
Pour simuler la publication sans réellement poster sur Mastodon (mode test) :
|
||||
|
||||
```bash
|
||||
./suggest_translation.py --dry-run
|
||||
```
|
||||
|
||||
### Proposer une traduction avec Ollama
|
||||
|
||||
Pour sélectionner une page wiki (par défaut la première du fichier outdated_pages.json) et générer une proposition de traduction avec Ollama :
|
||||
|
||||
```bash
|
||||
./propose_translation.py
|
||||
```
|
||||
|
||||
Pour traduire une page spécifique en utilisant sa clé :
|
||||
|
||||
```bash
|
||||
./propose_translation.py --page type
|
||||
```
|
||||
|
||||
Note : Ce script nécessite que Ollama soit installé et exécuté localement avec le modèle "mistral:7b" disponible. Pour installer Ollama, suivez les instructions sur [ollama.ai](https://ollama.ai/). Pour télécharger le modèle "mistral:7b", exécutez :
|
||||
|
||||
```bash
|
||||
ollama pull mistral:7b
|
||||
```
|
||||
|
||||
Le script enregistre la traduction proposée dans la propriété "proposed_translation" de l'entrée correspondante dans le fichier outdated_pages.json.
|
||||
|
||||
### Suggérer des améliorations grammaticales avec grammalecte
|
||||
|
||||
Pour sélectionner une page wiki française (par défaut la première avec une version française) et générer des suggestions d'amélioration grammaticale avec grammalecte :
|
||||
|
||||
```bash
|
||||
./suggest_grammar_improvements.py
|
||||
```
|
||||
|
||||
Pour vérifier une page spécifique en utilisant sa clé :
|
||||
|
||||
```bash
|
||||
./suggest_grammar_improvements.py --page type
|
||||
```
|
||||
|
||||
Note : Ce script nécessite que grammalecte soit installé. Pour l'installer, exécutez :
|
||||
|
||||
```bash
|
||||
pip install grammalecte
|
||||
```
|
||||
|
||||
Le script enregistre les suggestions grammaticales dans la propriété "grammar_suggestions" de l'entrée correspondante dans le fichier outdated_pages.json. Ces suggestions sont ensuite utilisées par Symfony dans le template pour afficher des corrections possibles sur la version française de la page dans une section dédiée.
|
||||
|
||||
### Détecter les suppressions suspectes
|
||||
|
||||
Pour analyser les changements récents du wiki OSM et détecter les suppressions suspectes :
|
||||
|
||||
```bash
|
||||
./detect_suspicious_deletions.py
|
||||
```
|
||||
|
||||
Pour afficher les suppressions détectées sans les enregistrer dans un fichier (mode test) :
|
||||
|
||||
```bash
|
||||
./detect_suspicious_deletions.py --dry-run
|
||||
```
|
||||
|
||||
### Récupérer les propositions de tags
|
||||
|
||||
Pour récupérer les propositions de tags OSM en cours de vote et récemment modifiées :
|
||||
|
||||
```bash
|
||||
./fetch_proposals.py
|
||||
```
|
||||
|
||||
Pour forcer la mise à jour des données même si le cache est encore frais :
|
||||
|
||||
```bash
|
||||
./fetch_proposals.py --force
|
||||
```
|
||||
|
||||
Pour afficher les propositions sans les enregistrer dans un fichier (mode test) :
|
||||
|
||||
```bash
|
||||
./fetch_proposals.py --dry-run
|
||||
```
|
||||
|
||||
### Trouver les pages françaises sans traduction anglaise
|
||||
|
||||
Pour identifier les pages wiki françaises qui n'ont pas de traduction en anglais :
|
||||
|
||||
```bash
|
||||
./find_untranslated_french_pages.py
|
||||
```
|
||||
|
||||
Pour forcer la mise à jour des données même si le cache est encore frais :
|
||||
|
||||
```bash
|
||||
./find_untranslated_french_pages.py --force
|
||||
```
|
||||
|
||||
Pour afficher les pages sans les enregistrer dans un fichier (mode test) :
|
||||
|
||||
```bash
|
||||
./find_untranslated_french_pages.py --dry-run
|
||||
```
|
||||
|
||||
### Trouver les pages non disponibles en français
|
||||
|
||||
Pour identifier les pages wiki qui n'ont pas de traduction française, groupées par langue d'origine :
|
||||
|
||||
```bash
|
||||
./find_pages_unavailable_in_french.py
|
||||
```
|
||||
|
||||
Pour forcer la mise à jour des données même si le cache est encore frais :
|
||||
|
||||
```bash
|
||||
./find_pages_unavailable_in_french.py --force
|
||||
```
|
||||
|
||||
Pour afficher les pages sans les enregistrer dans un fichier (mode test) :
|
||||
|
||||
```bash
|
||||
./find_pages_unavailable_in_french.py --dry-run
|
||||
```
|
||||
|
||||
### Récupérer les groupes OSM-FR
|
||||
|
||||
Pour récupérer les informations sur les groupes de travail et les groupes locaux d'OSM-FR :
|
||||
|
||||
```bash
|
||||
./fetch_osm_fr_groups.py
|
||||
```
|
||||
|
||||
Pour forcer la mise à jour des données même si le cache est encore frais :
|
||||
|
||||
```bash
|
||||
./fetch_osm_fr_groups.py --force
|
||||
```
|
||||
|
||||
Pour afficher les groupes sans les enregistrer dans un fichier (mode test) :
|
||||
|
||||
```bash
|
||||
./fetch_osm_fr_groups.py --dry-run
|
||||
```
|
||||
|
||||
## Automatisation
|
||||
|
||||
Vous pouvez automatiser l'exécution de ces scripts à l'aide de cron pour publier régulièrement des suggestions de mises
|
||||
à jour et de traductions, ainsi que pour maintenir à jour les données affichées sur le site web.
|
||||
|
||||
Exemple de configuration cron pour publier des suggestions et mettre à jour les données :
|
||||
|
||||
```
|
||||
# Publier des suggestions sur Mastodon
|
||||
0 10 * * 1 cd /chemin/vers/wiki_compare && ./wiki_compare.py && ./post_outdated_page.py
|
||||
0 10 * * 4 cd /chemin/vers/wiki_compare && ./wiki_compare.py && ./suggest_translation.py
|
||||
|
||||
# Mettre à jour les données pour le site web (toutes les 6 heures)
|
||||
0 */6 * * * cd /chemin/vers/wiki_compare && ./detect_suspicious_deletions.py
|
||||
0 */6 * * * cd /chemin/vers/wiki_compare && ./fetch_proposals.py
|
||||
0 */6 * * * cd /chemin/vers/wiki_compare && ./find_untranslated_french_pages.py
|
||||
0 */6 * * * cd /chemin/vers/wiki_compare && ./find_pages_unavailable_in_french.py
|
||||
0 */6 * * * cd /chemin/vers/wiki_compare && ./fetch_osm_fr_groups.py
|
||||
|
||||
# Récupérer les changements récents et détecter les pages nouvellement créées (toutes les heures)
|
||||
0 * * * * cd /chemin/vers/wiki_compare && ./fetch_recent_changes.py
|
||||
```
|
||||
|
||||
Note : Les scripts de mise à jour des données pour le site web intègrent déjà une vérification de fraîcheur du cache (1 heure),
|
||||
mais la configuration cron ci-dessus permet de s'assurer que les données sont régulièrement mises à jour même en cas de problème
|
||||
temporaire avec les scripts.
|
||||
|
||||
## Structure des données
|
||||
|
||||
### top_keys.json
|
||||
|
||||
Contient les 10 clés OSM les plus utilisées avec leur nombre d'utilisations :
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"key": "building",
|
||||
"count": 459876543
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### wiki_pages.csv
|
||||
|
||||
Contient des informations sur chaque page wiki :
|
||||
|
||||
```
|
||||
key,language,url,last_modified,sections,word_count
|
||||
building,en,https://wiki.openstreetmap.org/wiki/Key:building,2023-05-15,12,3500
|
||||
building,fr,https://wiki.openstreetmap.org/wiki/FR:Key:building,2022-01-10,10,2800
|
||||
...
|
||||
```
|
||||
|
||||
### outdated_pages.json
|
||||
|
||||
Contient des informations détaillées sur les pages qui ont besoin de mises à jour :
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"key": "building",
|
||||
"reason": "French page outdated by 491 days",
|
||||
"en_page": {},
|
||||
"fr_page": {},
|
||||
"date_diff": 491,
|
||||
"word_diff": 700,
|
||||
"section_diff": 2,
|
||||
"priority": 250.5,
|
||||
"proposed_translation": "Texte de la traduction proposée...",
|
||||
"grammar_suggestions": [
|
||||
{
|
||||
"paragraph": 1,
|
||||
"start": 45,
|
||||
"end": 52,
|
||||
"type": "ACCORD",
|
||||
"message": "Accord avec le nom : « bâtiments » est masculin pluriel.",
|
||||
"suggestions": ["grands"],
|
||||
"context": "...les grandes bâtiments de la ville..."
|
||||
},
|
||||
{
|
||||
"paragraph": 3,
|
||||
"start": 120,
|
||||
"end": 128,
|
||||
"type": "CONJUGAISON",
|
||||
"message": "Conjugaison erronée. Accord avec « ils ».",
|
||||
"suggestions": ["peuvent"],
|
||||
"context": "...les bâtiments peut être classés..."
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"key": "amenity",
|
||||
"reason": "French page missing",
|
||||
"en_page": {},
|
||||
"fr_page": null,
|
||||
"date_diff": 0,
|
||||
"word_diff": 4200,
|
||||
"section_diff": 15,
|
||||
"priority": 100
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### suspicious_deletions.json
|
||||
|
||||
Contient des informations sur les suppressions suspectes détectées dans les changements récents du wiki OSM :
|
||||
|
||||
```json
|
||||
{
|
||||
"last_updated": "2025-08-22T15:03:03.616532",
|
||||
"deletions": [
|
||||
{
|
||||
"page_title": "FR:Key:roof:shape",
|
||||
"page_url": "https://wiki.openstreetmap.org/wiki/FR:Key:roof:shape",
|
||||
"deletion_size": -286,
|
||||
"timestamp": "22 août 2025 à 14:15",
|
||||
"user": "RubenKelevra",
|
||||
"comment": "Suppression de contenu obsolète"
|
||||
},
|
||||
{
|
||||
"page_title": "FR:Key:sport",
|
||||
"page_url": "https://wiki.openstreetmap.org/wiki/FR:Key:sport",
|
||||
"deletion_size": -240,
|
||||
"timestamp": "21 août 2025 à 09:30",
|
||||
"user": "Computae",
|
||||
"comment": "Mise à jour de la documentation"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### proposals.json
|
||||
|
||||
Contient des informations sur les propositions de tags OSM en cours de vote et récemment modifiées :
|
||||
|
||||
```json
|
||||
{
|
||||
"last_updated": "2025-08-22T15:09:49.905332",
|
||||
"voting_proposals": [
|
||||
{
|
||||
"title": "Proposal:Man made=ceremonial gate",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/Proposal:Man_made%3Dceremonial_gate",
|
||||
"status": "Voting",
|
||||
"type": "voting"
|
||||
},
|
||||
{
|
||||
"title": "Proposal:Developer",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/Proposal:Developer",
|
||||
"status": "Voting",
|
||||
"type": "voting"
|
||||
}
|
||||
],
|
||||
"recent_proposals": [
|
||||
{
|
||||
"title": "Proposal:Landuse=brownfield",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/Proposal:Landuse=brownfield",
|
||||
"last_modified": "22 août 2025 à 10:45",
|
||||
"modified_by": "MapperUser",
|
||||
"type": "recent"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### untranslated_french_pages.json
|
||||
|
||||
Contient des informations sur les pages wiki françaises qui n'ont pas de traduction en anglais :
|
||||
|
||||
```json
|
||||
{
|
||||
"last_updated": "2025-08-22T16:30:15.123456",
|
||||
"untranslated_pages": [
|
||||
{
|
||||
"title": "FR:Key:building:colour",
|
||||
"key": "Key:building:colour",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/FR:Key:building:colour",
|
||||
"has_translation": false
|
||||
},
|
||||
{
|
||||
"title": "FR:Tag:amenity=bicycle_repair_station",
|
||||
"key": "Tag:amenity=bicycle_repair_station",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/FR:Tag:amenity=bicycle_repair_station",
|
||||
"has_translation": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### pages_unavailable_in_french.json
|
||||
|
||||
Contient des informations sur les pages wiki qui n'ont pas de traduction française, groupées par langue d'origine :
|
||||
|
||||
```json
|
||||
{
|
||||
"last_updated": "2025-08-22T17:15:45.123456",
|
||||
"grouped_pages": {
|
||||
"En": [
|
||||
{
|
||||
"title": "En:Key:building:colour",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/En:Key:building:colour",
|
||||
"language_prefix": "En",
|
||||
"is_english": true,
|
||||
"priority": 1
|
||||
}
|
||||
],
|
||||
"De": [
|
||||
{
|
||||
"title": "De:Tag:highway=residential",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/De:Tag:highway=residential",
|
||||
"language_prefix": "De",
|
||||
"is_english": false,
|
||||
"priority": 0
|
||||
}
|
||||
],
|
||||
"Other": [
|
||||
{
|
||||
"title": "Tag:amenity=bicycle_repair_station",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/Tag:amenity=bicycle_repair_station",
|
||||
"language_prefix": "Other",
|
||||
"is_english": false,
|
||||
"priority": 0
|
||||
}
|
||||
]
|
||||
},
|
||||
"all_pages": [
|
||||
{
|
||||
"title": "En:Key:building:colour",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/En:Key:building:colour",
|
||||
"language_prefix": "En",
|
||||
"is_english": true,
|
||||
"priority": 1
|
||||
},
|
||||
{
|
||||
"title": "De:Tag:highway=residential",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/De:Tag:highway=residential",
|
||||
"language_prefix": "De",
|
||||
"is_english": false,
|
||||
"priority": 0
|
||||
},
|
||||
{
|
||||
"title": "Tag:amenity=bicycle_repair_station",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/Tag:amenity=bicycle_repair_station",
|
||||
"language_prefix": "Other",
|
||||
"is_english": false,
|
||||
"priority": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### osm_fr_groups.json
|
||||
|
||||
Contient des informations sur les groupes de travail et les groupes locaux d'OSM-FR :
|
||||
|
||||
```json
|
||||
{
|
||||
"last_updated": "2025-08-22T16:45:30.789012",
|
||||
"working_groups": [
|
||||
{
|
||||
"name": "Groupe Bâtiments",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/France/OSM-FR/Groupes_de_travail/B%C3%A2timents",
|
||||
"description": "Groupe de travail sur la cartographie des bâtiments",
|
||||
"category": "Cartographie",
|
||||
"type": "working_group"
|
||||
}
|
||||
],
|
||||
"local_groups": [
|
||||
{
|
||||
"name": "Groupe local de Paris",
|
||||
"url": "https://wiki.openstreetmap.org/wiki/France/Paris",
|
||||
"description": "Groupe local des contributeurs parisiens",
|
||||
"type": "local_group"
|
||||
}
|
||||
],
|
||||
"umap_url": "https://umap.openstreetmap.fr/fr/map/groupes-locaux-openstreetmap_152488"
|
||||
}
|
||||
```
|
||||
|
||||
## Dépannage
|
||||
|
||||
### Problèmes courants
|
||||
|
||||
1. **Erreur d'authentification Mastodon** : Vérifiez que la variable d'environnement `MASTODON_ACCESS_TOKEN` est
|
||||
correctement définie et que le jeton est valide.
|
||||
|
||||
2. **Erreur de chargement des fichiers JSON** : Assurez-vous d'exécuter `wiki_compare.py` avant les autres scripts pour
|
||||
générer les fichiers de données nécessaires.
|
||||
|
||||
3. **Aucune page à mettre à jour ou à traduire** : Il est possible que toutes les pages soient à jour ou traduites.
|
||||
Essayez d'augmenter le nombre de clés analysées en modifiant la valeur `limit` dans la fonction `fetch_top_keys` de
|
||||
`wiki_compare.py`.
|
||||
|
||||
### Journalisation
|
||||
|
||||
Tous les scripts utilisent le module `logging` pour enregistrer les informations d'exécution. Par défaut, les logs sont
|
||||
affichés dans la console. Pour les rediriger vers un fichier, modifiez la configuration de logging dans chaque script.
|
||||
|
||||
## Contribution
|
||||
|
||||
Les contributions sont les bienvenues ! N'hésitez pas à ouvrir une issue ou une pull request pour améliorer ces scripts.
|
||||
|
||||
## Licence
|
||||
|
||||
Ce projet est sous licence MIT. Voir le fichier LICENSE pour plus de détails.
|
252
wiki_compare/detect_suspicious_deletions.py
Executable file
252
wiki_compare/detect_suspicious_deletions.py
Executable file
|
@ -0,0 +1,252 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import json
|
||||
import logging
|
||||
import argparse
|
||||
import os
|
||||
import re
|
||||
from datetime import datetime
|
||||
from urllib.parse import urlparse, parse_qs, urlencode
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# URL for recent changes in OSM Wiki (namespace 202 is for Tag pages)
|
||||
RECENT_CHANGES_URL = "https://wiki.openstreetmap.org/w/index.php?hidebots=1&hidenewpages=1&hidecategorization=1&hideWikibase=1&hidelog=1&hidenewuserlog=1&namespace=202&limit=250&days=30&enhanced=1&title=Special:RecentChanges&urlversion=2"
|
||||
|
||||
# Threshold for suspicious deletions (percentage of total content)
|
||||
DELETION_THRESHOLD_PERCENT = 5.0
|
||||
|
||||
# Base URL for OSM Wiki
|
||||
WIKI_BASE_URL = "https://wiki.openstreetmap.org"
|
||||
|
||||
def fetch_recent_changes():
|
||||
"""
|
||||
Fetch the recent changes page from OSM Wiki
|
||||
"""
|
||||
logger.info(f"Fetching recent changes from {RECENT_CHANGES_URL}")
|
||||
try:
|
||||
response = requests.get(RECENT_CHANGES_URL)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching recent changes: {e}")
|
||||
return None
|
||||
|
||||
def fetch_page_content(page_title):
|
||||
"""
|
||||
Fetch the content of a wiki page to count characters
|
||||
"""
|
||||
url = f"{WIKI_BASE_URL}/wiki/{page_title}"
|
||||
logger.info(f"Fetching page content from {url}")
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching page content: {e}")
|
||||
return None
|
||||
|
||||
def count_page_characters(html_content):
|
||||
"""
|
||||
Count the total number of characters in the wiki page content
|
||||
"""
|
||||
if not html_content:
|
||||
return 0
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
# Find the main content div
|
||||
content_div = soup.select_one('#mw-content-text')
|
||||
if not content_div:
|
||||
return 0
|
||||
|
||||
# Get all text content
|
||||
text_content = content_div.get_text(strip=True)
|
||||
|
||||
# Count characters
|
||||
char_count = len(text_content)
|
||||
logger.info(f"Page has {char_count} characters")
|
||||
|
||||
return char_count
|
||||
|
||||
def generate_diff_url(page_title, oldid):
|
||||
"""
|
||||
Generate URL to view the diff of a specific revision
|
||||
"""
|
||||
return f"{WIKI_BASE_URL}/w/index.php?title={page_title}&diff=prev&oldid={oldid}"
|
||||
|
||||
def generate_history_url(page_title):
|
||||
"""
|
||||
Generate URL to view the history of a page
|
||||
"""
|
||||
return f"{WIKI_BASE_URL}/w/index.php?title={page_title}&action=history"
|
||||
|
||||
def load_existing_deletions():
|
||||
"""
|
||||
Load existing suspicious deletions from the JSON file
|
||||
"""
|
||||
output_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'suspicious_deletions.json')
|
||||
existing_pages = set()
|
||||
|
||||
try:
|
||||
if os.path.exists(output_file):
|
||||
with open(output_file, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
if 'deletions' in data:
|
||||
for deletion in data['deletions']:
|
||||
if 'page_title' in deletion:
|
||||
existing_pages.add(deletion['page_title'])
|
||||
logger.info(f"Loaded {len(existing_pages)} existing pages from {output_file}")
|
||||
else:
|
||||
logger.info(f"No existing file found at {output_file}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error loading existing deletions: {e}")
|
||||
|
||||
return existing_pages
|
||||
|
||||
def parse_suspicious_deletions(html_content):
|
||||
"""
|
||||
Parse the HTML content to find suspicious deletions
|
||||
"""
|
||||
if not html_content:
|
||||
return []
|
||||
|
||||
# Load existing pages from the JSON file
|
||||
existing_pages = load_existing_deletions()
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
suspicious_deletions = []
|
||||
|
||||
# Find all change list lines
|
||||
change_lines = soup.select('.mw-changeslist .mw-changeslist-line')
|
||||
logger.info(f"Found {len(change_lines)} change lines to analyze")
|
||||
|
||||
for line in change_lines:
|
||||
# Look for deletion indicators
|
||||
deletion_indicator = line.select_one('.mw-plusminus-neg')
|
||||
if deletion_indicator:
|
||||
# Extract the deletion size
|
||||
deletion_text = deletion_indicator.text.strip()
|
||||
try:
|
||||
# Remove any non-numeric characters except minus sign
|
||||
deletion_size = int(''.join(c for c in deletion_text if c.isdigit() or c == '-'))
|
||||
|
||||
# Skip if deletion size is not greater than 100 characters
|
||||
if abs(deletion_size) <= 100:
|
||||
logger.info(f"Skipping deletion with size {deletion_size} (not > 100 characters)")
|
||||
continue
|
||||
|
||||
# Get the page title and URL
|
||||
title_element = line.select_one('.mw-changeslist-title')
|
||||
if title_element:
|
||||
page_title = title_element.text.strip()
|
||||
|
||||
# Skip if page is already in the JSON file
|
||||
if page_title in existing_pages:
|
||||
logger.info(f"Skipping {page_title} (already in JSON file)")
|
||||
continue
|
||||
|
||||
page_url = title_element.get('href', '')
|
||||
if not page_url.startswith('http'):
|
||||
page_url = f"{WIKI_BASE_URL}{page_url}"
|
||||
|
||||
# Extract oldid from the URL if available
|
||||
oldid = None
|
||||
if 'oldid=' in page_url:
|
||||
parsed_url = urlparse(page_url)
|
||||
query_params = parse_qs(parsed_url.query)
|
||||
if 'oldid' in query_params:
|
||||
oldid = query_params['oldid'][0]
|
||||
|
||||
# Fetch the page content to count characters
|
||||
page_html = fetch_page_content(page_title)
|
||||
total_chars = count_page_characters(page_html)
|
||||
|
||||
# Calculate deletion percentage
|
||||
deletion_percentage = 0
|
||||
if total_chars > 0:
|
||||
deletion_percentage = (abs(deletion_size) / total_chars) * 100
|
||||
|
||||
# If deletion percentage is significant
|
||||
if deletion_percentage > DELETION_THRESHOLD_PERCENT:
|
||||
# Get the timestamp
|
||||
timestamp_element = line.select_one('.mw-changeslist-date')
|
||||
timestamp = timestamp_element.text.strip() if timestamp_element else ""
|
||||
|
||||
# Get the user who made the change
|
||||
user_element = line.select_one('.mw-userlink')
|
||||
user = user_element.text.strip() if user_element else "Unknown"
|
||||
|
||||
# Get the comment if available
|
||||
comment_element = line.select_one('.comment')
|
||||
comment = comment_element.text.strip() if comment_element else ""
|
||||
|
||||
# Generate diff and history URLs
|
||||
diff_url = generate_diff_url(page_title, oldid) if oldid else ""
|
||||
history_url = generate_history_url(page_title)
|
||||
|
||||
suspicious_deletions.append({
|
||||
'page_title': page_title,
|
||||
'page_url': page_url,
|
||||
'diff_url': diff_url,
|
||||
'history_url': history_url,
|
||||
'deletion_size': deletion_size,
|
||||
'total_chars': total_chars,
|
||||
'deletion_percentage': round(deletion_percentage, 2),
|
||||
'timestamp': timestamp,
|
||||
'user': user,
|
||||
'comment': comment
|
||||
})
|
||||
logger.info(f"Found suspicious deletion: {page_title} ({deletion_size} chars, {deletion_percentage:.2f}% of content)")
|
||||
except ValueError:
|
||||
logger.warning(f"Could not parse deletion size from: {deletion_text}")
|
||||
|
||||
return suspicious_deletions
|
||||
|
||||
def save_suspicious_deletions(suspicious_deletions):
|
||||
"""
|
||||
Save the suspicious deletions to a JSON file
|
||||
"""
|
||||
output_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'suspicious_deletions.json')
|
||||
|
||||
# Add timestamp to the data
|
||||
data = {
|
||||
'last_updated': datetime.now().isoformat(),
|
||||
'deletions': suspicious_deletions
|
||||
}
|
||||
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
logger.info(f"Saved {len(suspicious_deletions)} suspicious deletions to {output_file}")
|
||||
return output_file
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Detect suspicious deletions in OSM Wiki recent changes')
|
||||
parser.add_argument('--dry-run', action='store_true', help='Print results without saving to file')
|
||||
args = parser.parse_args()
|
||||
|
||||
html_content = fetch_recent_changes()
|
||||
if html_content:
|
||||
suspicious_deletions = parse_suspicious_deletions(html_content)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"Found {len(suspicious_deletions)} suspicious deletions:")
|
||||
for deletion in suspicious_deletions:
|
||||
logger.info(f"- {deletion['page_title']}: {deletion['deletion_size']} chars by {deletion['user']}")
|
||||
else:
|
||||
output_file = save_suspicious_deletions(suspicious_deletions)
|
||||
logger.info(f"Results saved to {output_file}")
|
||||
else:
|
||||
logger.error("Failed to fetch recent changes. Exiting.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
697
wiki_compare/fetch_archived_proposals.py
Normal file
697
wiki_compare/fetch_archived_proposals.py
Normal file
|
@ -0,0 +1,697 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
fetch_archived_proposals.py
|
||||
|
||||
This script scrapes archived proposals from the OpenStreetMap wiki and extracts voting information.
|
||||
It analyzes the voting patterns, counts votes by type (approve, oppose, abstain), and collects
|
||||
information about the users who voted.
|
||||
|
||||
The script saves the data to a JSON file that can be used by the Symfony application.
|
||||
|
||||
Usage:
|
||||
python fetch_archived_proposals.py [--force] [--limit N]
|
||||
|
||||
Options:
|
||||
--force Force refresh of all proposals, even if they have already been processed
|
||||
--limit N Limit processing to N proposals (default: process all proposals)
|
||||
|
||||
Output:
|
||||
- archived_proposals.json file with voting information
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime
|
||||
from urllib.parse import urljoin
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup, NavigableString
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
ARCHIVED_PROPOSALS_URL = "https://wiki.openstreetmap.org/wiki/Category:Archived_proposals"
|
||||
import os
|
||||
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
ARCHIVED_PROPOSALS_FILE = os.path.join(SCRIPT_DIR, "archived_proposals.json")
|
||||
USER_AGENT = "OSM-Commerces/1.0 (https://github.com/yourusername/osm-commerces; your@email.com)"
|
||||
RATE_LIMIT_DELAY = 1 # seconds between requests to avoid rate limiting
|
||||
|
||||
# Vote patterns
|
||||
VOTE_PATTERNS = {
|
||||
'approve': [
|
||||
r'I\s+(?:(?:strongly|fully|completely|wholeheartedly)\s+)?(?:approve|support|agree\s+with)\s+this\s+proposal',
|
||||
r'I\s+vote\s+(?:to\s+)?(?:approve|support)',
|
||||
r'(?:Symbol\s+support\s+vote\.svg|Symbol_support_vote\.svg)',
|
||||
],
|
||||
'oppose': [
|
||||
r'I\s+(?:(?:strongly|fully|completely|wholeheartedly)\s+)?(?:oppose|disagree\s+with|reject|do\s+not\s+support)\s+this\s+proposal',
|
||||
r'I\s+vote\s+(?:to\s+)?(?:oppose|reject|against)',
|
||||
r'(?:Symbol\s+oppose\s+vote\.svg|Symbol_oppose_vote\.svg)',
|
||||
],
|
||||
'abstain': [
|
||||
r'I\s+(?:have\s+comments\s+but\s+)?abstain\s+from\s+voting',
|
||||
r'I\s+(?:have\s+comments\s+but\s+)?(?:neither\s+approve\s+nor\s+oppose|am\s+neutral)',
|
||||
r'(?:Symbol\s+abstain\s+vote\.svg|Symbol_abstain_vote\.svg)',
|
||||
]
|
||||
}
|
||||
|
||||
def parse_arguments():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(description='Fetch and analyze archived OSM proposals')
|
||||
parser.add_argument('--force', action='store_true', help='Force refresh of all proposals')
|
||||
parser.add_argument('--limit', type=int, help='Limit processing to N proposals (default: process all)')
|
||||
return parser.parse_args()
|
||||
|
||||
def load_existing_data():
|
||||
"""Load existing archived proposals data if available"""
|
||||
if os.path.exists(ARCHIVED_PROPOSALS_FILE):
|
||||
try:
|
||||
with open(ARCHIVED_PROPOSALS_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
logger.info(f"Loaded {len(data.get('proposals', []))} existing proposals from {ARCHIVED_PROPOSALS_FILE}")
|
||||
return data
|
||||
except (json.JSONDecodeError, IOError) as e:
|
||||
logger.error(f"Error loading existing data: {e}")
|
||||
|
||||
# Return empty structure if file doesn't exist or has errors
|
||||
return {
|
||||
'last_updated': None,
|
||||
'proposals': []
|
||||
}
|
||||
|
||||
def save_data(data):
|
||||
"""Save data to JSON file"""
|
||||
try:
|
||||
# Update last_updated timestamp
|
||||
data['last_updated'] = datetime.now().isoformat()
|
||||
|
||||
with open(ARCHIVED_PROPOSALS_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
|
||||
logger.info(f"Saved {len(data.get('proposals', []))} proposals to {ARCHIVED_PROPOSALS_FILE}")
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving data: {e}")
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error saving data: {e}")
|
||||
|
||||
def fetch_page(url):
|
||||
"""Fetch a page from the OSM wiki"""
|
||||
headers = {
|
||||
'User-Agent': USER_AGENT
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.get(url, headers=headers)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
def get_proposal_urls():
|
||||
"""Get URLs of all archived proposals"""
|
||||
logger.info(f"Fetching archived proposals list from {ARCHIVED_PROPOSALS_URL}")
|
||||
|
||||
html = fetch_page(ARCHIVED_PROPOSALS_URL)
|
||||
if not html:
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Find all links in the category pages
|
||||
proposal_urls = []
|
||||
|
||||
# Get proposals from the main category page
|
||||
category_content = soup.select_one('#mw-pages')
|
||||
if category_content:
|
||||
for link in category_content.select('a'):
|
||||
if link.get('title') and 'Category:' not in link.get('title'):
|
||||
proposal_urls.append({
|
||||
'title': link.get('title'),
|
||||
'url': urljoin(ARCHIVED_PROPOSALS_URL, link.get('href'))
|
||||
})
|
||||
|
||||
# Check if there are subcategories
|
||||
subcategories = soup.select('#mw-subcategories a')
|
||||
for subcat in subcategories:
|
||||
if 'Category:' in subcat.get('title', ''):
|
||||
logger.info(f"Found subcategory: {subcat.get('title')}")
|
||||
subcat_url = urljoin(ARCHIVED_PROPOSALS_URL, subcat.get('href'))
|
||||
|
||||
# Fetch the subcategory page
|
||||
time.sleep(RATE_LIMIT_DELAY) # Respect rate limits
|
||||
subcat_html = fetch_page(subcat_url)
|
||||
if subcat_html:
|
||||
subcat_soup = BeautifulSoup(subcat_html, 'html.parser')
|
||||
subcat_content = subcat_soup.select_one('#mw-pages')
|
||||
if subcat_content:
|
||||
for link in subcat_content.select('a'):
|
||||
if link.get('title') and 'Category:' not in link.get('title'):
|
||||
proposal_urls.append({
|
||||
'title': link.get('title'),
|
||||
'url': urljoin(ARCHIVED_PROPOSALS_URL, link.get('href'))
|
||||
})
|
||||
|
||||
logger.info(f"Found {len(proposal_urls)} archived proposals")
|
||||
return proposal_urls
|
||||
|
||||
def extract_username(text):
|
||||
"""Extract username from a signature line"""
|
||||
# Common patterns for signatures
|
||||
patterns = [
|
||||
r'--\s*\[\[User:([^|\]]+)(?:\|[^\]]+)?\]\]', # --[[User:Username|Username]]
|
||||
r'--\s*\[\[User:([^|\]]+)\]\]', # --[[User:Username]]
|
||||
r'--\s*\[\[User talk:([^|\]]+)(?:\|[^\]]+)?\]\]', # --[[User talk:Username|Username]]
|
||||
r'--\s*\[\[User talk:([^|\]]+)\]\]', # --[[User talk:Username]]
|
||||
r'--\s*\[\[Special:Contributions/([^|\]]+)(?:\|[^\]]+)?\]\]', # --[[Special:Contributions/Username|Username]]
|
||||
r'--\s*\[\[Special:Contributions/([^|\]]+)\]\]', # --[[Special:Contributions/Username]]
|
||||
]
|
||||
|
||||
for pattern in patterns:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
|
||||
# If no match found with the patterns, try to find any username-like string
|
||||
match = re.search(r'--\s*([A-Za-z0-9_-]+)', text)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
|
||||
return None
|
||||
|
||||
def extract_date(text):
|
||||
"""Extract date from a signature line"""
|
||||
# Look for common date formats in signatures
|
||||
date_patterns = [
|
||||
r'(\d{1,2}:\d{2}, \d{1,2} [A-Za-z]+ \d{4})', # 15:30, 25 December 2023
|
||||
r'(\d{1,2} [A-Za-z]+ \d{4} \d{1,2}:\d{2})', # 25 December 2023 15:30
|
||||
r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})', # 2023-12-25T15:30:00
|
||||
]
|
||||
|
||||
for pattern in date_patterns:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
return match.group(1)
|
||||
|
||||
return None
|
||||
|
||||
def determine_vote_type(text):
|
||||
"""Determine the type of vote from the text"""
|
||||
text_lower = text.lower()
|
||||
|
||||
for vote_type, patterns in VOTE_PATTERNS.items():
|
||||
for pattern in patterns:
|
||||
if re.search(pattern, text_lower, re.IGNORECASE):
|
||||
return vote_type
|
||||
|
||||
return None
|
||||
|
||||
def extract_votes(html):
|
||||
"""Extract voting information from proposal HTML"""
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Find the voting section
|
||||
voting_section = None
|
||||
for heading in soup.find_all(['h2', 'h3']):
|
||||
heading_text = heading.get_text().lower()
|
||||
if 'voting' in heading_text or 'votes' in heading_text or 'poll' in heading_text:
|
||||
voting_section = heading
|
||||
break
|
||||
|
||||
if not voting_section:
|
||||
logger.warning("No voting section found")
|
||||
return {
|
||||
'approve': {'count': 0, 'users': []},
|
||||
'oppose': {'count': 0, 'users': []},
|
||||
'abstain': {'count': 0, 'users': []}
|
||||
}
|
||||
|
||||
# Get the content after the voting section heading
|
||||
votes_content = []
|
||||
current = voting_section.next_sibling
|
||||
|
||||
# Collect all elements until the next heading or the end of the document
|
||||
while current and not current.name in ['h2', 'h3']:
|
||||
if current.name: # Skip NavigableString objects
|
||||
votes_content.append(current)
|
||||
current = current.next_sibling
|
||||
|
||||
# Process vote lists
|
||||
votes = {
|
||||
'approve': {'count': 0, 'users': []},
|
||||
'oppose': {'count': 0, 'users': []},
|
||||
'abstain': {'count': 0, 'users': []}
|
||||
}
|
||||
|
||||
# For tracking vote dates to calculate duration
|
||||
all_vote_dates = []
|
||||
|
||||
# Look for lists of votes
|
||||
for element in votes_content:
|
||||
if element.name == 'ul':
|
||||
for li in element.find_all('li'):
|
||||
vote_text = li.get_text()
|
||||
vote_type = determine_vote_type(vote_text)
|
||||
|
||||
if vote_type:
|
||||
username = extract_username(vote_text)
|
||||
date = extract_date(vote_text)
|
||||
|
||||
# Extract comment by removing vote declaration and signature
|
||||
comment = vote_text
|
||||
|
||||
# Remove vote declaration patterns
|
||||
for pattern in VOTE_PATTERNS[vote_type]:
|
||||
comment = re.sub(pattern, '', comment, flags=re.IGNORECASE)
|
||||
|
||||
# Remove signature
|
||||
signature_patterns = [
|
||||
r'--\s*\[\[User:[^]]+\]\].*$',
|
||||
r'--\s*\[\[User talk:[^]]+\]\].*$',
|
||||
r'--\s*\[\[Special:Contributions/[^]]+\]\].*$',
|
||||
r'--\s*[A-Za-z0-9_-]+.*$'
|
||||
]
|
||||
for pattern in signature_patterns:
|
||||
comment = re.sub(pattern, '', comment, flags=re.IGNORECASE)
|
||||
|
||||
# Clean up the comment
|
||||
comment = comment.strip()
|
||||
|
||||
if username:
|
||||
votes[vote_type]['count'] += 1
|
||||
votes[vote_type]['users'].append({
|
||||
'username': username,
|
||||
'date': date,
|
||||
'comment': comment
|
||||
})
|
||||
|
||||
# Add date to list for duration calculation if it's valid
|
||||
if date:
|
||||
try:
|
||||
# Try to parse the date in different formats
|
||||
parsed_date = None
|
||||
for date_format in [
|
||||
'%H:%M, %d %B %Y', # 15:30, 25 December 2023
|
||||
'%d %B %Y %H:%M', # 25 December 2023 15:30
|
||||
'%Y-%m-%dT%H:%M:%S' # 2023-12-25T15:30:00
|
||||
]:
|
||||
try:
|
||||
parsed_date = datetime.strptime(date, date_format)
|
||||
break
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
if parsed_date:
|
||||
all_vote_dates.append(parsed_date)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not parse date '{date}': {e}")
|
||||
|
||||
# Calculate vote duration if we have at least two dates
|
||||
if len(all_vote_dates) >= 2:
|
||||
all_vote_dates.sort()
|
||||
first_vote = all_vote_dates[0]
|
||||
last_vote = all_vote_dates[-1]
|
||||
vote_duration_days = (last_vote - first_vote).days
|
||||
votes['first_vote'] = first_vote.strftime('%Y-%m-%d')
|
||||
votes['last_vote'] = last_vote.strftime('%Y-%m-%d')
|
||||
votes['duration_days'] = vote_duration_days
|
||||
|
||||
return votes
|
||||
|
||||
def extract_proposal_metadata(html, url, original_title=None):
|
||||
"""Extract metadata about the proposal"""
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Get title
|
||||
title_element = soup.select_one('#firstHeading')
|
||||
extracted_title = title_element.get_text() if title_element else "Unknown Title"
|
||||
|
||||
# Debug logging
|
||||
logger.debug(f"Original title: '{original_title}', Extracted title: '{extracted_title}'")
|
||||
|
||||
# Check if the extracted title is a username or user page
|
||||
# This covers both "User:Username" and other user-related pages
|
||||
if (extracted_title.startswith("User:") or
|
||||
"User:" in extracted_title or
|
||||
"User talk:" in extracted_title) and original_title:
|
||||
logger.info(f"Extracted title '{extracted_title}' appears to be a user page. Using original title '{original_title}' instead.")
|
||||
title = original_title
|
||||
else:
|
||||
title = extracted_title
|
||||
|
||||
# Get last modified date
|
||||
last_modified = None
|
||||
footer_info = soup.select_one('#footer-info-lastmod')
|
||||
if footer_info:
|
||||
last_modified_text = footer_info.get_text()
|
||||
match = re.search(r'(\d{1,2} [A-Za-z]+ \d{4})', last_modified_text)
|
||||
if match:
|
||||
last_modified = match.group(1)
|
||||
|
||||
# Get content element for further processing
|
||||
content = soup.select_one('#mw-content-text')
|
||||
|
||||
# Get proposer from the page
|
||||
proposer = None
|
||||
|
||||
# Get proposal status from the page
|
||||
status = None
|
||||
|
||||
# Look for table rows to find proposer and status
|
||||
if content:
|
||||
# Look for table rows
|
||||
for row in content.select('tr'):
|
||||
# Check if the row has at least two cells (th and td)
|
||||
cells = row.select('th, td')
|
||||
if len(cells) >= 2:
|
||||
# Get the header text from the first cell
|
||||
header_text = cells[0].get_text().strip().lower()
|
||||
|
||||
# Check for "Proposed by:" to find proposer
|
||||
if "proposed by" in header_text:
|
||||
# Look for user link in the next cell
|
||||
user_link = cells[1].select_one('a[href*="/wiki/User:"]')
|
||||
if user_link:
|
||||
# Extract username from the link
|
||||
href = user_link.get('href', '')
|
||||
title = user_link.get('title', '')
|
||||
|
||||
# Try to get username from title attribute first
|
||||
if title and title.startswith('User:'):
|
||||
proposer = title[5:] # Remove 'User:' prefix
|
||||
# Otherwise try to extract from href
|
||||
elif href:
|
||||
href_match = re.search(r'/wiki/User:([^/]+)', href)
|
||||
if href_match:
|
||||
proposer = href_match.group(1)
|
||||
|
||||
# If still no proposer, use the link text
|
||||
if not proposer and user_link.get_text():
|
||||
proposer = user_link.get_text().strip()
|
||||
|
||||
logger.info(f"Found proposer in table: {proposer}")
|
||||
|
||||
# Check for "Proposal status:" to find status
|
||||
elif "proposal status" in header_text:
|
||||
# Get the status from the next cell
|
||||
status_cell = cells[1]
|
||||
|
||||
# First try to find a link with a category title containing status
|
||||
status_link = status_cell.select_one('a[title*="Category:Proposals with"]')
|
||||
if status_link:
|
||||
# Extract status from the title attribute
|
||||
status_match = re.search(r'Category:Proposals with "([^"]+)" status', status_link.get('title', ''))
|
||||
if status_match:
|
||||
status = status_match.group(1)
|
||||
logger.info(f"Found status in table link: {status}")
|
||||
|
||||
# If no status found in link, try to get text content
|
||||
if not status:
|
||||
status_text = status_cell.get_text().strip()
|
||||
# Try to match one of the known statuses
|
||||
known_statuses = [
|
||||
"Draft", "Proposed", "Voting", "Post-vote", "Approved",
|
||||
"Rejected", "Abandoned", "Canceled", "Obsoleted",
|
||||
"Inactive", "Undefined"
|
||||
]
|
||||
for known_status in known_statuses:
|
||||
if known_status.lower() in status_text.lower():
|
||||
status = known_status
|
||||
logger.info(f"Found status in table text: {status}")
|
||||
break
|
||||
|
||||
# If no proposer found in table, try the first paragraph method
|
||||
if not proposer:
|
||||
first_paragraph = soup.select_one('#mw-content-text p')
|
||||
if first_paragraph:
|
||||
proposer_match = re.search(r'(?:proposed|created|authored)\s+by\s+\[\[User:([^|\]]+)', first_paragraph.get_text())
|
||||
if proposer_match:
|
||||
proposer = proposer_match.group(1)
|
||||
logger.info(f"Found proposer in paragraph: {proposer}")
|
||||
|
||||
# Count sections, links, and words
|
||||
section_count = len(soup.select('#mw-content-text h2, #mw-content-text h3, #mw-content-text h4')) if content else 0
|
||||
|
||||
# Count links excluding user/talk pages (voting signatures)
|
||||
links = []
|
||||
if content:
|
||||
for link in content.select('a'):
|
||||
href = link.get('href', '')
|
||||
if href and not re.search(r'User:|User_talk:|Special:Contributions', href):
|
||||
links.append(href)
|
||||
link_count = len(links)
|
||||
|
||||
# Approximate word count
|
||||
word_count = 0
|
||||
if content:
|
||||
# Get text content excluding navigation elements
|
||||
for nav in content.select('.navbox, .ambox, .tmbox, .mw-editsection'):
|
||||
nav.decompose()
|
||||
|
||||
# Also exclude the voting section to count only the proposal content
|
||||
voting_section = None
|
||||
for heading in content.find_all(['h2', 'h3']):
|
||||
heading_text = heading.get_text().lower()
|
||||
if 'voting' in heading_text or 'votes' in heading_text or 'poll' in heading_text:
|
||||
voting_section = heading
|
||||
break
|
||||
|
||||
if voting_section:
|
||||
# Remove the voting section and everything after it
|
||||
current = voting_section
|
||||
while current:
|
||||
next_sibling = current.next_sibling
|
||||
# Only call decompose() if current is not a NavigableString
|
||||
# NavigableString objects don't have a decompose() method
|
||||
if not isinstance(current, NavigableString):
|
||||
current.decompose()
|
||||
current = next_sibling
|
||||
|
||||
# Count words in the remaining content
|
||||
text = content.get_text()
|
||||
word_count = len(re.findall(r'\b\w+\b', text))
|
||||
|
||||
return {
|
||||
'title': title,
|
||||
'url': url,
|
||||
'last_modified': last_modified,
|
||||
'proposer': proposer,
|
||||
'status': status,
|
||||
'section_count': section_count,
|
||||
'link_count': link_count,
|
||||
'word_count': word_count
|
||||
}
|
||||
|
||||
def process_proposal(proposal, force=False):
|
||||
"""Process a single proposal and extract voting information"""
|
||||
url = proposal['url']
|
||||
title = proposal['title']
|
||||
|
||||
logger.info(f"Processing proposal: {title}")
|
||||
|
||||
# Fetch the proposal page
|
||||
html = fetch_page(url)
|
||||
if not html:
|
||||
return None
|
||||
|
||||
# Extract metadata
|
||||
metadata = extract_proposal_metadata(html, url, original_title=title)
|
||||
|
||||
# Extract votes
|
||||
votes = extract_votes(html)
|
||||
|
||||
# Combine metadata and votes
|
||||
result = {**metadata, 'votes': votes}
|
||||
|
||||
# Calculate total votes and percentages
|
||||
total_votes = votes['approve']['count'] + votes['oppose']['count'] + votes['abstain']['count']
|
||||
|
||||
if total_votes > 0:
|
||||
result['total_votes'] = total_votes
|
||||
result['approve_percentage'] = round((votes['approve']['count'] / total_votes) * 100, 1)
|
||||
result['oppose_percentage'] = round((votes['oppose']['count'] / total_votes) * 100, 1)
|
||||
result['abstain_percentage'] = round((votes['abstain']['count'] / total_votes) * 100, 1)
|
||||
else:
|
||||
result['total_votes'] = 0
|
||||
result['approve_percentage'] = 0
|
||||
result['oppose_percentage'] = 0
|
||||
result['abstain_percentage'] = 0
|
||||
|
||||
return result
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
args = parse_arguments()
|
||||
force = args.force
|
||||
limit = args.limit
|
||||
|
||||
logger.info("Starting fetch_archived_proposals.py")
|
||||
if limit:
|
||||
logger.info(f"Processing limited to {limit} proposals")
|
||||
|
||||
# Load existing data
|
||||
data = load_existing_data()
|
||||
|
||||
# Get list of proposal URLs
|
||||
proposal_urls = get_proposal_urls()
|
||||
|
||||
# Apply limit if specified
|
||||
if limit and limit < len(proposal_urls):
|
||||
logger.info(f"Limiting processing from {len(proposal_urls)} to {limit} proposals")
|
||||
proposal_urls = proposal_urls[:limit]
|
||||
|
||||
# Create a map of existing proposals by URL for quick lookup
|
||||
existing_proposals = {p['url']: p for p in data.get('proposals', [])}
|
||||
|
||||
# Process each proposal
|
||||
new_proposals = []
|
||||
processed_count = 0
|
||||
for proposal in proposal_urls:
|
||||
url = proposal['url']
|
||||
original_title = proposal['title']
|
||||
|
||||
# Skip if already processed and not forcing refresh
|
||||
if url in existing_proposals and not force:
|
||||
logger.info(f"Skipping already processed proposal: {original_title}")
|
||||
new_proposals.append(existing_proposals[url])
|
||||
continue
|
||||
|
||||
# Process the proposal
|
||||
time.sleep(RATE_LIMIT_DELAY) # Respect rate limits
|
||||
processed = process_proposal(proposal, force)
|
||||
|
||||
if processed:
|
||||
# Ensure the title is preserved from the original proposal
|
||||
if processed.get('title') != original_title:
|
||||
# Check if the title contains "User:" - if it does, we've already handled it in extract_proposal_metadata
|
||||
# and don't need to log a warning
|
||||
if "User:" in processed.get('title', ''):
|
||||
logger.debug(f"Title contains 'User:' - already handled in extract_proposal_metadata")
|
||||
else:
|
||||
logger.warning(f"Title changed during processing from '{original_title}' to '{processed.get('title')}'. Restoring original title.")
|
||||
processed['title'] = original_title
|
||||
|
||||
new_proposals.append(processed)
|
||||
processed_count += 1
|
||||
|
||||
# Check if we've reached the limit
|
||||
if limit and processed_count >= limit:
|
||||
logger.info(f"Reached limit of {limit} processed proposals")
|
||||
break
|
||||
|
||||
# Update the data
|
||||
data['proposals'] = new_proposals
|
||||
|
||||
# Calculate global statistics
|
||||
total_proposals = len(new_proposals)
|
||||
total_votes = sum(p.get('total_votes', 0) for p in new_proposals)
|
||||
|
||||
# Calculate votes per proposal statistics, excluding proposals with 0 votes
|
||||
proposals_with_votes = [p for p in new_proposals if p.get('total_votes', 0) > 0]
|
||||
num_proposals_with_votes = len(proposals_with_votes)
|
||||
|
||||
if num_proposals_with_votes > 0:
|
||||
# Calculate average votes per proposal (excluding proposals with 0 votes)
|
||||
votes_per_proposal = [p.get('total_votes', 0) for p in proposals_with_votes]
|
||||
avg_votes_per_proposal = round(sum(votes_per_proposal) / num_proposals_with_votes, 1)
|
||||
|
||||
# Calculate median votes per proposal
|
||||
votes_per_proposal.sort()
|
||||
if num_proposals_with_votes % 2 == 0:
|
||||
# Even number of proposals, average the middle two
|
||||
median_votes_per_proposal = round((votes_per_proposal[num_proposals_with_votes // 2 - 1] +
|
||||
votes_per_proposal[num_proposals_with_votes // 2]) / 2, 1)
|
||||
else:
|
||||
# Odd number of proposals, take the middle one
|
||||
median_votes_per_proposal = votes_per_proposal[num_proposals_with_votes // 2]
|
||||
|
||||
# Calculate standard deviation of votes per proposal
|
||||
mean = sum(votes_per_proposal) / num_proposals_with_votes
|
||||
variance = sum((x - mean) ** 2 for x in votes_per_proposal) / num_proposals_with_votes
|
||||
std_dev_votes_per_proposal = round((variance ** 0.5), 1)
|
||||
else:
|
||||
avg_votes_per_proposal = 0
|
||||
median_votes_per_proposal = 0
|
||||
std_dev_votes_per_proposal = 0
|
||||
|
||||
# Count unique voters
|
||||
all_voters = set()
|
||||
for p in new_proposals:
|
||||
for vote_type in ['approve', 'oppose', 'abstain']:
|
||||
for user in p.get('votes', {}).get(vote_type, {}).get('users', []):
|
||||
if 'username' in user:
|
||||
all_voters.add(user['username'])
|
||||
|
||||
# Find most active voters
|
||||
voter_counts = {}
|
||||
for p in new_proposals:
|
||||
for vote_type in ['approve', 'oppose', 'abstain']:
|
||||
for user in p.get('votes', {}).get(vote_type, {}).get('users', []):
|
||||
if 'username' in user:
|
||||
username = user['username']
|
||||
if username not in voter_counts:
|
||||
voter_counts[username] = {'total': 0, 'approve': 0, 'oppose': 0, 'abstain': 0}
|
||||
voter_counts[username]['total'] += 1
|
||||
voter_counts[username][vote_type] += 1
|
||||
|
||||
# Sort voters by total votes
|
||||
top_voters = sorted(
|
||||
[{'username': k, **v} for k, v in voter_counts.items()],
|
||||
key=lambda x: x['total'],
|
||||
reverse=True
|
||||
)[:100] # Top 100 voters
|
||||
|
||||
# Count proposals by status
|
||||
status_counts = {}
|
||||
for p in new_proposals:
|
||||
status = p.get('status')
|
||||
if status:
|
||||
status_counts[status] = status_counts.get(status, 0) + 1
|
||||
else:
|
||||
status_counts['Unknown'] = status_counts.get('Unknown', 0) + 1
|
||||
|
||||
# Ensure status_counts is never empty
|
||||
if not status_counts:
|
||||
status_counts['No Status'] = 0
|
||||
|
||||
# Calculate average vote duration
|
||||
proposals_with_duration = [p for p in new_proposals if 'votes' in p and 'duration_days' in p['votes']]
|
||||
avg_vote_duration = 0
|
||||
if proposals_with_duration:
|
||||
total_duration = sum(p['votes']['duration_days'] for p in proposals_with_duration)
|
||||
avg_vote_duration = round(total_duration / len(proposals_with_duration), 1)
|
||||
|
||||
# Add statistics to the data
|
||||
data['statistics'] = {
|
||||
'total_proposals': total_proposals,
|
||||
'total_votes': total_votes,
|
||||
'avg_votes_per_proposal': avg_votes_per_proposal,
|
||||
'median_votes_per_proposal': median_votes_per_proposal,
|
||||
'std_dev_votes_per_proposal': std_dev_votes_per_proposal,
|
||||
'avg_vote_duration_days': avg_vote_duration,
|
||||
'unique_voters': len(all_voters),
|
||||
'top_voters': top_voters,
|
||||
'status_distribution': status_counts
|
||||
}
|
||||
|
||||
# Save the data
|
||||
save_data(data)
|
||||
|
||||
logger.info("Script completed successfully")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
517
wiki_compare/fetch_osm_fr_groups.py
Executable file
517
wiki_compare/fetch_osm_fr_groups.py
Executable file
|
@ -0,0 +1,517 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
fetch_osm_fr_groups.py
|
||||
|
||||
This script fetches information about OSM-FR local groups from two sources:
|
||||
1. The OpenStreetMap wiki page for France/OSM-FR (specifically the #Pages_des_groupes_locaux section)
|
||||
2. The Framacalc spreadsheet at https://framacalc.org/osm-groupes-locaux
|
||||
|
||||
It then verifies that each group from the Framacalc has a corresponding wiki page.
|
||||
|
||||
Usage:
|
||||
python fetch_osm_fr_groups.py [--dry-run] [--force]
|
||||
|
||||
Options:
|
||||
--dry-run Run the script without saving the results to a file
|
||||
--force Force update even if the cache is still fresh (less than 1 hour old)
|
||||
|
||||
Output:
|
||||
- osm_fr_groups.json: JSON file with information about OSM-FR local groups
|
||||
- Log messages about the scraping process and results
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import csv
|
||||
import io
|
||||
from datetime import datetime, timedelta
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTPUT_FILE = "osm_fr_groups.json"
|
||||
BASE_URL = "https://wiki.openstreetmap.org/wiki/France/OSM-FR"
|
||||
WIKI_BASE_URL = "https://wiki.openstreetmap.org"
|
||||
FRAMACALC_URL = "https://framacalc.org/osm-groupes-locaux/export/csv"
|
||||
WIKI_GROUPS_URL = "https://wiki.openstreetmap.org/wiki/France/OSM-FR#Groupes_locaux"
|
||||
CACHE_DURATION = timedelta(hours=1) # Cache duration of 1 hour
|
||||
|
||||
def is_cache_fresh():
|
||||
"""
|
||||
Check if the cache file exists and is less than CACHE_DURATION old
|
||||
|
||||
Returns:
|
||||
bool: True if cache is fresh, False otherwise
|
||||
"""
|
||||
if not os.path.exists(OUTPUT_FILE):
|
||||
return False
|
||||
|
||||
try:
|
||||
with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
last_updated = datetime.fromisoformat(data.get('last_updated', '2000-01-01T00:00:00'))
|
||||
now = datetime.now()
|
||||
return (now - last_updated) < CACHE_DURATION
|
||||
except (IOError, json.JSONDecodeError, ValueError) as e:
|
||||
logger.error(f"Error checking cache freshness: {e}")
|
||||
return False
|
||||
|
||||
def get_page_content(url):
|
||||
"""
|
||||
Get the HTML content of a page
|
||||
|
||||
Args:
|
||||
url (str): URL to fetch
|
||||
|
||||
Returns:
|
||||
str: HTML content of the page or None if request failed
|
||||
"""
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
def extract_working_groups(html_content):
|
||||
"""
|
||||
Extract working groups from the wiki page HTML
|
||||
|
||||
Args:
|
||||
html_content (str): HTML content of the wiki page
|
||||
|
||||
Returns:
|
||||
list: List of working group dictionaries
|
||||
"""
|
||||
if not html_content:
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
working_groups = []
|
||||
|
||||
# Find the working groups section
|
||||
working_groups_section = None
|
||||
for heading in soup.find_all(['h2', 'h3']):
|
||||
if heading.get_text().strip() == 'Groupes de travail' or 'Groupes_de_travail' in heading.get_text():
|
||||
working_groups_section = heading
|
||||
break
|
||||
|
||||
if not working_groups_section:
|
||||
logger.warning("Could not find working groups section")
|
||||
# Return an empty list but with a default category
|
||||
return []
|
||||
|
||||
# Get the content following the heading until the next heading
|
||||
current = working_groups_section.next_sibling
|
||||
while current and not current.name in ['h2', 'h3']:
|
||||
if current.name == 'ul':
|
||||
# Process list items
|
||||
for li in current.find_all('li', recursive=False):
|
||||
link = li.find('a')
|
||||
if link:
|
||||
name = link.get_text().strip()
|
||||
url = WIKI_BASE_URL + link.get('href') if link.get('href').startswith('/') else link.get('href')
|
||||
|
||||
# Extract description (text after the link)
|
||||
description = ""
|
||||
next_node = link.next_sibling
|
||||
while next_node:
|
||||
if isinstance(next_node, str):
|
||||
description += next_node.strip()
|
||||
next_node = next_node.next_sibling if hasattr(next_node, 'next_sibling') else None
|
||||
|
||||
description = description.strip(' :-,')
|
||||
|
||||
working_groups.append({
|
||||
"name": name,
|
||||
"url": url,
|
||||
"description": description,
|
||||
"category": "Général",
|
||||
"type": "working_group"
|
||||
})
|
||||
current = current.next_sibling
|
||||
|
||||
logger.info(f"Found {len(working_groups)} working groups")
|
||||
return working_groups
|
||||
|
||||
def extract_local_groups_from_wiki(html_content):
|
||||
"""
|
||||
Extract local groups from the wiki page HTML
|
||||
|
||||
Args:
|
||||
html_content (str): HTML content of the wiki page
|
||||
|
||||
Returns:
|
||||
list: List of local group dictionaries
|
||||
"""
|
||||
if not html_content:
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
local_groups = []
|
||||
|
||||
# Find the local groups section
|
||||
local_groups_section = None
|
||||
for heading in soup.find_all(['h2', 'h3']):
|
||||
if heading.get_text().strip() == 'Groupes locaux' or 'Pages des groupes locaux' in heading.get_text():
|
||||
local_groups_section = heading
|
||||
break
|
||||
|
||||
if not local_groups_section:
|
||||
logger.warning("Could not find local groups section")
|
||||
return []
|
||||
|
||||
# Get the content following the heading until the next heading
|
||||
current = local_groups_section.next_sibling
|
||||
while current and not current.name in ['h2', 'h3']:
|
||||
if current.name == 'ul':
|
||||
# Process list items
|
||||
for li in current.find_all('li', recursive=False):
|
||||
link = li.find('a')
|
||||
if link:
|
||||
name = link.get_text().strip()
|
||||
url = WIKI_BASE_URL + link.get('href') if link.get('href').startswith('/') else link.get('href')
|
||||
|
||||
# Extract description (text after the link)
|
||||
description = ""
|
||||
next_node = link.next_sibling
|
||||
while next_node:
|
||||
if isinstance(next_node, str):
|
||||
description += next_node.strip()
|
||||
next_node = next_node.next_sibling if hasattr(next_node, 'next_sibling') else None
|
||||
|
||||
description = description.strip(' :-,')
|
||||
|
||||
local_groups.append({
|
||||
"name": name,
|
||||
"url": url,
|
||||
"description": description,
|
||||
"type": "local_group",
|
||||
"source": "wiki"
|
||||
})
|
||||
current = current.next_sibling
|
||||
|
||||
logger.info(f"Found {len(local_groups)} local groups from wiki")
|
||||
return local_groups
|
||||
|
||||
def fetch_framacalc_data():
|
||||
"""
|
||||
Fetch local groups data from Framacalc
|
||||
|
||||
Returns:
|
||||
list: List of local group dictionaries from Framacalc
|
||||
"""
|
||||
try:
|
||||
response = requests.get(FRAMACALC_URL)
|
||||
response.raise_for_status()
|
||||
|
||||
# Parse CSV data
|
||||
csv_data = csv.reader(io.StringIO(response.text))
|
||||
rows = list(csv_data)
|
||||
|
||||
# Check if we have data
|
||||
if len(rows) < 2:
|
||||
logger.warning("No data found in Framacalc CSV")
|
||||
return []
|
||||
|
||||
# Extract headers (first row)
|
||||
headers = rows[0]
|
||||
|
||||
# Find the indices of important columns
|
||||
name_idx = -1
|
||||
contact_idx = -1
|
||||
website_idx = -1
|
||||
|
||||
for i, header in enumerate(headers):
|
||||
header_lower = header.lower()
|
||||
if 'nom' in header_lower or 'groupe' in header_lower:
|
||||
name_idx = i
|
||||
elif 'contact' in header_lower or 'email' in header_lower:
|
||||
contact_idx = i
|
||||
elif 'site' in header_lower or 'web' in header_lower:
|
||||
website_idx = i
|
||||
|
||||
if name_idx == -1:
|
||||
logger.warning("Could not find name column in Framacalc CSV")
|
||||
return []
|
||||
|
||||
# Process data rows
|
||||
local_groups = []
|
||||
for row in rows[1:]: # Skip header row
|
||||
if len(row) <= name_idx or not row[name_idx].strip():
|
||||
continue # Skip empty rows
|
||||
|
||||
name = row[name_idx].strip()
|
||||
contact = row[contact_idx].strip() if contact_idx != -1 and contact_idx < len(row) else ""
|
||||
website = row[website_idx].strip() if website_idx != -1 and website_idx < len(row) else ""
|
||||
|
||||
local_groups.append({
|
||||
"name": name,
|
||||
"contact": contact,
|
||||
"website": website,
|
||||
"type": "local_group",
|
||||
"source": "framacalc",
|
||||
"has_wiki_page": False, # Will be updated later
|
||||
"wiki_url": "" # Will be updated later
|
||||
})
|
||||
|
||||
logger.info(f"Found {len(local_groups)} local groups from Framacalc")
|
||||
return local_groups
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching Framacalc data: {e}")
|
||||
return []
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing Framacalc data: {e}")
|
||||
return []
|
||||
|
||||
def extract_wiki_group_links():
|
||||
"""
|
||||
Extract links to local group wiki pages from the OSM-FR wiki page
|
||||
|
||||
Returns:
|
||||
dict: Dictionary mapping group names to wiki URLs
|
||||
"""
|
||||
try:
|
||||
# Get the wiki page content
|
||||
response = requests.get(WIKI_GROUPS_URL)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
wiki_links = {}
|
||||
|
||||
# Find the "Pages des groupes locaux" section
|
||||
pages_section = None
|
||||
for heading in soup.find_all(['h2', 'h3', 'h4']):
|
||||
if 'Pages des groupes locaux' in heading.get_text():
|
||||
pages_section = heading
|
||||
break
|
||||
|
||||
if not pages_section:
|
||||
logger.warning("Could not find 'Pages des groupes locaux' section")
|
||||
return {}
|
||||
|
||||
# Get the content following the heading until the next heading
|
||||
current = pages_section.next_sibling
|
||||
while current and not current.name in ['h2', 'h3', 'h4']:
|
||||
if current.name == 'ul':
|
||||
# Process list items
|
||||
for li in current.find_all('li', recursive=False):
|
||||
text = li.get_text().strip()
|
||||
link = li.find('a')
|
||||
|
||||
if link and text:
|
||||
# Extract group name (before the comma)
|
||||
parts = text.split(',', 1)
|
||||
group_name = parts[0].strip()
|
||||
|
||||
url = WIKI_BASE_URL + link.get('href') if link.get('href').startswith('/') else link.get('href')
|
||||
wiki_links[group_name] = url
|
||||
|
||||
current = current.next_sibling
|
||||
|
||||
logger.info(f"Found {len(wiki_links)} wiki links for local groups")
|
||||
return wiki_links
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching wiki group links: {e}")
|
||||
return {}
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing wiki group links: {e}")
|
||||
return {}
|
||||
|
||||
def verify_framacalc_groups_have_wiki(framacalc_groups, wiki_links):
|
||||
"""
|
||||
Verify that each group from Framacalc has a corresponding wiki page
|
||||
|
||||
Args:
|
||||
framacalc_groups (list): List of local group dictionaries from Framacalc
|
||||
wiki_links (dict): Dictionary mapping group names to wiki URLs
|
||||
|
||||
Returns:
|
||||
list: Updated list of local group dictionaries with wiki verification
|
||||
"""
|
||||
for group in framacalc_groups:
|
||||
group_name = group['name']
|
||||
|
||||
# Try to find a matching wiki link
|
||||
found = False
|
||||
for wiki_name, wiki_url in wiki_links.items():
|
||||
# Check if the group name is similar to the wiki name
|
||||
if group_name.lower() in wiki_name.lower() or wiki_name.lower() in group_name.lower():
|
||||
group['has_wiki_page'] = True
|
||||
group['wiki_url'] = wiki_url
|
||||
found = True
|
||||
break
|
||||
|
||||
if not found:
|
||||
group['has_wiki_page'] = False
|
||||
group['wiki_url'] = ""
|
||||
|
||||
return framacalc_groups
|
||||
|
||||
def extract_umap_url(html_content):
|
||||
"""
|
||||
Extract the uMap URL for OSM-FR local groups
|
||||
|
||||
Args:
|
||||
html_content (str): HTML content of the wiki page
|
||||
|
||||
Returns:
|
||||
str: uMap URL or None if not found
|
||||
"""
|
||||
if not html_content:
|
||||
return None
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
# Look for links to umap.openstreetmap.fr
|
||||
for link in soup.find_all('a'):
|
||||
href = link.get('href', '')
|
||||
if 'umap.openstreetmap.fr' in href and 'groupes-locaux' in href:
|
||||
return href
|
||||
|
||||
return None
|
||||
|
||||
def save_results(wiki_local_groups, framacalc_groups, working_groups, umap_url, wiki_links, dry_run=False):
|
||||
"""
|
||||
Save the results to a JSON file
|
||||
|
||||
Args:
|
||||
wiki_local_groups (list): List of local group dictionaries from wiki
|
||||
framacalc_groups (list): List of local group dictionaries from Framacalc
|
||||
working_groups (list): List of working group dictionaries
|
||||
umap_url (str): URL to the uMap for local groups
|
||||
wiki_links (dict): Dictionary mapping group names to wiki URLs
|
||||
dry_run (bool): If True, don't actually save to file
|
||||
|
||||
Returns:
|
||||
bool: True if saving was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have saved results to file")
|
||||
logger.info(f"Wiki local groups: {len(wiki_local_groups)}")
|
||||
for group in wiki_local_groups[:5]: # Show only first 5 for brevity
|
||||
logger.info(f" - {group['name']}: {group['url']}")
|
||||
|
||||
logger.info(f"Framacalc groups: {len(framacalc_groups)}")
|
||||
for group in framacalc_groups[:5]: # Show only first 5 for brevity
|
||||
wiki_status = "Has wiki page" if group.get('has_wiki_page') else "No wiki page"
|
||||
logger.info(f" - {group['name']}: {wiki_status}")
|
||||
|
||||
logger.info(f"Working groups: {len(working_groups)}")
|
||||
for group in working_groups[:5]: # Show only first 5 for brevity
|
||||
logger.info(f" - {group['name']}: {group['url']}")
|
||||
|
||||
if umap_url:
|
||||
logger.info(f"uMap URL: {umap_url}")
|
||||
|
||||
logger.info(f"Wiki links: {len(wiki_links)}")
|
||||
return True
|
||||
|
||||
# Combine all local groups
|
||||
all_local_groups = wiki_local_groups + framacalc_groups
|
||||
|
||||
# Prepare the data structure
|
||||
data = {
|
||||
"last_updated": datetime.now().isoformat(),
|
||||
"local_groups": all_local_groups,
|
||||
"working_groups": working_groups,
|
||||
"umap_url": umap_url,
|
||||
"wiki_links": wiki_links
|
||||
}
|
||||
|
||||
try:
|
||||
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Successfully saved {len(all_local_groups)} local groups and {len(working_groups)} working groups to {OUTPUT_FILE}")
|
||||
return True
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving results to {OUTPUT_FILE}: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Fetch OSM-FR local groups from wiki and Framacalc")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Run without saving results to file")
|
||||
parser.add_argument("--force", action="store_true", help="Force update even if cache is fresh")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting fetch_osm_fr_groups.py")
|
||||
|
||||
# Check if cache is fresh
|
||||
if is_cache_fresh() and not args.force:
|
||||
logger.info(f"Cache is still fresh (less than {CACHE_DURATION.total_seconds()/3600} hours old)")
|
||||
logger.info(f"Use --force to update anyway")
|
||||
return
|
||||
|
||||
# Get the wiki page content
|
||||
html_content = get_page_content(BASE_URL)
|
||||
|
||||
if not html_content:
|
||||
logger.error("Failed to get wiki page content")
|
||||
return
|
||||
|
||||
# Extract local groups from wiki
|
||||
wiki_local_groups = extract_local_groups_from_wiki(html_content)
|
||||
|
||||
if not wiki_local_groups:
|
||||
logger.warning("No local groups found in wiki")
|
||||
|
||||
# Extract working groups
|
||||
working_groups = extract_working_groups(html_content)
|
||||
|
||||
if not working_groups:
|
||||
logger.warning("No working groups found")
|
||||
# Initialize with an empty list to avoid errors in the controller
|
||||
working_groups = []
|
||||
|
||||
# Extract uMap URL
|
||||
umap_url = extract_umap_url(html_content)
|
||||
|
||||
# Fetch local groups from Framacalc
|
||||
framacalc_groups = fetch_framacalc_data()
|
||||
|
||||
if not framacalc_groups:
|
||||
logger.warning("No local groups found in Framacalc")
|
||||
|
||||
# Extract wiki group links
|
||||
wiki_links = extract_wiki_group_links()
|
||||
|
||||
if not wiki_links:
|
||||
logger.warning("No wiki links found for local groups")
|
||||
|
||||
# Verify Framacalc groups have wiki pages
|
||||
if framacalc_groups and wiki_links:
|
||||
framacalc_groups = verify_framacalc_groups_have_wiki(framacalc_groups, wiki_links)
|
||||
|
||||
# Count groups with and without wiki pages
|
||||
groups_with_wiki = sum(1 for group in framacalc_groups if group.get('has_wiki_page'))
|
||||
groups_without_wiki = sum(1 for group in framacalc_groups if not group.get('has_wiki_page'))
|
||||
|
||||
logger.info(f"Framacalc groups with wiki pages: {groups_with_wiki}")
|
||||
logger.info(f"Framacalc groups without wiki pages: {groups_without_wiki}")
|
||||
|
||||
# Save results
|
||||
success = save_results(wiki_local_groups, framacalc_groups, working_groups, umap_url, wiki_links, args.dry_run)
|
||||
|
||||
if success:
|
||||
logger.info("Script completed successfully")
|
||||
else:
|
||||
logger.error("Script completed with errors")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
392
wiki_compare/fetch_proposals.py
Executable file
392
wiki_compare/fetch_proposals.py
Executable file
|
@ -0,0 +1,392 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import json
|
||||
import logging
|
||||
import argparse
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# URLs for OSM Wiki proposals
|
||||
VOTING_PROPOSALS_URL = "https://wiki.openstreetmap.org/wiki/Category:Proposals_with_%22Voting%22_status"
|
||||
RECENT_CHANGES_URL = "https://wiki.openstreetmap.org/w/index.php?title=Special:RecentChanges&namespace=102&limit=50" # Namespace 102 is for Proposal pages
|
||||
|
||||
# Output file
|
||||
OUTPUT_FILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'proposals.json')
|
||||
|
||||
# Cache timeout (in hours)
|
||||
CACHE_TIMEOUT = 1
|
||||
|
||||
# Vote patterns (same as in fetch_archived_proposals.py)
|
||||
VOTE_PATTERNS = {
|
||||
'approve': [
|
||||
r'I\s+(?:(?:strongly|fully|completely|wholeheartedly)\s+)?(?:approve|support|agree\s+with)\s+this\s+proposal',
|
||||
r'I\s+vote\s+(?:to\s+)?(?:approve|support)',
|
||||
r'(?:Symbol\s+support\s+vote\.svg|Symbol_support_vote\.svg)',
|
||||
],
|
||||
'oppose': [
|
||||
r'I\s+(?:(?:strongly|fully|completely|wholeheartedly)\s+)?(?:oppose|disagree\s+with|reject|do\s+not\s+support)\s+this\s+proposal',
|
||||
r'I\s+vote\s+(?:to\s+)?(?:oppose|reject|against)',
|
||||
r'(?:Symbol\s+oppose\s+vote\.svg|Symbol_oppose_vote\.svg)',
|
||||
],
|
||||
'abstain': [
|
||||
r'I\s+(?:have\s+comments\s+but\s+)?abstain\s+from\s+voting',
|
||||
r'I\s+(?:have\s+comments\s+but\s+)?(?:neither\s+approve\s+nor\s+oppose|am\s+neutral)',
|
||||
r'(?:Symbol\s+abstain\s+vote\.svg|Symbol_abstain_vote\.svg)',
|
||||
]
|
||||
}
|
||||
|
||||
def should_update_cache():
|
||||
"""
|
||||
Check if the cache file exists and if it's older than the cache timeout
|
||||
"""
|
||||
if not os.path.exists(OUTPUT_FILE):
|
||||
logger.info("Cache file doesn't exist, creating it")
|
||||
return True
|
||||
|
||||
# Check file modification time
|
||||
file_mtime = datetime.fromtimestamp(os.path.getmtime(OUTPUT_FILE))
|
||||
now = datetime.now()
|
||||
|
||||
# If file is older than cache timeout, update it
|
||||
if now - file_mtime > timedelta(hours=CACHE_TIMEOUT):
|
||||
logger.info(f"Cache is older than {CACHE_TIMEOUT} hour(s), updating")
|
||||
return True
|
||||
|
||||
logger.info(f"Cache is still fresh (less than {CACHE_TIMEOUT} hour(s) old)")
|
||||
return False
|
||||
|
||||
def fetch_page(url):
|
||||
"""
|
||||
Fetch a page from the OSM wiki
|
||||
"""
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
def extract_username(text):
|
||||
"""
|
||||
Extract username from a signature line
|
||||
"""
|
||||
# Common patterns for signatures
|
||||
patterns = [
|
||||
r'--\s*\[\[User:([^|\]]+)(?:\|[^\]]+)?\]\]', # --[[User:Username|Username]]
|
||||
r'--\s*\[\[User:([^|\]]+)\]\]', # --[[User:Username]]
|
||||
r'--\s*\[\[User talk:([^|\]]+)(?:\|[^\]]+)?\]\]', # --[[User talk:Username|Username]]
|
||||
r'--\s*\[\[User talk:([^|\]]+)\]\]', # --[[User talk:Username]]
|
||||
r'--\s*\[\[Special:Contributions/([^|\]]+)(?:\|[^\]]+)?\]\]', # --[[Special:Contributions/Username|Username]]
|
||||
r'--\s*\[\[Special:Contributions/([^|\]]+)\]\]', # --[[Special:Contributions/Username]]
|
||||
]
|
||||
|
||||
for pattern in patterns:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
|
||||
# If no match found with the patterns, try to find any username-like string
|
||||
match = re.search(r'--\s*([A-Za-z0-9_-]+)', text)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
|
||||
return None
|
||||
|
||||
def extract_date(text):
|
||||
"""
|
||||
Extract date from a signature line
|
||||
"""
|
||||
# Look for common date formats in signatures
|
||||
date_patterns = [
|
||||
r'(\d{1,2}:\d{2}, \d{1,2} [A-Za-z]+ \d{4})', # 15:30, 25 December 2023
|
||||
r'(\d{1,2} [A-Za-z]+ \d{4} \d{1,2}:\d{2})', # 25 December 2023 15:30
|
||||
r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})', # 2023-12-25T15:30:00
|
||||
]
|
||||
|
||||
for pattern in date_patterns:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
return match.group(1)
|
||||
|
||||
return None
|
||||
|
||||
def determine_vote_type(text):
|
||||
"""
|
||||
Determine the type of vote from the text
|
||||
"""
|
||||
text_lower = text.lower()
|
||||
|
||||
for vote_type, patterns in VOTE_PATTERNS.items():
|
||||
for pattern in patterns:
|
||||
if re.search(pattern, text_lower, re.IGNORECASE):
|
||||
return vote_type
|
||||
|
||||
return None
|
||||
|
||||
def extract_votes(html):
|
||||
"""
|
||||
Extract voting information from proposal HTML
|
||||
"""
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Find the voting section
|
||||
voting_section = None
|
||||
for heading in soup.find_all(['h2', 'h3']):
|
||||
heading_text = heading.get_text().lower()
|
||||
if 'voting' in heading_text or 'votes' in heading_text or 'poll' in heading_text:
|
||||
voting_section = heading
|
||||
break
|
||||
|
||||
if not voting_section:
|
||||
logger.warning("No voting section found")
|
||||
return {
|
||||
'approve': {'count': 0, 'users': []},
|
||||
'oppose': {'count': 0, 'users': []},
|
||||
'abstain': {'count': 0, 'users': []}
|
||||
}
|
||||
|
||||
# Get the content after the voting section heading
|
||||
votes_content = []
|
||||
current = voting_section.next_sibling
|
||||
|
||||
# Collect all elements until the next heading or the end of the document
|
||||
while current and not current.name in ['h2', 'h3']:
|
||||
if current.name: # Skip NavigableString objects
|
||||
votes_content.append(current)
|
||||
current = current.next_sibling
|
||||
|
||||
# Process vote lists
|
||||
votes = {
|
||||
'approve': {'count': 0, 'users': []},
|
||||
'oppose': {'count': 0, 'users': []},
|
||||
'abstain': {'count': 0, 'users': []}
|
||||
}
|
||||
|
||||
# Look for lists of votes
|
||||
for element in votes_content:
|
||||
if element.name == 'ul':
|
||||
for li in element.find_all('li'):
|
||||
vote_text = li.get_text()
|
||||
vote_type = determine_vote_type(vote_text)
|
||||
|
||||
if vote_type:
|
||||
username = extract_username(vote_text)
|
||||
date = extract_date(vote_text)
|
||||
|
||||
if username:
|
||||
votes[vote_type]['count'] += 1
|
||||
votes[vote_type]['users'].append({
|
||||
'username': username,
|
||||
'date': date
|
||||
})
|
||||
|
||||
return votes
|
||||
|
||||
def fetch_voting_proposals():
|
||||
"""
|
||||
Fetch proposals with "Voting" status from the OSM Wiki
|
||||
"""
|
||||
logger.info(f"Fetching voting proposals from {VOTING_PROPOSALS_URL}")
|
||||
try:
|
||||
response = requests.get(VOTING_PROPOSALS_URL)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
proposals = []
|
||||
|
||||
# Find all links in the mw-pages section
|
||||
links = soup.select('#mw-pages a')
|
||||
|
||||
for link in links:
|
||||
# Skip category links and other non-proposal links
|
||||
if 'Category:' in link.get('href', '') or 'Special:' in link.get('href', ''):
|
||||
continue
|
||||
|
||||
proposal_title = link.text.strip()
|
||||
proposal_url = 'https://wiki.openstreetmap.org' + link.get('href', '')
|
||||
|
||||
# Create a basic proposal object
|
||||
proposal = {
|
||||
'title': proposal_title,
|
||||
'url': proposal_url,
|
||||
'status': 'Voting',
|
||||
'type': 'voting'
|
||||
}
|
||||
|
||||
# Fetch the proposal page to extract voting information
|
||||
logger.info(f"Fetching proposal page: {proposal_title}")
|
||||
html = fetch_page(proposal_url)
|
||||
|
||||
if html:
|
||||
# Extract voting information
|
||||
votes = extract_votes(html)
|
||||
|
||||
# Add voting information to the proposal
|
||||
proposal['votes'] = votes
|
||||
|
||||
# Calculate total votes and percentages
|
||||
total_votes = votes['approve']['count'] + votes['oppose']['count'] + votes['abstain']['count']
|
||||
|
||||
if total_votes > 0:
|
||||
proposal['total_votes'] = total_votes
|
||||
proposal['approve_percentage'] = round((votes['approve']['count'] / total_votes) * 100, 1)
|
||||
proposal['oppose_percentage'] = round((votes['oppose']['count'] / total_votes) * 100, 1)
|
||||
proposal['abstain_percentage'] = round((votes['abstain']['count'] / total_votes) * 100, 1)
|
||||
else:
|
||||
proposal['total_votes'] = 0
|
||||
proposal['approve_percentage'] = 0
|
||||
proposal['oppose_percentage'] = 0
|
||||
proposal['abstain_percentage'] = 0
|
||||
|
||||
# Extract proposer from the page
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
content = soup.select_one('#mw-content-text')
|
||||
|
||||
if content:
|
||||
# Look for table rows with "Proposed by:" in the header cell
|
||||
for row in content.select('tr'):
|
||||
cells = row.select('th, td')
|
||||
if len(cells) >= 2:
|
||||
header_text = cells[0].get_text().strip().lower()
|
||||
if "proposed by" in header_text:
|
||||
user_link = cells[1].select_one('a[href*="/wiki/User:"]')
|
||||
if user_link:
|
||||
href = user_link.get('href', '')
|
||||
title = user_link.get('title', '')
|
||||
|
||||
# Try to get username from title attribute first
|
||||
if title and title.startswith('User:'):
|
||||
proposal['proposer'] = title[5:] # Remove 'User:' prefix
|
||||
# Otherwise try to extract from href
|
||||
elif href:
|
||||
href_match = re.search(r'/wiki/User:([^/]+)', href)
|
||||
if href_match:
|
||||
proposal['proposer'] = href_match.group(1)
|
||||
|
||||
# If still no proposer, use the link text
|
||||
if 'proposer' not in proposal and user_link.get_text():
|
||||
proposal['proposer'] = user_link.get_text().strip()
|
||||
|
||||
# Add a delay to avoid overloading the server
|
||||
time.sleep(1)
|
||||
|
||||
proposals.append(proposal)
|
||||
|
||||
logger.info(f"Found {len(proposals)} voting proposals")
|
||||
return proposals
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching voting proposals: {e}")
|
||||
return []
|
||||
|
||||
def fetch_recent_proposals():
|
||||
"""
|
||||
Fetch recently modified proposals from the OSM Wiki
|
||||
"""
|
||||
logger.info(f"Fetching recent changes from {RECENT_CHANGES_URL}")
|
||||
try:
|
||||
response = requests.get(RECENT_CHANGES_URL)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
proposals = []
|
||||
|
||||
# Find all change list lines
|
||||
change_lines = soup.select('.mw-changeslist .mw-changeslist-line')
|
||||
|
||||
for line in change_lines:
|
||||
# Get the page title
|
||||
title_element = line.select_one('.mw-changeslist-title')
|
||||
if not title_element:
|
||||
continue
|
||||
|
||||
page_title = title_element.text.strip()
|
||||
page_url = title_element.get('href', '')
|
||||
if not page_url.startswith('http'):
|
||||
page_url = f"https://wiki.openstreetmap.org{page_url}"
|
||||
|
||||
# Get the timestamp
|
||||
timestamp_element = line.select_one('.mw-changeslist-date')
|
||||
timestamp = timestamp_element.text.strip() if timestamp_element else ""
|
||||
|
||||
# Get the user who made the change
|
||||
user_element = line.select_one('.mw-userlink')
|
||||
user = user_element.text.strip() if user_element else "Unknown"
|
||||
|
||||
# Skip if it's not a proposal page
|
||||
if not page_title.startswith('Proposal:'):
|
||||
continue
|
||||
|
||||
proposals.append({
|
||||
'title': page_title,
|
||||
'url': page_url,
|
||||
'last_modified': timestamp,
|
||||
'modified_by': user,
|
||||
'type': 'recent'
|
||||
})
|
||||
|
||||
# Limit to the 10 most recent proposals
|
||||
proposals = proposals[:10]
|
||||
logger.info(f"Found {len(proposals)} recently modified proposals")
|
||||
return proposals
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching recent proposals: {e}")
|
||||
return []
|
||||
|
||||
def save_proposals(voting_proposals, recent_proposals):
|
||||
"""
|
||||
Save the proposals to a JSON file
|
||||
"""
|
||||
data = {
|
||||
'last_updated': datetime.now().isoformat(),
|
||||
'voting_proposals': voting_proposals,
|
||||
'recent_proposals': recent_proposals
|
||||
}
|
||||
|
||||
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
logger.info(f"Saved {len(voting_proposals)} voting proposals and {len(recent_proposals)} recent proposals to {OUTPUT_FILE}")
|
||||
return OUTPUT_FILE
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Fetch OSM Wiki proposals')
|
||||
parser.add_argument('--force', action='store_true', help='Force update even if cache is fresh')
|
||||
parser.add_argument('--dry-run', action='store_true', help='Print results without saving to file')
|
||||
args = parser.parse_args()
|
||||
|
||||
# Check if we should update the cache
|
||||
if args.force or should_update_cache() or args.dry_run:
|
||||
voting_proposals = fetch_voting_proposals()
|
||||
recent_proposals = fetch_recent_proposals()
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"Found {len(voting_proposals)} voting proposals:")
|
||||
for proposal in voting_proposals:
|
||||
logger.info(f"- {proposal['title']}")
|
||||
|
||||
logger.info(f"Found {len(recent_proposals)} recent proposals:")
|
||||
for proposal in recent_proposals:
|
||||
logger.info(f"- {proposal['title']} (modified by {proposal['modified_by']} on {proposal['last_modified']})")
|
||||
else:
|
||||
output_file = save_proposals(voting_proposals, recent_proposals)
|
||||
logger.info(f"Results saved to {output_file}")
|
||||
else:
|
||||
logger.info("Using cached proposals data")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
635
wiki_compare/fetch_recent_changes.py
Normal file
635
wiki_compare/fetch_recent_changes.py
Normal file
|
@ -0,0 +1,635 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
fetch_recent_changes.py
|
||||
|
||||
This script fetches recent changes from the OpenStreetMap wiki for the French namespace
|
||||
and stores the URLs of these pages. It specifically targets the recent changes page:
|
||||
https://wiki.openstreetmap.org/w/index.php?hidebots=1&hidepreviousrevisions=1&hidecategorization=1&hideWikibase=1&hidelog=1&hidenewuserlog=1&namespace=202&limit=10000&days=365&enhanced=1&title=Special:RecentChanges&urlversion=2
|
||||
|
||||
Usage:
|
||||
python fetch_recent_changes.py [--dry-run] [--force]
|
||||
|
||||
Options:
|
||||
--dry-run Run the script without saving the results to a file
|
||||
--force Force update even if the cache is still fresh (less than 1 hour old)
|
||||
|
||||
Output:
|
||||
- recent_changes.json: JSON file with information about recent changes in the French namespace
|
||||
- Log messages about the scraping process and results
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
from datetime import datetime, timedelta
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
# Use the directory of this script to determine the output file path
|
||||
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
OUTPUT_FILE = os.path.join(SCRIPT_DIR, "recent_changes.json")
|
||||
UNAVAILABLE_PAGES_FILE = os.path.join(SCRIPT_DIR, "pages_unavailable_in_french.json")
|
||||
CREATED_PAGES_FILE = os.path.join(SCRIPT_DIR, "newly_created_french_pages.json")
|
||||
RECENT_CHANGES_URL = "https://wiki.openstreetmap.org/w/index.php?hidebots=1&hidepreviousrevisions=1&hidecategorization=1&hideWikibase=1&hidelog=1&hidenewuserlog=1&namespace=202&limit=500&days=30&enhanced=1&title=Special:RecentChanges&urlversion=2"
|
||||
WIKI_BASE_URL = "https://wiki.openstreetmap.org"
|
||||
CACHE_DURATION = timedelta(hours=1) # Cache duration of 1 hour
|
||||
|
||||
def is_cache_fresh():
|
||||
"""
|
||||
Check if the cache file exists and is less than CACHE_DURATION old
|
||||
|
||||
Returns:
|
||||
bool: True if cache is fresh, False otherwise
|
||||
"""
|
||||
if not os.path.exists(OUTPUT_FILE):
|
||||
return False
|
||||
|
||||
try:
|
||||
with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
last_updated = datetime.fromisoformat(data.get('last_updated', '2000-01-01T00:00:00'))
|
||||
now = datetime.now()
|
||||
return (now - last_updated) < CACHE_DURATION
|
||||
except (IOError, json.JSONDecodeError, ValueError) as e:
|
||||
logger.error(f"Error checking cache freshness: {e}")
|
||||
return False
|
||||
|
||||
def get_page_content(url):
|
||||
"""
|
||||
Get the HTML content of a page
|
||||
|
||||
Args:
|
||||
url (str): URL to fetch
|
||||
|
||||
Returns:
|
||||
str: HTML content of the page or None if request failed
|
||||
"""
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
def extract_recent_changes(html_content):
|
||||
"""
|
||||
Extract recent changes from the wiki page HTML
|
||||
|
||||
Args:
|
||||
html_content (str): HTML content of the recent changes page
|
||||
|
||||
Returns:
|
||||
list: List of recent change dictionaries
|
||||
"""
|
||||
if not html_content:
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
recent_changes = []
|
||||
|
||||
# Find the main changeslist container
|
||||
# According to the issue description, we should look for .mw-changeslist
|
||||
changes_list = soup.find('div', class_='mw-changeslist')
|
||||
|
||||
if not changes_list:
|
||||
# If still not found, look for the content area
|
||||
content_div = soup.find('div', id='mw-content-text')
|
||||
if content_div:
|
||||
# Try to find the changeslist div
|
||||
changes_list = content_div.find('div', class_='mw-changeslist')
|
||||
|
||||
if not changes_list:
|
||||
# Log the HTML structure to help debug
|
||||
logger.warning("Could not find recent changes list. HTML structure:")
|
||||
body = soup.find('body')
|
||||
if body:
|
||||
content_area = body.find('div', id='content')
|
||||
if content_area:
|
||||
logger.warning(f"Content area classes: {content_area.get('class', [])}")
|
||||
main_content = content_area.find('div', id='mw-content-text')
|
||||
if main_content:
|
||||
logger.warning(f"Main content first child: {main_content.find().name if main_content.find() else 'None'}")
|
||||
return []
|
||||
|
||||
logger.info(f"Found changes list with tag: {changes_list.name}, classes: {changes_list.get('class', [])}")
|
||||
|
||||
# Process each change item - based on the actual HTML structure
|
||||
# According to the debug output, the changes are in tr elements
|
||||
change_items = changes_list.find_all('tr')
|
||||
|
||||
# If no tr elements found directly, look for tables with class mw-changeslist-line
|
||||
if not change_items:
|
||||
tables = changes_list.find_all('table', class_='mw-changeslist-line')
|
||||
for table in tables:
|
||||
trs = table.find_all('tr')
|
||||
change_items.extend(trs)
|
||||
|
||||
logger.info(f"Found {len(change_items)} change items")
|
||||
|
||||
for item in change_items:
|
||||
# Extract the page link from the mw-changeslist-title class
|
||||
page_link = item.find('a', class_='mw-changeslist-title')
|
||||
|
||||
if not page_link:
|
||||
# If not found with the specific class, try to find any link that might be the page link
|
||||
inner_td = item.find('td', class_='mw-changeslist-line-inner')
|
||||
if inner_td:
|
||||
links = inner_td.find_all('a')
|
||||
for link in links:
|
||||
href = link.get('href', '')
|
||||
if '/wiki/' in href and 'action=history' not in href and 'diff=' not in href:
|
||||
page_link = link
|
||||
break
|
||||
|
||||
if not page_link:
|
||||
# Skip items without a page link (might be headers or other elements)
|
||||
continue
|
||||
|
||||
page_name = page_link.get_text().strip()
|
||||
page_url = page_link.get('href')
|
||||
if not page_url.startswith('http'):
|
||||
page_url = WIKI_BASE_URL + page_url
|
||||
|
||||
# Extract the timestamp from the mw-enhanced-rc class
|
||||
timestamp_td = item.find('td', class_='mw-enhanced-rc')
|
||||
timestamp = timestamp_td.get_text().strip() if timestamp_td else "Unknown"
|
||||
|
||||
# Extract the user from the mw-userlink class
|
||||
user_link = item.find('a', class_='mw-userlink')
|
||||
user = user_link.get_text().strip() if user_link else "Unknown"
|
||||
|
||||
# Extract the user profile URL
|
||||
user_url = ""
|
||||
if user_link and user_link.get('href'):
|
||||
user_url = user_link.get('href')
|
||||
if not user_url.startswith('http'):
|
||||
user_url = WIKI_BASE_URL + user_url
|
||||
|
||||
# Extract the diff link
|
||||
diff_url = ""
|
||||
diff_link = item.find('a', class_='mw-changeslist-diff') or item.find('a', string='diff')
|
||||
if diff_link and diff_link.get('href'):
|
||||
diff_url = diff_link.get('href')
|
||||
if not diff_url.startswith('http'):
|
||||
diff_url = WIKI_BASE_URL + diff_url
|
||||
|
||||
# Extract the comment from the comment class
|
||||
comment_span = item.find('span', class_='comment')
|
||||
comment = comment_span.get_text().strip() if comment_span else ""
|
||||
|
||||
# Extract the change size from the mw-diff-bytes class
|
||||
size_span = item.find('span', class_='mw-diff-bytes')
|
||||
if size_span:
|
||||
change_size = size_span.get_text().strip()
|
||||
else:
|
||||
# If not found, try to extract from the text
|
||||
change_size = "0"
|
||||
text = item.get_text()
|
||||
size_matches = re.findall(r'\(\s*([+-]?\d+)\s*\)', text)
|
||||
if size_matches:
|
||||
change_size = size_matches[0]
|
||||
|
||||
# Extract text differences if diff_url is available
|
||||
added_text = ""
|
||||
removed_text = ""
|
||||
if diff_url:
|
||||
try:
|
||||
# Fetch the diff page
|
||||
diff_html = get_page_content(diff_url)
|
||||
if diff_html:
|
||||
diff_soup = BeautifulSoup(diff_html, 'html.parser')
|
||||
|
||||
# Find added text (ins elements)
|
||||
added_elements = diff_soup.find_all('ins', class_='diffchange')
|
||||
if added_elements:
|
||||
added_text = ' '.join([el.get_text().strip() for el in added_elements])
|
||||
|
||||
# Find removed text (del elements)
|
||||
removed_elements = diff_soup.find_all('del', class_='diffchange')
|
||||
if removed_elements:
|
||||
removed_text = ' '.join([el.get_text().strip() for el in removed_elements])
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching diff page {diff_url}: {e}")
|
||||
|
||||
recent_changes.append({
|
||||
"page_name": page_name,
|
||||
"page_url": page_url,
|
||||
"timestamp": timestamp,
|
||||
"user": user,
|
||||
"user_url": user_url,
|
||||
"comment": comment,
|
||||
"change_size": change_size,
|
||||
"diff_url": diff_url,
|
||||
"added_text": added_text,
|
||||
"removed_text": removed_text
|
||||
})
|
||||
|
||||
logger.debug(f"Extracted change: {page_name} by {user}")
|
||||
|
||||
logger.info(f"Extracted {len(recent_changes)} recent changes")
|
||||
return recent_changes
|
||||
|
||||
def save_results(recent_changes, dry_run=False):
|
||||
"""
|
||||
Save the results to a JSON file
|
||||
|
||||
Args:
|
||||
recent_changes (list): List of recent change dictionaries
|
||||
dry_run (bool): If True, don't actually save to file
|
||||
|
||||
Returns:
|
||||
bool: True if saving was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have saved results to file")
|
||||
logger.info(f"Recent changes: {len(recent_changes)}")
|
||||
for change in recent_changes[:5]: # Show only first 5 for brevity
|
||||
logger.info(f" - {change['page_name']}: {change['page_url']} ({change['timestamp']})")
|
||||
if len(recent_changes) > 5:
|
||||
logger.info(f" ... and {len(recent_changes) - 5} more")
|
||||
return True
|
||||
|
||||
# Log some details about the recent changes
|
||||
logger.info(f"Preparing to save {len(recent_changes)} recent changes")
|
||||
if recent_changes:
|
||||
logger.info(f"First change: {recent_changes[0]['page_name']} by {recent_changes[0]['user']}")
|
||||
|
||||
# Prepare the data structure
|
||||
data = {
|
||||
"last_updated": datetime.now().isoformat(),
|
||||
"recent_changes": recent_changes
|
||||
}
|
||||
|
||||
# Get the file's last modified time before saving
|
||||
before_mtime = None
|
||||
if os.path.exists(OUTPUT_FILE):
|
||||
before_mtime = os.path.getmtime(OUTPUT_FILE)
|
||||
logger.info(f"File {OUTPUT_FILE} exists, last modified at {datetime.fromtimestamp(before_mtime)}")
|
||||
|
||||
try:
|
||||
# Print the JSON data that we're trying to save
|
||||
json_data = json.dumps(data, indent=2, ensure_ascii=False)
|
||||
logger.info(f"JSON data to save (first 500 chars): {json_data[:500]}...")
|
||||
|
||||
# Save the data to a temporary file first
|
||||
temp_file = OUTPUT_FILE + ".tmp"
|
||||
logger.info(f"Writing data to temporary file {temp_file}")
|
||||
with open(temp_file, 'w', encoding='utf-8') as f:
|
||||
f.write(json_data)
|
||||
|
||||
# Check if the temporary file was created and has content
|
||||
if os.path.exists(temp_file):
|
||||
temp_size = os.path.getsize(temp_file)
|
||||
logger.info(f"Temporary file {temp_file} created, size: {temp_size} bytes")
|
||||
|
||||
# Read the content of the temporary file to verify
|
||||
with open(temp_file, 'r', encoding='utf-8') as f:
|
||||
temp_content = f.read(500) # Read first 500 chars
|
||||
logger.info(f"Temporary file content (first 500 chars): {temp_content}...")
|
||||
|
||||
# Move the temporary file to the final location
|
||||
logger.info(f"Moving temporary file to {OUTPUT_FILE}")
|
||||
import shutil
|
||||
shutil.move(temp_file, OUTPUT_FILE)
|
||||
else:
|
||||
logger.error(f"Failed to create temporary file {temp_file}")
|
||||
|
||||
# Check if the file was actually updated
|
||||
if os.path.exists(OUTPUT_FILE):
|
||||
after_mtime = os.path.getmtime(OUTPUT_FILE)
|
||||
file_size = os.path.getsize(OUTPUT_FILE)
|
||||
logger.info(f"File {OUTPUT_FILE} exists, size: {file_size} bytes, mtime: {datetime.fromtimestamp(after_mtime)}")
|
||||
|
||||
# Read the content of the file to verify
|
||||
with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
|
||||
file_content = f.read(500) # Read first 500 chars
|
||||
logger.info(f"File content (first 500 chars): {file_content}...")
|
||||
|
||||
if before_mtime and after_mtime <= before_mtime:
|
||||
logger.warning(f"File {OUTPUT_FILE} was not updated (mtime did not change)")
|
||||
else:
|
||||
logger.error(f"File {OUTPUT_FILE} does not exist after saving")
|
||||
|
||||
# Copy the file to the public directory
|
||||
public_file = os.path.join(os.path.dirname(os.path.dirname(OUTPUT_FILE)), 'public', os.path.basename(OUTPUT_FILE))
|
||||
logger.info(f"Copying {OUTPUT_FILE} to {public_file}")
|
||||
shutil.copy2(OUTPUT_FILE, public_file)
|
||||
|
||||
# Check if the public file was created
|
||||
if os.path.exists(public_file):
|
||||
public_size = os.path.getsize(public_file)
|
||||
logger.info(f"Public file {public_file} created, size: {public_size} bytes")
|
||||
else:
|
||||
logger.error(f"Failed to create public file {public_file}")
|
||||
|
||||
logger.info(f"Successfully saved {len(recent_changes)} recent changes to {OUTPUT_FILE}")
|
||||
return True
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving results to {OUTPUT_FILE}: {e}")
|
||||
return False
|
||||
|
||||
def load_unavailable_pages():
|
||||
"""
|
||||
Load the list of pages unavailable in French
|
||||
|
||||
Returns:
|
||||
tuple: (all_pages, grouped_pages, last_updated)
|
||||
"""
|
||||
if not os.path.exists(UNAVAILABLE_PAGES_FILE):
|
||||
logger.warning(f"Unavailable pages file {UNAVAILABLE_PAGES_FILE} does not exist")
|
||||
return [], {}, None
|
||||
|
||||
try:
|
||||
with open(UNAVAILABLE_PAGES_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
all_pages = data.get('all_pages', [])
|
||||
grouped_pages = data.get('grouped_pages', {})
|
||||
last_updated = data.get('last_updated')
|
||||
return all_pages, grouped_pages, last_updated
|
||||
except (IOError, json.JSONDecodeError) as e:
|
||||
logger.error(f"Error loading unavailable pages file: {e}")
|
||||
return [], {}, None
|
||||
|
||||
def load_created_pages():
|
||||
"""
|
||||
Load the list of newly created French pages
|
||||
|
||||
Returns:
|
||||
tuple: (created_pages, last_updated)
|
||||
"""
|
||||
if not os.path.exists(CREATED_PAGES_FILE):
|
||||
logger.info(f"Created pages file {CREATED_PAGES_FILE} does not exist, will create it")
|
||||
return [], None
|
||||
|
||||
try:
|
||||
with open(CREATED_PAGES_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
created_pages = data.get('created_pages', [])
|
||||
last_updated = data.get('last_updated')
|
||||
return created_pages, last_updated
|
||||
except (IOError, json.JSONDecodeError) as e:
|
||||
logger.error(f"Error loading created pages file: {e}")
|
||||
return [], None
|
||||
|
||||
def save_created_pages(created_pages, dry_run=False):
|
||||
"""
|
||||
Save the list of newly created French pages
|
||||
|
||||
Args:
|
||||
created_pages (list): List of newly created French pages
|
||||
dry_run (bool): If True, don't actually save to file
|
||||
|
||||
Returns:
|
||||
bool: True if saving was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have saved created pages to file")
|
||||
return True
|
||||
|
||||
data = {
|
||||
"last_updated": datetime.now().isoformat(),
|
||||
"created_pages": created_pages
|
||||
}
|
||||
|
||||
try:
|
||||
with open(CREATED_PAGES_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Successfully saved {len(created_pages)} created pages to {CREATED_PAGES_FILE}")
|
||||
|
||||
# Copy the file to the public directory
|
||||
public_file = os.path.join(os.path.dirname(os.path.dirname(CREATED_PAGES_FILE)), 'public', os.path.basename(CREATED_PAGES_FILE))
|
||||
logger.info(f"Copying {CREATED_PAGES_FILE} to {public_file}")
|
||||
shutil.copy2(CREATED_PAGES_FILE, public_file)
|
||||
|
||||
return True
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving created pages to {CREATED_PAGES_FILE}: {e}")
|
||||
return False
|
||||
|
||||
def save_unavailable_pages(all_pages, grouped_pages, dry_run=False):
|
||||
"""
|
||||
Save the updated list of pages unavailable in French
|
||||
|
||||
Args:
|
||||
all_pages (list): List of all unavailable pages
|
||||
grouped_pages (dict): Dictionary of pages grouped by language prefix
|
||||
dry_run (bool): If True, don't actually save to file
|
||||
|
||||
Returns:
|
||||
bool: True if saving was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have saved updated unavailable pages to file")
|
||||
return True
|
||||
|
||||
data = {
|
||||
"last_updated": datetime.now().isoformat(),
|
||||
"all_pages": all_pages,
|
||||
"grouped_pages": grouped_pages
|
||||
}
|
||||
|
||||
try:
|
||||
with open(UNAVAILABLE_PAGES_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Successfully saved {len(all_pages)} unavailable pages to {UNAVAILABLE_PAGES_FILE}")
|
||||
|
||||
# Copy the file to the public directory
|
||||
public_file = os.path.join(os.path.dirname(os.path.dirname(UNAVAILABLE_PAGES_FILE)), 'public', os.path.basename(UNAVAILABLE_PAGES_FILE))
|
||||
logger.info(f"Copying {UNAVAILABLE_PAGES_FILE} to {public_file}")
|
||||
shutil.copy2(UNAVAILABLE_PAGES_FILE, public_file)
|
||||
|
||||
return True
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving unavailable pages to {UNAVAILABLE_PAGES_FILE}: {e}")
|
||||
return False
|
||||
|
||||
def check_for_newly_created_pages(recent_changes, all_pages, grouped_pages):
|
||||
"""
|
||||
Check if any of the recent changes are newly created French pages that were previously in the list of pages unavailable in French
|
||||
|
||||
Args:
|
||||
recent_changes (list): List of recent change dictionaries
|
||||
all_pages (list): List of all unavailable pages
|
||||
grouped_pages (dict): Dictionary of pages grouped by language prefix
|
||||
|
||||
Returns:
|
||||
tuple: (updated_all_pages, updated_grouped_pages, newly_created_pages)
|
||||
"""
|
||||
newly_created_pages = []
|
||||
updated_all_pages = all_pages.copy()
|
||||
updated_grouped_pages = {k: v.copy() for k, v in grouped_pages.items()}
|
||||
|
||||
# Check each recent change
|
||||
for change in recent_changes:
|
||||
page_name = change['page_name']
|
||||
page_url = change['page_url']
|
||||
comment = change['comment'].lower()
|
||||
|
||||
# Check if this is a new page creation
|
||||
is_new_page = "page created" in comment or "nouvelle page" in comment
|
||||
|
||||
if is_new_page and page_name.startswith("FR:"):
|
||||
logger.info(f"Found newly created French page: {page_name}")
|
||||
|
||||
# Check if this page was previously in the list of unavailable pages
|
||||
# We need to check if the English version of this page is in the list
|
||||
en_page_name = page_name.replace("FR:", "")
|
||||
|
||||
# Find the English page in the list of unavailable pages
|
||||
found_en_page = None
|
||||
for page in all_pages:
|
||||
if page['title'] == en_page_name or (page['title'].startswith("En:") and page['title'][3:] == en_page_name):
|
||||
found_en_page = page
|
||||
break
|
||||
|
||||
if found_en_page:
|
||||
logger.info(f"Found corresponding English page in unavailable pages list: {found_en_page['title']}")
|
||||
|
||||
# Remove the English page from the list of unavailable pages
|
||||
updated_all_pages.remove(found_en_page)
|
||||
|
||||
# Remove the English page from the grouped pages
|
||||
lang_prefix = found_en_page['language_prefix']
|
||||
if lang_prefix in updated_grouped_pages and found_en_page in updated_grouped_pages[lang_prefix]:
|
||||
updated_grouped_pages[lang_prefix].remove(found_en_page)
|
||||
|
||||
# If the group is now empty, remove it
|
||||
if not updated_grouped_pages[lang_prefix]:
|
||||
del updated_grouped_pages[lang_prefix]
|
||||
|
||||
# Add the newly created page to the list
|
||||
newly_created_pages.append({
|
||||
"title": page_name,
|
||||
"url": page_url,
|
||||
"en_title": found_en_page['title'],
|
||||
"en_url": found_en_page['url'],
|
||||
"created_at": change['timestamp'],
|
||||
"created_by": change['user'],
|
||||
"comment": change['comment']
|
||||
})
|
||||
|
||||
return updated_all_pages, updated_grouped_pages, newly_created_pages
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Fetch recent changes from the OSM wiki French namespace")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Run without saving results to file")
|
||||
parser.add_argument("--force", action="store_true", help="Force update even if cache is fresh")
|
||||
parser.add_argument("--debug", action="store_true", help="Save HTML content to a file for debugging")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting fetch_recent_changes.py")
|
||||
|
||||
# Check if cache is fresh
|
||||
if is_cache_fresh() and not args.force:
|
||||
logger.info(f"Cache is still fresh (less than {CACHE_DURATION.total_seconds()/3600} hours old)")
|
||||
logger.info(f"Use --force to update anyway")
|
||||
return
|
||||
|
||||
# Get the recent changes page content
|
||||
html_content = get_page_content(RECENT_CHANGES_URL)
|
||||
|
||||
if not html_content:
|
||||
logger.error("Failed to get recent changes page content")
|
||||
return
|
||||
|
||||
# Save HTML content to a file for debugging
|
||||
if args.debug:
|
||||
debug_file = "recent_changes_debug.html"
|
||||
try:
|
||||
with open(debug_file, 'w', encoding='utf-8') as f:
|
||||
f.write(html_content)
|
||||
logger.info(f"Saved HTML content to {debug_file} for debugging")
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving HTML content to {debug_file}: {e}")
|
||||
|
||||
# Parse the HTML to find the structure
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
# Find the main content area
|
||||
content_div = soup.find('div', id='mw-content-text')
|
||||
if content_div:
|
||||
logger.info(f"Found content div with id 'mw-content-text'")
|
||||
|
||||
# Look for elements with mw-changeslist class
|
||||
changeslist_elements = content_div.find_all(class_='mw-changeslist')
|
||||
logger.info(f"Found {len(changeslist_elements)} elements with class 'mw-changeslist'")
|
||||
|
||||
for i, element in enumerate(changeslist_elements):
|
||||
logger.info(f"Element {i+1} tag: {element.name}, classes: {element.get('class', [])}")
|
||||
|
||||
# Look for table rows or other elements that might contain changes
|
||||
rows = element.find_all('tr')
|
||||
divs = element.find_all('div', class_='mw-changeslist-line')
|
||||
lis = element.find_all('li')
|
||||
|
||||
logger.info(f" - Contains {len(rows)} tr elements")
|
||||
logger.info(f" - Contains {len(divs)} div.mw-changeslist-line elements")
|
||||
logger.info(f" - Contains {len(lis)} li elements")
|
||||
|
||||
# Check direct children
|
||||
children = list(element.children)
|
||||
logger.info(f" - Has {len(children)} direct children")
|
||||
if children:
|
||||
child_types = {}
|
||||
for child in children:
|
||||
if hasattr(child, 'name') and child.name:
|
||||
child_type = child.name
|
||||
child_types[child_type] = child_types.get(child_type, 0) + 1
|
||||
logger.info(f" - Direct children types: {child_types}")
|
||||
|
||||
# Extract recent changes
|
||||
recent_changes = extract_recent_changes(html_content)
|
||||
|
||||
if not recent_changes:
|
||||
logger.warning("No recent changes found")
|
||||
|
||||
# Save results
|
||||
success = save_results(recent_changes, args.dry_run)
|
||||
|
||||
# Check for newly created French pages
|
||||
logger.info("Checking for newly created French pages...")
|
||||
all_pages, grouped_pages, last_updated = load_unavailable_pages()
|
||||
created_pages, created_last_updated = load_created_pages()
|
||||
|
||||
if all_pages and grouped_pages:
|
||||
# Check for newly created pages
|
||||
updated_all_pages, updated_grouped_pages, newly_created = check_for_newly_created_pages(recent_changes, all_pages, grouped_pages)
|
||||
|
||||
# If we found newly created pages, update both files
|
||||
if newly_created:
|
||||
logger.info(f"Found {len(newly_created)} newly created French pages")
|
||||
|
||||
# Add the newly created pages to the existing list
|
||||
created_pages.extend(newly_created)
|
||||
|
||||
# Save the updated files
|
||||
save_unavailable_pages(updated_all_pages, updated_grouped_pages, args.dry_run)
|
||||
save_created_pages(created_pages, args.dry_run)
|
||||
else:
|
||||
logger.info("No newly created French pages found")
|
||||
else:
|
||||
logger.warning("Could not check for newly created French pages: unavailable pages file not found or empty")
|
||||
|
||||
if success:
|
||||
logger.info("Script completed successfully")
|
||||
else:
|
||||
logger.error("Script completed with errors")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
293
wiki_compare/find_pages_unavailable_in_english.py
Normal file
293
wiki_compare/find_pages_unavailable_in_english.py
Normal file
|
@ -0,0 +1,293 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
find_pages_unavailable_in_english.py
|
||||
|
||||
This script scrapes the OpenStreetMap wiki category "Pages unavailable in English"
|
||||
to identify French pages that need translation to English. It handles pagination to get all pages,
|
||||
filters for pages with "FR:" in the title, and saves them to a JSON file.
|
||||
|
||||
Usage:
|
||||
python find_pages_unavailable_in_english.py [--dry-run] [--force]
|
||||
|
||||
Options:
|
||||
--dry-run Run the script without saving the results to a file
|
||||
--force Force update even if the cache is still fresh (less than 1 hour old)
|
||||
|
||||
Output:
|
||||
- pages_unavailable_in_english.json: JSON file with French pages that need translation to English
|
||||
- Log messages about the scraping process and results
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import random
|
||||
import hashlib
|
||||
import csv
|
||||
from datetime import datetime, timedelta
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTPUT_FILE = "pages_unavailable_in_english.json"
|
||||
WIKI_PAGES_CSV = "wiki_pages.csv"
|
||||
BASE_URL = "https://wiki.openstreetmap.org/wiki/Category:Pages_unavailable_in_English"
|
||||
WIKI_BASE_URL = "https://wiki.openstreetmap.org"
|
||||
CACHE_DURATION = timedelta(hours=1) # Cache duration of 1 hour
|
||||
|
||||
def read_wiki_pages_csv():
|
||||
"""
|
||||
Read the wiki_pages.csv file and create a mapping of URLs to description_img_url values
|
||||
|
||||
Returns:
|
||||
dict: Dictionary mapping URLs to description_img_url values
|
||||
"""
|
||||
url_to_img_map = {}
|
||||
|
||||
try:
|
||||
with open(WIKI_PAGES_CSV, 'r', newline='', encoding='utf-8') as f:
|
||||
reader = csv.DictReader(f)
|
||||
for row in reader:
|
||||
if 'url' in row and 'description_img_url' in row and row['description_img_url']:
|
||||
url_to_img_map[row['url']] = row['description_img_url']
|
||||
|
||||
logger.info(f"Read {len(url_to_img_map)} image URLs from {WIKI_PAGES_CSV}")
|
||||
return url_to_img_map
|
||||
except (IOError, csv.Error) as e:
|
||||
logger.error(f"Error reading {WIKI_PAGES_CSV}: {e}")
|
||||
return {}
|
||||
|
||||
def is_cache_fresh():
|
||||
"""
|
||||
Check if the cache file exists and is less than CACHE_DURATION old
|
||||
|
||||
Returns:
|
||||
bool: True if cache is fresh, False otherwise
|
||||
"""
|
||||
if not os.path.exists(OUTPUT_FILE):
|
||||
return False
|
||||
|
||||
try:
|
||||
with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
last_updated = datetime.fromisoformat(data.get('last_updated', '2000-01-01T00:00:00'))
|
||||
now = datetime.now()
|
||||
return (now - last_updated) < CACHE_DURATION
|
||||
except (IOError, json.JSONDecodeError, ValueError) as e:
|
||||
logger.error(f"Error checking cache freshness: {e}")
|
||||
return False
|
||||
|
||||
def get_page_content(url):
|
||||
"""
|
||||
Get the HTML content of a page
|
||||
|
||||
Args:
|
||||
url (str): URL to fetch
|
||||
|
||||
Returns:
|
||||
str: HTML content of the page or None if request failed
|
||||
"""
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
def extract_pages_from_category(html_content, current_url):
|
||||
"""
|
||||
Extract pages from the category page HTML, filtering for pages with "FR:" in the title
|
||||
|
||||
Args:
|
||||
html_content (str): HTML content of the category page
|
||||
current_url (str): URL of the current page for resolving relative links
|
||||
|
||||
Returns:
|
||||
tuple: (list of page dictionaries, next page URL or None)
|
||||
"""
|
||||
if not html_content:
|
||||
return [], None
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
pages = []
|
||||
|
||||
# Find the category content
|
||||
category_content = soup.find('div', class_='mw-category-generated')
|
||||
if not category_content:
|
||||
logger.warning("Could not find category content")
|
||||
return [], None
|
||||
|
||||
# Extract pages
|
||||
for link in category_content.find_all('a'):
|
||||
title = link.get_text()
|
||||
url = WIKI_BASE_URL + link.get('href')
|
||||
|
||||
# Filter for pages with "FR:" in the title
|
||||
if "FR:" in title:
|
||||
# Extract language prefix (should be "FR")
|
||||
language_prefix = "FR"
|
||||
|
||||
# Calculate outdatedness score
|
||||
outdatedness_score = calculate_outdatedness_score(title)
|
||||
|
||||
pages.append({
|
||||
"title": title,
|
||||
"url": url,
|
||||
"language_prefix": language_prefix,
|
||||
"priority": 1, # All French pages have the same priority
|
||||
"outdatedness_score": outdatedness_score
|
||||
})
|
||||
|
||||
# Find next page link
|
||||
next_page_url = None
|
||||
pagination = soup.find('div', class_='mw-category-generated')
|
||||
if pagination:
|
||||
next_link = pagination.find('a', string='next page')
|
||||
if next_link:
|
||||
next_page_url = WIKI_BASE_URL + next_link.get('href')
|
||||
|
||||
return pages, next_page_url
|
||||
|
||||
def scrape_all_pages():
|
||||
"""
|
||||
Scrape all pages from the category, handling pagination
|
||||
|
||||
Returns:
|
||||
list: List of page dictionaries
|
||||
"""
|
||||
all_pages = []
|
||||
current_url = BASE_URL
|
||||
page_num = 1
|
||||
|
||||
while current_url:
|
||||
logger.info(f"Scraping page {page_num}: {current_url}")
|
||||
html_content = get_page_content(current_url)
|
||||
|
||||
if not html_content:
|
||||
logger.error(f"Failed to get content for page {page_num}")
|
||||
break
|
||||
|
||||
pages, next_url = extract_pages_from_category(html_content, current_url)
|
||||
logger.info(f"Found {len(pages)} French pages on page {page_num}")
|
||||
|
||||
all_pages.extend(pages)
|
||||
current_url = next_url
|
||||
page_num += 1
|
||||
|
||||
if not next_url:
|
||||
logger.info("No more pages to scrape")
|
||||
|
||||
logger.info(f"Total French pages scraped: {len(all_pages)}")
|
||||
return all_pages
|
||||
|
||||
def calculate_outdatedness_score(title):
|
||||
"""
|
||||
Calculate an outdatedness score for a page based on its title
|
||||
|
||||
Args:
|
||||
title (str): The page title
|
||||
|
||||
Returns:
|
||||
int: An outdatedness score between 1 and 100
|
||||
"""
|
||||
# Use a hash of the title to generate a consistent but varied score
|
||||
hash_value = int(hashlib.md5(title.encode('utf-8')).hexdigest(), 16)
|
||||
|
||||
# Generate a score between 1 and 100
|
||||
base_score = (hash_value % 100) + 1
|
||||
|
||||
return base_score
|
||||
|
||||
def save_results(pages, dry_run=False):
|
||||
"""
|
||||
Save the results to a JSON file
|
||||
|
||||
Args:
|
||||
pages (list): List of page dictionaries
|
||||
dry_run (bool): If True, don't actually save to file
|
||||
|
||||
Returns:
|
||||
bool: True if saving was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have saved results to file")
|
||||
return True
|
||||
|
||||
# Prepare the data structure
|
||||
data = {
|
||||
"last_updated": datetime.now().isoformat(),
|
||||
"pages": pages,
|
||||
"count": len(pages)
|
||||
}
|
||||
|
||||
try:
|
||||
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Successfully saved {len(pages)} pages to {OUTPUT_FILE}")
|
||||
|
||||
# Copy the file to the public directory for web access
|
||||
public_dir = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'public')
|
||||
if os.path.exists(public_dir):
|
||||
public_file = os.path.join(public_dir, OUTPUT_FILE)
|
||||
with open(public_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Copied {OUTPUT_FILE} to public directory")
|
||||
|
||||
return True
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving results to {OUTPUT_FILE}: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Scrape French pages unavailable in English from OSM wiki")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Run without saving results to file")
|
||||
parser.add_argument("--force", action="store_true", help="Force update even if cache is fresh")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting find_pages_unavailable_in_english.py")
|
||||
|
||||
# Check if cache is fresh
|
||||
if is_cache_fresh() and not args.force:
|
||||
logger.info(f"Cache is still fresh (less than {CACHE_DURATION.total_seconds()/3600} hours old)")
|
||||
logger.info(f"Use --force to update anyway")
|
||||
return
|
||||
|
||||
# Read image URLs from wiki_pages.csv
|
||||
url_to_img_map = read_wiki_pages_csv()
|
||||
|
||||
# Scrape pages
|
||||
pages = scrape_all_pages()
|
||||
|
||||
if not pages:
|
||||
logger.error("No pages found")
|
||||
return
|
||||
|
||||
# Add description_img_url to pages
|
||||
for page in pages:
|
||||
if page["url"] in url_to_img_map:
|
||||
page["description_img_url"] = url_to_img_map[page["url"]]
|
||||
|
||||
# Save results
|
||||
success = save_results(pages, args.dry_run)
|
||||
|
||||
if success:
|
||||
logger.info("Script completed successfully")
|
||||
else:
|
||||
logger.error("Script completed with errors")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
329
wiki_compare/find_pages_unavailable_in_french.py
Executable file
329
wiki_compare/find_pages_unavailable_in_french.py
Executable file
|
@ -0,0 +1,329 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
find_pages_unavailable_in_french.py
|
||||
|
||||
This script scrapes the OpenStreetMap wiki category "Pages unavailable in French"
|
||||
to identify pages that need translation. It handles pagination to get all pages,
|
||||
groups them by language prefix, and prioritizes English pages starting with "En:".
|
||||
|
||||
Usage:
|
||||
python find_pages_unavailable_in_french.py [--dry-run] [--force]
|
||||
|
||||
Options:
|
||||
--dry-run Run the script without saving the results to a file
|
||||
--force Force update even if the cache is still fresh (less than 1 hour old)
|
||||
|
||||
Output:
|
||||
- pages_unavailable_in_french.json: JSON file with pages that need translation
|
||||
- Log messages about the scraping process and results
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import random
|
||||
import hashlib
|
||||
import csv
|
||||
from datetime import datetime, timedelta
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTPUT_FILE = "pages_unavailable_in_french.json"
|
||||
WIKI_PAGES_CSV = "wiki_pages.csv"
|
||||
BASE_URL = "https://wiki.openstreetmap.org/wiki/Category:Pages_unavailable_in_French"
|
||||
WIKI_BASE_URL = "https://wiki.openstreetmap.org"
|
||||
CACHE_DURATION = timedelta(hours=1) # Cache duration of 1 hour
|
||||
|
||||
def read_wiki_pages_csv():
|
||||
"""
|
||||
Read the wiki_pages.csv file and create a mapping of URLs to description_img_url values
|
||||
|
||||
Returns:
|
||||
dict: Dictionary mapping URLs to description_img_url values
|
||||
"""
|
||||
url_to_img_map = {}
|
||||
|
||||
try:
|
||||
with open(WIKI_PAGES_CSV, 'r', newline='', encoding='utf-8') as f:
|
||||
reader = csv.DictReader(f)
|
||||
for row in reader:
|
||||
if 'url' in row and 'description_img_url' in row and row['description_img_url']:
|
||||
url_to_img_map[row['url']] = row['description_img_url']
|
||||
|
||||
logger.info(f"Read {len(url_to_img_map)} image URLs from {WIKI_PAGES_CSV}")
|
||||
return url_to_img_map
|
||||
except (IOError, csv.Error) as e:
|
||||
logger.error(f"Error reading {WIKI_PAGES_CSV}: {e}")
|
||||
return {}
|
||||
|
||||
def is_cache_fresh():
|
||||
"""
|
||||
Check if the cache file exists and is less than CACHE_DURATION old
|
||||
|
||||
Returns:
|
||||
bool: True if cache is fresh, False otherwise
|
||||
"""
|
||||
if not os.path.exists(OUTPUT_FILE):
|
||||
return False
|
||||
|
||||
try:
|
||||
with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
last_updated = datetime.fromisoformat(data.get('last_updated', '2000-01-01T00:00:00'))
|
||||
now = datetime.now()
|
||||
return (now - last_updated) < CACHE_DURATION
|
||||
except (IOError, json.JSONDecodeError, ValueError) as e:
|
||||
logger.error(f"Error checking cache freshness: {e}")
|
||||
return False
|
||||
|
||||
def get_page_content(url):
|
||||
"""
|
||||
Get the HTML content of a page
|
||||
|
||||
Args:
|
||||
url (str): URL to fetch
|
||||
|
||||
Returns:
|
||||
str: HTML content of the page or None if request failed
|
||||
"""
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
def extract_pages_from_category(html_content, current_url):
|
||||
"""
|
||||
Extract pages from the category page HTML
|
||||
|
||||
Args:
|
||||
html_content (str): HTML content of the category page
|
||||
current_url (str): URL of the current page for resolving relative links
|
||||
|
||||
Returns:
|
||||
tuple: (list of page dictionaries, next page URL or None)
|
||||
"""
|
||||
if not html_content:
|
||||
return [], None
|
||||
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
pages = []
|
||||
|
||||
# Find the category content
|
||||
category_content = soup.find('div', class_='mw-category-generated')
|
||||
if not category_content:
|
||||
logger.warning("Could not find category content")
|
||||
return [], None
|
||||
|
||||
# Extract pages
|
||||
for link in category_content.find_all('a'):
|
||||
title = link.get_text()
|
||||
url = WIKI_BASE_URL + link.get('href')
|
||||
|
||||
# Skip pages with "FR:User:" or "FR:Réunions"
|
||||
if "FR:User:" in title or "FR:Réunions" in title:
|
||||
logger.info(f"Skipping excluded page: {title}")
|
||||
continue
|
||||
|
||||
# Extract language prefix (e.g., "En:", "De:", etc.)
|
||||
language_prefix = "Other"
|
||||
match = re.match(r'^([A-Za-z]{2}):', title)
|
||||
if match:
|
||||
language_prefix = match.group(1)
|
||||
|
||||
# Check if it's an English page
|
||||
is_english = language_prefix.lower() == "en"
|
||||
|
||||
# Set priority (English pages have higher priority)
|
||||
priority = 1 if is_english else 0
|
||||
|
||||
# Calculate outdatedness score
|
||||
outdatedness_score = calculate_outdatedness_score(title, is_english)
|
||||
|
||||
pages.append({
|
||||
"title": title,
|
||||
"url": url,
|
||||
"language_prefix": language_prefix,
|
||||
"is_english": is_english,
|
||||
"priority": priority,
|
||||
"outdatedness_score": outdatedness_score
|
||||
})
|
||||
|
||||
# Find next page link
|
||||
next_page_url = None
|
||||
pagination = soup.find('div', class_='mw-category-generated')
|
||||
if pagination:
|
||||
next_link = pagination.find('a', string='next page')
|
||||
if next_link:
|
||||
next_page_url = WIKI_BASE_URL + next_link.get('href')
|
||||
|
||||
return pages, next_page_url
|
||||
|
||||
def scrape_all_pages():
|
||||
"""
|
||||
Scrape all pages from the category, handling pagination
|
||||
|
||||
Returns:
|
||||
list: List of page dictionaries
|
||||
"""
|
||||
all_pages = []
|
||||
current_url = BASE_URL
|
||||
page_num = 1
|
||||
|
||||
while current_url:
|
||||
logger.info(f"Scraping page {page_num}: {current_url}")
|
||||
html_content = get_page_content(current_url)
|
||||
|
||||
if not html_content:
|
||||
logger.error(f"Failed to get content for page {page_num}")
|
||||
break
|
||||
|
||||
pages, next_url = extract_pages_from_category(html_content, current_url)
|
||||
logger.info(f"Found {len(pages)} pages on page {page_num}")
|
||||
|
||||
all_pages.extend(pages)
|
||||
current_url = next_url
|
||||
page_num += 1
|
||||
|
||||
if not next_url:
|
||||
logger.info("No more pages to scrape")
|
||||
|
||||
logger.info(f"Total pages scraped: {len(all_pages)}")
|
||||
return all_pages
|
||||
|
||||
def calculate_outdatedness_score(title, is_english):
|
||||
"""
|
||||
Calculate an outdatedness score for a page based on its title
|
||||
|
||||
Args:
|
||||
title (str): The page title
|
||||
is_english (bool): Whether the page is in English
|
||||
|
||||
Returns:
|
||||
int: An outdatedness score between 1 and 100
|
||||
"""
|
||||
# Use a hash of the title to generate a consistent but varied score
|
||||
hash_value = int(hashlib.md5(title.encode('utf-8')).hexdigest(), 16)
|
||||
|
||||
# Generate a score between 1 and 100
|
||||
base_score = (hash_value % 100) + 1
|
||||
|
||||
# English pages get a higher base score
|
||||
if is_english:
|
||||
base_score = min(base_score + 20, 100)
|
||||
|
||||
return base_score
|
||||
|
||||
def group_pages_by_language(pages):
|
||||
"""
|
||||
Group pages by language prefix
|
||||
|
||||
Args:
|
||||
pages (list): List of page dictionaries
|
||||
|
||||
Returns:
|
||||
dict: Dictionary with language prefixes as keys and lists of pages as values
|
||||
"""
|
||||
grouped = {}
|
||||
|
||||
for page in pages:
|
||||
prefix = page["language_prefix"]
|
||||
if prefix not in grouped:
|
||||
grouped[prefix] = []
|
||||
grouped[prefix].append(page)
|
||||
|
||||
# Sort each group by priority (English pages first) and then by title
|
||||
for prefix in grouped:
|
||||
grouped[prefix].sort(key=lambda x: (-x["priority"], x["title"]))
|
||||
|
||||
return grouped
|
||||
|
||||
def save_results(pages, dry_run=False):
|
||||
"""
|
||||
Save the results to a JSON file
|
||||
|
||||
Args:
|
||||
pages (list): List of page dictionaries
|
||||
dry_run (bool): If True, don't actually save to file
|
||||
|
||||
Returns:
|
||||
bool: True if saving was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have saved results to file")
|
||||
return True
|
||||
|
||||
# Group pages by language prefix
|
||||
grouped_pages = group_pages_by_language(pages)
|
||||
|
||||
# Prepare the data structure
|
||||
data = {
|
||||
"last_updated": datetime.now().isoformat(),
|
||||
"grouped_pages": grouped_pages,
|
||||
"all_pages": pages
|
||||
}
|
||||
|
||||
try:
|
||||
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Successfully saved {len(pages)} pages to {OUTPUT_FILE}")
|
||||
return True
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving results to {OUTPUT_FILE}: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Scrape pages unavailable in French from OSM wiki")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Run without saving results to file")
|
||||
parser.add_argument("--force", action="store_true", help="Force update even if cache is fresh")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting find_pages_unavailable_in_french.py")
|
||||
|
||||
# Check if cache is fresh
|
||||
if is_cache_fresh() and not args.force:
|
||||
logger.info(f"Cache is still fresh (less than {CACHE_DURATION.total_seconds()/3600} hours old)")
|
||||
logger.info(f"Use --force to update anyway")
|
||||
return
|
||||
|
||||
# Read image URLs from wiki_pages.csv
|
||||
url_to_img_map = read_wiki_pages_csv()
|
||||
|
||||
# Scrape pages
|
||||
pages = scrape_all_pages()
|
||||
|
||||
if not pages:
|
||||
logger.error("No pages found")
|
||||
return
|
||||
|
||||
# Add description_img_url to pages
|
||||
for page in pages:
|
||||
if page["url"] in url_to_img_map:
|
||||
page["description_img_url"] = url_to_img_map[page["url"]]
|
||||
|
||||
# Save results
|
||||
success = save_results(pages, args.dry_run)
|
||||
|
||||
if success:
|
||||
logger.info("Script completed successfully")
|
||||
else:
|
||||
logger.error("Script completed with errors")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
212
wiki_compare/find_untranslated_french_pages.py
Executable file
212
wiki_compare/find_untranslated_french_pages.py
Executable file
|
@ -0,0 +1,212 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
find_untranslated_french_pages.py
|
||||
|
||||
This script scrapes the OSM wiki to find French pages that don't have translations
|
||||
in other languages. It caches the results and only performs the scraping
|
||||
at most once per hour.
|
||||
|
||||
Usage:
|
||||
python find_untranslated_french_pages.py [--force] [--dry-run]
|
||||
|
||||
Options:
|
||||
--force Force update even if cache is fresh
|
||||
--dry-run Print results without saving to file
|
||||
|
||||
Output:
|
||||
- untranslated_french_pages.json: JSON file containing information about French pages without translations
|
||||
"""
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import json
|
||||
import logging
|
||||
import argparse
|
||||
import os
|
||||
from datetime import datetime, timedelta
|
||||
import re
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTPUT_FILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'untranslated_french_pages.json')
|
||||
CACHE_TIMEOUT = 1 # hours
|
||||
WIKI_BASE_URL = "https://wiki.openstreetmap.org"
|
||||
FRENCH_PAGES_URL = "https://wiki.openstreetmap.org/wiki/Special:AllPages?from=&to=&namespace=202&hideredirects=1&prefix=FR:"
|
||||
|
||||
def should_update_cache():
|
||||
"""
|
||||
Check if the cache file exists and if it's older than the cache timeout
|
||||
|
||||
Returns:
|
||||
bool: True if cache should be updated, False otherwise
|
||||
"""
|
||||
if not os.path.exists(OUTPUT_FILE):
|
||||
logger.info("Cache file doesn't exist, creating it")
|
||||
return True
|
||||
|
||||
# Check file modification time
|
||||
file_mtime = datetime.fromtimestamp(os.path.getmtime(OUTPUT_FILE))
|
||||
now = datetime.now()
|
||||
|
||||
# If file is older than cache timeout, update it
|
||||
if now - file_mtime > timedelta(hours=CACHE_TIMEOUT):
|
||||
logger.info(f"Cache is older than {CACHE_TIMEOUT} hour(s), updating")
|
||||
return True
|
||||
|
||||
logger.info(f"Cache is still fresh (less than {CACHE_TIMEOUT} hour(s) old)")
|
||||
return False
|
||||
|
||||
def fetch_french_pages():
|
||||
"""
|
||||
Fetch all French pages from the OSM wiki
|
||||
|
||||
Returns:
|
||||
list: List of dictionaries containing French page information
|
||||
"""
|
||||
logger.info(f"Fetching French pages from {FRENCH_PAGES_URL}")
|
||||
french_pages = []
|
||||
next_page_url = FRENCH_PAGES_URL
|
||||
|
||||
while next_page_url:
|
||||
try:
|
||||
response = requests.get(next_page_url)
|
||||
response.raise_for_status()
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
# Find all links in the mw-allpages-body section
|
||||
links_container = soup.select_one('.mw-allpages-body')
|
||||
if links_container:
|
||||
links = links_container.select('li a')
|
||||
|
||||
for link in links:
|
||||
page_title = link.text.strip()
|
||||
page_url = WIKI_BASE_URL + link.get('href', '')
|
||||
|
||||
# Extract the key name (remove the FR: prefix)
|
||||
key_match = re.match(r'FR:(.*)', page_title)
|
||||
if key_match:
|
||||
key_name = key_match.group(1)
|
||||
|
||||
french_pages.append({
|
||||
'title': page_title,
|
||||
'key': key_name,
|
||||
'url': page_url,
|
||||
'has_translation': False # Will be updated later
|
||||
})
|
||||
|
||||
# Check if there's a next page
|
||||
next_link = soup.select_one('a.mw-nextlink')
|
||||
next_page_url = WIKI_BASE_URL + next_link.get('href') if next_link else None
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching French pages: {e}")
|
||||
break
|
||||
|
||||
logger.info(f"Found {len(french_pages)} French pages")
|
||||
return french_pages
|
||||
|
||||
def check_translations(french_pages):
|
||||
"""
|
||||
Check if each French page has translations in other languages
|
||||
|
||||
Args:
|
||||
french_pages (list): List of dictionaries containing French page information
|
||||
|
||||
Returns:
|
||||
list: Updated list with translation information
|
||||
"""
|
||||
logger.info("Checking for translations of French pages")
|
||||
|
||||
for i, page in enumerate(french_pages):
|
||||
if i % 10 == 0: # Log progress every 10 pages
|
||||
logger.info(f"Checking page {i+1}/{len(french_pages)}: {page['title']}")
|
||||
|
||||
try:
|
||||
# Construct the English page URL by removing the FR: prefix
|
||||
en_url = page['url'].replace('/wiki/FR:', '/wiki/')
|
||||
|
||||
# Check if the English page exists
|
||||
response = requests.head(en_url)
|
||||
|
||||
# If the page returns a 200 status code, it exists
|
||||
if response.status_code == 200:
|
||||
page['has_translation'] = True
|
||||
page['en_url'] = en_url
|
||||
else:
|
||||
page['has_translation'] = False
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error checking translation for {page['title']}: {e}")
|
||||
# Assume no translation in case of error
|
||||
page['has_translation'] = False
|
||||
|
||||
# Filter to only include pages without translations
|
||||
untranslated_pages = [page for page in french_pages if not page['has_translation']]
|
||||
logger.info(f"Found {len(untranslated_pages)} French pages without translations")
|
||||
|
||||
return untranslated_pages
|
||||
|
||||
def save_untranslated_pages(untranslated_pages):
|
||||
"""
|
||||
Save the untranslated pages to a JSON file
|
||||
|
||||
Args:
|
||||
untranslated_pages (list): List of dictionaries containing untranslated page information
|
||||
|
||||
Returns:
|
||||
str: Path to the output file
|
||||
"""
|
||||
data = {
|
||||
'last_updated': datetime.now().isoformat(),
|
||||
'untranslated_pages': untranslated_pages
|
||||
}
|
||||
|
||||
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
logger.info(f"Saved {len(untranslated_pages)} untranslated pages to {OUTPUT_FILE}")
|
||||
return OUTPUT_FILE
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Find French OSM wiki pages without translations")
|
||||
parser.add_argument("--force", action="store_true", help="Force update even if cache is fresh")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Print results without saving to file")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting find_untranslated_french_pages.py")
|
||||
|
||||
# Check if we should update the cache
|
||||
if args.force or should_update_cache() or args.dry_run:
|
||||
# Fetch all French pages
|
||||
french_pages = fetch_french_pages()
|
||||
|
||||
# Check which ones don't have translations
|
||||
untranslated_pages = check_translations(french_pages)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"Found {len(untranslated_pages)} French pages without translations:")
|
||||
for page in untranslated_pages[:10]: # Show only the first 10 in dry run
|
||||
logger.info(f"- {page['title']} ({page['url']})")
|
||||
if len(untranslated_pages) > 10:
|
||||
logger.info(f"... and {len(untranslated_pages) - 10} more")
|
||||
else:
|
||||
# Save the results
|
||||
output_file = save_untranslated_pages(untranslated_pages)
|
||||
logger.info(f"Results saved to {output_file}")
|
||||
else:
|
||||
logger.info("Using cached untranslated pages data")
|
||||
|
||||
logger.info("Script completed successfully")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
242
wiki_compare/fix_grammar_suggestions.py
Normal file
242
wiki_compare/fix_grammar_suggestions.py
Normal file
|
@ -0,0 +1,242 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
fix_grammar_suggestions.py
|
||||
|
||||
This script adds grammar suggestions to the "type" page in the outdated_pages.json file.
|
||||
It fetches the French content for the page, runs the grammar checker, and updates the file.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import subprocess
|
||||
import tempfile
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTDATED_PAGES_FILE = "outdated_pages.json"
|
||||
TARGET_KEY = "type"
|
||||
|
||||
def load_outdated_pages():
|
||||
"""
|
||||
Load the outdated pages from the JSON file
|
||||
|
||||
Returns:
|
||||
dict: Dictionary containing outdated page information
|
||||
"""
|
||||
try:
|
||||
with open(OUTDATED_PAGES_FILE, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
logger.info(f"Successfully loaded outdated pages from {OUTDATED_PAGES_FILE}")
|
||||
return data
|
||||
except (IOError, json.JSONDecodeError) as e:
|
||||
logger.error(f"Error loading pages from {OUTDATED_PAGES_FILE}: {e}")
|
||||
return None
|
||||
|
||||
def save_outdated_pages(data):
|
||||
"""
|
||||
Save the outdated pages to the JSON file
|
||||
|
||||
Args:
|
||||
data (dict): Dictionary containing outdated page information
|
||||
"""
|
||||
try:
|
||||
with open(OUTDATED_PAGES_FILE, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Successfully saved outdated pages to {OUTDATED_PAGES_FILE}")
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving pages to {OUTDATED_PAGES_FILE}: {e}")
|
||||
|
||||
def fetch_wiki_page_content(url):
|
||||
"""
|
||||
Fetch the content of a wiki page
|
||||
|
||||
Args:
|
||||
url (str): URL of the wiki page
|
||||
|
||||
Returns:
|
||||
str: Content of the wiki page
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Fetching content from {url}")
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
# Get the main content
|
||||
content = soup.select_one('#mw-content-text')
|
||||
if content:
|
||||
# Remove script and style elements
|
||||
for script in content.select('script, style'):
|
||||
script.extract()
|
||||
|
||||
# Remove .languages elements
|
||||
for languages_elem in content.select('.languages'):
|
||||
languages_elem.extract()
|
||||
|
||||
# Get text
|
||||
text = content.get_text(separator=' ', strip=True)
|
||||
logger.info(f"Successfully fetched content ({len(text)} characters)")
|
||||
return text
|
||||
else:
|
||||
logger.warning(f"Could not find content in page: {url}")
|
||||
return ""
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching wiki page content: {e}")
|
||||
return ""
|
||||
|
||||
def check_grammar_with_grammalecte(text):
|
||||
"""
|
||||
Check grammar in French text using grammalecte-cli
|
||||
|
||||
Args:
|
||||
text (str): French text to check
|
||||
|
||||
Returns:
|
||||
list: List of grammar suggestions
|
||||
"""
|
||||
if not text or len(text.strip()) == 0:
|
||||
logger.warning("Empty text provided for grammar checking")
|
||||
return []
|
||||
|
||||
logger.info("Checking grammar with grammalecte-cli...")
|
||||
|
||||
try:
|
||||
# Create a temporary file with the text
|
||||
with tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', suffix='.txt', delete=False) as temp_file:
|
||||
temp_file.write(text)
|
||||
temp_file_path = temp_file.name
|
||||
|
||||
# Run grammalecte-cli on the temporary file
|
||||
cmd = ['grammalecte-cli', '-f', temp_file_path, '-j', '-ctx', '-wss']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
|
||||
# Parse the JSON output
|
||||
grammar_data = json.loads(result.stdout)
|
||||
|
||||
# Extract grammar errors from all paragraphs
|
||||
grammar_suggestions = []
|
||||
for paragraph in grammar_data.get('data', []):
|
||||
paragraph_index = paragraph.get('iParagraph', 0)
|
||||
|
||||
# Process grammar errors
|
||||
for error in paragraph.get('lGrammarErrors', []):
|
||||
suggestion = {
|
||||
'paragraph': paragraph_index,
|
||||
'start': error.get('nStart', 0),
|
||||
'end': error.get('nEnd', 0),
|
||||
'type': error.get('sType', ''),
|
||||
'message': error.get('sMessage', ''),
|
||||
'suggestions': error.get('aSuggestions', []),
|
||||
'text': error.get('sUnderlined', ''),
|
||||
'before': error.get('sBefore', ''),
|
||||
'after': error.get('sAfter', '')
|
||||
}
|
||||
grammar_suggestions.append(suggestion)
|
||||
|
||||
# Process spelling errors
|
||||
for error in paragraph.get('lSpellingErrors', []):
|
||||
suggestion = {
|
||||
'paragraph': paragraph_index,
|
||||
'start': error.get('nStart', 0),
|
||||
'end': error.get('nEnd', 0),
|
||||
'type': 'spelling',
|
||||
'message': 'Erreur d\'orthographe',
|
||||
'suggestions': error.get('aSuggestions', []),
|
||||
'text': error.get('sUnderlined', ''),
|
||||
'before': error.get('sBefore', ''),
|
||||
'after': error.get('sAfter', '')
|
||||
}
|
||||
grammar_suggestions.append(suggestion)
|
||||
|
||||
# Clean up the temporary file
|
||||
os.unlink(temp_file_path)
|
||||
|
||||
logger.info(f"Found {len(grammar_suggestions)} grammar/spelling suggestions")
|
||||
return grammar_suggestions
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Error running grammalecte-cli: {e}")
|
||||
logger.error(f"stdout: {e.stdout}")
|
||||
logger.error(f"stderr: {e.stderr}")
|
||||
return []
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
logger.error(f"Error parsing grammalecte-cli output: {e}")
|
||||
return []
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error during grammar checking: {e}")
|
||||
return []
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
logger.info("Starting fix_grammar_suggestions.py")
|
||||
|
||||
# Load outdated pages
|
||||
data = load_outdated_pages()
|
||||
if not data:
|
||||
logger.error("Failed to load outdated pages")
|
||||
return
|
||||
|
||||
# Find the "type" page in the regular_pages array
|
||||
type_page = None
|
||||
for i, page in enumerate(data.get('regular_pages', [])):
|
||||
if page.get('key') == TARGET_KEY:
|
||||
type_page = page
|
||||
type_page_index = i
|
||||
break
|
||||
|
||||
if not type_page:
|
||||
logger.error(f"Could not find page with key '{TARGET_KEY}'")
|
||||
return
|
||||
|
||||
# Get the French page URL
|
||||
fr_page = type_page.get('fr_page')
|
||||
if not fr_page:
|
||||
logger.error(f"No French page found for key '{TARGET_KEY}'")
|
||||
return
|
||||
|
||||
fr_url = fr_page.get('url')
|
||||
if not fr_url:
|
||||
logger.error(f"No URL found for French page of key '{TARGET_KEY}'")
|
||||
return
|
||||
|
||||
# Fetch the content of the French page
|
||||
content = fetch_wiki_page_content(fr_url)
|
||||
if not content:
|
||||
logger.error(f"Could not fetch content from {fr_url}")
|
||||
return
|
||||
|
||||
# Check grammar
|
||||
logger.info(f"Checking grammar for key '{TARGET_KEY}'")
|
||||
suggestions = check_grammar_with_grammalecte(content)
|
||||
if not suggestions:
|
||||
logger.warning("No grammar suggestions found or grammar checker not available")
|
||||
|
||||
# Add the grammar suggestions to the page
|
||||
type_page['grammar_suggestions'] = suggestions
|
||||
|
||||
# Update the page in the data
|
||||
data['regular_pages'][type_page_index] = type_page
|
||||
|
||||
# Save the updated data
|
||||
save_outdated_pages(data)
|
||||
|
||||
logger.info("Script completed successfully")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
1
wiki_compare/install_ubuntu.sh
Normal file
1
wiki_compare/install_ubuntu.sh
Normal file
|
@ -0,0 +1 @@
|
|||
sudo apt install aspell aspell-fr grammalecte-cli
|
226
wiki_compare/post_outdated_page.py
Executable file
226
wiki_compare/post_outdated_page.py
Executable file
|
@ -0,0 +1,226 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
post_outdated_page.py
|
||||
|
||||
This script reads the outdated_pages.json file generated by wiki_compare.py,
|
||||
randomly selects an outdated French wiki page, and posts a message on Mastodon
|
||||
suggesting that the page needs updating.
|
||||
|
||||
Usage:
|
||||
python post_outdated_page.py [--dry-run]
|
||||
|
||||
Options:
|
||||
--dry-run Run the script without actually posting to Mastodon
|
||||
|
||||
Output:
|
||||
- A post on Mastodon about an outdated French wiki page
|
||||
- Log messages about the selected page and posting status
|
||||
"""
|
||||
|
||||
import json
|
||||
import random
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime
|
||||
import requests
|
||||
import re
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Function to read variables from .env file
|
||||
def read_env_file(env_file_path=".env"):
|
||||
"""
|
||||
Read environment variables from a .env file
|
||||
|
||||
Args:
|
||||
env_file_path (str): Path to the .env file
|
||||
|
||||
Returns:
|
||||
dict: Dictionary of environment variables
|
||||
"""
|
||||
env_vars = {}
|
||||
|
||||
try:
|
||||
with open(env_file_path, 'r', encoding='utf-8') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
# Skip comments and empty lines
|
||||
if not line or line.startswith('#'):
|
||||
continue
|
||||
|
||||
# Match variable assignments (KEY=VALUE)
|
||||
match = re.match(r'^([A-Za-z0-9_]+)=(.*)$', line)
|
||||
if match:
|
||||
key, value = match.groups()
|
||||
# Remove quotes if present
|
||||
value = value.strip('\'"')
|
||||
env_vars[key] = value
|
||||
|
||||
logger.info(f"Successfully loaded environment variables from {env_file_path}")
|
||||
return env_vars
|
||||
except IOError as e:
|
||||
logger.error(f"Error reading .env file {env_file_path}: {e}")
|
||||
return {}
|
||||
|
||||
# Constants
|
||||
OUTDATED_PAGES_FILE = "outdated_pages.json"
|
||||
MASTODON_API_URL = "https://mastodon.cipherbliss.com/api/v1/statuses" # Replace with actual instance
|
||||
|
||||
# Read MASTODON_ACCESS_TOKEN from .env file
|
||||
env_vars = read_env_file(".env")
|
||||
if not env_vars and os.path.exists(os.path.join(os.path.dirname(__file__), ".env")):
|
||||
# Try with absolute path if relative path fails
|
||||
env_vars = read_env_file(os.path.join(os.path.dirname(__file__), ".env"))
|
||||
|
||||
MASTODON_ACCESS_TOKEN = env_vars.get("MASTODON_ACCESS_TOKEN") or os.environ.get("MASTODON_ACCESS_TOKEN")
|
||||
|
||||
def load_outdated_pages():
|
||||
"""
|
||||
Load the outdated pages from the JSON file
|
||||
|
||||
Returns:
|
||||
list: List of dictionaries containing outdated page information
|
||||
"""
|
||||
try:
|
||||
with open(OUTDATED_PAGES_FILE, 'r', encoding='utf-8') as f:
|
||||
pages = json.load(f)
|
||||
logger.info(f"Successfully loaded {len(pages)} outdated pages from {OUTDATED_PAGES_FILE}")
|
||||
return pages
|
||||
except (IOError, json.JSONDecodeError) as e:
|
||||
logger.error(f"Error loading outdated pages from {OUTDATED_PAGES_FILE}: {e}")
|
||||
return []
|
||||
|
||||
def select_random_outdated_page(pages):
|
||||
"""
|
||||
Randomly select an outdated French page from the list
|
||||
|
||||
Args:
|
||||
pages (list): List of dictionaries containing outdated page information
|
||||
|
||||
Returns:
|
||||
dict: Randomly selected outdated page or None if no suitable pages found
|
||||
"""
|
||||
# Filter pages to include only those with a French page (not missing)
|
||||
pages_with_fr = [page for page in pages if page.get('fr_page') is not None]
|
||||
|
||||
if not pages_with_fr:
|
||||
logger.warning("No outdated French pages found")
|
||||
return None
|
||||
|
||||
# Randomly select a page
|
||||
selected_page = random.choice(pages_with_fr)
|
||||
logger.info(f"Randomly selected page for key '{selected_page['key']}'")
|
||||
|
||||
return selected_page
|
||||
|
||||
def create_mastodon_post(page):
|
||||
"""
|
||||
Create a Mastodon post about the outdated wiki page
|
||||
|
||||
Args:
|
||||
page (dict): Dictionary containing outdated page information
|
||||
|
||||
Returns:
|
||||
str: Formatted Mastodon post text
|
||||
"""
|
||||
key = page['key']
|
||||
reason = page['reason']
|
||||
fr_url = page['fr_page']['url']
|
||||
en_url = page['en_page']['url']
|
||||
|
||||
# Format the post
|
||||
post = f"""📝 La page wiki OSM pour la clé #{key} a besoin d'une mise à jour !
|
||||
|
||||
Raison : {reason}
|
||||
|
||||
Vous pouvez aider en mettant à jour la page française :
|
||||
{fr_url}
|
||||
|
||||
Page anglaise de référence :
|
||||
{en_url}
|
||||
|
||||
#OpenStreetMap #OSM #Wiki #Contribution #Traduction"""
|
||||
|
||||
return post
|
||||
|
||||
def post_to_mastodon(post_text, dry_run=False):
|
||||
"""
|
||||
Post the message to Mastodon
|
||||
|
||||
Args:
|
||||
post_text (str): Text to post
|
||||
dry_run (bool): If True, don't actually post to Mastodon
|
||||
|
||||
Returns:
|
||||
bool: True if posting was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have posted to Mastodon:")
|
||||
logger.info(post_text)
|
||||
return True
|
||||
|
||||
if not MASTODON_ACCESS_TOKEN:
|
||||
logger.error("MASTODON_ACCESS_TOKEN not found in .env file or environment variables")
|
||||
return False
|
||||
|
||||
headers = {
|
||||
"Authorization": f"Bearer {MASTODON_ACCESS_TOKEN}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
data = {
|
||||
"status": post_text,
|
||||
"visibility": "public"
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.post(MASTODON_API_URL, headers=headers, json=data)
|
||||
response.raise_for_status()
|
||||
logger.info("Successfully posted to Mastodon")
|
||||
return True
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error posting to Mastodon: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Post about an outdated OSM wiki page on Mastodon")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Run without actually posting to Mastodon")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting post_outdated_page.py")
|
||||
|
||||
# Load outdated pages
|
||||
outdated_pages = load_outdated_pages()
|
||||
if not outdated_pages:
|
||||
logger.error("No outdated pages found. Run wiki_compare.py first.")
|
||||
return
|
||||
|
||||
# Select a random outdated page
|
||||
selected_page = select_random_outdated_page(outdated_pages)
|
||||
if not selected_page:
|
||||
logger.error("Could not select an outdated page.")
|
||||
return
|
||||
|
||||
# Create the post text
|
||||
post_text = create_mastodon_post(selected_page)
|
||||
|
||||
# Post to Mastodon
|
||||
success = post_to_mastodon(post_text, args.dry_run)
|
||||
|
||||
if success:
|
||||
logger.info("Script completed successfully")
|
||||
else:
|
||||
logger.error("Script completed with errors")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
233
wiki_compare/propose_translation.py
Executable file
233
wiki_compare/propose_translation.py
Executable file
|
@ -0,0 +1,233 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
propose_translation.py
|
||||
|
||||
This script reads the outdated_pages.json file, selects a wiki page (by default the first one),
|
||||
and uses Ollama with the "mistral:7b" model to propose a translation of the page.
|
||||
The translation is saved in the "proposed_translation" property of the JSON file.
|
||||
|
||||
Usage:
|
||||
python propose_translation.py [--page KEY]
|
||||
|
||||
Options:
|
||||
--page KEY Specify the key of the page to translate (default: first page in the file)
|
||||
|
||||
Output:
|
||||
- Updated outdated_pages.json file with proposed translations
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
import requests
|
||||
import os
|
||||
import sys
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTDATED_PAGES_FILE = "outdated_pages.json"
|
||||
OLLAMA_API_URL = "http://localhost:11434/api/generate"
|
||||
OLLAMA_MODEL = "mistral:7b"
|
||||
|
||||
def load_outdated_pages():
|
||||
"""
|
||||
Load the outdated pages from the JSON file
|
||||
|
||||
Returns:
|
||||
list: List of dictionaries containing outdated page information
|
||||
"""
|
||||
try:
|
||||
with open(OUTDATED_PAGES_FILE, 'r', encoding='utf-8') as f:
|
||||
pages = json.load(f)
|
||||
logger.info(f"Successfully loaded {len(pages)} pages from {OUTDATED_PAGES_FILE}")
|
||||
return pages
|
||||
except (IOError, json.JSONDecodeError) as e:
|
||||
logger.error(f"Error loading pages from {OUTDATED_PAGES_FILE}: {e}")
|
||||
return []
|
||||
|
||||
def save_to_json(data, filename):
|
||||
"""
|
||||
Save data to a JSON file
|
||||
|
||||
Args:
|
||||
data: Data to save
|
||||
filename (str): Name of the file
|
||||
"""
|
||||
try:
|
||||
with open(filename, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Data saved to {filename}")
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving data to {filename}: {e}")
|
||||
|
||||
def fetch_wiki_page_content(url):
|
||||
"""
|
||||
Fetch the content of a wiki page
|
||||
|
||||
Args:
|
||||
url (str): URL of the wiki page
|
||||
|
||||
Returns:
|
||||
str: Content of the wiki page
|
||||
"""
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
# Get the main content
|
||||
content = soup.select_one('#mw-content-text')
|
||||
if content:
|
||||
# Remove script and style elements
|
||||
for script in content.select('script, style'):
|
||||
script.extract()
|
||||
|
||||
# Remove .languages elements
|
||||
for languages_elem in content.select('.languages'):
|
||||
languages_elem.extract()
|
||||
|
||||
# Get text
|
||||
text = content.get_text(separator=' ', strip=True)
|
||||
return text
|
||||
else:
|
||||
logger.warning(f"Could not find content in page: {url}")
|
||||
return ""
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching wiki page content: {e}")
|
||||
return ""
|
||||
|
||||
def translate_with_ollama(text, model=OLLAMA_MODEL):
|
||||
"""
|
||||
Translate text using Ollama
|
||||
|
||||
Args:
|
||||
text (str): Text to translate
|
||||
model (str): Ollama model to use
|
||||
|
||||
Returns:
|
||||
str: Translated text
|
||||
"""
|
||||
prompt = f"""
|
||||
Tu es un traducteur professionnel spécialisé dans la traduction de documentation technique de l'anglais vers le français.
|
||||
Traduis le texte suivant de l'anglais vers le français. Conserve le formatage et la structure du texte original.
|
||||
Ne traduis pas les noms propres, les URLs, et les termes techniques spécifiques à OpenStreetMap.
|
||||
|
||||
Texte à traduire:
|
||||
{text}
|
||||
"""
|
||||
|
||||
try:
|
||||
logger.info(f"Sending request to Ollama with model {model}")
|
||||
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"stream": False
|
||||
}
|
||||
|
||||
response = requests.post(OLLAMA_API_URL, json=payload)
|
||||
response.raise_for_status()
|
||||
|
||||
result = response.json()
|
||||
translation = result.get('response', '')
|
||||
|
||||
logger.info(f"Successfully received translation from Ollama")
|
||||
return translation
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error translating with Ollama: {e}")
|
||||
return ""
|
||||
|
||||
def select_page_for_translation(pages, key=None):
|
||||
"""
|
||||
Select a page for translation
|
||||
|
||||
Args:
|
||||
pages (list): List of dictionaries containing page information
|
||||
key (str): Key of the page to select (if None, select the first page)
|
||||
|
||||
Returns:
|
||||
dict: Selected page or None if no suitable page found
|
||||
"""
|
||||
if not pages:
|
||||
logger.warning("No pages found that need translation")
|
||||
return None
|
||||
|
||||
if key:
|
||||
# Find the page with the specified key
|
||||
for page in pages:
|
||||
if page.get('key') == key:
|
||||
logger.info(f"Selected page for key '{key}' for translation")
|
||||
return page
|
||||
|
||||
logger.warning(f"No page found with key '{key}'")
|
||||
return None
|
||||
else:
|
||||
# Select the first page
|
||||
selected_page = pages[0]
|
||||
logger.info(f"Selected first page (key '{selected_page['key']}') for translation")
|
||||
return selected_page
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Propose a translation for an OSM wiki page using Ollama")
|
||||
parser.add_argument("--page", help="Key of the page to translate (default: first page in the file)")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting propose_translation.py")
|
||||
|
||||
# Load pages
|
||||
pages = load_outdated_pages()
|
||||
if not pages:
|
||||
logger.error("No pages found. Run wiki_compare.py first.")
|
||||
sys.exit(1)
|
||||
|
||||
# Select a page for translation
|
||||
selected_page = select_page_for_translation(pages, args.page)
|
||||
if not selected_page:
|
||||
logger.error("Could not select a page for translation.")
|
||||
sys.exit(1)
|
||||
|
||||
# Get the English page URL
|
||||
en_url = selected_page.get('en_page', {}).get('url')
|
||||
if not en_url:
|
||||
logger.error(f"No English page URL found for key '{selected_page['key']}'")
|
||||
sys.exit(1)
|
||||
|
||||
# Fetch the content of the English page
|
||||
logger.info(f"Fetching content from {en_url}")
|
||||
content = fetch_wiki_page_content(en_url)
|
||||
if not content:
|
||||
logger.error(f"Could not fetch content from {en_url}")
|
||||
sys.exit(1)
|
||||
|
||||
# Translate the content
|
||||
logger.info(f"Translating content for key '{selected_page['key']}'")
|
||||
translation = translate_with_ollama(content)
|
||||
if not translation:
|
||||
logger.error("Could not translate content")
|
||||
sys.exit(1)
|
||||
|
||||
# Save the translation in the JSON file
|
||||
logger.info(f"Saving translation for key '{selected_page['key']}'")
|
||||
selected_page['proposed_translation'] = translation
|
||||
|
||||
# Save the updated data back to the file
|
||||
save_to_json(pages, OUTDATED_PAGES_FILE)
|
||||
|
||||
logger.info("Script completed successfully")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
381
wiki_compare/suggest_grammar_improvements.py
Executable file
381
wiki_compare/suggest_grammar_improvements.py
Executable file
|
@ -0,0 +1,381 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
suggest_grammar_improvements.py
|
||||
|
||||
This script reads the outdated_pages.json file, selects a wiki page (by default the first one),
|
||||
and uses grammalecte to check the grammar of the French page content.
|
||||
The grammar suggestions are saved in the "grammar_suggestions" property of the JSON file.
|
||||
|
||||
The script is compatible with different versions of the grammalecte API:
|
||||
- For newer versions where GrammarChecker is directly in the grammalecte module
|
||||
- For older versions where GrammarChecker is in the grammalecte.fr module
|
||||
|
||||
Usage:
|
||||
python suggest_grammar_improvements.py [--page KEY]
|
||||
|
||||
Options:
|
||||
--page KEY Specify the key of the page to check (default: first page in the file)
|
||||
|
||||
Output:
|
||||
- Updated outdated_pages.json file with grammar suggestions
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
import requests
|
||||
import os
|
||||
import sys
|
||||
import subprocess
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
try:
|
||||
import grammalecte
|
||||
import grammalecte.text as txt
|
||||
|
||||
# Check if GrammarChecker is available directly in the grammalecte module (newer versions)
|
||||
try:
|
||||
from grammalecte import GrammarChecker
|
||||
GRAMMALECTE_DIRECT_API = True
|
||||
except ImportError:
|
||||
# Try the older API structure with fr submodule
|
||||
try:
|
||||
import grammalecte.fr as gr_fr
|
||||
GRAMMALECTE_DIRECT_API = False
|
||||
except ImportError:
|
||||
# Neither API is available
|
||||
raise ImportError("Could not import GrammarChecker from grammalecte")
|
||||
|
||||
GRAMMALECTE_AVAILABLE = True
|
||||
except ImportError:
|
||||
GRAMMALECTE_AVAILABLE = False
|
||||
GRAMMALECTE_DIRECT_API = False
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTDATED_PAGES_FILE = "outdated_pages.json"
|
||||
|
||||
def load_outdated_pages():
|
||||
"""
|
||||
Load the outdated pages from the JSON file
|
||||
|
||||
Returns:
|
||||
list: List of dictionaries containing outdated page information
|
||||
"""
|
||||
try:
|
||||
with open(OUTDATED_PAGES_FILE, 'r', encoding='utf-8') as f:
|
||||
pages = json.load(f)
|
||||
logger.info(f"Successfully loaded {len(pages)} pages from {OUTDATED_PAGES_FILE}")
|
||||
return pages
|
||||
except (IOError, json.JSONDecodeError) as e:
|
||||
logger.error(f"Error loading pages from {OUTDATED_PAGES_FILE}: {e}")
|
||||
return []
|
||||
|
||||
def save_to_json(data, filename):
|
||||
"""
|
||||
Save data to a JSON file
|
||||
|
||||
Args:
|
||||
data: Data to save
|
||||
filename (str): Name of the file
|
||||
"""
|
||||
try:
|
||||
with open(filename, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"Data saved to {filename}")
|
||||
except IOError as e:
|
||||
logger.error(f"Error saving data to {filename}: {e}")
|
||||
|
||||
def fetch_wiki_page_content(url):
|
||||
"""
|
||||
Fetch the content of a wiki page
|
||||
|
||||
Args:
|
||||
url (str): URL of the wiki page
|
||||
|
||||
Returns:
|
||||
str: Content of the wiki page
|
||||
"""
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
# Get the main content
|
||||
content = soup.select_one('#mw-content-text')
|
||||
if content:
|
||||
# Remove script and style elements
|
||||
for script in content.select('script, style'):
|
||||
script.extract()
|
||||
|
||||
# Remove .languages elements
|
||||
for languages_elem in content.select('.languages'):
|
||||
languages_elem.extract()
|
||||
|
||||
# Get text
|
||||
text = content.get_text(separator=' ', strip=True)
|
||||
return text
|
||||
else:
|
||||
logger.warning(f"Could not find content in page: {url}")
|
||||
return ""
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error fetching wiki page content: {e}")
|
||||
return ""
|
||||
|
||||
def check_grammar_with_grammalecte(text):
|
||||
"""
|
||||
Check grammar using grammalecte
|
||||
|
||||
Args:
|
||||
text (str): Text to check
|
||||
|
||||
Returns:
|
||||
list: List of grammar suggestions
|
||||
"""
|
||||
if not GRAMMALECTE_AVAILABLE:
|
||||
logger.error("Grammalecte is not installed. Please install it with: pip install grammalecte")
|
||||
return []
|
||||
|
||||
try:
|
||||
logger.info("Checking grammar with grammalecte")
|
||||
|
||||
# Initialize grammalecte based on which API version is available
|
||||
if GRAMMALECTE_DIRECT_API:
|
||||
# New API: GrammarChecker is directly in grammalecte module
|
||||
logger.info("Using direct GrammarChecker API")
|
||||
gce = GrammarChecker("fr")
|
||||
|
||||
# Split text into paragraphs
|
||||
paragraphs = txt.getParagraph(text)
|
||||
|
||||
# Check grammar for each paragraph
|
||||
suggestions = []
|
||||
for i, paragraph in enumerate(paragraphs):
|
||||
if paragraph.strip():
|
||||
# Use getParagraphErrors method
|
||||
errors = gce.getParagraphErrors(paragraph)
|
||||
for error in errors:
|
||||
# Filter out spelling errors if needed
|
||||
if "sType" in error and error["sType"] != "WORD" and error.get("bError", True):
|
||||
suggestion = {
|
||||
"paragraph": i + 1,
|
||||
"start": error.get("nStart", 0),
|
||||
"end": error.get("nEnd", 0),
|
||||
"type": error.get("sType", ""),
|
||||
"message": error.get("sMessage", ""),
|
||||
"suggestions": error.get("aSuggestions", []),
|
||||
"context": paragraph[max(0, error.get("nStart", 0) - 20):min(len(paragraph), error.get("nEnd", 0) + 20)]
|
||||
}
|
||||
suggestions.append(suggestion)
|
||||
else:
|
||||
# Old API: GrammarChecker is in grammalecte.fr module
|
||||
logger.info("Using legacy grammalecte.fr.GrammarChecker API")
|
||||
gce = gr_fr.GrammarChecker("fr")
|
||||
|
||||
# Split text into paragraphs
|
||||
paragraphs = txt.getParagraph(text)
|
||||
|
||||
# Check grammar for each paragraph
|
||||
suggestions = []
|
||||
for i, paragraph in enumerate(paragraphs):
|
||||
if paragraph.strip():
|
||||
# Use parse method for older API
|
||||
for error in gce.parse(paragraph, "FR", False):
|
||||
if error["sType"] != "WORD" and error["bError"]:
|
||||
suggestion = {
|
||||
"paragraph": i + 1,
|
||||
"start": error["nStart"],
|
||||
"end": error["nEnd"],
|
||||
"type": error["sType"],
|
||||
"message": error["sMessage"],
|
||||
"suggestions": error.get("aSuggestions", []),
|
||||
"context": paragraph[max(0, error["nStart"] - 20):min(len(paragraph), error["nEnd"] + 20)]
|
||||
}
|
||||
suggestions.append(suggestion)
|
||||
|
||||
logger.info(f"Found {len(suggestions)} grammar suggestions")
|
||||
return suggestions
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error checking grammar with grammalecte: {e}")
|
||||
return []
|
||||
|
||||
def check_grammar_with_cli(text):
|
||||
"""
|
||||
Check grammar using grammalecte-cli command
|
||||
|
||||
Args:
|
||||
text (str): Text to check
|
||||
|
||||
Returns:
|
||||
list: List of grammar suggestions
|
||||
"""
|
||||
try:
|
||||
logger.info("Checking grammar with grammalecte-cli")
|
||||
|
||||
# Create a temporary file with the text
|
||||
temp_file = "temp_text_for_grammar_check.txt"
|
||||
with open(temp_file, 'w', encoding='utf-8') as f:
|
||||
f.write(text)
|
||||
|
||||
# Run grammalecte-cli
|
||||
cmd = ["grammalecte-cli", "--json", "--file", temp_file]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, encoding='utf-8')
|
||||
|
||||
# Remove temporary file
|
||||
if os.path.exists(temp_file):
|
||||
os.remove(temp_file)
|
||||
|
||||
if result.returncode != 0:
|
||||
logger.error(f"Error running grammalecte-cli: {result.stderr}")
|
||||
return []
|
||||
|
||||
# Parse JSON output
|
||||
output = json.loads(result.stdout)
|
||||
|
||||
# Extract grammar suggestions
|
||||
suggestions = []
|
||||
for paragraph_data in output.get("data", []):
|
||||
paragraph_index = paragraph_data.get("iParagraph", 0)
|
||||
for error in paragraph_data.get("lGrammarErrors", []):
|
||||
suggestion = {
|
||||
"paragraph": paragraph_index + 1,
|
||||
"start": error.get("nStart", 0),
|
||||
"end": error.get("nEnd", 0),
|
||||
"type": error.get("sType", ""),
|
||||
"message": error.get("sMessage", ""),
|
||||
"suggestions": error.get("aSuggestions", []),
|
||||
"context": error.get("sContext", "")
|
||||
}
|
||||
suggestions.append(suggestion)
|
||||
|
||||
logger.info(f"Found {len(suggestions)} grammar suggestions")
|
||||
return suggestions
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error checking grammar with grammalecte-cli: {e}")
|
||||
return []
|
||||
|
||||
def check_grammar(text):
|
||||
"""
|
||||
Check grammar using available method (Python library or CLI)
|
||||
|
||||
Args:
|
||||
text (str): Text to check
|
||||
|
||||
Returns:
|
||||
list: List of grammar suggestions
|
||||
"""
|
||||
# Try using the Python library first
|
||||
if GRAMMALECTE_AVAILABLE:
|
||||
return check_grammar_with_grammalecte(text)
|
||||
|
||||
# Fall back to CLI if available
|
||||
try:
|
||||
# Check if grammalecte-cli is available
|
||||
subprocess.run(["grammalecte-cli", "--help"], capture_output=True)
|
||||
return check_grammar_with_cli(text)
|
||||
except (subprocess.SubprocessError, FileNotFoundError):
|
||||
logger.error("Neither grammalecte Python package nor grammalecte-cli is available.")
|
||||
logger.error("Please install grammalecte with: pip install grammalecte")
|
||||
return []
|
||||
|
||||
def select_page_for_grammar_check(pages, key=None):
|
||||
"""
|
||||
Select a page for grammar checking
|
||||
|
||||
Args:
|
||||
pages (list): List of dictionaries containing page information
|
||||
key (str): Key of the page to select (if None, select the first page)
|
||||
|
||||
Returns:
|
||||
dict: Selected page or None if no suitable page found
|
||||
"""
|
||||
if not pages:
|
||||
logger.warning("No pages found that need grammar checking")
|
||||
return None
|
||||
|
||||
if key:
|
||||
# Find the page with the specified key
|
||||
for page in pages:
|
||||
if page.get('key') == key:
|
||||
# Check if the page has a French version
|
||||
if page.get('fr_page') is None:
|
||||
logger.warning(f"Page with key '{key}' does not have a French version")
|
||||
return None
|
||||
logger.info(f"Selected page for key '{key}' for grammar checking")
|
||||
return page
|
||||
|
||||
logger.warning(f"No page found with key '{key}'")
|
||||
return None
|
||||
else:
|
||||
# Select the first page that has a French version
|
||||
for page in pages:
|
||||
if page.get('fr_page') is not None:
|
||||
logger.info(f"Selected first page with French version (key '{page['key']}') for grammar checking")
|
||||
return page
|
||||
|
||||
logger.warning("No pages found with French versions")
|
||||
return None
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Suggest grammar improvements for an OSM wiki page using grammalecte")
|
||||
parser.add_argument("--page", help="Key of the page to check (default: first page with a French version)")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting suggest_grammar_improvements.py")
|
||||
|
||||
# Load pages
|
||||
pages = load_outdated_pages()
|
||||
if not pages:
|
||||
logger.error("No pages found. Run wiki_compare.py first.")
|
||||
sys.exit(1)
|
||||
|
||||
# Select a page for grammar checking
|
||||
selected_page = select_page_for_grammar_check(pages, args.page)
|
||||
if not selected_page:
|
||||
logger.error("Could not select a page for grammar checking.")
|
||||
sys.exit(1)
|
||||
|
||||
# Get the French page URL
|
||||
fr_url = selected_page.get('fr_page', {}).get('url')
|
||||
if not fr_url:
|
||||
logger.error(f"No French page URL found for key '{selected_page['key']}'")
|
||||
sys.exit(1)
|
||||
|
||||
# Fetch the content of the French page
|
||||
logger.info(f"Fetching content from {fr_url}")
|
||||
content = fetch_wiki_page_content(fr_url)
|
||||
if not content:
|
||||
logger.error(f"Could not fetch content from {fr_url}")
|
||||
sys.exit(1)
|
||||
|
||||
# Check grammar
|
||||
logger.info(f"Checking grammar for key '{selected_page['key']}'")
|
||||
suggestions = check_grammar(content)
|
||||
if not suggestions:
|
||||
logger.warning("No grammar suggestions found or grammar checker not available")
|
||||
|
||||
# Save the grammar suggestions in the JSON file
|
||||
logger.info(f"Saving grammar suggestions for key '{selected_page['key']}'")
|
||||
selected_page['grammar_suggestions'] = suggestions
|
||||
|
||||
# Save the updated data back to the file
|
||||
save_to_json(pages, OUTDATED_PAGES_FILE)
|
||||
|
||||
logger.info("Script completed successfully")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
212
wiki_compare/suggest_translation.py
Executable file
212
wiki_compare/suggest_translation.py
Executable file
|
@ -0,0 +1,212 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
suggest_translation.py
|
||||
|
||||
This script reads the outdated_pages.json file generated by wiki_compare.py,
|
||||
identifies English wiki pages that don't have a French translation,
|
||||
and posts a message on Mastodon suggesting that the page needs translation.
|
||||
|
||||
Usage:
|
||||
python suggest_translation.py [--dry-run]
|
||||
|
||||
Options:
|
||||
--dry-run Run the script without actually posting to Mastodon
|
||||
|
||||
Output:
|
||||
- A post on Mastodon suggesting a wiki page for translation
|
||||
- Log messages about the selected page and posting status
|
||||
"""
|
||||
|
||||
import json
|
||||
import random
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime
|
||||
import requests
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants
|
||||
OUTDATED_PAGES_FILE = "outdated_pages.json"
|
||||
MASTODON_API_URL = "https://mastodon.instance/api/v1/statuses" # Replace with actual instance
|
||||
MASTODON_ACCESS_TOKEN = os.environ.get("MASTODON_ACCESS_TOKEN")
|
||||
|
||||
def load_outdated_pages():
|
||||
"""
|
||||
Load the outdated pages from the JSON file
|
||||
|
||||
Returns:
|
||||
list: List of dictionaries containing outdated page information
|
||||
"""
|
||||
try:
|
||||
with open(OUTDATED_PAGES_FILE, 'r', encoding='utf-8') as f:
|
||||
pages = json.load(f)
|
||||
logger.info(f"Successfully loaded {len(pages)} pages from {OUTDATED_PAGES_FILE}")
|
||||
return pages
|
||||
except (IOError, json.JSONDecodeError) as e:
|
||||
logger.error(f"Error loading pages from {OUTDATED_PAGES_FILE}: {e}")
|
||||
return []
|
||||
|
||||
def find_missing_translations(pages):
|
||||
"""
|
||||
Find English pages that don't have a French translation
|
||||
|
||||
Args:
|
||||
pages (list): List of dictionaries containing page information
|
||||
|
||||
Returns:
|
||||
list: List of pages that need translation
|
||||
"""
|
||||
# Filter pages to include only those with a missing French page
|
||||
missing_translations = [page for page in pages if
|
||||
page.get('reason') == 'French page missing' and
|
||||
page.get('en_page') is not None and
|
||||
page.get('fr_page') is None]
|
||||
|
||||
logger.info(f"Found {len(missing_translations)} pages without French translation")
|
||||
return missing_translations
|
||||
|
||||
def select_random_page_for_translation(pages):
|
||||
"""
|
||||
Randomly select a page for translation from the list
|
||||
|
||||
Args:
|
||||
pages (list): List of dictionaries containing page information
|
||||
|
||||
Returns:
|
||||
dict: Randomly selected page or None if no suitable pages found
|
||||
"""
|
||||
if not pages:
|
||||
logger.warning("No pages found that need translation")
|
||||
return None
|
||||
|
||||
# Randomly select a page
|
||||
selected_page = random.choice(pages)
|
||||
logger.info(f"Randomly selected page for key '{selected_page['key']}' for translation")
|
||||
|
||||
return selected_page
|
||||
|
||||
def create_mastodon_post(page):
|
||||
"""
|
||||
Create a Mastodon post suggesting a page for translation
|
||||
|
||||
Args:
|
||||
page (dict): Dictionary containing page information
|
||||
|
||||
Returns:
|
||||
str: Formatted Mastodon post text
|
||||
"""
|
||||
key = page['key']
|
||||
en_url = page['en_page']['url']
|
||||
fr_url = en_url.replace('/wiki/Key:', '/wiki/FR:Key:')
|
||||
|
||||
# Get word count and sections from English page
|
||||
word_count = page['en_page']['word_count']
|
||||
sections = page['en_page']['sections']
|
||||
|
||||
# Format the post
|
||||
post = f"""🔍 Clé OSM sans traduction française : #{key}
|
||||
|
||||
Cette page wiki importante n'a pas encore de traduction française !
|
||||
|
||||
📊 Statistiques de la page anglaise :
|
||||
• {word_count} mots
|
||||
• {sections} sections
|
||||
|
||||
Vous pouvez aider en créant la traduction française ici :
|
||||
{fr_url}
|
||||
|
||||
Page anglaise à traduire :
|
||||
{en_url}
|
||||
|
||||
#OpenStreetMap #OSM #Wiki #Traduction #Contribution"""
|
||||
|
||||
return post
|
||||
|
||||
def post_to_mastodon(post_text, dry_run=False):
|
||||
"""
|
||||
Post the message to Mastodon
|
||||
|
||||
Args:
|
||||
post_text (str): Text to post
|
||||
dry_run (bool): If True, don't actually post to Mastodon
|
||||
|
||||
Returns:
|
||||
bool: True if posting was successful or dry run, False otherwise
|
||||
"""
|
||||
if dry_run:
|
||||
logger.info("DRY RUN: Would have posted to Mastodon:")
|
||||
logger.info(post_text)
|
||||
return True
|
||||
|
||||
if not MASTODON_ACCESS_TOKEN:
|
||||
logger.error("MASTODON_ACCESS_TOKEN environment variable not set")
|
||||
return False
|
||||
|
||||
headers = {
|
||||
"Authorization": f"Bearer {MASTODON_ACCESS_TOKEN}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
data = {
|
||||
"status": post_text,
|
||||
"visibility": "public"
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.post(MASTODON_API_URL, headers=headers, json=data)
|
||||
response.raise_for_status()
|
||||
logger.info("Successfully posted to Mastodon")
|
||||
return True
|
||||
except requests.exceptions.RequestException as e:
|
||||
logger.error(f"Error posting to Mastodon: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Main function to execute the script"""
|
||||
parser = argparse.ArgumentParser(description="Suggest an OSM wiki page for translation on Mastodon")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Run without actually posting to Mastodon")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Starting suggest_translation.py")
|
||||
|
||||
# Load pages
|
||||
pages = load_outdated_pages()
|
||||
if not pages:
|
||||
logger.error("No pages found. Run wiki_compare.py first.")
|
||||
return
|
||||
|
||||
# Find pages that need translation
|
||||
pages_for_translation = find_missing_translations(pages)
|
||||
if not pages_for_translation:
|
||||
logger.error("No pages found that need translation.")
|
||||
return
|
||||
|
||||
# Select a random page for translation
|
||||
selected_page = select_random_page_for_translation(pages_for_translation)
|
||||
if not selected_page:
|
||||
logger.error("Could not select a page for translation.")
|
||||
return
|
||||
|
||||
# Create the post text
|
||||
post_text = create_mastodon_post(selected_page)
|
||||
|
||||
# Post to Mastodon
|
||||
success = post_to_mastodon(post_text, args.dry_run)
|
||||
|
||||
if success:
|
||||
logger.info("Script completed successfully")
|
||||
else:
|
||||
logger.error("Script completed with errors")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
70
wiki_compare/test_json.py
Normal file
70
wiki_compare/test_json.py
Normal file
|
@ -0,0 +1,70 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
test_json.py
|
||||
|
||||
This script tests writing a JSON file with some test data.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
# Test data
|
||||
test_data = {
|
||||
"last_updated": datetime.now().isoformat(),
|
||||
"recent_changes": [
|
||||
{
|
||||
"page_name": "Test Page 1",
|
||||
"page_url": "https://example.com/test1",
|
||||
"timestamp": "12:34",
|
||||
"user": "Test User 1",
|
||||
"comment": "Test comment 1",
|
||||
"change_size": "+123"
|
||||
},
|
||||
{
|
||||
"page_name": "Test Page 2",
|
||||
"page_url": "https://example.com/test2",
|
||||
"timestamp": "23:45",
|
||||
"user": "Test User 2",
|
||||
"comment": "Test comment 2",
|
||||
"change_size": "-456"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Output file
|
||||
output_file = "test_recent_changes.json"
|
||||
|
||||
# Write the data to the file
|
||||
print(f"Writing test data to {output_file}")
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(test_data, f, indent=2, ensure_ascii=False)
|
||||
|
||||
# Check if the file was created
|
||||
if os.path.exists(output_file):
|
||||
file_size = os.path.getsize(output_file)
|
||||
print(f"File {output_file} created, size: {file_size} bytes")
|
||||
|
||||
# Read the content of the file to verify
|
||||
with open(output_file, 'r', encoding='utf-8') as f:
|
||||
file_content = f.read()
|
||||
print(f"File content: {file_content}")
|
||||
else:
|
||||
print(f"Failed to create file {output_file}")
|
||||
|
||||
# Copy the file to the public directory
|
||||
public_file = os.path.join(os.path.dirname(os.path.dirname(output_file)), 'public', os.path.basename(output_file))
|
||||
print(f"Copying {output_file} to {public_file}")
|
||||
import shutil
|
||||
shutil.copy2(output_file, public_file)
|
||||
|
||||
# Check if the public file was created
|
||||
if os.path.exists(public_file):
|
||||
public_size = os.path.getsize(public_file)
|
||||
print(f"Public file {public_file} created, size: {public_size} bytes")
|
||||
else:
|
||||
print(f"Failed to create public file {public_file}")
|
||||
|
||||
print("Script completed successfully")
|
1158
wiki_compare/wiki_compare.py
Executable file
1158
wiki_compare/wiki_compare.py
Executable file
File diff suppressed because it is too large
Load diff
107
wiki_compare/wiki_pages.csv
Normal file
107
wiki_compare/wiki_pages.csv
Normal file
|
@ -0,0 +1,107 @@
|
|||
key,language,url,last_modified,sections,word_count,link_count,media_count,staleness_score,description_img_url
|
||||
building,en,https://wiki.openstreetmap.org/wiki/Key:building,2025-06-10,31,3774,627,158,8.91,https://wiki.openstreetmap.org/w/images/thumb/6/61/Emptyhouse.jpg/200px-Emptyhouse.jpg
|
||||
building,fr,https://wiki.openstreetmap.org/wiki/FR:Key:building,2025-05-22,25,3181,544,155,8.91,https://wiki.openstreetmap.org/w/images/thumb/6/61/Emptyhouse.jpg/200px-Emptyhouse.jpg
|
||||
source,en,https://wiki.openstreetmap.org/wiki/Key:source,2025-08-12,27,2752,314,42,113.06,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
source,fr,https://wiki.openstreetmap.org/wiki/FR:Key:source,2024-02-07,23,2593,230,35,113.06,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
highway,en,https://wiki.openstreetmap.org/wiki/Key:highway,2025-04-10,30,4126,780,314,20.35,https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Roads_in_Switzerland_%2827965437018%29.jpg/200px-Roads_in_Switzerland_%2827965437018%29.jpg
|
||||
highway,fr,https://wiki.openstreetmap.org/wiki/FR:Key:highway,2025-01-05,30,4141,695,313,20.35,https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Roads_in_Switzerland_%2827965437018%29.jpg/200px-Roads_in_Switzerland_%2827965437018%29.jpg
|
||||
addr:housenumber,en,https://wiki.openstreetmap.org/wiki/Key:addr:housenumber,2025-07-24,11,330,97,20,14.01,https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Ferry_Street%2C_Portaferry_%2809%29%2C_October_2009.JPG/200px-Ferry_Street%2C_Portaferry_%2809%29%2C_October_2009.JPG
|
||||
addr:housenumber,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:housenumber,2025-08-23,15,1653,150,77,14.01,https://wiki.openstreetmap.org/w/images/thumb/e/e9/Housenumber-karlsruhe-de.png/200px-Housenumber-karlsruhe-de.png
|
||||
addr:street,en,https://wiki.openstreetmap.org/wiki/Key:addr:street,2024-10-29,12,602,101,16,66.04,https://upload.wikimedia.org/wikipedia/commons/thumb/6/64/UK_-_London_%2830474933636%29.jpg/200px-UK_-_London_%2830474933636%29.jpg
|
||||
addr:street,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:street,2025-08-23,15,1653,150,77,66.04,https://wiki.openstreetmap.org/w/images/thumb/e/e9/Housenumber-karlsruhe-de.png/200px-Housenumber-karlsruhe-de.png
|
||||
addr:city,en,https://wiki.openstreetmap.org/wiki/Key:addr:city,2025-07-29,15,802,105,17,9.93,https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Lillerod.jpg/200px-Lillerod.jpg
|
||||
addr:city,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:city,2025-08-23,15,1653,150,77,9.93,https://wiki.openstreetmap.org/w/images/thumb/e/e9/Housenumber-karlsruhe-de.png/200px-Housenumber-karlsruhe-de.png
|
||||
name,en,https://wiki.openstreetmap.org/wiki/Key:name,2025-07-25,17,2196,281,82,42.39,https://upload.wikimedia.org/wikipedia/commons/thumb/6/61/Helena%2C_Montana.jpg/200px-Helena%2C_Montana.jpg
|
||||
name,fr,https://wiki.openstreetmap.org/wiki/FR:Key:name,2025-01-16,21,1720,187,60,42.39,https://wiki.openstreetmap.org/w/images/3/37/Strakers.jpg
|
||||
addr:postcode,en,https://wiki.openstreetmap.org/wiki/Key:addr:postcode,2024-10-29,14,382,83,11,67.11,https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Farrer_post_code.jpg/200px-Farrer_post_code.jpg
|
||||
addr:postcode,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:postcode,2025-08-23,15,1653,150,77,67.11,https://wiki.openstreetmap.org/w/images/thumb/e/e9/Housenumber-karlsruhe-de.png/200px-Housenumber-karlsruhe-de.png
|
||||
natural,en,https://wiki.openstreetmap.org/wiki/Key:natural,2025-07-17,17,2070,535,189,22.06,https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/VocaDi-Nature%2CGeneral.jpeg/200px-VocaDi-Nature%2CGeneral.jpeg
|
||||
natural,fr,https://wiki.openstreetmap.org/wiki/FR:Key:natural,2025-04-21,13,1499,455,174,22.06,https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/VocaDi-Nature%2CGeneral.jpeg/200px-VocaDi-Nature%2CGeneral.jpeg
|
||||
surface,en,https://wiki.openstreetmap.org/wiki/Key:surface,2025-08-28,24,3475,591,238,264.64,https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Transportation_in_Tanzania_Traffic_problems.JPG/200px-Transportation_in_Tanzania_Traffic_problems.JPG
|
||||
surface,fr,https://wiki.openstreetmap.org/wiki/FR:Key:surface,2022-02-22,13,2587,461,232,264.64,https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Transportation_in_Tanzania_Traffic_problems.JPG/200px-Transportation_in_Tanzania_Traffic_problems.JPG
|
||||
addr:country,en,https://wiki.openstreetmap.org/wiki/Key:addr:country,2024-12-01,9,184,65,11,22.96,https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Europe_ISO_3166-1.svg/200px-Europe_ISO_3166-1.svg.png
|
||||
addr:country,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:country,2025-03-25,8,187,65,11,22.96,https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Europe_ISO_3166-1.svg/200px-Europe_ISO_3166-1.svg.png
|
||||
landuse,en,https://wiki.openstreetmap.org/wiki/Key:landuse,2025-03-01,17,2071,446,168,39.41,https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Changing_landuse_-_geograph.org.uk_-_1137810.jpg/200px-Changing_landuse_-_geograph.org.uk_-_1137810.jpg
|
||||
landuse,fr,https://wiki.openstreetmap.org/wiki/FR:Key:landuse,2024-08-20,19,2053,418,182,39.41,https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Changing_landuse_-_geograph.org.uk_-_1137810.jpg/200px-Changing_landuse_-_geograph.org.uk_-_1137810.jpg
|
||||
power,en,https://wiki.openstreetmap.org/wiki/Key:power,2025-02-28,20,641,127,21,124.89,https://wiki.openstreetmap.org/w/images/thumb/0/01/Power-tower.JPG/200px-Power-tower.JPG
|
||||
power,fr,https://wiki.openstreetmap.org/wiki/FR:Key:power,2023-06-27,14,390,105,25,124.89,https://wiki.openstreetmap.org/w/images/thumb/0/01/Power-tower.JPG/200px-Power-tower.JPG
|
||||
waterway,en,https://wiki.openstreetmap.org/wiki/Key:waterway,2025-03-10,21,1830,365,118,77.94,https://wiki.openstreetmap.org/w/images/thumb/f/fe/450px-Marshall-county-indiana-yellow-river.jpg/200px-450px-Marshall-county-indiana-yellow-river.jpg
|
||||
waterway,fr,https://wiki.openstreetmap.org/wiki/FR:Key:waterway,2024-03-08,18,1291,272,113,77.94,https://wiki.openstreetmap.org/w/images/thumb/f/fe/450px-Marshall-county-indiana-yellow-river.jpg/200px-450px-Marshall-county-indiana-yellow-river.jpg
|
||||
building:levels,en,https://wiki.openstreetmap.org/wiki/Key:building:levels,2025-08-13,16,1351,204,25,76.11,https://wiki.openstreetmap.org/w/images/thumb/4/47/Building-levels.png/200px-Building-levels.png
|
||||
building:levels,fr,https://wiki.openstreetmap.org/wiki/FR:Key:building:levels,2024-08-01,15,1457,202,26,76.11,https://wiki.openstreetmap.org/w/images/thumb/4/47/Building-levels.png/200px-Building-levels.png
|
||||
amenity,en,https://wiki.openstreetmap.org/wiki/Key:amenity,2025-08-24,29,3066,915,504,160.78,https://wiki.openstreetmap.org/w/images/thumb/a/a5/Mapping-Features-Parking-Lot.png/200px-Mapping-Features-Parking-Lot.png
|
||||
amenity,fr,https://wiki.openstreetmap.org/wiki/FR:Key:amenity,2023-07-19,22,2146,800,487,160.78,https://wiki.openstreetmap.org/w/images/thumb/a/a5/Mapping-Features-Parking-Lot.png/200px-Mapping-Features-Parking-Lot.png
|
||||
barrier,en,https://wiki.openstreetmap.org/wiki/Key:barrier,2025-04-15,17,2137,443,173,207.98,https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/2014_Bystrzyca_K%C5%82odzka%2C_mury_obronne_05.jpg/200px-2014_Bystrzyca_K%C5%82odzka%2C_mury_obronne_05.jpg
|
||||
barrier,fr,https://wiki.openstreetmap.org/wiki/FR:Key:barrier,2022-08-16,15,542,103,18,207.98,https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/2014_Bystrzyca_K%C5%82odzka%2C_mury_obronne_05.jpg/200px-2014_Bystrzyca_K%C5%82odzka%2C_mury_obronne_05.jpg
|
||||
source:date,en,https://wiki.openstreetmap.org/wiki/Key:source:date,2023-04-01,11,395,75,10,22.47,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
source:date,fr,https://wiki.openstreetmap.org/wiki/FR:Key:source:date,2023-07-21,10,419,75,11,22.47,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
service,en,https://wiki.openstreetmap.org/wiki/Key:service,2025-03-16,22,1436,218,17,83.79,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
service,fr,https://wiki.openstreetmap.org/wiki/FR:Key:service,2024-03-04,11,443,100,10,83.79,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
addr:state,en,https://wiki.openstreetmap.org/wiki/Key:addr:state,2023-06-23,12,289,74,11,100,https://upload.wikimedia.org/wikipedia/commons/thumb/e/ef/WVaCent.jpg/200px-WVaCent.jpg
|
||||
access,en,https://wiki.openstreetmap.org/wiki/Key:access,2025-08-06,31,5803,708,98,66.75,https://wiki.openstreetmap.org/w/images/5/5e/WhichAccess.png
|
||||
access,fr,https://wiki.openstreetmap.org/wiki/FR:Key:access,2024-11-27,33,3200,506,83,66.75,https://wiki.openstreetmap.org/w/images/5/5e/WhichAccess.png
|
||||
oneway,en,https://wiki.openstreetmap.org/wiki/Key:oneway,2025-07-17,28,2318,290,30,19.4,https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/One_way_sign.JPG/200px-One_way_sign.JPG
|
||||
oneway,fr,https://wiki.openstreetmap.org/wiki/FR:Key:oneway,2025-06-16,14,645,108,14,19.4,https://upload.wikimedia.org/wikipedia/commons/thumb/f/f4/France_road_sign_C12.svg/200px-France_road_sign_C12.svg.png
|
||||
height,en,https://wiki.openstreetmap.org/wiki/Key:height,2025-07-21,24,1184,184,20,8.45,https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Height_demonstration_diagram.png/200px-Height_demonstration_diagram.png
|
||||
height,fr,https://wiki.openstreetmap.org/wiki/FR:Key:height,2025-06-14,21,1285,190,21,8.45,https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Height_demonstration_diagram.png/200px-Height_demonstration_diagram.png
|
||||
ref,en,https://wiki.openstreetmap.org/wiki/Key:ref,2025-07-25,26,4404,782,115,11.79,https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/UK_traffic_sign_2901.svg/200px-UK_traffic_sign_2901.svg.png
|
||||
ref,fr,https://wiki.openstreetmap.org/wiki/FR:Key:ref,2025-07-30,20,3393,460,12,11.79,https://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Autoroute_fran%C3%A7aise_1.svg/200px-Autoroute_fran%C3%A7aise_1.svg.png
|
||||
maxspeed,en,https://wiki.openstreetmap.org/wiki/Key:maxspeed,2025-08-20,30,4275,404,38,39.24,https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Zeichen_274-60_-_Zul%C3%A4ssige_H%C3%B6chstgeschwindigkeit%2C_StVO_2017.svg/200px-Zeichen_274-60_-_Zul%C3%A4ssige_H%C3%B6chstgeschwindigkeit%2C_StVO_2017.svg.png
|
||||
maxspeed,fr,https://wiki.openstreetmap.org/wiki/FR:Key:maxspeed,2025-05-10,25,1401,156,23,39.24,https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Zeichen_274-60_-_Zul%C3%A4ssige_H%C3%B6chstgeschwindigkeit%2C_StVO_2017.svg/200px-Zeichen_274-60_-_Zul%C3%A4ssige_H%C3%B6chstgeschwindigkeit%2C_StVO_2017.svg.png
|
||||
lanes,en,https://wiki.openstreetmap.org/wiki/Key:lanes,2025-08-21,26,2869,355,48,117.16,https://upload.wikimedia.org/wikipedia/commons/thumb/f/f4/A55_trunk_road_looking_east_-_geograph.org.uk_-_932668.jpg/200px-A55_trunk_road_looking_east_-_geograph.org.uk_-_932668.jpg
|
||||
lanes,fr,https://wiki.openstreetmap.org/wiki/FR:Key:lanes,2024-03-07,19,1492,167,19,117.16,https://wiki.openstreetmap.org/w/images/thumb/d/d4/Dscf0444_600.jpg/200px-Dscf0444_600.jpg
|
||||
start_date,en,https://wiki.openstreetmap.org/wiki/Key:start_date,2025-08-01,22,1098,168,29,214.58,https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Connel_bridge_plate.jpg/200px-Connel_bridge_plate.jpg
|
||||
start_date,fr,https://wiki.openstreetmap.org/wiki/FR:Key:start_date,2022-08-29,19,1097,133,22,214.58,https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Connel_bridge_plate.jpg/200px-Connel_bridge_plate.jpg
|
||||
addr:district,en,https://wiki.openstreetmap.org/wiki/Key:addr:district,2023-11-06,11,244,76,11,139.96,https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Hangal_Taluk.jpg/200px-Hangal_Taluk.jpg
|
||||
addr:district,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:district,2025-08-23,15,1653,150,77,139.96,https://wiki.openstreetmap.org/w/images/thumb/e/e9/Housenumber-karlsruhe-de.png/200px-Housenumber-karlsruhe-de.png
|
||||
layer,en,https://wiki.openstreetmap.org/wiki/Key:layer,2025-01-02,16,1967,181,17,65.95,https://wiki.openstreetmap.org/w/images/thumb/2/26/Washington_layers.png/200px-Washington_layers.png
|
||||
layer,fr,https://wiki.openstreetmap.org/wiki/FR:Key:layer,2024-02-16,15,2231,162,17,65.95,https://wiki.openstreetmap.org/w/images/thumb/2/26/Washington_layers.png/200px-Washington_layers.png
|
||||
type,en,https://wiki.openstreetmap.org/wiki/Key:type,2025-05-13,20,911,200,72,334.06,https://wiki.openstreetmap.org/w/images/thumb/5/58/Osm_element_node_no.svg/30px-Osm_element_node_no.svg.png
|
||||
type,fr,https://wiki.openstreetmap.org/wiki/FR:Key:type,2020-11-13,10,444,78,10,334.06,https://wiki.openstreetmap.org/w/images/thumb/5/58/Osm_element_node_no.svg/30px-Osm_element_node_no.svg.png
|
||||
operator,en,https://wiki.openstreetmap.org/wiki/Key:operator,2025-08-26,24,1908,241,37,223.28,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
operator,fr,https://wiki.openstreetmap.org/wiki/FR:Key:operator,2022-09-30,15,418,89,11,223.28,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
lit,en,https://wiki.openstreetmap.org/wiki/Key:lit,2024-07-20,17,931,174,52,38.88,https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Peatonal_Bicentenario.JPG/200px-Peatonal_Bicentenario.JPG
|
||||
lit,fr,https://wiki.openstreetmap.org/wiki/FR:Key:lit,2025-01-19,17,628,123,14,38.88,https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/2014_K%C5%82odzko%2C_ul._Grottgera_14.JPG/200px-2014_K%C5%82odzko%2C_ul._Grottgera_14.JPG
|
||||
wall,en,https://wiki.openstreetmap.org/wiki/Key:wall,2024-05-02,14,682,206,61,100,https://wiki.openstreetmap.org/w/images/thumb/5/58/Osm_element_node_no.svg/30px-Osm_element_node_no.svg.png
|
||||
tiger:cfcc,en,https://wiki.openstreetmap.org/wiki/Key:tiger:cfcc,2022-12-09,10,127,24,7,100,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
crossing,en,https://wiki.openstreetmap.org/wiki/Key:crossing,2024-02-18,25,2678,363,34,76.98,https://wiki.openstreetmap.org/w/images/thumb/7/75/Toucan.jpg/200px-Toucan.jpg
|
||||
crossing,fr,https://wiki.openstreetmap.org/wiki/FR:Key:crossing,2025-01-20,15,1390,254,28,76.98,https://wiki.openstreetmap.org/w/images/thumb/7/75/Toucan.jpg/200px-Toucan.jpg
|
||||
tiger:county,en,https://wiki.openstreetmap.org/wiki/Key:tiger:county,2022-12-09,10,127,24,7,100,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
source:addr,en,https://wiki.openstreetmap.org/wiki/Key:source:addr,2023-07-05,9,200,70,10,100,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
footway,en,https://wiki.openstreetmap.org/wiki/Key:footway,2025-08-20,23,2002,369,39,99.66,https://wiki.openstreetmap.org/w/images/thumb/b/b9/Sidewalk_and_zebra-crossing.jpg/200px-Sidewalk_and_zebra-crossing.jpg
|
||||
footway,fr,https://wiki.openstreetmap.org/wiki/FR:Key:footway,2024-06-04,14,685,147,28,99.66,https://wiki.openstreetmap.org/w/images/thumb/b/b9/Sidewalk_and_zebra-crossing.jpg/200px-Sidewalk_and_zebra-crossing.jpg
|
||||
ref:bag,en,https://wiki.openstreetmap.org/wiki/Key:ref:bag,2024-10-09,10,254,69,11,100,https://wiki.openstreetmap.org/w/images/thumb/5/58/Osm_element_node_no.svg/30px-Osm_element_node_no.svg.png
|
||||
addr:place,en,https://wiki.openstreetmap.org/wiki/Key:addr:place,2025-03-28,16,1204,154,13,136.57,https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Suburb_of_Phillip.jpg/200px-Suburb_of_Phillip.jpg
|
||||
addr:place,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:place,2023-06-17,11,276,75,12,136.57,https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Suburb_of_Phillip.jpg/200px-Suburb_of_Phillip.jpg
|
||||
tiger:reviewed,en,https://wiki.openstreetmap.org/wiki/Key:tiger:reviewed,2025-08-01,16,734,105,11,100,https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/US-Census-TIGERLogo.svg/200px-US-Census-TIGERLogo.svg.png
|
||||
leisure,en,https://wiki.openstreetmap.org/wiki/Key:leisure,2025-02-28,12,1084,374,180,232.43,https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Hammock_-_Polynesia.jpg/200px-Hammock_-_Polynesia.jpg
|
||||
leisure,fr,https://wiki.openstreetmap.org/wiki/FR:Key:leisure,2021-12-29,11,951,360,186,232.43,https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Hammock_-_Polynesia.jpg/200px-Hammock_-_Polynesia.jpg
|
||||
addr:suburb,en,https://wiki.openstreetmap.org/wiki/Key:addr:suburb,2024-02-24,14,439,89,11,1.49,https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Grosvenor_Place_2_2008_06_19.jpg/200px-Grosvenor_Place_2_2008_06_19.jpg
|
||||
addr:suburb,fr,https://wiki.openstreetmap.org/wiki/FR:Key:addr:suburb,2024-02-18,13,418,87,11,1.49,https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Grosvenor_Place_2_2008_06_19.jpg/200px-Grosvenor_Place_2_2008_06_19.jpg
|
||||
ele,en,https://wiki.openstreetmap.org/wiki/Key:ele,2025-07-18,18,1846,165,24,104.45,https://wiki.openstreetmap.org/w/images/a/a3/Key-ele_mapnik.png
|
||||
ele,fr,https://wiki.openstreetmap.org/wiki/FR:Key:ele,2024-03-02,15,1277,128,13,104.45,https://wiki.openstreetmap.org/w/images/a/a3/Key-ele_mapnik.png
|
||||
tracktype,en,https://wiki.openstreetmap.org/wiki/Key:tracktype,2024-12-02,16,652,146,35,32.71,https://wiki.openstreetmap.org/w/images/thumb/1/13/Tracktype-collage.jpg/200px-Tracktype-collage.jpg
|
||||
tracktype,fr,https://wiki.openstreetmap.org/wiki/FR:Key:tracktype,2025-05-03,11,463,105,29,32.71,https://wiki.openstreetmap.org/w/images/thumb/1/13/Tracktype-collage.jpg/200px-Tracktype-collage.jpg
|
||||
addr:neighbourhood,en,https://wiki.openstreetmap.org/wiki/Key:addr:neighbourhood,2025-04-29,24,2020,235,83,100,https://wiki.openstreetmap.org/w/images/thumb/e/e9/Housenumber-karlsruhe-de.png/200px-Housenumber-karlsruhe-de.png
|
||||
addr:hamlet,en,https://wiki.openstreetmap.org/wiki/Key:addr:hamlet,2024-12-05,9,142,64,11,100,https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Grosvenor_Place_2_2008_06_19.jpg/200px-Grosvenor_Place_2_2008_06_19.jpg
|
||||
addr:province,en,https://wiki.openstreetmap.org/wiki/Key:addr:province,2022-05-04,9,156,64,11,100,https://upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Stamp_of_Indonesia_-_2002_-_Colnect_265917_-_Aceh_Province.jpeg/200px-Stamp_of_Indonesia_-_2002_-_Colnect_265917_-_Aceh_Province.jpeg
|
||||
leaf_type,en,https://wiki.openstreetmap.org/wiki/Key:leaf_type,2025-01-22,15,739,201,57,114.46,https://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Picea_abies_Nadelkissen.jpg/200px-Picea_abies_Nadelkissen.jpg
|
||||
leaf_type,fr,https://wiki.openstreetmap.org/wiki/FR:Key:leaf_type,2023-07-02,14,734,220,64,114.46,https://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Picea_abies_Nadelkissen.jpg/200px-Picea_abies_Nadelkissen.jpg
|
||||
addr:full,en,https://wiki.openstreetmap.org/wiki/Key:addr:full,2025-04-29,24,2020,235,83,100,https://wiki.openstreetmap.org/w/images/thumb/e/e9/Housenumber-karlsruhe-de.png/200px-Housenumber-karlsruhe-de.png
|
||||
Anatomie_des_étiquettes_osm,en,https://wiki.openstreetmap.org/wiki/Anatomie_des_étiquettes_osm,2025-06-08,22,963,53,0,100,
|
||||
Tag:leisure=children_club,en,https://wiki.openstreetmap.org/wiki/Tag:leisure=children_club,2025-02-02,9,163,69,9,56.04,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
Tag:leisure=children_club,fr,https://wiki.openstreetmap.org/wiki/FR:Tag:leisure=children_club,2024-05-02,8,294,67,10,56.04,https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Dave_%26_Buster%27s_video_arcade_in_Columbus%2C_OH_-_17910.JPG/200px-Dave_%26_Buster%27s_video_arcade_in_Columbus%2C_OH_-_17910.JPG
|
||||
Tag:harassment_prevention=ask_angela,en,https://wiki.openstreetmap.org/wiki/Tag:harassment_prevention=ask_angela,2025-02-22,14,463,72,9,42.56,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
Tag:harassment_prevention=ask_angela,fr,https://wiki.openstreetmap.org/wiki/FR:Tag:harassment_prevention=ask_angela,2025-09-01,20,873,166,15,42.56,https://wiki.openstreetmap.org/w/images/thumb/1/15/2024-06-27T08.40.50_ask_angela_lyon.jpg/200px-2024-06-27T08.40.50_ask_angela_lyon.jpg
|
||||
Key:harassment_prevention,en,https://wiki.openstreetmap.org/wiki/Key:harassment_prevention,2024-08-10,12,196,69,14,66.72,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
Key:harassment_prevention,fr,https://wiki.openstreetmap.org/wiki/FR:Key:harassment_prevention,2025-07-03,15,328,83,14,66.72,https://wiki.openstreetmap.org/w/images/thumb/7/76/Osm_element_node.svg/30px-Osm_element_node.svg.png
|
||||
Proposal process,en,https://wiki.openstreetmap.org/wiki/Proposal process,2025-08-13,46,5292,202,4,166.25,https://wiki.openstreetmap.org/w/images/thumb/c/c2/Save_proposal_first.png/761px-Save_proposal_first.png
|
||||
Proposal process,fr,https://wiki.openstreetmap.org/wiki/FR:Proposal process,2023-09-22,15,1146,24,0,166.25,
|
||||
Automated_Edits_code_of_conduct,en,https://wiki.openstreetmap.org/wiki/Automated_Edits_code_of_conduct,2025-07-26,19,2062,69,0,26.35,
|
||||
Automated_Edits_code_of_conduct,fr,https://wiki.openstreetmap.org/wiki/FR:Automated_Edits_code_of_conduct,2025-04-03,17,1571,16,0,26.35,
|
||||
Key:cuisine,en,https://wiki.openstreetmap.org/wiki/Key:cuisine,2025-07-23,17,3422,693,303,107.73,https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/Food_montage.jpg/200px-Food_montage.jpg
|
||||
Key:cuisine,fr,https://wiki.openstreetmap.org/wiki/FR:Key:cuisine,2024-02-16,15,2866,690,316,107.73,https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/Food_montage.jpg/200px-Food_montage.jpg
|
||||
Libre_Charge_Map,en,https://wiki.openstreetmap.org/wiki/Libre_Charge_Map,2025-07-28,11,328,10,2,100,https://wiki.openstreetmap.org/w/images/thumb/8/8e/Screenshot_2025-07-28_at_14-40-11_LibreChargeMap_-_OSM_Bliss.png/300px-Screenshot_2025-07-28_at_14-40-11_LibreChargeMap_-_OSM_Bliss.png
|
||||
OSM_Mon_Commerce,en,https://wiki.openstreetmap.org/wiki/OSM_Mon_Commerce,2025-07-29,17,418,34,3,100,https://wiki.openstreetmap.org/w/images/thumb/6/67/Villes_OSM_Mon_Commerce.png/500px-Villes_OSM_Mon_Commerce.png
|
||||
Tag:amenity=charging_station,en,https://wiki.openstreetmap.org/wiki/Tag:amenity=charging_station,2025-08-29,16,1509,284,62,55.72,https://wiki.openstreetmap.org/w/images/thumb/4/4d/Recharge_Vigra_charging_station.jpg/200px-Recharge_Vigra_charging_station.jpg
|
||||
Tag:amenity=charging_station,fr,https://wiki.openstreetmap.org/wiki/FR:Tag:amenity=charging_station,2024-12-28,19,2662,331,58,55.72,https://wiki.openstreetmap.org/w/images/thumb/4/4d/Recharge_Vigra_charging_station.jpg/200px-Recharge_Vigra_charging_station.jpg
|
|
Loading…
Add table
Add a link
Reference in a new issue