Cours TC4 (Partie Probabiliste)

By Alexandre Allauzen

Cette page contient les informations pour la partie probabiliste de ce cours (Rappel de probabilité et modèle de Markov). L'autre partie est assurée par J-C Janodet.

1 Cours 1 : Rappel de probabilité

Ce cours servira autant au module TC1 (Apprentissage).

1.1 ToDo avant le cours: le cours

Avant le premier cours vous devez lire attentivement les 2 cours suivants (sur le drive):

  • Probabilité: un rappel sur les probabilités telles que l'on va les manipuler en AIC. Selon vos aquis dans le domaine, un cours basique peut vous être utile. Dans ce cas lisez le cours de mon collègue F. Rossi qui est très complet.
  • Classification Bayésienne

Même si vous êtes nombreux à déjà connaitre ces notions (le tout ou certaines parties). Votre objectif est de tout comprendre, et si il y a des points obscures de venir au cours avec des questions.

1.2 ToDo avant le cours: exercices

Les exercices "papier" sont disponibles sur cette page.

2 Cours 2 : Modèle de Markov Caché ou HMM, introduction

  • Cours : Introduction au HMM.
  • TP : pas de TP mais …

3 ToDo pour le 16/10

  • Faire le TP1 qui familiarise avec avec les données que l'on va utiliser
  • Préparer Viterbi

Pour l'algorithme de Viterbi, la question qui se pose est comment calculer la meilleure séquence d'étiquettes pour une phrase donnée connaissant les paramètres du HMM. Par meilleure, on entend la séquence d'étiquettes (ou d'états) la plus probable connaissant la séquence d'obervation.

Proposer et implémenter un algorithme répondant à cette question. Pour vous aider à démarrer, cet algorithme s'appelle Viterbi et regardez cette vidéo https://www.youtube.com/watch?v=RwwfUICZLsA, pour comprendre comment il opère

4 Cours 3 : Inférence / Viterbi

  • Cours sur Viterbi / forward / backward
  • TP: modèle de Markov Apprentissage supervisé

5 Cours 4 : Forward/Backward

  • Cours sur Viterbi / forward / backward, suite et fin.
  • TP: modèle de Markov Apprentissage supervisé, suite et fin

6 Project: Second-Order HMM for typos correction

The goal is to design a model to correct typos in texts without a dictionaries.

In this problem, a state refers to the correct letter that should have been typed, and an observation refers to the actual letter that is typed. Given a sequence of outputs/observations (i.e., actually typed letters), the problem is to reconstruct the hidden state sequence (i.e., the intended sequence of letters). Thus, data for this problem looks like:

[('t', 't'), ('h', 'h'), ('w', 'e'), ('k', 'm')]
 [('f', 'f'), ('o', 'o'), ('r', 'r'), ('m', 'm')] 

The first example is misspelled: the observation is thwk while the correct word is them. The second example is correctly typed.

Data for this problem was generated as follows: starting with a text document, in this case, the Unabomber's Manifesto, which was chosen not for political reasons, but for its convenience being available on-line and of about the right length, all numbers and punctuation were converted to white space and all letters converted to lower case. The remaining text is a sequence only over the lower case letters and the space character, represented in the data files by an underscore character. Next, typos were artificially added to the data as follows: with 90% probability, the correct letter is transcribed, but with 10% probability, a randomly chosen neighbor (on an ordinary physical keyboard) of the letter is transcribed instead. Space characters are always transcribed correctly. In a harder variant of the problem, the rate of errors is increased to 20%.

The dataset in an archive, see the shared drive to download it. This archive contains 4 pickles: train10 and test10 constitute the dataset with 10% or spelling errors, while train20 and test20 the one with 20% or errors.

6.1 TODO:

  • (3pts) Dry run: Train a first-order HMM using the training data. This is basically what we did in lab sessions for POS-tagging. Compute the error rate (at the character level) and compare this results with the dummiest classifier that just do nothing. You can also compute the number of errors your model can correct and the number of errors your model creates.
  • (7pts) Second Order HMM: To improve the performance, we can increase the order of the HMM. Implement a second Order model for this task (this means that the probability of the next state depends on the current state and the previous one as well). A convenient way to implement a second order HMM, is to think about it as a variable change.
  • (5pts) Critics and evolution: This model is somehow limited. For instance, it only handles substitution errors. Could you describe a way to extend this model to also handle noisy insertion of characters ?
  • (5pts) Same question for deletion (or omitted characters) ?
  • (5pts) Unsupervised training: Propose and discuss some ideas to do unsupervised training for this task. For this question you should provide details on : what kind of data, which parameters will be involved, …

For the last three questions, be as precise as you can. An implementation is not mandatory.

6.2 Submission Checklist

You can rely on existing libraries such as nltk or sklearn, and you can use other programming languages. You must send an archive with:

  • the source code (without external libraries of course) necessary to

reproduce your results

  • a report that presents, explains, analyses and discusses your results.

You can of course use a notebook.

6.3 Grading Scheme:

Just follow the TODO list. Extra points will be given for an implementation for the last three questions (5pts each)