|Published (Last):||21 March 2010|
|PDF File Size:||13.74 Mb|
|ePub File Size:||19.40 Mb|
|Price:||Free* [*Free Regsitration Required]|
This study aimed to evaluate the use of hidden Markov models HMM for the segmentation of person names and its influence on record linkage.
A sample of patients from each database was segmented via HMM, and the results were compared to those from segmentation by the authors. Conformity of segmentation via HMM varied from The different segmentation strategies yielded similar results in the record linkage process.
This study suggests that segmentation of Brazilian names via HMM is no more effective than traditional segmentation approaches in the linkage process. A major challenge faced by organizations is the integration of their information systems. However, eventually it is necessary arises to integrate these systems in order to improve processes or generate strategic decision-making information.
It is thus very difficult to link data from one system to the others. Sousa RC. However, this integration is limited by the difficulty in using deterministic means to determine which records belong to the same entity in the respective databases. Various approaches are used to perform database integration in such scenarios, and this is an active field of research 2 2. Christen P. Data matching. Concepts and techniques for record linkage, entity resolution, and duplicate detection.
Berlin: Springer-Verlag; Fellegi IP, Sunter A. A theory of record linkage. J Am Stat Assoc ; Some preliminary stages are necessary in the record linkage process: data cleaning and standardization, and blocking.
The cleaning and standardization stage involves preparation of data fields, seeking to minimize errors during the blocking and record matching process. Due to the low quality of original data completion, this stage is extremely important, contributing greatly to the efficiency of the process. Another important component of standardization is segmentation separation of the name into its constituent parts.
The objective is to increase insofar as possible the likelihood that a given individual will be truly identified. There are various probabilistic record linkage software programs that include a segmentation stage. In Brazil, Camargo Jr. Camargo Jr. KR, Coeli CM. Reclink has made an important contribution to the use of record linkage in the health field in Brazil.
However, this does not rule out the study of other alternatives for name segmentation that could make the linkage process more efficient than that proposed by Reclink.
Febrl — a freely available record linkage system with a graphical user interface. Rabiner L, Juang B. An introduction to Hidden Markov Models.
The latter authors applied HMM to English-language names 7 7. Preparation of name and address data for record linkage using hidden Markov models. The objective of the current study was to apply HMM to the segmentation of Brazilian names and verify whether use of the parts of the name thus obtained in the record linkage process is more efficient than the traditional name segmentation methods. The basic assumption is that the use of initials from middle names leads to loss of information and that use of all complete parts of the name would result in greater efficiency in the linkage process.
This section is divided into four subsections for greater clarity in the presentation: databases used, name segmentation process, evaluation of segmentation, and evaluation of the influence of segmentation on the record linkage process.
The AIH database was used to apply one of the previously obtained HMM without the need to generate new ancillary tables or alter the existing tables. The methodology used for name segmentation consists of eight phases: data cleaning, standardization of the form, name standardization, name segmentation, creation of the initial HMM, training, and refinement.
The data cleaning phase identified records that were invalid for linkage and performed corrections in the name field, preparing it for the subsequent standardization phases.
The name segmentation phase was subdivided into two stages. In the first stage, names were separated into five distinct fields. Reclink adopts a similar hypothesis, since it only uses the initials from middle names.
Thus, the following adjustment was performed for names with more than 5 parts: a names with six parts: the 4 th part of the name was eliminated; b names with seven parts: the 4 th and 5 th parts of the name were eliminated; c names with eight parts: the 4 th , 5 th , and 6 th parts of the name were eliminated; and d names with nine or more parts: the first three and the last two parts of the name were maintained.
The qualifiers may thus become incorrect. The problem of selecting the most likely sequence was solved with a probabilistic model called the HMM 7 7. The main underlying idea in the model is that there are various phenomena whose outputs depend on factors that are not directly observable i.
Souza-e-Silva EG. In the above-mentioned example, one can assume that a hidden Markov model for the name field would have the following states: first name, second name, first surname, second surname, and third surname. These would be the hidden states of the above-mentioned set S. Each identification symbol is assumed to be emitted by a hidden state.
Thus, the sequences of states could be the following:. Intuitively, the first sequence would be more probable than the second, indicating that this sequence of hidden states is more consistent with the sequence of symbols. This probability is calculated using the Viterbi algorithm 6 6. The HMM was defined as follows. The hidden states are: given name 1, given name 2, surname 1, surname 2, and surname 3. The symbols are:. In the next phase, creation of the initial hidden Markov model, a thousand random records were selected from the APAC and SIM databases and the respective sequences of identification symbols were generated.
The training and refinement phases aim to achieve the best fit of the initial model to the real data. For each phase, another random sequence of a thousand records was selected from APAC and SIM, generating the corresponding identification symbols. The Baum-Welch algorithm 13 Maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.
Annals of Mathematical Statistics ; The algorithm is a method of iterative re-estimation which generates for each new model a sequence of observations with higher probability than the previous model. Francois JM. Iterations of the model were done, and the Kullback-Leibler divergence 14 All the tables created in these stages, as well as the algorithms used in the process, can be obtained by consulting the authors.
The Viterbi algorithm 12 With the estimated model and the sequences of observations for each name, the JAHMM library was used to determine the sequence of hidden states for each name. One of the authors was in charge of creating the tables and training the hidden Markov model. A critical discussion of intraclass correlation coefficients. Stat Med ; The authors evaluated the sequences of states generated for the sequences of observations in order to assess the consistency between two independent reviewers, measured by the kappa coefficient 16 Cohen JA.
Coefficient of agreement for nominal scales. Educ Psychol Meas ; The cells in the 2 x 2 table for estimating the kappa coefficient indicate the number of times in which the sequences of hidden states generated by the hidden Markov model were classified, respectively, as: correct by both reviewers; correct by reviewer A, but incorrect by reviewer B; correct by reviewer B, but incorrect by reviewer A; and incorrect by both reviewers. The confidence intervals for the measurement of conformity were calculated with the OpenEpi software version 3.
Fleiss JL. Statistical methods for rates and proportions. Three linkage processes were performed, each with a different name segmentation strategy. The first was that used by the Reclink software 4 4. The second segmentation consisted of separating the name into a maximum of five parts before applying the HMM. The third segmentation was the name segmentation resulting from the application of the hidden Markov model to the parts of the name obtained from the second strategy, identifying whether each part of the name was given name 1 GN1 , given name 2 GN2 , surname 1 LN1 , surname 2 LN2 , and so on.
In the second alternative, the last part of the name was always placed in P5, regardless of the number of parts in the name. Apache Commons Project. Implementations of common encoders and decoders. Coeli CM, Camargo Jr. Rev Bras Epidemiol ; In a previous study 1 1. Taking this previous study as the basis for estimating the parameters m i probability of agreement between values for variable i, assuming that the pair of compared records is true for variable i, the following steps were performed: 1 pairs of records sampled from the SIM and APAC tables were identified that were considered true in the previous study 1 1.
To estimate the parameters u i probability of agreement between values for variable i, assuming that the pair of compared records is false for each variable i, the following steps were performed: 1 random records from the sampled APAC table were paired with random records from the sampled SIM table, for a total of 10, pairs of records; 2 for each variable, u i was estimated as the amount of these pairs for which the values of the variable agreed in the two records for each pair, divided by the total number of pairs 10, The final stage in the record linkage process was to define the cutoff point.
For each strategy, cutoff points were established through manual inspection by two reviewers. The pairs with scores above the cutoffs points were classified by consensus by the authors as false or true.
Taking the pairs classified as false or true as a gold standard, it was possible to evaluate the efficiency of the linkage according to the following metrics 20 Data quality and record linkage techniques. New York: Springer Science; The model shows the hidden states and probabilities of transition between these states.
Modelo oculto de Markov
Optimal state selection and tuning parameters for a degradation model in bearings using Mel-Frequency Cepstral Coefficients and Hidden Markov Chains. Complejo Educativo La Julita. Pereira, Colombia. E-mail: mau. Preventive maintenance is a philosophy for assets management that aims to maximize operation through routine inspections with increasing frequency when no abnormalities are exhibit. This leads to an increase in the probability of failure due to the repetitive intervention and the inherent human error.
Model ocult de Màrkov