Wunderlist is a discontinued cloud-based task management application. It allowed users to create lists to manage their tasks from a smartphone, tablet, computer and smartwatch. Wunderlist was free; additional collaboration features were available in a paid version known as Wunderlist Pro, released April 2013. Wunderlist was created in 2011 by Berlin-based startup 6Wunderkinder (Engl.: 6Prodigies). The company was acquired by Microsoft in June 2015, at which time the app had over 13 million users. In April 2017, Microsoft announced that Wunderlist would eventually be discontinued in favor of Microsoft To Do, a new multi-platform app developed by the Wunderlist team that has direct integration with the company's Office 365 service. On December 6, 2019, Microsoft announced that it would shut down Wunderlist on May 6, 2020. After this date, the application would no longer sync but users could still import their content into Microsoft To Do. == History == In 2009, Wunderlist's CEO Christian Reber called on the social network platform XING for business partners to create a new to-do app. Frank Thelen responded and together Reber and Thelen developed first concepts for Wunderlist. The necessary seed funding was granted by High-Tech Gründerfonds and e42 GmbH. The first version of Wunderlist was launched on November 9, 2010. Initially, the program was created for desktop PCs and platforms such as Windows, Linux and Mac OS X. In December 2011, the app received approval for the iPhone. Subsequently, the developers released a version prepared for the iPad with the name Wunderlist HD. In September 2012, the developers announced a shutdown of their service Wunderkit. Instead they wanted to focus on creating a new version of Wunderlist, which was later on released in December 2012 under the name Wunderlist 2. In September 2013, the company announced it had over 5 million users. In July 2014, a new major update was released under the name of Wunderlist 3, with a new real-time sync architecture. Wunderlist reached 10 million users in December 2014. On June 1, 2015, it was announced that Microsoft had acquired 6Wunderkinder, makers of Wunderlist, for between US$100 million and US$200 million (~$258 million in 2024). Following its acquisition of the app, Microsoft announced in April 2017 a preview of To-Do, a multi-platform task management app developed by the Wunderlist team that was intended to eventually replace Wunderlist and incorporate most of its features. As of January 2019, To-Do had not yet reached feature parity with Wunderlist, with its team citing that the service had to be completely re-written to use Microsoft Azure instead of Amazon Web Services. Frustrated by the perceived lack of roadmap, in September 2019, Reber began to publicly ask Microsoft-related accounts on Twitter whether he could buy Wunderlist back. Shortly afterward, however, Microsoft unveiled updates to To-Do that make it more closely resemble Wunderlist. In December 2019, Microsoft announced that it would fully shut down Wunderlist as of May 6, 2020. The team responsible for creating Wunderlist, led by co-founder Christian Reber, created that Superlist app in early 2024. == Finances == In its initial round of funding, 100,000 euro was invested in 6Wunderkinder by Frank Thelen and others. In December 2010, High-Tech Gründerfonds invested 500,000 euro (approximately US$660,000) in the company. T-Venture also invested an undisclosed amount in the startup. In its Series A round of funding in November 2011, Atomico invested $4.2 million (~$5.76 million in 2024) while High-Tech Gründerfonds invested an undisclosed additional amount. In May 2012, High-Tech Gründerfonds sold off its stake in 6Wunderkinder to Earlybird Venture Capital. In November 2013, $19 million (~$25.2 million in 2024) was raised in a Series B round led by Sequoia Capital with participation from Earlybird and Atomico. == Awards == In 2013, Wunderlist for Mac was named App of the Year. Wunderlist was selected as a Google Play Top Developer in 2013. In 2014, Wunderlist won the "Golden Mi" award from Xiaomi, and also named as one of its Best Apps of 2014 was given a "Google Play Editor's Choice" award, and was named in Google Play's Best Apps of 2014 as well as Apple's Best of 2014.
Framebuffer
A framebuffer (frame buffer, or sometimes framestore) is a portion of random-access memory (RAM) containing a bitmap that drives a video display. It is a memory buffer containing data representing all the pixels in a complete video frame. Modern video cards contain framebuffer circuitry in their cores. This circuitry converts an in-memory bitmap into a video signal that can be displayed on a computer monitor. In computing, a screen buffer is a part of computer memory used by a computer application for the representation of the content to be shown on the computer display. The screen buffer may also be called the video buffer, the regeneration buffer, or regen buffer for short. The phrase "screen buffer” refers to a logical function, while video memory refers to a hardware storage location. In particular, the screen buffer may be placed in the main RAM, the video memory, or some other hardware location. To reduce latency and avoid screen tearing, multiple frames can be buffered, and this technique is called multiple buffering. When this is so, at any time, only one frame would be visible, and the others would not be. The currently invisible frames are located in the off-screen buffer. The information in the buffer typically consists of color values for every pixel to be shown on the display. Color values are commonly stored in 1-bit binary (monochrome), 4-bit palettized, 8-bit palettized, 16-bit high color and 24-bit true color formats. An additional alpha channel is sometimes used to retain information about pixel transparency. The total amount of memory required for the framebuffer depends on the resolution of the output signal, and on the color depth or palette size. == History == Computer researchers had long discussed the theoretical advantages of a framebuffer but were unable to produce a machine with sufficient memory at an economically practicable cost. In 1947, the Manchester Baby computer used a Williams tube, later the Williams-Kilburn tube, to store 1024 bits on a cathode-ray tube (CRT) memory and displayed on a second CRT. Other research labs were exploring these techniques with MIT Lincoln Laboratory achieving a 4096 display in 1950. A color-scanned display was implemented in the late 1960s, called the Brookhaven RAster Display (BRAD), which used a drum memory and a television monitor. In 1969, A. Michael Noll of Bell Telephone Laboratories, Inc. implemented a scanned display with a frame buffer, using magnetic-core memory. A year or so later, the Bell Labs system was expanded to display an image with a color depth of three bits on a standard color TV monitor. The vector graphics used in the computer had to be converted for the scanned graphics of a TV display. In the early 1970s, the development of MOS memory (metal–oxide–semiconductor memory) integrated-circuit chips, particularly high-density DRAM (dynamic random-access memory) chips with at least 1 kb memory, made it practical to create, for the first time, a digital memory system with framebuffers capable of holding a standard video image. This led to the development of the SuperPaint system by Richard Shoup at Xerox PARC in 1972. Shoup was able to use the SuperPaint framebuffer to create an early digital video-capture system. By synchronizing the output signal to the input signal, Shoup was able to overwrite each pixel of data as it shifted in. Shoup also experimented with modifying the output signal using color tables. These color tables allowed the SuperPaint system to produce a wide variety of colors outside the range of the limited 8-bit data it contained. This scheme would later become commonplace in computer framebuffers. In 1974, Evans & Sutherland released the first commercial framebuffer, the Picture System, costing about $15,000. It was capable of producing resolutions of up to 512 by 512 pixels in 8-bit grayscale, and became a boon for graphics researchers who did not have the resources to build their own framebuffer. The New York Institute of Technology would later create the first 24-bit color system using three of the Evans & Sutherland framebuffers. Each framebuffer was connected to an RGB color output (one for red, one for green and one for blue), with a Digital Equipment Corporation PDP 11/04 minicomputer controlling the three devices as one. In 1975, the UK company Quantel produced the first commercial full-color broadcast framebuffer, the Quantel DFS 3000. It was first used in TV coverage of the 1976 Montreal Olympics to generate a picture-in-picture inset of the Olympic flaming torch while the rest of the picture featured the runner entering the stadium. The rapid improvement of integrated-circuit technology made it possible for many of the home computers of the late 1970s to contain low-color-depth framebuffers. Today, nearly all computers with graphical capabilities utilize a framebuffer for generating the video signal. Amiga computers, created in the 1980s, featured special design attention to graphics performance and included a unique Hold-And-Modify framebuffer capable of displaying 4096 colors. Framebuffers also became popular in high-end workstations and arcade system boards throughout the 1980s. SGI, Sun Microsystems, HP, DEC and IBM all released framebuffers for their workstation computers in this period. These framebuffers were usually of a much higher quality than could be found in most home computers, and were regularly used in television, printing, computer modeling and 3D graphics. Framebuffers were also used by Sega for its high-end arcade boards, which were also of a higher quality than on home computers. == Display modes == Framebuffers used in personal and home computing often had sets of defined modes under which the framebuffer can operate. These modes reconfigure the hardware to output different resolutions, color depths, memory layouts and refresh rate timings. In the world of Unix machines and operating systems, such conveniences were usually eschewed in favor of directly manipulating the hardware settings. This manipulation was far more flexible in that any resolution, color depth and refresh rate was attainable – limited only by the memory available to the framebuffer. An unfortunate side-effect of this method was that the display device could be driven beyond its capabilities. In some cases, this resulted in hardware damage to the display. More commonly, it simply produced garbled and unusable output. Modern CRT monitors fix this problem through the introduction of protection circuitry. When the display mode is changed, the monitor attempts to obtain a signal lock on the new refresh frequency. If the monitor is unable to obtain a signal lock or if the signal is outside the range of its design limitations, the monitor will ignore the framebuffer signal and possibly present the user with an error message. LCD monitors tend to contain similar protection circuitry, but for different reasons. Since the LCD must digitally sample the display signal (thereby emulating an electron beam), any signal that is out of range cannot be physically displayed on the monitor. == Color palette == Framebuffers have traditionally supported a wide variety of color modes. Due to the expense of memory, most early framebuffers used 1-bit (2 colors per pixel), 2-bit (4 colors), 4-bit (16 colors) or 8-bit (256 colors) color depths. The problem with such small color depths is that a full range of colors cannot be produced. The solution to this problem was indexed color, which adds a lookup table to the framebuffer. Each color stored in framebuffer memory acts as a color index. The lookup table serves as a palette with a limited number of different colors, while the rest is used as an index table. Here is a typical indexed 256-color image and its own palette (shown as a rectangle of swatches): In some designs, it was also possible to write data to the lookup table (or switch between existing palettes) on the fly, allowing dividing the picture into horizontal bars with their own palette and thus rendering an image that had a far wider palette. For example, viewing an outdoor shot photograph, the picture could be divided into four bars: the top one with emphasis on sky tones, the next with foliage tones, the next with skin and clothing tones, and the bottom one with ground colors. This required each palette to have overlapping colors, but, carefully done, allowed great flexibility. == Memory access == While framebuffers are commonly accessed via a memory mapping directly to the CPU memory space, this is not the only method by which they may be accessed. Framebuffers have varied widely in the methods used to access memory. Some of the most common are: Mapping the entire framebuffer to a given memory range. Port commands to set each pixel, range of pixels or palette entry. Mapping a memory range smaller than the framebuffer memory, then bank switching as necessary. The framebuffer organization may be packed pixel or planar. The framebuffer may be all
Baum–Welch algorithm
In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision. == History == The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of speech processing. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information. They have since become an important tool in the probabilistic modeling of genomic sequences. == Description == A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X t {\displaystyle X_{t}} be a discrete hidden random variable with N {\displaystyle N} possible values (i.e. We assume there are N {\displaystyle N} states in total). We assume the P ( X t ∣ X t − 1 ) {\displaystyle P(X_{t}\mid X_{t-1})} is independent of time t {\displaystyle t} , which leads to the definition of the time-independent stochastic transition matrix A = { a i j } = P ( X t = j ∣ X t − 1 = i ) . {\displaystyle A=\{a_{ij}\}=P(X_{t}=j\mid X_{t-1}=i).} The initial state distribution (i.e. when t = 1 {\displaystyle t=1} ) is given by π i = P ( X 1 = i ) . {\displaystyle \pi _{i}=P(X_{1}=i).} The observation variables Y t {\displaystyle Y_{t}} can take one of K {\displaystyle K} possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y i {\displaystyle y_{i}} at time t {\displaystyle t} for state X t = j {\displaystyle X_{t}=j} is given by b j ( y i ) = P ( Y t = y i ∣ X t = j ) . {\displaystyle b_{j}(y_{i})=P(Y_{t}=y_{i}\mid X_{t}=j).} Taking into account all the possible values of Y t {\displaystyle Y_{t}} and X t {\displaystyle X_{t}} , we obtain the N × K {\displaystyle N\times K} matrix B = { b j ( y i ) } {\displaystyle B=\{b_{j}(y_{i})\}} where b j {\displaystyle b_{j}} belongs to all the possible states and y i {\displaystyle y_{i}} belongs to all the observations. An observation sequence is given by Y = ( Y 1 = y 1 , Y 2 = y 2 , … , Y T = y T ) {\displaystyle Y=(Y_{1}=y_{1},Y_{2}=y_{2},\ldots ,Y_{T}=y_{T})} . Thus we can describe a hidden Markov chain by θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} . The Baum–Welch algorithm finds a local maximum for θ ∗ = a r g m a x θ P ( Y ∣ θ ) {\displaystyle \theta ^{}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )} (i.e. the HMM parameters θ {\displaystyle \theta } that maximize the probability of the observation). === Algorithm === Set θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum. ==== Forward procedure ==== Let α i ( t ) = P ( Y 1 = y 1 , … , Y t = y t , X t = i ∣ θ ) {\displaystyle \alpha _{i}(t)=P(Y_{1}=y_{1},\ldots ,Y_{t}=y_{t},X_{t}=i\mid \theta )} , the probability of seeing the observations y 1 , y 2 , … , y t {\displaystyle y_{1},y_{2},\ldots ,y_{t}} and being in state i {\displaystyle i} at time t {\displaystyle t} . This is found recursively: α i ( 1 ) = π i b i ( y 1 ) , {\displaystyle \alpha _{i}(1)=\pi _{i}b_{i}(y_{1}),} α i ( t + 1 ) = b i ( y t + 1 ) ∑ j = 1 N α j ( t ) a j i . {\displaystyle \alpha _{i}(t+1)=b_{i}(y_{t+1})\sum _{j=1}^{N}\alpha _{j}(t)a_{ji}.} Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling α {\displaystyle \alpha } in the forward and β {\displaystyle \beta } in the backward procedure below. ==== Backward procedure ==== Let β i ( t ) = P ( Y t + 1 = y t + 1 , … , Y T = y T ∣ X t = i , θ ) {\displaystyle \beta _{i}(t)=P(Y_{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )} that is the probability of the ending partial sequence y t + 1 , … , y T {\displaystyle y_{t+1},\ldots ,y_{T}} given starting state i {\displaystyle i} at time t {\displaystyle t} . We calculate β i ( t ) {\displaystyle \beta _{i}(t)} as, β i ( T ) = 1 , {\displaystyle \beta _{i}(T)=1,} β i ( t ) = ∑ j = 1 N β j ( t + 1 ) a i j b j ( y t + 1 ) . {\displaystyle \beta _{i}(t)=\sum _{j=1}^{N}\beta _{j}(t+1)a_{ij}b_{j}(y_{t+1}).} ==== Update ==== We can now calculate the temporary variables, according to Bayes' theorem: γ i ( t ) = P ( X t = i ∣ Y , θ ) = P ( X t = i , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) β i ( t ) ∑ j = 1 N α j ( t ) β j ( t ) , {\displaystyle \gamma _{i}(t)=P(X_{t}=i\mid Y,\theta )={\frac {P(X_{t}=i,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)\beta _{i}(t)}{\sum _{j=1}^{N}\alpha _{j}(t)\beta _{j}(t)}},} which is the probability of being in state i {\displaystyle i} at time t {\displaystyle t} given the observed sequence Y {\displaystyle Y} and the parameters θ {\displaystyle \theta } ξ i j ( t ) = P ( X t = i , X t + 1 = j ∣ Y , θ ) = P ( X t = i , X t + 1 = j , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) a i j β j ( t + 1 ) b j ( y t + 1 ) ∑ k = 1 N ∑ w = 1 N α k ( t ) a k w β w ( t + 1 ) b w ( y t + 1 ) , {\displaystyle \xi _{ij}(t)=P(X_{t}=i,X_{t+1}=j\mid Y,\theta )={\frac {P(X_{t}=i,X_{t+1}=j,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)a_{ij}\beta _{j}(t+1)b_{j}(y_{t+1})}{\sum _{k=1}^{N}\sum _{w=1}^{N}\alpha _{k}(t)a_{kw}\beta _{w}(t+1)b_{w}(y_{t+1})}},} which is the probability of being in state i {\displaystyle i} and j {\displaystyle j} at times t {\displaystyle t} and t + 1 {\displaystyle t+1} respectively given the observed sequence Y {\displaystyle Y} and parameters θ {\displaystyle \theta } . The denominators of γ i ( t ) {\displaystyle \gamma _{i}(t)} and ξ i j ( t ) {\displaystyle \xi _{ij}(t)} are the same ; they represent the probability of making the observation Y {\displaystyle Y} given the parameters θ {\displaystyle \theta } . The parameters of the hidden Markov model θ {\displaystyle \theta } can now be updated: π i ∗ = γ i ( 1 ) , {\displaystyle \pi _{i}^{}=\gamma _{i}(1),} which is the expected frequency spent in state i {\displaystyle i} at time 1 {\displaystyle 1} . a i j ∗ = ∑ t = 1 T − 1 ξ i j ( t ) ∑ t = 1 T − 1 γ i ( t ) , {\displaystyle a_{ij}^{}={\frac {\sum _{t=1}^{T-1}\xi _{ij}(t)}{\sum _{t=1}^{T-1}\gamma _{i}(t)}},} which is the expected number of transitions from state i to state j compared to the expected total number of transitions starting in state i, including from state i to itself. The number of transitions starting in state i is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1. b i ∗ ( v k ) = ∑ t = 1 T 1 y t = v k γ i ( t ) ∑ t = 1 T γ i ( t ) , {\displaystyle b_{i}^{}(v_{k})={\frac {\sum _{t=1}^{T}1_{y_{t}=v_{k}}\gamma _{i}(t)}{\sum _{t=1}^{T}\gamma _{i}(t)}},} where 1 y t = v k = { 1 if y t = v k , 0 otherwise {\displaystyle 1_{y_{t}=v_{k}}={\begin{cases}1&{\text{if }}y_{t}=v_{k},\\0&{\text{otherwise}}\end{cases}}} is an indicator function, and b i ∗ ( v k ) {\displaystyle b_{i}^{}(v_{k})} is the expected number of times the output observations have been equal to v k {\displaystyle v_{k}} while in state i {\displaystyle i} over the expected total number of times in state i {\displaystyle i} . These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P ( Y ∣ θ final ) > P ( Y ∣ θ true ) {\displaystyle P(Y\mid \theta _{\text{final}})>P(Y\mid \theta _{\text{true}})} . The algorithm also does not guarantee a global maximum. ==== Multiple sequences ==== The algorithm described thus far assumes a single observed sequence Y = y 1 , … , y T {\displaystyle Y=y_{1},\ldots ,y_{T}} . However, in many situations, there are several sequences observed: Y 1 ,
Baum–Welch algorithm
In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision. == History == The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of speech processing. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information. They have since become an important tool in the probabilistic modeling of genomic sequences. == Description == A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X t {\displaystyle X_{t}} be a discrete hidden random variable with N {\displaystyle N} possible values (i.e. We assume there are N {\displaystyle N} states in total). We assume the P ( X t ∣ X t − 1 ) {\displaystyle P(X_{t}\mid X_{t-1})} is independent of time t {\displaystyle t} , which leads to the definition of the time-independent stochastic transition matrix A = { a i j } = P ( X t = j ∣ X t − 1 = i ) . {\displaystyle A=\{a_{ij}\}=P(X_{t}=j\mid X_{t-1}=i).} The initial state distribution (i.e. when t = 1 {\displaystyle t=1} ) is given by π i = P ( X 1 = i ) . {\displaystyle \pi _{i}=P(X_{1}=i).} The observation variables Y t {\displaystyle Y_{t}} can take one of K {\displaystyle K} possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y i {\displaystyle y_{i}} at time t {\displaystyle t} for state X t = j {\displaystyle X_{t}=j} is given by b j ( y i ) = P ( Y t = y i ∣ X t = j ) . {\displaystyle b_{j}(y_{i})=P(Y_{t}=y_{i}\mid X_{t}=j).} Taking into account all the possible values of Y t {\displaystyle Y_{t}} and X t {\displaystyle X_{t}} , we obtain the N × K {\displaystyle N\times K} matrix B = { b j ( y i ) } {\displaystyle B=\{b_{j}(y_{i})\}} where b j {\displaystyle b_{j}} belongs to all the possible states and y i {\displaystyle y_{i}} belongs to all the observations. An observation sequence is given by Y = ( Y 1 = y 1 , Y 2 = y 2 , … , Y T = y T ) {\displaystyle Y=(Y_{1}=y_{1},Y_{2}=y_{2},\ldots ,Y_{T}=y_{T})} . Thus we can describe a hidden Markov chain by θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} . The Baum–Welch algorithm finds a local maximum for θ ∗ = a r g m a x θ P ( Y ∣ θ ) {\displaystyle \theta ^{}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )} (i.e. the HMM parameters θ {\displaystyle \theta } that maximize the probability of the observation). === Algorithm === Set θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum. ==== Forward procedure ==== Let α i ( t ) = P ( Y 1 = y 1 , … , Y t = y t , X t = i ∣ θ ) {\displaystyle \alpha _{i}(t)=P(Y_{1}=y_{1},\ldots ,Y_{t}=y_{t},X_{t}=i\mid \theta )} , the probability of seeing the observations y 1 , y 2 , … , y t {\displaystyle y_{1},y_{2},\ldots ,y_{t}} and being in state i {\displaystyle i} at time t {\displaystyle t} . This is found recursively: α i ( 1 ) = π i b i ( y 1 ) , {\displaystyle \alpha _{i}(1)=\pi _{i}b_{i}(y_{1}),} α i ( t + 1 ) = b i ( y t + 1 ) ∑ j = 1 N α j ( t ) a j i . {\displaystyle \alpha _{i}(t+1)=b_{i}(y_{t+1})\sum _{j=1}^{N}\alpha _{j}(t)a_{ji}.} Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling α {\displaystyle \alpha } in the forward and β {\displaystyle \beta } in the backward procedure below. ==== Backward procedure ==== Let β i ( t ) = P ( Y t + 1 = y t + 1 , … , Y T = y T ∣ X t = i , θ ) {\displaystyle \beta _{i}(t)=P(Y_{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )} that is the probability of the ending partial sequence y t + 1 , … , y T {\displaystyle y_{t+1},\ldots ,y_{T}} given starting state i {\displaystyle i} at time t {\displaystyle t} . We calculate β i ( t ) {\displaystyle \beta _{i}(t)} as, β i ( T ) = 1 , {\displaystyle \beta _{i}(T)=1,} β i ( t ) = ∑ j = 1 N β j ( t + 1 ) a i j b j ( y t + 1 ) . {\displaystyle \beta _{i}(t)=\sum _{j=1}^{N}\beta _{j}(t+1)a_{ij}b_{j}(y_{t+1}).} ==== Update ==== We can now calculate the temporary variables, according to Bayes' theorem: γ i ( t ) = P ( X t = i ∣ Y , θ ) = P ( X t = i , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) β i ( t ) ∑ j = 1 N α j ( t ) β j ( t ) , {\displaystyle \gamma _{i}(t)=P(X_{t}=i\mid Y,\theta )={\frac {P(X_{t}=i,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)\beta _{i}(t)}{\sum _{j=1}^{N}\alpha _{j}(t)\beta _{j}(t)}},} which is the probability of being in state i {\displaystyle i} at time t {\displaystyle t} given the observed sequence Y {\displaystyle Y} and the parameters θ {\displaystyle \theta } ξ i j ( t ) = P ( X t = i , X t + 1 = j ∣ Y , θ ) = P ( X t = i , X t + 1 = j , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) a i j β j ( t + 1 ) b j ( y t + 1 ) ∑ k = 1 N ∑ w = 1 N α k ( t ) a k w β w ( t + 1 ) b w ( y t + 1 ) , {\displaystyle \xi _{ij}(t)=P(X_{t}=i,X_{t+1}=j\mid Y,\theta )={\frac {P(X_{t}=i,X_{t+1}=j,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)a_{ij}\beta _{j}(t+1)b_{j}(y_{t+1})}{\sum _{k=1}^{N}\sum _{w=1}^{N}\alpha _{k}(t)a_{kw}\beta _{w}(t+1)b_{w}(y_{t+1})}},} which is the probability of being in state i {\displaystyle i} and j {\displaystyle j} at times t {\displaystyle t} and t + 1 {\displaystyle t+1} respectively given the observed sequence Y {\displaystyle Y} and parameters θ {\displaystyle \theta } . The denominators of γ i ( t ) {\displaystyle \gamma _{i}(t)} and ξ i j ( t ) {\displaystyle \xi _{ij}(t)} are the same ; they represent the probability of making the observation Y {\displaystyle Y} given the parameters θ {\displaystyle \theta } . The parameters of the hidden Markov model θ {\displaystyle \theta } can now be updated: π i ∗ = γ i ( 1 ) , {\displaystyle \pi _{i}^{}=\gamma _{i}(1),} which is the expected frequency spent in state i {\displaystyle i} at time 1 {\displaystyle 1} . a i j ∗ = ∑ t = 1 T − 1 ξ i j ( t ) ∑ t = 1 T − 1 γ i ( t ) , {\displaystyle a_{ij}^{}={\frac {\sum _{t=1}^{T-1}\xi _{ij}(t)}{\sum _{t=1}^{T-1}\gamma _{i}(t)}},} which is the expected number of transitions from state i to state j compared to the expected total number of transitions starting in state i, including from state i to itself. The number of transitions starting in state i is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1. b i ∗ ( v k ) = ∑ t = 1 T 1 y t = v k γ i ( t ) ∑ t = 1 T γ i ( t ) , {\displaystyle b_{i}^{}(v_{k})={\frac {\sum _{t=1}^{T}1_{y_{t}=v_{k}}\gamma _{i}(t)}{\sum _{t=1}^{T}\gamma _{i}(t)}},} where 1 y t = v k = { 1 if y t = v k , 0 otherwise {\displaystyle 1_{y_{t}=v_{k}}={\begin{cases}1&{\text{if }}y_{t}=v_{k},\\0&{\text{otherwise}}\end{cases}}} is an indicator function, and b i ∗ ( v k ) {\displaystyle b_{i}^{}(v_{k})} is the expected number of times the output observations have been equal to v k {\displaystyle v_{k}} while in state i {\displaystyle i} over the expected total number of times in state i {\displaystyle i} . These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P ( Y ∣ θ final ) > P ( Y ∣ θ true ) {\displaystyle P(Y\mid \theta _{\text{final}})>P(Y\mid \theta _{\text{true}})} . The algorithm also does not guarantee a global maximum. ==== Multiple sequences ==== The algorithm described thus far assumes a single observed sequence Y = y 1 , … , y T {\displaystyle Y=y_{1},\ldots ,y_{T}} . However, in many situations, there are several sequences observed: Y 1 ,
Baum–Welch algorithm
In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision. == History == The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of speech processing. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information. They have since become an important tool in the probabilistic modeling of genomic sequences. == Description == A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X t {\displaystyle X_{t}} be a discrete hidden random variable with N {\displaystyle N} possible values (i.e. We assume there are N {\displaystyle N} states in total). We assume the P ( X t ∣ X t − 1 ) {\displaystyle P(X_{t}\mid X_{t-1})} is independent of time t {\displaystyle t} , which leads to the definition of the time-independent stochastic transition matrix A = { a i j } = P ( X t = j ∣ X t − 1 = i ) . {\displaystyle A=\{a_{ij}\}=P(X_{t}=j\mid X_{t-1}=i).} The initial state distribution (i.e. when t = 1 {\displaystyle t=1} ) is given by π i = P ( X 1 = i ) . {\displaystyle \pi _{i}=P(X_{1}=i).} The observation variables Y t {\displaystyle Y_{t}} can take one of K {\displaystyle K} possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y i {\displaystyle y_{i}} at time t {\displaystyle t} for state X t = j {\displaystyle X_{t}=j} is given by b j ( y i ) = P ( Y t = y i ∣ X t = j ) . {\displaystyle b_{j}(y_{i})=P(Y_{t}=y_{i}\mid X_{t}=j).} Taking into account all the possible values of Y t {\displaystyle Y_{t}} and X t {\displaystyle X_{t}} , we obtain the N × K {\displaystyle N\times K} matrix B = { b j ( y i ) } {\displaystyle B=\{b_{j}(y_{i})\}} where b j {\displaystyle b_{j}} belongs to all the possible states and y i {\displaystyle y_{i}} belongs to all the observations. An observation sequence is given by Y = ( Y 1 = y 1 , Y 2 = y 2 , … , Y T = y T ) {\displaystyle Y=(Y_{1}=y_{1},Y_{2}=y_{2},\ldots ,Y_{T}=y_{T})} . Thus we can describe a hidden Markov chain by θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} . The Baum–Welch algorithm finds a local maximum for θ ∗ = a r g m a x θ P ( Y ∣ θ ) {\displaystyle \theta ^{}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )} (i.e. the HMM parameters θ {\displaystyle \theta } that maximize the probability of the observation). === Algorithm === Set θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum. ==== Forward procedure ==== Let α i ( t ) = P ( Y 1 = y 1 , … , Y t = y t , X t = i ∣ θ ) {\displaystyle \alpha _{i}(t)=P(Y_{1}=y_{1},\ldots ,Y_{t}=y_{t},X_{t}=i\mid \theta )} , the probability of seeing the observations y 1 , y 2 , … , y t {\displaystyle y_{1},y_{2},\ldots ,y_{t}} and being in state i {\displaystyle i} at time t {\displaystyle t} . This is found recursively: α i ( 1 ) = π i b i ( y 1 ) , {\displaystyle \alpha _{i}(1)=\pi _{i}b_{i}(y_{1}),} α i ( t + 1 ) = b i ( y t + 1 ) ∑ j = 1 N α j ( t ) a j i . {\displaystyle \alpha _{i}(t+1)=b_{i}(y_{t+1})\sum _{j=1}^{N}\alpha _{j}(t)a_{ji}.} Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling α {\displaystyle \alpha } in the forward and β {\displaystyle \beta } in the backward procedure below. ==== Backward procedure ==== Let β i ( t ) = P ( Y t + 1 = y t + 1 , … , Y T = y T ∣ X t = i , θ ) {\displaystyle \beta _{i}(t)=P(Y_{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )} that is the probability of the ending partial sequence y t + 1 , … , y T {\displaystyle y_{t+1},\ldots ,y_{T}} given starting state i {\displaystyle i} at time t {\displaystyle t} . We calculate β i ( t ) {\displaystyle \beta _{i}(t)} as, β i ( T ) = 1 , {\displaystyle \beta _{i}(T)=1,} β i ( t ) = ∑ j = 1 N β j ( t + 1 ) a i j b j ( y t + 1 ) . {\displaystyle \beta _{i}(t)=\sum _{j=1}^{N}\beta _{j}(t+1)a_{ij}b_{j}(y_{t+1}).} ==== Update ==== We can now calculate the temporary variables, according to Bayes' theorem: γ i ( t ) = P ( X t = i ∣ Y , θ ) = P ( X t = i , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) β i ( t ) ∑ j = 1 N α j ( t ) β j ( t ) , {\displaystyle \gamma _{i}(t)=P(X_{t}=i\mid Y,\theta )={\frac {P(X_{t}=i,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)\beta _{i}(t)}{\sum _{j=1}^{N}\alpha _{j}(t)\beta _{j}(t)}},} which is the probability of being in state i {\displaystyle i} at time t {\displaystyle t} given the observed sequence Y {\displaystyle Y} and the parameters θ {\displaystyle \theta } ξ i j ( t ) = P ( X t = i , X t + 1 = j ∣ Y , θ ) = P ( X t = i , X t + 1 = j , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) a i j β j ( t + 1 ) b j ( y t + 1 ) ∑ k = 1 N ∑ w = 1 N α k ( t ) a k w β w ( t + 1 ) b w ( y t + 1 ) , {\displaystyle \xi _{ij}(t)=P(X_{t}=i,X_{t+1}=j\mid Y,\theta )={\frac {P(X_{t}=i,X_{t+1}=j,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)a_{ij}\beta _{j}(t+1)b_{j}(y_{t+1})}{\sum _{k=1}^{N}\sum _{w=1}^{N}\alpha _{k}(t)a_{kw}\beta _{w}(t+1)b_{w}(y_{t+1})}},} which is the probability of being in state i {\displaystyle i} and j {\displaystyle j} at times t {\displaystyle t} and t + 1 {\displaystyle t+1} respectively given the observed sequence Y {\displaystyle Y} and parameters θ {\displaystyle \theta } . The denominators of γ i ( t ) {\displaystyle \gamma _{i}(t)} and ξ i j ( t ) {\displaystyle \xi _{ij}(t)} are the same ; they represent the probability of making the observation Y {\displaystyle Y} given the parameters θ {\displaystyle \theta } . The parameters of the hidden Markov model θ {\displaystyle \theta } can now be updated: π i ∗ = γ i ( 1 ) , {\displaystyle \pi _{i}^{}=\gamma _{i}(1),} which is the expected frequency spent in state i {\displaystyle i} at time 1 {\displaystyle 1} . a i j ∗ = ∑ t = 1 T − 1 ξ i j ( t ) ∑ t = 1 T − 1 γ i ( t ) , {\displaystyle a_{ij}^{}={\frac {\sum _{t=1}^{T-1}\xi _{ij}(t)}{\sum _{t=1}^{T-1}\gamma _{i}(t)}},} which is the expected number of transitions from state i to state j compared to the expected total number of transitions starting in state i, including from state i to itself. The number of transitions starting in state i is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1. b i ∗ ( v k ) = ∑ t = 1 T 1 y t = v k γ i ( t ) ∑ t = 1 T γ i ( t ) , {\displaystyle b_{i}^{}(v_{k})={\frac {\sum _{t=1}^{T}1_{y_{t}=v_{k}}\gamma _{i}(t)}{\sum _{t=1}^{T}\gamma _{i}(t)}},} where 1 y t = v k = { 1 if y t = v k , 0 otherwise {\displaystyle 1_{y_{t}=v_{k}}={\begin{cases}1&{\text{if }}y_{t}=v_{k},\\0&{\text{otherwise}}\end{cases}}} is an indicator function, and b i ∗ ( v k ) {\displaystyle b_{i}^{}(v_{k})} is the expected number of times the output observations have been equal to v k {\displaystyle v_{k}} while in state i {\displaystyle i} over the expected total number of times in state i {\displaystyle i} . These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P ( Y ∣ θ final ) > P ( Y ∣ θ true ) {\displaystyle P(Y\mid \theta _{\text{final}})>P(Y\mid \theta _{\text{true}})} . The algorithm also does not guarantee a global maximum. ==== Multiple sequences ==== The algorithm described thus far assumes a single observed sequence Y = y 1 , … , y T {\displaystyle Y=y_{1},\ldots ,y_{T}} . However, in many situations, there are several sequences observed: Y 1 ,
List of artificial intelligence journals
This is a list of notable peer-reviewed academic journals that publish research in the field of artificial intelligence (AI), including areas such as machine learning, computer vision, natural language processing, robotics, and intelligent systems. == General artificial intelligence == Artificial Intelligence (journal) – Elsevier Journal of Artificial Intelligence Research (JAIR) – AI Access Foundation Knowledge-Based Systems – Elsevier == Machine learning == Data Mining and Knowledge Discovery – Springer Machine Learning (journal) – Springer Journal of Machine Learning Research – Microtome Pattern Recognition (journal) – Elsevier Neural Networks (journal) – Elsevier Neural Computation (journal) – MIT Press Neurocomputing (journal) - Elsevier == Deep learning and neural computation == IEEE Transactions on Evolutionary Computation – IEEE IEEE Transactions on Neural Networks and Learning Systems – IEEE Nature Machine Intelligence – Springer Nature == Computer vision == International Journal of Computer Vision – Springer IEEE Transactions on Pattern Analysis and Machine Intelligence – IEEE Machine Vision and Applications – Springer == Natural language processing == Computational Linguistics (journal) – MIT Press Natural Language Processing Transactions of the Association for Computational Linguistics – ACL == Robotics and intelligent systems == IEEE Transactions on Robotics – IEEE Autonomous Robots – Springer Journal of Intelligent & Robotic Systems – Springer == Interdisciplinary and ethics in AI == AI & Society – Springer Artificial Life – MIT Press Philosophy & Technology – Springer Minds and Machines – Springer
Baum–Welch algorithm
In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision. == History == The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of speech processing. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information. They have since become an important tool in the probabilistic modeling of genomic sequences. == Description == A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X t {\displaystyle X_{t}} be a discrete hidden random variable with N {\displaystyle N} possible values (i.e. We assume there are N {\displaystyle N} states in total). We assume the P ( X t ∣ X t − 1 ) {\displaystyle P(X_{t}\mid X_{t-1})} is independent of time t {\displaystyle t} , which leads to the definition of the time-independent stochastic transition matrix A = { a i j } = P ( X t = j ∣ X t − 1 = i ) . {\displaystyle A=\{a_{ij}\}=P(X_{t}=j\mid X_{t-1}=i).} The initial state distribution (i.e. when t = 1 {\displaystyle t=1} ) is given by π i = P ( X 1 = i ) . {\displaystyle \pi _{i}=P(X_{1}=i).} The observation variables Y t {\displaystyle Y_{t}} can take one of K {\displaystyle K} possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y i {\displaystyle y_{i}} at time t {\displaystyle t} for state X t = j {\displaystyle X_{t}=j} is given by b j ( y i ) = P ( Y t = y i ∣ X t = j ) . {\displaystyle b_{j}(y_{i})=P(Y_{t}=y_{i}\mid X_{t}=j).} Taking into account all the possible values of Y t {\displaystyle Y_{t}} and X t {\displaystyle X_{t}} , we obtain the N × K {\displaystyle N\times K} matrix B = { b j ( y i ) } {\displaystyle B=\{b_{j}(y_{i})\}} where b j {\displaystyle b_{j}} belongs to all the possible states and y i {\displaystyle y_{i}} belongs to all the observations. An observation sequence is given by Y = ( Y 1 = y 1 , Y 2 = y 2 , … , Y T = y T ) {\displaystyle Y=(Y_{1}=y_{1},Y_{2}=y_{2},\ldots ,Y_{T}=y_{T})} . Thus we can describe a hidden Markov chain by θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} . The Baum–Welch algorithm finds a local maximum for θ ∗ = a r g m a x θ P ( Y ∣ θ ) {\displaystyle \theta ^{}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )} (i.e. the HMM parameters θ {\displaystyle \theta } that maximize the probability of the observation). === Algorithm === Set θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum. ==== Forward procedure ==== Let α i ( t ) = P ( Y 1 = y 1 , … , Y t = y t , X t = i ∣ θ ) {\displaystyle \alpha _{i}(t)=P(Y_{1}=y_{1},\ldots ,Y_{t}=y_{t},X_{t}=i\mid \theta )} , the probability of seeing the observations y 1 , y 2 , … , y t {\displaystyle y_{1},y_{2},\ldots ,y_{t}} and being in state i {\displaystyle i} at time t {\displaystyle t} . This is found recursively: α i ( 1 ) = π i b i ( y 1 ) , {\displaystyle \alpha _{i}(1)=\pi _{i}b_{i}(y_{1}),} α i ( t + 1 ) = b i ( y t + 1 ) ∑ j = 1 N α j ( t ) a j i . {\displaystyle \alpha _{i}(t+1)=b_{i}(y_{t+1})\sum _{j=1}^{N}\alpha _{j}(t)a_{ji}.} Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling α {\displaystyle \alpha } in the forward and β {\displaystyle \beta } in the backward procedure below. ==== Backward procedure ==== Let β i ( t ) = P ( Y t + 1 = y t + 1 , … , Y T = y T ∣ X t = i , θ ) {\displaystyle \beta _{i}(t)=P(Y_{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )} that is the probability of the ending partial sequence y t + 1 , … , y T {\displaystyle y_{t+1},\ldots ,y_{T}} given starting state i {\displaystyle i} at time t {\displaystyle t} . We calculate β i ( t ) {\displaystyle \beta _{i}(t)} as, β i ( T ) = 1 , {\displaystyle \beta _{i}(T)=1,} β i ( t ) = ∑ j = 1 N β j ( t + 1 ) a i j b j ( y t + 1 ) . {\displaystyle \beta _{i}(t)=\sum _{j=1}^{N}\beta _{j}(t+1)a_{ij}b_{j}(y_{t+1}).} ==== Update ==== We can now calculate the temporary variables, according to Bayes' theorem: γ i ( t ) = P ( X t = i ∣ Y , θ ) = P ( X t = i , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) β i ( t ) ∑ j = 1 N α j ( t ) β j ( t ) , {\displaystyle \gamma _{i}(t)=P(X_{t}=i\mid Y,\theta )={\frac {P(X_{t}=i,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)\beta _{i}(t)}{\sum _{j=1}^{N}\alpha _{j}(t)\beta _{j}(t)}},} which is the probability of being in state i {\displaystyle i} at time t {\displaystyle t} given the observed sequence Y {\displaystyle Y} and the parameters θ {\displaystyle \theta } ξ i j ( t ) = P ( X t = i , X t + 1 = j ∣ Y , θ ) = P ( X t = i , X t + 1 = j , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) a i j β j ( t + 1 ) b j ( y t + 1 ) ∑ k = 1 N ∑ w = 1 N α k ( t ) a k w β w ( t + 1 ) b w ( y t + 1 ) , {\displaystyle \xi _{ij}(t)=P(X_{t}=i,X_{t+1}=j\mid Y,\theta )={\frac {P(X_{t}=i,X_{t+1}=j,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)a_{ij}\beta _{j}(t+1)b_{j}(y_{t+1})}{\sum _{k=1}^{N}\sum _{w=1}^{N}\alpha _{k}(t)a_{kw}\beta _{w}(t+1)b_{w}(y_{t+1})}},} which is the probability of being in state i {\displaystyle i} and j {\displaystyle j} at times t {\displaystyle t} and t + 1 {\displaystyle t+1} respectively given the observed sequence Y {\displaystyle Y} and parameters θ {\displaystyle \theta } . The denominators of γ i ( t ) {\displaystyle \gamma _{i}(t)} and ξ i j ( t ) {\displaystyle \xi _{ij}(t)} are the same ; they represent the probability of making the observation Y {\displaystyle Y} given the parameters θ {\displaystyle \theta } . The parameters of the hidden Markov model θ {\displaystyle \theta } can now be updated: π i ∗ = γ i ( 1 ) , {\displaystyle \pi _{i}^{}=\gamma _{i}(1),} which is the expected frequency spent in state i {\displaystyle i} at time 1 {\displaystyle 1} . a i j ∗ = ∑ t = 1 T − 1 ξ i j ( t ) ∑ t = 1 T − 1 γ i ( t ) , {\displaystyle a_{ij}^{}={\frac {\sum _{t=1}^{T-1}\xi _{ij}(t)}{\sum _{t=1}^{T-1}\gamma _{i}(t)}},} which is the expected number of transitions from state i to state j compared to the expected total number of transitions starting in state i, including from state i to itself. The number of transitions starting in state i is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1. b i ∗ ( v k ) = ∑ t = 1 T 1 y t = v k γ i ( t ) ∑ t = 1 T γ i ( t ) , {\displaystyle b_{i}^{}(v_{k})={\frac {\sum _{t=1}^{T}1_{y_{t}=v_{k}}\gamma _{i}(t)}{\sum _{t=1}^{T}\gamma _{i}(t)}},} where 1 y t = v k = { 1 if y t = v k , 0 otherwise {\displaystyle 1_{y_{t}=v_{k}}={\begin{cases}1&{\text{if }}y_{t}=v_{k},\\0&{\text{otherwise}}\end{cases}}} is an indicator function, and b i ∗ ( v k ) {\displaystyle b_{i}^{}(v_{k})} is the expected number of times the output observations have been equal to v k {\displaystyle v_{k}} while in state i {\displaystyle i} over the expected total number of times in state i {\displaystyle i} . These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P ( Y ∣ θ final ) > P ( Y ∣ θ true ) {\displaystyle P(Y\mid \theta _{\text{final}})>P(Y\mid \theta _{\text{true}})} . The algorithm also does not guarantee a global maximum. ==== Multiple sequences ==== The algorithm described thus far assumes a single observed sequence Y = y 1 , … , y T {\displaystyle Y=y_{1},\ldots ,y_{T}} . However, in many situations, there are several sequences observed: Y 1 ,