Public Documentation
Module Markovify
The following is the documentation of symbols which are exported from the Markovify module. The module is used to construct a Markov chain from the given list of lists of tokens and to walk through it, generating a random sequence of tokens along the way. Please see Příklady if you are looking for some usage examples.
Markovify.Model — Type.The datastructure of the Markov chain. Encodes all the different states and the probabilities of going from one to another as a dictionary. The keys are the states, the values are the respective TokenOccurences dictionaries. Those are dictionaries which say how many times was a token found immediately after the state.
Fields
TokenOccurences dictionary.
Markovify.Model — Method.Model(nodes)Return a model constructed from nodes. Can be used to reconstruct a model object from its nodes, e.g. if the nodes were saved in a JSON file.
Markovify.Model — Method.Model(suptokens::Vector{<:Vector{T}}; order=2, weight=stdweight)Return a Model trained on an array of arrays of tokens (suptokens). Optionally an order of the chain can be supplied; that is the number of tokens in one state. A weight function of general type func(::State{T}, ::Token{T}) -> Int can be supplied to be used to bias the weights based on the state or token value.
Markovify.combine — Method.combine(chain, others)Return a Model which is a combination of all of the models provided. All of the arguments should have the same order. The nodes of all the Models are merged using the function merge.
Markovify.walk — Method.walk(model[, init_state])Return an array of tokens obtained by a random walk through the Markov chain. The walk starts at state init_state if supplied, and at state [:begin, :begin...] (the length depends on the order of the supplied model) otherwise. The walk ends once a special token :end is reached.
See also: walk2.
Markovify.walk2 — Method.walk2(model[, init_state])Return an array of tokens obtained by a random walk through the Markov chain. When there is only one state following the current one (i.e. there is 100% chance that the state will become the next one), the function shortens the current State as to lower the requirements and obtain more randomness. The State gets shortened until a state with at least two possible successors is found (or until State is only one token long).
The walk starts at state init_state if supplied, and at state [:begin, :begin...] (the length depends on the order of the supplied model) otherwise. The walk ends once a special token :end is reached.
See also: walk.
Module Markovify.Tokenizer
The following symbols are exported from the Markovify.Tokenizer module. This module is used to tokenize text into a list of lists of tokens, which is a format better suited for model training.
Tokenizer.cleanup — Method.cleanup(suptokens::Vector{<:Vector{<:AbstractString}}; badchars="»«\n-_()[]{}<>–—$='"„“
")Remove all characters that are in badchars from all tokens in suptokens.
Tokenizer.letters — Function.letters = cleanup ∘ to_letters ∘ to_sentencesComposite function which splits its input into sentences, then the sentences into letters, and then removes special characters.
Tokenizer.lines — Function.lines = cleanup ∘ to_letters ∘ to_sentencesComposite function which splits its input into lines, then the line into letters, and then removes special characters.
Tokenizer.to_letters — Method.to_letters(tokens::Vector{<:AbstractString})Split all of the tokens in tokens into individual characters.
Tokenizer.to_lines — Method.to_lines(text::AbstractString)Return an array of lines in text.
Tokenizer.to_sentences — Method.to_sentences(text::AbstractString)Return an array of sentences in text. The text is split along dots; the dots remain in the strings, only the spaces after the dots are stripped.
The function tries to be as smart as possible. For example, the string "Channel No. 5 is a perfume." will be treated as one sentence, although it has two dots.
Tokenizer.tokenize — Method.tokenize(text[, on=letters])Split text into SupTokens (array of arrays of tokens). An optional function of general type func(::Any) -> Vector{Vector{Any}} can be provided to be used for the tokenization.
For possible combinators which can be composed to obtain func, see: to_lines, to_sentences, to_letters, to_words, cleanup.
Tokenizer.words — Function.words = cleanup ∘ to_letters ∘ to_sentencesComposite function which splits its input into sentences, then the sentences into words, and then removes special characters. Please note that dots and commas are not removed.