Public Documentation
Module Markovify
The following is the documentation of symbols which are exported from the Markovify
module. The module is used to construct a Markov chain from the given list of lists of tokens and to walk through it, generating a random sequence of tokens along the way. Please see Příklady if you are looking for some usage examples.
Markovify.Model
— Type.The datastructure of the Markov chain. Encodes all the different states and the probabilities of going from one to another as a dictionary. The keys are the states, the values are the respective TokenOccurences
dictionaries. Those are dictionaries which say how many times was a token found immediately after the state.
Fields
TokenOccurences
dictionary.
Markovify.Model
— Method.Model(nodes)
Return a model constructed from nodes
. Can be used to reconstruct a model object from its nodes, e.g. if the nodes were saved in a JSON file.
Markovify.Model
— Method.Model(suptokens::Vector{<:Vector{T}}; order=2, weight=stdweight)
Return a Model
trained on an array of arrays of tokens
(suptokens
). Optionally an order
of the chain can be supplied; that is the number of tokens in one state. A weight function of general type func(::State{T}, ::Token{T}) -> Int
can be supplied to be used to bias the weights based on the state or token value.
Markovify.combine
— Method.combine(chain, others)
Return a Model which is a combination of all of the models provided. All of the arguments should have the same order
. The nodes of all the Models are merged using the function merge
.
Markovify.walk
— Method.walk(model[, init_state])
Return an array of tokens obtained by a random walk through the Markov chain. The walk starts at state init_state
if supplied, and at state [:begin, :begin...]
(the length depends on the order of the supplied model
) otherwise. The walk ends once a special token :end
is reached.
See also: walk2
.
Markovify.walk2
— Method.walk2(model[, init_state])
Return an array of tokens obtained by a random walk through the Markov chain. When there is only one state following the current one (i.e. there is 100% chance that the state will become the next one), the function shortens the current State
as to lower the requirements and obtain more randomness. The State
gets shortened until a state with at least two possible successors is found (or until State
is only one token long).
The walk starts at state init_state
if supplied, and at state [:begin, :begin...]
(the length depends on the order of the supplied model
) otherwise. The walk ends once a special token :end
is reached.
See also: walk
.
Module Markovify.Tokenizer
The following symbols are exported from the Markovify.Tokenizer
module. This module is used to tokenize text into a list of lists of tokens, which is a format better suited for model training.
Tokenizer.cleanup
— Method.cleanup(suptokens::Vector{<:Vector{<:AbstractString}}; badchars="»«\n-_()[]{}<>–—$='"„“
")
Remove all characters that are in badchars
from all tokens in suptokens
.
Tokenizer.letters
— Function.letters = cleanup ∘ to_letters ∘ to_sentences
Composite function which splits its input into sentences, then the sentences into letters, and then removes special characters.
Tokenizer.lines
— Function.lines = cleanup ∘ to_letters ∘ to_sentences
Composite function which splits its input into lines, then the line into letters, and then removes special characters.
Tokenizer.to_letters
— Method.to_letters(tokens::Vector{<:AbstractString})
Split all of the tokens in tokens
into individual characters.
Tokenizer.to_lines
— Method.to_lines(text::AbstractString)
Return an array of lines in text
.
Tokenizer.to_sentences
— Method.to_sentences(text::AbstractString)
Return an array of sentences in text
. The text is split along dots; the dots remain in the strings, only the spaces after the dots are stripped.
The function tries to be as smart as possible. For example, the string "Channel No. 5 is a perfume."
will be treated as one sentence, although it has two dots.
Tokenizer.tokenize
— Method.tokenize(text[, on=letters])
Split text
into SupTokens (array of arrays of tokens). An optional function of general type func(::Any) -> Vector{Vector{Any}}
can be provided to be used for the tokenization.
For possible combinators which can be composed to obtain func
, see: to_lines
, to_sentences
, to_letters
, to_words
, cleanup
.
Tokenizer.words
— Function.words = cleanup ∘ to_letters ∘ to_sentences
Composite function which splits its input into sentences, then the sentences into words, and then removes special characters. Please note that dots and commas are not removed.