Veřejné symboly (EN)

Public Documentation

Module Markovify

The following is the documentation of symbols which are exported from the Markovify module. The module is used to construct a Markov chain from the given list of lists of tokens and to walk through it, generating a random sequence of tokens along the way. Please see Příklady if you are looking for some usage examples.

Markovify.ModelType.

The datastructure of the Markov chain. Encodes all the different states and the probabilities of going from one to another as a dictionary. The keys are the states, the values are the respective TokenOccurences dictionaries. Those are dictionaries which say how many times was a token found immediately after the state.

Fields

  • order is the number of tokens in a State
  • nodes is a dictionary pairing State and its respective

TokenOccurences dictionary.

Markovify.ModelMethod.
Model(nodes)

Return a model constructed from nodes. Can be used to reconstruct a model object from its nodes, e.g. if the nodes were saved in a JSON file.

Markovify.ModelMethod.
Model(suptokens::Vector{<:Vector{T}}; order=2, weight=stdweight)

Return a Model trained on an array of arrays of tokens (suptokens). Optionally an order of the chain can be supplied; that is the number of tokens in one state. A weight function of general type func(::State{T}, ::Token{T}) -> Int can be supplied to be used to bias the weights based on the state or token value.

Markovify.combineMethod.
combine(chain, others)

Return a Model which is a combination of all of the models provided. All of the arguments should have the same order. The nodes of all the Models are merged using the function merge.

Markovify.walkMethod.
walk(model[, init_state])

Return an array of tokens obtained by a random walk through the Markov chain. The walk starts at state init_state if supplied, and at state [:begin, :begin...] (the length depends on the order of the supplied model) otherwise. The walk ends once a special token :end is reached.

See also: walk2.

Markovify.walk2Method.
walk2(model[, init_state])

Return an array of tokens obtained by a random walk through the Markov chain. When there is only one state following the current one (i.e. there is 100% chance that the state will become the next one), the function shortens the current State as to lower the requirements and obtain more randomness. The State gets shortened until a state with at least two possible successors is found (or until State is only one token long).

The walk starts at state init_state if supplied, and at state [:begin, :begin...] (the length depends on the order of the supplied model) otherwise. The walk ends once a special token :end is reached.

See also: walk.

Module Markovify.Tokenizer

The following symbols are exported from the Markovify.Tokenizer module. This module is used to tokenize text into a list of lists of tokens, which is a format better suited for model training.

Tokenizer.cleanupMethod.
cleanup(suptokens::Vector{<:Vector{<:AbstractString}}; badchars="»«\n-_()[]{}<>–—$='"„“
	")

Remove all characters that are in badchars from all tokens in suptokens.

source
Tokenizer.lettersFunction.
letters = cleanup ∘ to_letters ∘ to_sentences

Composite function which splits its input into sentences, then the sentences into letters, and then removes special characters.

source
Tokenizer.linesFunction.
lines = cleanup ∘ to_letters ∘ to_sentences

Composite function which splits its input into lines, then the line into letters, and then removes special characters.

source
to_letters(tokens::Vector{<:AbstractString})

Split all of the tokens in tokens into individual characters.

source
Tokenizer.to_linesMethod.
to_lines(text::AbstractString)

Return an array of lines in text.

source
to_sentences(text::AbstractString)

Return an array of sentences in text. The text is split along dots; the dots remain in the strings, only the spaces after the dots are stripped.

The function tries to be as smart as possible. For example, the string "Channel No. 5 is a perfume." will be treated as one sentence, although it has two dots.

source
Tokenizer.tokenizeMethod.
tokenize(text[, on=letters])

Split text into SupTokens (array of arrays of tokens). An optional function of general type func(::Any) -> Vector{Vector{Any}} can be provided to be used for the tokenization.

For possible combinators which can be composed to obtain func, see: to_lines, to_sentences, to_letters, to_words, cleanup.

source
Tokenizer.wordsFunction.
words = cleanup ∘ to_letters ∘ to_sentences

Composite function which splits its input into sentences, then the sentences into words, and then removes special characters. Please note that dots and commas are not removed.

source