Public Documentation

Module Markovify

The following is the documentation of symbols which are exported from the Markovify module. The module is used to construct a Markov chain from the given list of lists of tokens and to walk through it, generating a random sequence of tokens along the way. Please see Příklady if you are looking for some usage examples.

Markovify.Model — Type.

The datastructure of the Markov chain. Encodes all the different states and the probabilities of going from one to another as a dictionary. The keys are the states, the values are the respective TokenOccurences dictionaries. Those are dictionaries which say how many times was a token found immediately after the state.

Fields

order is the number of tokens in a State
nodes is a dictionary pairing State and its respective

TokenOccurences dictionary.

Markovify.Model — Method.

Model(nodes)

Return a model constructed from nodes. Can be used to reconstruct a model object from its nodes, e.g. if the nodes were saved in a JSON file.

Markovify.Model — Method.

Model(suptokens::Vector{<:Vector{T}}; order=2, weight=stdweight)

Return a Model trained on an array of arrays of tokens (suptokens). Optionally an order of the chain can be supplied; that is the number of tokens in one state. A weight function of general type func(::State{T}, ::Token{T}) -> Int can be supplied to be used to bias the weights based on the state or token value.

Markovify.combine — Method.

combine(chain, others)

Return a Model which is a combination of all of the models provided. All of the arguments should have the same order. The nodes of all the Models are merged using the function merge.

Markovify.walk — Method.

walk(model[, init_state])

Return an array of tokens obtained by a random walk through the Markov chain. The walk starts at state init_state if supplied, and at state [:begin, :begin...] (the length depends on the order of the supplied model) otherwise. The walk ends once a special token :end is reached.

Module Markovify.Tokenizer

The following symbols are exported from the Markovify.Tokenizer module. This module is used to tokenize text into a list of lists of tokens, which is a format better suited for model training.

Tokenizer.cleanup — Method.

cleanup(suptokens::Vector{<:Vector{<:AbstractString}}; badchars="»«\n-_()[]{}<>–—$='"„“
	")

Remove all characters that are in badchars from all tokens in suptokens.

source

Tokenizer.letters — Function.

letters = cleanup ∘ to_letters ∘ to_sentences

Composite function which splits its input into sentences, then the sentences into letters, and then removes special characters.

source

Tokenizer.lines — Function.

lines = cleanup ∘ to_letters ∘ to_sentences

Composite function which splits its input into lines, then the line into letters, and then removes special characters.

source

Tokenizer.to_letters — Method.

to_letters(tokens::Vector{<:AbstractString})

Split all of the tokens in tokens into individual characters.

source

Tokenizer.to_lines — Method.

to_lines(text::AbstractString)

Return an array of lines in text.

source

Tokenizer.to_sentences — Method.

to_sentences(text::AbstractString)

Return an array of sentences in text. The text is split along dots; the dots remain in the strings, only the spaces after the dots are stripped.

The function tries to be as smart as possible. For example, the string "Channel No. 5 is a perfume." will be treated as one sentence, although it has two dots.

source

Tokenizer.tokenize — Method.

tokenize(text[, on=letters])

Split text into SupTokens (array of arrays of tokens). An optional function of general type func(::Any) -> Vector{Vector{Any}} can be provided to be used for the tokenization.

For possible combinators which can be composed to obtain func, see: to_lines, to_sentences, to_letters, to_words, cleanup.

source

Tokenizer.words — Function.

words = cleanup ∘ to_letters ∘ to_sentences

Composite function which splits its input into sentences, then the sentences into words, and then removes special characters. Please note that dots and commas are not removed.

source