cefrpy package

Submodules

cefrpy.CEFRAnalyzer module

class cefrpy.CEFRAnalyzer.CEFRAnalyzer(data_processor: ~cefrpy.CEFRDataProcessor.CEFRDataProcessor = <cefrpy.CEFRDataProcessor.CEFRDataProcessor object>)

Bases: object

A class to analyze CEFR (Common European Framework of Reference for Languages) data.

This class provides methods to analyze word part of speech levels and retrieve information about words’ average levels.

Attributes:

_data_processor (CEFRDataProcessor): The CEFR data processor object to use.

get_all_pos_for_word(word: str) list[POSTag]

Retrieves all part-of-speech tags associated with a given word as POSTag enums.

Args:

word (str): The word to retrieve part-of-speech tags for.

Returns:
list[POSTag]: A list of POSTag enums representing the part-of-speech tags associated with the word.

If the word is not found in the data, an empty list is returned.

get_all_pos_for_word_as_str(word: str) list[str]

Retrieves the names of all part-of-speech tags associated with a given word.

Args:

word (str): The word to retrieve part-of-speech tags for.

Returns:
list[str]: A list of strings representing the names of the part-of-speech tags associated with the word.

If the word is not found in the data, an empty list is returned.

get_average_word_level_CEFR(word: str) CEFRLevel | None

Get the average CEFR level of the word.

Args:

word (str): The word to query.

Returns:

Union[CEFRLevel, None]: The average level of the word, or None if not found.

get_average_word_level_float(word: str) float | None

Get the average level of the word.

Args:

word (str): The word to query.

Returns:

Union[float, None]: The average level of the word, or None if not found.

get_max_word_len() int

Get the maximum word length available in the data.

Returns:

int: The maximum word length.

get_pos_level_dict_for_word(word: str, pos_tag_as_string: bool = False, word_level_as_float: bool = False) dict[str | POSTag, float | CEFRLevel]

Retrieves a dictionary mapping part-of-speech tags to their associated CEFR levels for a given word.

Args:

word (str): The word to retrieve part-of-speech tags and their associated levels for. pos_tag_as_string (bool, optional): If True, part-of-speech tags are returned as strings; if False, as POSTag enums. Defaults to False. word_level_as_float (bool, optional): If True, CEFR levels are returned as floats; if False, as CEFRLevel enums. Defaults to False.

Returns:
dict[Union[str, POSTag], Union[float, CEFRLevel]]: A dictionary mapping part-of-speech tags to their associated CEFR levels.

If pos_tag_as_string is True, part-of-speech tags are strings; otherwise, they are POSTag enums. If word_level_as_float is True, CEFR levels are floats; otherwise, they are CEFRLevel enums. If the word is not found in the data, an empty dictionary is returned.

static get_pos_tag_id(pos_tag: str | POSTag) int | None

Get the part of speech id.

Args:

pos_tag (Union[str, POSTag]): The part of speech tag.

Returns:

Union[int, None]: The part of speech id, or None if an exception occurs.

get_total_words() int

Get the total count of words in the data.

Returns:

int: Total count of words.

get_word_count_for_length(word_length: int) int

Count the number of words of a specific length in the data.

Args:

word_length (int): Length of the words to count.

Returns:

int: Number of words of the specified length.

get_word_pos_count() int

Get the total count of positions in the data where words start, across all word lengths.

Returns:

int: Total count of positions where words start.

get_word_pos_count_for_length(word_length: int) int

Count the number of positions in the data where words of a specific length start.

Args:

word_length (int): Length of the words to count positions for.

Returns:

int: Number of positions where words of the specified length start.

get_word_pos_level_CEFR(word: str, pos_tag: str | POSTag, avg_level_not_found_pos: bool = False) CEFRLevel | None

Get the CEFR level of a word’s part of speech.

Args:

word (str): The word to query. pos_tag (Union[str, POSTag]): The part of speech tag. avg_level_not_found_pos (bool, optional): If True, returns the average level of the part of speech when not found. Defaults to False.

Returns:

Union[CEFRLevel, None]: The level of the word’s part of speech, or None if not found.

get_word_pos_level_float(word: str, pos_tag: str | POSTag, avg_level_not_found_pos: bool = False) float | None

Get the level of a word’s part of speech.

Args:

word (str): The word to query. pos_tag (Union[str, POSTag]): The part of speech tag. avg_level_not_found_pos (bool, optional): If True, returns the average level of the part of speech when not found. Defaults to False.

Returns:

Union[float, None]: The level of the word’s part of speech, or None if not found.

is_word_in_database(word: str) bool

Check if a word is in the DataReader database.

Args:

word (str): The word to check.

Returns:

bool: True if the word is in the database, False otherwise.

is_word_pos_id_database(word: str, pos_tag: str | POSTag) bool

Check if a word pos is in the database.

Args:

word (str): The word to check. pos_tag (Union[str, POSTag]): The part of speech tag.

Returns:

bool: True if the word is in the database, False otherwise.

yield_word_pos(reverse_order: bool = False, pos_tag_as_string: bool = False, word_length_sort: bool = False)

Yield all words with their associated part-of-speech tag IDs from the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False. word_length_sort (bool): If True, yields data sorted by word length.

Yields:
tuple: A tuple containing the word and its associated part-of-speech tag.

If pos_tag_as_string is True, the tuple format is (str, str). If pos_tag_as_string is False, the tuple format is (str, POSTag).

yield_word_pos_level(reverse_order: bool = False, pos_tag_as_string: bool = False, word_level_as_float: bool = False, word_length_sort: bool = False)

Yield all words, their part-of-speech tags, and their CEFR levels from the database based on the specified criteria.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False. word_level_as_float (bool, optional): If True, yield CEFR levels as floats instead of CEFRLevel enums. Defaults to False. word_length_sort (bool): If True, yields data sorted by word length.

Yields:
tuple: A tuple containing the word, its part-of-speech tag, and its CEFR level. If pos_tag_as_string is True, the part-of-speech tag is a string,

otherwise, it’s a POSTag enum. If word_level_as_float is True, the level is a float, otherwise, it’s a CEFRLevel enum.

yield_word_pos_level_with_length(word_length: int, reverse_order: bool = False, pos_tag_as_string: bool = False, word_level_as_float: bool = False)

Yield words of a specific length, their part-of-speech tags, and their CEFR levels from the database based on the specified criteria.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False. word_level_as_float (bool, optional): If True, yield CEFR levels as floats instead of CEFRLevel enums. Defaults to False.

Yields:
tuple: A tuple containing the word, its part-of-speech tag, and its CEFR level. If pos_tag_as_string is True, the part-of-speech tag is a string,

otherwise, it’s a POSTag enum. If word_level_as_float is True, the level is a float, otherwise, it’s a CEFRLevel enum.

yield_word_pos_with_length(word_length: int, reverse_order: bool = False, pos_tag_as_string: bool = False)

Yield words of a specific length with their associated part-of-speech tag IDs from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False.

Yields:
tuple: A tuple containing the word and its associated part-of-speech tag.

If pos_tag_as_string is True, the tuple format is (str, str). If pos_tag_as_string is False, the tuple format is (str, POSTag).

yield_words(reverse_order: bool = False, word_length_sort: bool = False)

Yield all words in the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:

str: A word from the database.

yield_words_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

str: A word from the database with the specified length.

cefrpy.CEFRDataProcessor module

class cefrpy.CEFRDataProcessor.CEFRDataProcessor(data_reader: ~cefrpy.CEFRDataReader.CEFRDataReader = <cefrpy.CEFRDataReader.CEFRDataReader object>)

Bases: object

A class to process CEFR (Common European Framework of Reference for Languages) data.

Attributes:

_data_reader (CEFRDataReader): An instance of CEFRDataReader to read CEFR data.

static byte_int_level_to_float(level: int) float

Convert packed level to float.

Args:

level (int): level in range 0 <= level <= 250.

Returns:

float: The level in range 1 <= level <= 6.

Raises:

ValueError: If the level not in range 0 <= level <= 250.

get_all_pos_for_word(word: str) list[int]

Retrieves the IDs of all part-of-speech tags associated with a given word.

Args:

word (str): The word to retrieve part-of-speech tags for.

Returns:
list[int]: A list of IDs representing the part-of-speech tags associated with the word.

If the word is not found in the data, an empty list is returned.

get_max_word_len() int

Get the maximum word length available in the data.

Returns:

int: The maximum word length.

get_pos_level_dict_for_word(word: str) dict[int, float]

Retrieves a dictionary mapping part-of-speech tag IDs to their associated CEFR levels for a given word.

Args:

word (str): The word to retrieve part-of-speech tags and their associated levels for.

Returns:
dict[int, float]: A dictionary mapping part-of-speech tag IDs to their associated CEFR levels (as floats).

If the word is not found in the data, an empty dictionary is returned.

get_total_words() int

Get the total count of words in the data.

Returns:

int: Total count of words.

get_word_count_for_length(word_length: int) int

Count the number of words of a specific length in the data.

Args:

word_length (int): Length of the words to count.

Returns:

int: Number of words of the specified length.

get_word_level_for_pos_id(word: str, pos_tag_id: int, avg_level_not_found_pos: bool = False) float | None

Get the level of a word’s part of speech.

Args:

word (str): The word to query. pos_tag_id (int): The part of speech tag ID. avg_level_not_found_pos (bool, optional): If True, returns the average level of the part of speech when not found. Defaults to False.

Returns:

Union[float, None]: The level of the word’s part of speech, or None if not found.

get_word_pos_count() int

Get the total count of positions in the data where words start, across all word lengths.

Returns:

int: Total count of positions where words start.

get_word_pos_count_for_length(word_length: int) int

Count the number of positions in the data where words of a specific length start.

Args:

word_length (int): Length of the words to count positions for.

Returns:

int: Number of positions where words of the specified length start.

is_word_in_database(word: str) bool

Check if a word is in the database.

Args:

word (str): The word to check.

Returns:

bool: True if the word is in the database, False otherwise.

is_word_len_valid(word_len: int) bool

Check if the word length is valid.

Args:

word_len (int): The length of the word.

Returns:

bool: True if the word length is valid, False otherwise.

is_word_pos_id_database(word: str, pos_tag_id: int) bool

Check if a word pos is in the database.

Args:

word (str): The word to check. pos_tag_id (int): The part of speech tag ID.

Returns:

bool: True if the word is in the database, False otherwise.

static pack_word(word: str) bytes

Pack a word into bytes.

Args:

word (str): The word to pack.

Returns:

bytes: The packed representation of the word.

yield_word_pos_id(reverse_order: bool = False, word_lenght_sort: bool = False)

Yield words with their part-of-speech tag IDs from the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:

tuple[str, int]: A tuple containing a word from the database and its associated part-of-speech tag ID.

yield_word_pos_id_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length with their associated part-of-speech tag IDs from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

tuple[str, int]: A tuple containing a word from the database with the specified length and its associated part-of-speech tag ID as an integer.

yield_word_pos_level(reverse_order: bool = False, word_lenght_sort: bool = False)

Yield words with their part-of-speech tag IDs and levels from the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:
tuple[str, int, float]: A tuple containing a word from the database, its associated part-of-speech tag ID,

and its level.

yield_word_pos_level_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length with their part-of-speech tag IDs and levels from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

tuple[str, int, float]: A tuple containing a word from the database with the specified length, its associated part-of-speech tag ID, and its level.

yield_words(reverse_order: bool = False, word_lenght_sort: bool = False)

Yield all words in the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:

str: A word from the database.

yield_words_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

str: A word from the database with the specified length.

class cefrpy.CEFRDataProcessor.HeapqReverseDataWrapper(data)

Bases: object

Wrapper class to reverse the ordering of data when using heapq.

This class is used to wrap data objects to reverse their ordering when they are stored in a heapq. By default, heapq stores items in ascending order. This wrapper class allows items to be stored in descending order.

Args:

data: The data object to be wrapped.

Attributes:

data: The wrapped data object.

Methods:

__lt__(self, other): Less-than comparison method used to determine the ordering of the wrapped data.

cefrpy.CEFRDataReader module

class cefrpy.CEFRDataReader.CEFRDataReader(data_path: str | None = None)

Bases: object

A class to read CEFR (Common European Framework of Reference for Languages) data from database file.

This class provides methods to access word length positions in database file, data array values, and retrieve information about words’ part of speech levels.

Attributes:

data_path (str): The path to the binary data file. _wlp (array.array): An array containing word length positions. _data_array (bytearray): The data array from the database file.

get_data_array_len() int

Get the length of the data array.

Returns:

int: Length of the data array.

get_data_array_value_at(i: int) int

Get the value at index i in the data array.

Args:

i (int): Index in the array.

Returns:

int: Value at the specified index.

Raises:

IndexError: If the index is out of range.

get_wlp_len() int

Get the length of the word length positions array.

Returns:

int: Length of the word length positions array.

get_wlp_value_at(i: int) int

Get the value at index i in the word length positions array.

Args:

i (int): Index in the array.

Returns:

int: Value at the specified index.

Raises:

IndexError: If the index is out of range.

cefrpy.CEFRDataValidator module

cefrpy.CEFRDataValidator.is_data_valid(wlp_array: array, data: bytearray) bool

Check if the CEFR data is valid.

Args:

wlp_array (array): The Word Length Position (WLP) array. data (bytearray): The CEFR data.

Returns:

bool: True if the data is valid, False otherwise.

cefrpy.CEFRDataValidator.is_wlp_array_valid(wlp_array: array) bool

Check if the Word Length Position (WLP) array is valid.

Args:

wlp_array (array): The WLP array.

Returns:

bool: True if the WLP array is valid, False otherwise.

cefrpy.CEFRDataValidator.is_wlp_length_valid(wlp_length: int) bool

Check if the length of the Word Length Position (WLP) array is valid.

Args:

wlp_length (int): The length of the WLP array.

Returns:

bool: True if the length is valid, False otherwise.

cefrpy.CEFRDataValidator.validate_data_block(data: bytearray, start_pos: int, block_length: int) bool

Validate a data block within the CEFR data.

Args:

data (bytearray): The CEFR data. start_pos (int): The starting position of the data block. block_length (int): The length of the data block.

Returns:

bool: True if the data block is valid, False otherwise.

cefrpy.CEFRLevel module

class cefrpy.CEFRLevel.CEFRLevel(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Represents CEFR (Common European Framework of Reference for Languages) levels.

A1 = 1
A2 = 2
B1 = 3
B2 = 4
C1 = 5
C2 = 6
classmethod from_str(level_str: str)

Creates a CEFRLevel instance from a string representation of the level.

Parameters:

level_str (str): The string representation of the CEFR level.

Returns:

CEFRLevel: The CEFRLevel instance corresponding to the input string.

Raises:

ValueError: If the provided string is invalid.

cefrpy.CEFRSpaCyAnalyzer module

class cefrpy.CEFRSpaCyAnalyzer.CEFRSpaCyAnalyzer(analyzer: ~cefrpy.CEFRAnalyzer.CEFRAnalyzer = <cefrpy.CEFRAnalyzer.CEFRAnalyzer object>, entity_types_to_skip: set[str] | list[str] | None = None, abbreviation_mapping: dict[str, str] | None = None)

Bases: object

Analyze text for CEFR levels, considering provided entity types to skip and abbreviation mapping.

Attributes:

_analyzer (CEFRAnalyzer): The CEFR analyzer instance. entity_types_to_skip (set[str]): Set of entity types to skip. abbreviation_mapping (dict[str, str]): Dictionary mapping abbreviations to their full forms. tokens (list[tuple[str, str, bool, float, int, int]]): List of token tuples containing word, POS tag, skip status, CEFR level, start index, and end index.

analize_doc(doc) list[tuple[str, str, bool, float, int, int]]

Analyze the document for CEFR levels, considering skipped entities and abbreviation mapping.

Args:

doc: SpaCy tokens.

Returns:

list[tuple[str, str, bool, float, int, int]]: List of token tuples containing word, POS tag, skip status, CEFR level, start index, and end index.

cefrpy.POSTag module

class cefrpy.POSTag.POSTag(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enumeration of Part-of-Speech (POS) tags with their corresponding IDs and descriptions.

CC = 0
CD = 1
DT = 2
IN = 3
JJ = 4
JJR = 5
JJS = 6
MD = 7
NN = 8
NNP = 10
NNPS = 11
NNS = 9
PRP = 12
RB = 13
RBR = 14
RBS = 15
RP = 16
TO = 17
UH = 18
VB = 19
VBD = 20
VBG = 21
VBN = 22
VBP = 23
VBZ = 24
WDT = 25
WP = 26
WRB = 27
classmethod from_tag_name(tag_name: str)

Initialize a POS tag using its name.

Args:

tag_name (str): The name of the POS tag.

Returns:

POSTag: The POS tag corresponding to the given name.

Raises:

ValueError: If the provided tag name is invalid.

static get_all_tags() list[str]

Get a list of all part-of-speech tag names.

Returns:

list[str]: A list containing all part-of-speech tag names.

get_description() str

Retrieve the description of a POS tag.

static get_description_by_tag_id(tag_id: int) str

Retrieve the description of a POS tag by its ID.

Args:

tag_id (int): The ID of the POS tag.

Returns:

str: The description corresponding to the given POS tag ID.

Raises:

ValueError: If the provided tag ID is invalid.

static get_description_by_tag_name(tag_name: str) str

Retrieve the description of a POS tag by its name.

Args:

tag_name (str): The name of the POS tag.

Returns:

str: The description corresponding to the given POS tag name.

Raises:

ValueError: If the provided tag name is invalid.

static get_id_by_tag_name(tag_name: str) int

Retrieve the ID of a POS tag by its name.

Args:

tag_name (str): The name of the POS tag.

Returns:

int: The ID corresponding to the given POS tag name.

Raises:

ValueError: If the provided tag name is invalid.

static get_tag_name_by_id(tag_id: int) str

Retrieve the name of a Part-of-Speech (POS) tag by its corresponding ID.

Args:

tag_id (int): The integer ID of the POS tag.

Returns:

str: The name of the POS tag corresponding to the provided ID.

Raises:

ValueError: If the provided tag_id is not within the valid range of tag IDs.

static get_total_tags() int

Retrieve the total number of POS tags.

Returns:

int: The total number of POS tags.

Module contents

class cefrpy.CEFRAnalyzer(data_processor: ~cefrpy.CEFRDataProcessor.CEFRDataProcessor = <cefrpy.CEFRDataProcessor.CEFRDataProcessor object>)

Bases: object

A class to analyze CEFR (Common European Framework of Reference for Languages) data.

This class provides methods to analyze word part of speech levels and retrieve information about words’ average levels.

Attributes:

_data_processor (CEFRDataProcessor): The CEFR data processor object to use.

get_all_pos_for_word(word: str) list[POSTag]

Retrieves all part-of-speech tags associated with a given word as POSTag enums.

Args:

word (str): The word to retrieve part-of-speech tags for.

Returns:
list[POSTag]: A list of POSTag enums representing the part-of-speech tags associated with the word.

If the word is not found in the data, an empty list is returned.

get_all_pos_for_word_as_str(word: str) list[str]

Retrieves the names of all part-of-speech tags associated with a given word.

Args:

word (str): The word to retrieve part-of-speech tags for.

Returns:
list[str]: A list of strings representing the names of the part-of-speech tags associated with the word.

If the word is not found in the data, an empty list is returned.

get_average_word_level_CEFR(word: str) CEFRLevel | None

Get the average CEFR level of the word.

Args:

word (str): The word to query.

Returns:

Union[CEFRLevel, None]: The average level of the word, or None if not found.

get_average_word_level_float(word: str) float | None

Get the average level of the word.

Args:

word (str): The word to query.

Returns:

Union[float, None]: The average level of the word, or None if not found.

get_max_word_len() int

Get the maximum word length available in the data.

Returns:

int: The maximum word length.

get_pos_level_dict_for_word(word: str, pos_tag_as_string: bool = False, word_level_as_float: bool = False) dict[str | POSTag, float | CEFRLevel]

Retrieves a dictionary mapping part-of-speech tags to their associated CEFR levels for a given word.

Args:

word (str): The word to retrieve part-of-speech tags and their associated levels for. pos_tag_as_string (bool, optional): If True, part-of-speech tags are returned as strings; if False, as POSTag enums. Defaults to False. word_level_as_float (bool, optional): If True, CEFR levels are returned as floats; if False, as CEFRLevel enums. Defaults to False.

Returns:
dict[Union[str, POSTag], Union[float, CEFRLevel]]: A dictionary mapping part-of-speech tags to their associated CEFR levels.

If pos_tag_as_string is True, part-of-speech tags are strings; otherwise, they are POSTag enums. If word_level_as_float is True, CEFR levels are floats; otherwise, they are CEFRLevel enums. If the word is not found in the data, an empty dictionary is returned.

static get_pos_tag_id(pos_tag: str | POSTag) int | None

Get the part of speech id.

Args:

pos_tag (Union[str, POSTag]): The part of speech tag.

Returns:

Union[int, None]: The part of speech id, or None if an exception occurs.

get_total_words() int

Get the total count of words in the data.

Returns:

int: Total count of words.

get_word_count_for_length(word_length: int) int

Count the number of words of a specific length in the data.

Args:

word_length (int): Length of the words to count.

Returns:

int: Number of words of the specified length.

get_word_pos_count() int

Get the total count of positions in the data where words start, across all word lengths.

Returns:

int: Total count of positions where words start.

get_word_pos_count_for_length(word_length: int) int

Count the number of positions in the data where words of a specific length start.

Args:

word_length (int): Length of the words to count positions for.

Returns:

int: Number of positions where words of the specified length start.

get_word_pos_level_CEFR(word: str, pos_tag: str | POSTag, avg_level_not_found_pos: bool = False) CEFRLevel | None

Get the CEFR level of a word’s part of speech.

Args:

word (str): The word to query. pos_tag (Union[str, POSTag]): The part of speech tag. avg_level_not_found_pos (bool, optional): If True, returns the average level of the part of speech when not found. Defaults to False.

Returns:

Union[CEFRLevel, None]: The level of the word’s part of speech, or None if not found.

get_word_pos_level_float(word: str, pos_tag: str | POSTag, avg_level_not_found_pos: bool = False) float | None

Get the level of a word’s part of speech.

Args:

word (str): The word to query. pos_tag (Union[str, POSTag]): The part of speech tag. avg_level_not_found_pos (bool, optional): If True, returns the average level of the part of speech when not found. Defaults to False.

Returns:

Union[float, None]: The level of the word’s part of speech, or None if not found.

is_word_in_database(word: str) bool

Check if a word is in the DataReader database.

Args:

word (str): The word to check.

Returns:

bool: True if the word is in the database, False otherwise.

is_word_pos_id_database(word: str, pos_tag: str | POSTag) bool

Check if a word pos is in the database.

Args:

word (str): The word to check. pos_tag (Union[str, POSTag]): The part of speech tag.

Returns:

bool: True if the word is in the database, False otherwise.

yield_word_pos(reverse_order: bool = False, pos_tag_as_string: bool = False, word_length_sort: bool = False)

Yield all words with their associated part-of-speech tag IDs from the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False. word_length_sort (bool): If True, yields data sorted by word length.

Yields:
tuple: A tuple containing the word and its associated part-of-speech tag.

If pos_tag_as_string is True, the tuple format is (str, str). If pos_tag_as_string is False, the tuple format is (str, POSTag).

yield_word_pos_level(reverse_order: bool = False, pos_tag_as_string: bool = False, word_level_as_float: bool = False, word_length_sort: bool = False)

Yield all words, their part-of-speech tags, and their CEFR levels from the database based on the specified criteria.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False. word_level_as_float (bool, optional): If True, yield CEFR levels as floats instead of CEFRLevel enums. Defaults to False. word_length_sort (bool): If True, yields data sorted by word length.

Yields:
tuple: A tuple containing the word, its part-of-speech tag, and its CEFR level. If pos_tag_as_string is True, the part-of-speech tag is a string,

otherwise, it’s a POSTag enum. If word_level_as_float is True, the level is a float, otherwise, it’s a CEFRLevel enum.

yield_word_pos_level_with_length(word_length: int, reverse_order: bool = False, pos_tag_as_string: bool = False, word_level_as_float: bool = False)

Yield words of a specific length, their part-of-speech tags, and their CEFR levels from the database based on the specified criteria.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False. word_level_as_float (bool, optional): If True, yield CEFR levels as floats instead of CEFRLevel enums. Defaults to False.

Yields:
tuple: A tuple containing the word, its part-of-speech tag, and its CEFR level. If pos_tag_as_string is True, the part-of-speech tag is a string,

otherwise, it’s a POSTag enum. If word_level_as_float is True, the level is a float, otherwise, it’s a CEFRLevel enum.

yield_word_pos_with_length(word_length: int, reverse_order: bool = False, pos_tag_as_string: bool = False)

Yield words of a specific length with their associated part-of-speech tag IDs from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. pos_tag_as_string (bool, optional): If True, yield part-of-speech tags as strings; if False, yield them as POSTag enums. Defaults to False.

Yields:
tuple: A tuple containing the word and its associated part-of-speech tag.

If pos_tag_as_string is True, the tuple format is (str, str). If pos_tag_as_string is False, the tuple format is (str, POSTag).

yield_words(reverse_order: bool = False, word_length_sort: bool = False)

Yield all words in the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:

str: A word from the database.

yield_words_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

str: A word from the database with the specified length.

class cefrpy.CEFRDataProcessor(data_reader: ~cefrpy.CEFRDataReader.CEFRDataReader = <cefrpy.CEFRDataReader.CEFRDataReader object>)

Bases: object

A class to process CEFR (Common European Framework of Reference for Languages) data.

Attributes:

_data_reader (CEFRDataReader): An instance of CEFRDataReader to read CEFR data.

static byte_int_level_to_float(level: int) float

Convert packed level to float.

Args:

level (int): level in range 0 <= level <= 250.

Returns:

float: The level in range 1 <= level <= 6.

Raises:

ValueError: If the level not in range 0 <= level <= 250.

get_all_pos_for_word(word: str) list[int]

Retrieves the IDs of all part-of-speech tags associated with a given word.

Args:

word (str): The word to retrieve part-of-speech tags for.

Returns:
list[int]: A list of IDs representing the part-of-speech tags associated with the word.

If the word is not found in the data, an empty list is returned.

get_max_word_len() int

Get the maximum word length available in the data.

Returns:

int: The maximum word length.

get_pos_level_dict_for_word(word: str) dict[int, float]

Retrieves a dictionary mapping part-of-speech tag IDs to their associated CEFR levels for a given word.

Args:

word (str): The word to retrieve part-of-speech tags and their associated levels for.

Returns:
dict[int, float]: A dictionary mapping part-of-speech tag IDs to their associated CEFR levels (as floats).

If the word is not found in the data, an empty dictionary is returned.

get_total_words() int

Get the total count of words in the data.

Returns:

int: Total count of words.

get_word_count_for_length(word_length: int) int

Count the number of words of a specific length in the data.

Args:

word_length (int): Length of the words to count.

Returns:

int: Number of words of the specified length.

get_word_level_for_pos_id(word: str, pos_tag_id: int, avg_level_not_found_pos: bool = False) float | None

Get the level of a word’s part of speech.

Args:

word (str): The word to query. pos_tag_id (int): The part of speech tag ID. avg_level_not_found_pos (bool, optional): If True, returns the average level of the part of speech when not found. Defaults to False.

Returns:

Union[float, None]: The level of the word’s part of speech, or None if not found.

get_word_pos_count() int

Get the total count of positions in the data where words start, across all word lengths.

Returns:

int: Total count of positions where words start.

get_word_pos_count_for_length(word_length: int) int

Count the number of positions in the data where words of a specific length start.

Args:

word_length (int): Length of the words to count positions for.

Returns:

int: Number of positions where words of the specified length start.

is_word_in_database(word: str) bool

Check if a word is in the database.

Args:

word (str): The word to check.

Returns:

bool: True if the word is in the database, False otherwise.

is_word_len_valid(word_len: int) bool

Check if the word length is valid.

Args:

word_len (int): The length of the word.

Returns:

bool: True if the word length is valid, False otherwise.

is_word_pos_id_database(word: str, pos_tag_id: int) bool

Check if a word pos is in the database.

Args:

word (str): The word to check. pos_tag_id (int): The part of speech tag ID.

Returns:

bool: True if the word is in the database, False otherwise.

static pack_word(word: str) bytes

Pack a word into bytes.

Args:

word (str): The word to pack.

Returns:

bytes: The packed representation of the word.

yield_word_pos_id(reverse_order: bool = False, word_lenght_sort: bool = False)

Yield words with their part-of-speech tag IDs from the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:

tuple[str, int]: A tuple containing a word from the database and its associated part-of-speech tag ID.

yield_word_pos_id_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length with their associated part-of-speech tag IDs from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

tuple[str, int]: A tuple containing a word from the database with the specified length and its associated part-of-speech tag ID as an integer.

yield_word_pos_level(reverse_order: bool = False, word_lenght_sort: bool = False)

Yield words with their part-of-speech tag IDs and levels from the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:
tuple[str, int, float]: A tuple containing a word from the database, its associated part-of-speech tag ID,

and its level.

yield_word_pos_level_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length with their part-of-speech tag IDs and levels from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

tuple[str, int, float]: A tuple containing a word from the database with the specified length, its associated part-of-speech tag ID, and its level.

yield_words(reverse_order: bool = False, word_lenght_sort: bool = False)

Yield all words in the database.

Args:

reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False. word_length_sort (bool, optional): If True, yield words sorted by word length. Defaults to False.

Yields:

str: A word from the database.

yield_words_with_length(word_length: int, reverse_order: bool = False)

Yield words of a specific length from the database.

Args:

word_length (int): The length of the words to yield. reverse_order (bool, optional): If True, yield words in reverse order. Defaults to False.

Yields:

str: A word from the database with the specified length.

class cefrpy.CEFRDataReader(data_path: str | None = None)

Bases: object

A class to read CEFR (Common European Framework of Reference for Languages) data from database file.

This class provides methods to access word length positions in database file, data array values, and retrieve information about words’ part of speech levels.

Attributes:

data_path (str): The path to the binary data file. _wlp (array.array): An array containing word length positions. _data_array (bytearray): The data array from the database file.

get_data_array_len() int

Get the length of the data array.

Returns:

int: Length of the data array.

get_data_array_value_at(i: int) int

Get the value at index i in the data array.

Args:

i (int): Index in the array.

Returns:

int: Value at the specified index.

Raises:

IndexError: If the index is out of range.

get_wlp_len() int

Get the length of the word length positions array.

Returns:

int: Length of the word length positions array.

get_wlp_value_at(i: int) int

Get the value at index i in the word length positions array.

Args:

i (int): Index in the array.

Returns:

int: Value at the specified index.

Raises:

IndexError: If the index is out of range.

class cefrpy.CEFRLevel(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Represents CEFR (Common European Framework of Reference for Languages) levels.

A1 = 1
A2 = 2
B1 = 3
B2 = 4
C1 = 5
C2 = 6
classmethod from_str(level_str: str)

Creates a CEFRLevel instance from a string representation of the level.

Parameters:

level_str (str): The string representation of the CEFR level.

Returns:

CEFRLevel: The CEFRLevel instance corresponding to the input string.

Raises:

ValueError: If the provided string is invalid.

class cefrpy.CEFRSpaCyAnalyzer(analyzer: ~cefrpy.CEFRAnalyzer.CEFRAnalyzer = <cefrpy.CEFRAnalyzer.CEFRAnalyzer object>, entity_types_to_skip: set[str] | list[str] | None = None, abbreviation_mapping: dict[str, str] | None = None)

Bases: object

Analyze text for CEFR levels, considering provided entity types to skip and abbreviation mapping.

Attributes:

_analyzer (CEFRAnalyzer): The CEFR analyzer instance. entity_types_to_skip (set[str]): Set of entity types to skip. abbreviation_mapping (dict[str, str]): Dictionary mapping abbreviations to their full forms. tokens (list[tuple[str, str, bool, float, int, int]]): List of token tuples containing word, POS tag, skip status, CEFR level, start index, and end index.

analize_doc(doc) list[tuple[str, str, bool, float, int, int]]

Analyze the document for CEFR levels, considering skipped entities and abbreviation mapping.

Args:

doc: SpaCy tokens.

Returns:

list[tuple[str, str, bool, float, int, int]]: List of token tuples containing word, POS tag, skip status, CEFR level, start index, and end index.

class cefrpy.POSTag(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enumeration of Part-of-Speech (POS) tags with their corresponding IDs and descriptions.

CC = 0
CD = 1
DT = 2
IN = 3
JJ = 4
JJR = 5
JJS = 6
MD = 7
NN = 8
NNP = 10
NNPS = 11
NNS = 9
PRP = 12
RB = 13
RBR = 14
RBS = 15
RP = 16
TO = 17
UH = 18
VB = 19
VBD = 20
VBG = 21
VBN = 22
VBP = 23
VBZ = 24
WDT = 25
WP = 26
WRB = 27
classmethod from_tag_name(tag_name: str)

Initialize a POS tag using its name.

Args:

tag_name (str): The name of the POS tag.

Returns:

POSTag: The POS tag corresponding to the given name.

Raises:

ValueError: If the provided tag name is invalid.

static get_all_tags() list[str]

Get a list of all part-of-speech tag names.

Returns:

list[str]: A list containing all part-of-speech tag names.

get_description() str

Retrieve the description of a POS tag.

static get_description_by_tag_id(tag_id: int) str

Retrieve the description of a POS tag by its ID.

Args:

tag_id (int): The ID of the POS tag.

Returns:

str: The description corresponding to the given POS tag ID.

Raises:

ValueError: If the provided tag ID is invalid.

static get_description_by_tag_name(tag_name: str) str

Retrieve the description of a POS tag by its name.

Args:

tag_name (str): The name of the POS tag.

Returns:

str: The description corresponding to the given POS tag name.

Raises:

ValueError: If the provided tag name is invalid.

static get_id_by_tag_name(tag_name: str) int

Retrieve the ID of a POS tag by its name.

Args:

tag_name (str): The name of the POS tag.

Returns:

int: The ID corresponding to the given POS tag name.

Raises:

ValueError: If the provided tag name is invalid.

static get_tag_name_by_id(tag_id: int) str

Retrieve the name of a Part-of-Speech (POS) tag by its corresponding ID.

Args:

tag_id (int): The integer ID of the POS tag.

Returns:

str: The name of the POS tag corresponding to the provided ID.

Raises:

ValueError: If the provided tag_id is not within the valid range of tag IDs.

static get_total_tags() int

Retrieve the total number of POS tags.

Returns:

int: The total number of POS tags.