MINI MINI MANI MO
ó
i:Oc @ s5 d Z d d l Z d d l Z d d l Z y d d l Z Wn e k
rS d Z n Xd d l Z d d l m
Z
d d l m Z e
j
d Z e e d d d d g e d
d Z e e j e e Z e j d Z e d
Z d d d Z d d Z d Z d d Z d d Z d Z d S( sÍ
---------------------------------------------
Miscellaneous functions for manipulating text
---------------------------------------------
Collection of text functions that don't fit in another category.
i˙˙˙˙N( t sets( t ControlCharErrorg333333ă?i i i i i i s (?s)<[^>]*>|&#?\w+;c C są t | t s' t t j d n d } y t | | d Wn t k
rZ d } n X| r t r | r t j | } | d t
k r | d } q n | s d } n | S( s# Try to guess the encoding of a byte :class:`str`
:arg byte_string: byte :class:`str` to guess the encoding of
:kwarg disable_chardet: If this is True, we never attempt to use
:mod:`chardet` to guess the encoding. This is useful if you need to
have reproducibility whether :mod:`chardet` is installed or not.
Default: :data:`False`.
:raises TypeError: if :attr:`byte_string` is not a byte :class:`str` type
:returns: string containing a guess at the encoding of
:attr:`byte_string`. This is appropriate to pass as the encoding
argument when encoding and decoding unicode strings.
We start by attempting to decode the byte :class:`str` as :term:`UTF-8`.
If this succeeds we tell the world it's :term:`UTF-8` text. If it doesn't
and :mod:`chardet` is installed on the system and :attr:`disable_chardet`
is False this function will use it to try detecting the encoding of
:attr:`byte_string`. If it is not installed or :mod:`chardet` cannot
determine the encoding with a high enough confidence then we rather
arbitrarily claim that it is ``latin-1``. Since ``latin-1`` will encode
to every byte, decoding from ``latin-1`` to :class:`unicode` will not
cause :exc:`UnicodeErrors` although the output might be mangled.
s' byte_string must be a byte string (str)s utf-8t strictt
confidencet encodings latin-1N( t
isinstancet strt TypeErrort kt b_t unicodet UnicodeDecodeErrort Nonet chardett detectt _CHARDET_THRESHHOLD( t byte_stringt disable_chardett input_encodingt detection_info( ( s5 /usr/lib/python2.7/site-packages/kitchen/text/misc.pyt guess_encoding; s
s utf-8t replacec C sz y | | k o | | k SWn t k
r/ n Xt | t rT | j | | } n | j | | } | | k rv t St S( sÝ Compare two stringsi, converting to byte :class:`str` if one is
:class:`unicode`
:arg str1: First string to compare
:arg str2: Second string to compare
:kwarg encoding: If we need to convert one string into a byte :class:`str`
to compare, the encoding to use. Default is :term:`utf-8`.
:kwarg errors: What to do if we encounter errors when encoding the string.
See the :func:`kitchen.text.converters.to_bytes` documentation for
possible values. The default is ``replace``.
This function prevents :exc:`UnicodeError` (python-2.4 or less) and
:exc:`UnicodeWarning` (python 2.5 and higher) when we compare
a :class:`unicode` string to a byte :class:`str`. The errors normally
arise because the conversion is done to :term:`ASCII`. This function
lets you convert to :term:`utf-8` or another encoding instead.
.. note::
When we need to convert one of the strings from :class:`unicode` in
order to compare them we convert the :class:`unicode` string into
a byte :class:`str`. That means that strings can compare differently
if you use different encodings for each.
Note that ``str1 == str2`` is faster than this function if you can accept
the following limitations:
* Limited to python-2.5+ (otherwise a :exc:`UnicodeDecodeError` may be
thrown)
* Will generate a :exc:`UnicodeWarning` if non-:term:`ASCII` byte
:class:`str` is compared to :class:`unicode` string.
( t UnicodeErrorR R
t encodet Truet False( t str1t str2R t errors( ( s5 /usr/lib/python2.7/site-packages/kitchen/text/misc.pyt str_eqd s !
c C s t | t s' t t j d n | d k rX t t t d g t t } n¤ | d k r t t t d g t t } ns | d k rç d } t
| } g t D] } | | k rŽ | ^ qŽ rü t t j d qü n t
t j d | r| j | } n | S( s˙ Look for and transform :term:`control characters` in a string
:arg string: string to search for and transform :term:`control characters`
within
:kwarg strategy: XML does not allow :term:`ASCII` :term:`control
characters`. When we encounter those we need to know what to do.
Valid options are:
:replace: (default) Replace the :term:`control characters`
with ``"?"``
:ignore: Remove the characters altogether from the output
:strict: Raise a :exc:`~kitchen.text.exceptions.ControlCharError` when
we encounter a control character
:raises TypeError: if :attr:`string` is not a unicode string.
:raises ValueError: if the strategy is not one of replace, ignore, or
strict.
:raises kitchen.text.exceptions.ControlCharError: if the strategy is
``strict`` and a :term:`control character` is present in the
:attr:`string`
:returns: :class:`unicode` string with no :term:`control characters` in
it.
sD process_control_char must have a unicode type as the first argument.t ignoreR u ?R s* ASCII control code present in string inputsX The strategy argument to process_control_chars must be one of ignore, replace, or strictN( R R
R R R t dictt zipt _CONTROL_CODESR t lent frozensett _CONTROL_CHARSR t
ValueErrort translate( t stringt strategyt
control_tablet datat c( ( s5 /usr/lib/python2.7/site-packages/kitchen/text/misc.pyt process_control_chars s %%%c C sC d } t | t s0 t t j d n t j t | | S( s/ Substitute unicode characters for HTML entities
:arg string: :class:`unicode` string to substitute out html entities
:raises TypeError: if something other than a :class:`unicode` string is
given
:rtype: :class:`unicode` string
:returns: The plain text without html entities
c S s | j d } | d d k r# d S| d d k r yE | d d k r` t t | d d !d
St t | d d ! SWqt k
r qXn | d d k rt j j | d d !j d } | r| d d
k r y t t | d d ! SWqt k
rqXqt | d Sqn | S( Ni i u <t i u &#i u &#xi˙˙˙˙i u &s utf-8s &#s
iso-8859-1( t groupt unichrt intR% t htmlentitydefst
entitydefst getR R
( t matchR' t entity( ( s5 /usr/lib/python2.7/site-packages/kitchen/text/misc.pyt fixupÔ s(
"
sF html_entities_unescape must have a unicode type for its first argument( R R
R R R t ret subt
_ENTITY_RE( R' R6 ( ( s5 /usr/lib/python2.7/site-packages/kitchen/text/misc.pyt html_entities_unescapeË s c C s^ t | t s t Sy t | | } Wn t k
r: t SXt | } | j t rZ t St S( sĂ Check that a byte :class:`str` would be valid in xml
:arg byte_string: Byte :class:`str` to check
:arg encoding: Encoding of the xml file. Default: :term:`UTF-8`
:returns: :data:`True` if the string is valid. :data:`False` if it would
be invalid in the xml file
In some cases you'll have a whole bunch of byte strings and rather than
transforming them to :class:`unicode` and back to byte :class:`str` for
output to xml, you will just want to make sure they work with the xml file
you're constructing. This function will help you do that. Example::
ARRAY_OF_MOSTLY_UTF8_STRINGS = [...]
processed_array = []
for string in ARRAY_OF_MOSTLY_UTF8_STRINGS:
if byte_string_valid_xml(string, 'utf-8'):
processed_array.append(string)
else:
processed_array.append(guess_bytes_to_xml(string, encoding='utf-8'))
output_xml(processed_array)
( R R R R
R R# t intersectionR$ R ( R R t u_stringR* ( ( s5 /usr/lib/python2.7/site-packages/kitchen/text/misc.pyt byte_string_valid_xmlő s
c C s* y t | | Wn t k
r% t SXt S( sÔ Detect if a byte :class:`str` is valid in a specific encoding
:arg byte_string: Byte :class:`str` to test for bytes not valid in this
encoding
:kwarg encoding: encoding to test against. Defaults to :term:`UTF-8`.
:returns: :data:`True` if there are no invalid :term:`UTF-8` characters.
:data:`False` if an invalid character is detected.
.. note::
This function checks whether the byte :class:`str` is valid in the
specified encoding. It **does not** detect whether the byte
:class:`str` actually was encoded in that encoding. If you want that
sort of functionality, you probably want to use
:func:`~kitchen.text.misc.guess_encoding` instead.
( R
R R R ( R R ( ( s5 /usr/lib/python2.7/site-packages/kitchen/text/misc.pyt byte_string_valid_encoding s
R>