Tokenization Utilizing Spacy library – GeeksforGeeks

    0
    51


    Earlier than shifting to the reason of tokenization, let’s first focus on what’s Spacy. Spacy is a library that comes beneath NLP (Pure Language Processing). It’s an object-oriented Library that’s used to take care of pre-processing of textual content, and sentences, and to extract info from the textual content utilizing modules and capabilities.

    Tokenization is the method of splitting a textual content or a sentence into segments, that are referred to as tokens. It is step one of textual content preprocessing and is used as enter for subsequent processes like textual content classification, lemmatization, and so on.

    Process followed to convert text into tokens

    Course of adopted to transform textual content into tokens

    Making a clean language object offers a tokenizer and an empty pipeline so as to add modules within the pipeline together with a tokenizer we will use:

                

    Intermediate steps for tokenization

    Intermediate steps for tokenization

                                  

    Under is the Implementation

    Python

    import spacy

      

    nlp = spacy.clean("en")

      

    doc = nlp("GeeksforGeeks is a one cease

    studying vacation spot for geeks.")

      

    for token in doc:

        print(token)

    Output:

    GeeksforGeeks
    is
    a
    one
    cease
    studying
    vacation spot
    for
    geeks
    .

    We are able to additionally add performance in tokens by including different modules within the pipeline utilizing spacy.load().

    Python3

    nlp = spacy.load("en_core_web_sm")

      

    nlp.pipe_names

    Output:

    ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

    Right here is an instance to indicate what different functionalities might be enhanced by including modules to the pipeline.

    Python

    import spacy

      

    nlp = spacy.load("en_core_web_sm")

      

    doc = nlp("If you wish to be a wonderful programmer

    , be constant to follow day by day on GFG.")

      

    for token in doc:

        print(token, " | ",

              spacy.clarify(token.pos_),

              " | ", token.lemma_)

    Output:

    If  |  subordinating conjunction  |  if
    you  |  pronoun  |  you
    need  |  verb  |  need
    to  |  particle  |  to
    be  |  auxiliary  |  be
    an  |  determiner  |  an
    glorious  |  adjective  |  glorious
    programmer  |  noun  |  programmer
    ,  |  punctuation  |  ,
    be  |  auxiliary  |  be
    constant  |  adjective  |  constant
    to  |  particle  |  to
    follow  |  verb  |  follow
    day by day  |  adverb  |  day by day
    on  |  adposition  |  on
    GFG  |  correct noun  |  GFG
    .  |  punctuation  |  .

    Within the above instance, we now have used a part of speech (POS) and lemmatization utilizing NLP modules, which resulted in POS for each phrase and lemmatization (a course of to cut back each token to its base kind). We weren’t in a position to entry this performance earlier than, this performance is simply added after we loaded our NLP occasion with (“en_core_web_sm”). 

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here