Accéder au contenu principal

Python, BeautifulSoup: select the only text in mutli-Imbriqued childs

<div id="parrent">
<ul>
<li>
<span>
<span>hi, i want you to get me!!!!</span>
</span>
</li>
<li>
<span>me too please!!!!</span>
</li>
</ul>
</div>

 Imaging in the example above we want to get all the texts withing the li, note there is only one text by li, but can be imbricated into multiple levels Tags.

So how we get it using BeautifulSoup of course. the protperty .string don't do the job, here is a resume about .string:
  •  If a tag has only one child, and that child is a NavigableString, the child is made available as .string: (a navigable string is such a text within a tag)
    for ex:
    <span>me too please!!!!</span>
    here if we have a soup of span, and we call .string then it give as "me too please!!!!" (span.string   => "me too please!!!!" )
  • If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
    for example:
    <span>
    <span>hi, i want you to get me!!!!</span>
    </span>
    here if we have the upper span, than upperSpanSoup.string will give the inner span .string. and so give "hi, i want you to get me!!!!".
  • Finally the important point: <the None> : If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None
    <li>
    <span>
    <span>hi, i want you to get me!!!!</span>
    </span>
    </li>
    in such an example: liSoup.string will return None.
How to achieve that, in such a case (one text and only one within mutliple imbricated tags) :
==> We use getText() or get_text() methods, both do the same, the first i think available only in BS4. i wanted to mention both. And we stripe the result.
liSoup.getText().strip()
that will give our wanted result  "hi, i want you to get me!!!!"

GOOD SCRAPPING !!

Commentaires

Posts les plus consultés de ce blog

Visual Studio Code: Formate and beautify PHP files and Laravel blade files with HTML markup on them (Formate the HTML)

The extension beautify just do it very well, either add php and any other file extension type to the config as said above here is an example : go to user settings (CTRL + comma) search for beautify in the field above, and identify the line beautify.language, then click left in the pencil icon and click replace in settings. It will add the config to the right (user settings). for html section just add php and blade.php OTHERWISE: you can also do it directly , type F1 then write beautify , the auto completion give you two choices beautify selection or beautify file . Choose the one you need , and it will do the job. that's a straight direct way . You can also add a keybinding to have a keyboard shortcut , here how to do it: open keybindings.json (go file>preferences>keyboard shortcuts ) click in above open and edit keybindings.json add the following into the closed brackets [] { "key": "alt+b", "command"

VSCODE: change all occurence quickly one write

as you can see, we can use the shortcut CTRL+F2  (make sure there is now other program using that shortcut, you can change that or add another keybinding too) How it work? => we select a word, then CTRL+F2 then all the occurrence get automatically selected and so we will get a cursor at each occurrence. if we start writing, we write over each occurence