How To Extract A Bunch Of Text From An HTML File Using Python3

Here goes my first hashnode post!

The problem statement

I wanted to extract some text from a webpage. To be specific, I wanted to get anything that's between <h3> tags and save that output into another file.

The Idea

My idea was to view the source code of that HTML page and save it as a target.txt file and then use Python3 to process it further.

First Solution

Here's the code that I wrote to test if my idea would work:

sentences = []

with open('target.txt', 'U') as f:

    newText=f.read()

    while '<h3>' in newText:

        start_index = newText.index('<h3>')
        end_index = newText.index('</h3>')

        if start_index:
            start_index+= 4

        sentence = newText[start_index:end_index]

        sentences.append(sentence)

        newText=newText[end_index+6:]

print(sentences)

I created a list called sentences. Then opened the file, read the contents of it and stored it in the newText variable.

Let's say newText contains the following code:

<h2> some imp header </h2>
<h3> some header </h3>
<p> some paragraph </p>
<h3> some header </h3>

Then the logic was to process the code as long as <h3> existed in newText.

And processing involved, finding the start and end index of the string I wanted to extract, then extracting the string and replacing newText with unprocessed text (i.e. the text that remained from the end of last extracted string).

Therefore unprocessed text would be:

<p> some paragraph </p>
<h3> some header </h3>

The extracted code is appended to the sentences list.

Better Solution

I learned that I don't have to use the U mode anymore as that's the default in Python3 .

Also, I noticed the strings in the sentences list was a little messy with encoding issues like \xe2\x80\x99s.

The fix for that is to use the encoding parameter as follows:

sentences = []

with open('target.txt', encoding='utf-8') as f:

    newText=f.read()

    while '<h3>' in newText:

        start_index = newText.index('<h3>')
        end_index = newText.index('</h3>')

        if start_index:
            start_index+= 4

        sentence = newText[start_index:end_index]

        sentences.append(sentence)

        newText=newText[end_index+6:]

print(sentences)

Output File

Now that we have a list of sentences, we can move them to an output file as follows:

f = open('output.txt', 'w')

for sentence in sentences:
    f.write(sentence + '\n')

The Result

I wanted to extract all the idioms from this website to a text file.

Here's the end result:

What’s an idiom? How is it different from a proverb?
1. Stir up a hornets’ nest
2. Back against the wall
3. Bite off more than you can chew
4. Head over heels
5. Upset someone’s applecart
6. Spoil someone’s plans
7. Keep someone at arm’s length
8. Up in arms
9. Drive a hard bargain
10. Barking up the wrong tree
11. Scrape the barrel
12. Bend over backwards
13. A chip off the old block
14. Blow your own trumpet
15. Once in a blue moon
16. Burn your boats/ bridges
17. Make no bones about something
18. Break fresh/ new ground
19. In the same breath
20. Take away your breath
21. Sell like hot cakes
22. Burn the candle at both ends
23. Separate the wheat from the chaff
24. Change tune
25. Run around in circles
26. Turn the clock back
27. Against the clock
28. Close the door on someone
29. Burn the midnight oil
...
-- truncated --

That's all in this post. Let me know if you found this valuable. Thanks.

No Comments Yet