How To Extract A Bunch Of Text From An HTML File Using Python3
Here goes my first hashnode post!
The problem statement
I wanted to extract some text from a webpage. To be specific, I wanted to get anything that's between
<h3> tags and save that output into another file.
My idea was to view the source code of that HTML page and save it as a
target.txt file and then use Python3 to process it further.
Here's the code that I wrote to test if my idea would work:
sentences =  with open('target.txt', 'U') as f: newText=f.read() while '<h3>' in newText: start_index = newText.index('<h3>') end_index = newText.index('</h3>') if start_index: start_index+= 4 sentence = newText[start_index:end_index] sentences.append(sentence) newText=newText[end_index+6:] print(sentences)
I created a list called
sentences. Then opened the file, read the contents of it and stored it in the
Let's say newText contains the following code:
<h2> some imp header </h2> <h3> some header </h3> <p> some paragraph </p> <h3> some header </h3>
Then the logic was to process the code as long as
<h3> existed in
And processing involved, finding the start and end index of the string I wanted to extract, then extracting the string and replacing
newText with unprocessed text (i.e. the text that remained from the end of last extracted string).
Therefore unprocessed text would be:
<p> some paragraph </p> <h3> some header </h3>
The extracted code is appended to the
I learned that I don't have to use the
U mode anymore as that's the default in Python3 .
Also, I noticed the strings in the
sentences list was a little messy with encoding issues like
The fix for that is to use the encoding parameter as follows:
sentences =  with open('target.txt', encoding='utf-8') as f: newText=f.read() while '<h3>' in newText: start_index = newText.index('<h3>') end_index = newText.index('</h3>') if start_index: start_index+= 4 sentence = newText[start_index:end_index] sentences.append(sentence) newText=newText[end_index+6:] print(sentences)
Now that we have a list of sentences, we can move them to an output file as follows:
f = open('output.txt', 'w') for sentence in sentences: f.write(sentence + '\n')
I wanted to extract all the idioms from this website to a text file.
Here's the end result:
What’s an idiom? How is it different from a proverb? 1. Stir up a hornets’ nest 2. Back against the wall 3. Bite off more than you can chew 4. Head over heels 5. Upset someone’s applecart 6. Spoil someone’s plans 7. Keep someone at arm’s length 8. Up in arms 9. Drive a hard bargain 10. Barking up the wrong tree 11. Scrape the barrel 12. Bend over backwards 13. A chip off the old block 14. Blow your own trumpet 15. Once in a blue moon 16. Burn your boats/ bridges 17. Make no bones about something 18. Break fresh/ new ground 19. In the same breath 20. Take away your breath 21. Sell like hot cakes 22. Burn the candle at both ends 23. Separate the wheat from the chaff 24. Change tune 25. Run around in circles 26. Turn the clock back 27. Against the clock 28. Close the door on someone 29. Burn the midnight oil ... -- truncated --
That's all in this post. Let me know if you found this valuable. Thanks.