For applying data science to tackle certain real-world tasks, data often needs to be mined from the internet. The course Using Python to Access Web Data by University of Michigan on Coursera taught by Prof. Charles Severance is a very helpful resource for web-scraping. This blog is an attempt to give a brief overview of the tools taught in this course in a way that can be grasped by a beginner in data science.
The data on internet comes in many formats, most commonly JSON, XML and plain old HTML. Python library
urllib lets one access data directly from the URL of a website. If the data thus obtained is a HTML file, then the relevant information can be extracted using regular expressions, or the libraries designed for parsing -
BeautifulSoup4, pickle, etc. The Python libraries
xml.etree.ElementTree are used for JSON and XML respectively.
JSON and XML are the two most common data-interchange formats used to encode data transporting over the internet.
XML(Extensible Markup Language)
XML is a markup language, like HTML, designed to store and exchange data in a format that is both human-readable and machine-readable. Like HTML, it has tags and a tree-like structure.
The following Python libraries are very useful for web scraping:
BeautifulSoup4 is one of the widely used Python library for web-scraping i.e. searching through websites on the internet to extract only the relevant data. It is used to parse data from the HTML or XML files. It takes a file and makes a soup out of its content so that the target information can be extracted easily and efficiently from an enormous amount of irrelevant data.
Json library in Python is specially designed to work with JSON files.
Xml.etree.ElementTree library in Python is specially designed to work with XML files.
Work in progress…
Any questions, comments and/or suggestions are welcome! If you find this blog useful, the icons below lets you share it in social media.