Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Yes! Anything that you see on any website can be scraped. From searching a house for rent to analyzing stocks, web scraping make things easy. You can extract data as and when you wish and use it for yourĀ need.
In this blog I will walk you through the six steps that I followed to scrape the website of National Stock Exchange and analyze the data using candlestick charts. Letās get started with some definition now.
What is web scraping?
Web scraping or web harvesting is a technique used to extract data such as weather reports, market pricing or any data that you wish, from different websites across theĀ web.
I find using APIs simple, compared to using web scraping. If you can find an API well and good, there is no need for web scraping. The important point here is that not all websites provide APIs so thatās where web scraping comes into theĀ picture.
PrerequisitesĀ :
A. Prerequisites for web scrapingĀ :
Since we would be scrapping a website, it is important to have a basic knowledge of HTML, JavaScript and, of course,Ā python.
B. Choosing theĀ IDE:
In due course of the implementation of an idea, you might want to install various modules and packages. Therefore, choosing a good IDE is actually a very important step.
I used Anaconda distribution for my implementation. I choose the Spyder IDE because it comes with a lot of packages built-in. Installing packages is also prettyĀ easy.
If you prefer to implement your idea using an online IDE, I would recommend you to useĀ Repl.
C.InstallationsĀ :
If you are using anaconda (click here to download), you can use the following commands to set up the mpl_finance module.
First of all, check for your pip version using pip --version. If your pip needs to be updated then, use python -m pip install --upgrade pip toĀ update.
To install mpl_finance module, use the following command.
pip install https://github.com/matplotlib/mpl_finance/archive/master.zip
If you choose to run the code on repl, then no worries! All the packages get installed during theĀ runtime.
Implementation:
Go through the HTML structure of the webpage that you wish toĀ scrap.
Step 1. Understanding theĀ URL:
As we all know, a URL acts as an identifier to a particular resource on the internet. It has various components like scheme name, hostname, port number, path, query string, and a fragment identifier. To brush up your knowledge on this may have to refer to thisĀ link.
The URL for the web page that we would be scrapping is
While we look at the URL we see the word called āsymbolā. The key here is that, if we pass the symbol of the security of our choice, we get the corresponding page for that security.
Step 2. Making a HTTPĀ request:
To access the web page, we need to first send a HTTP request. This can be done with the help of a python library called ārequestsā. There are a lot of features, that this library provides, like Keep alive and connection pooling, Session with cookie persistence, streaming download, connection timeouts, etc. There are a number of errors and exceptions that the request library canĀ handle.
- ConnectionError exception: This is raised when there is problem with the network. For example, in the case of refused network or DNS failure this exception isĀ raised.
- HTTPError exception: This is raised when the HTTP response isĀ invalid.
- Timeout exception: This is raised when the request time isĀ out.
- TooManyRedirects exception: If the request exceeds the preconfigured number of maximum redirections.
More information about the library and its methods can be foundĀ here.
The Requests library, can be imported by using import requestsĀ .You can use this library to send a HTTP request by passing the URL to it. Then you can get the contents from theĀ page.
headers={āRefererā:āhttps://www.nseindia.com","Host":"www.nseindia.com","DNT":"1"} page = requests.get(URL, headers = headers)
If you have this done, then itās time to extract theĀ soup.
Step 3. The role of BeautifulSoup:
BeautifulSoup is a python library that is designed for quick turnaround projects like screen scraping. You can import BeautifulSoup byĀ using
from bs4 import BeautifulSoup
When you pass a page through BeautifulSoup it gives us a BeautifulSoup object, which represents that document, as a nested data structure.
cont = page.content soup = BeautifulSoup(cont,āhtml.parserā)
The structure of the soup extracted looks somewhat likeĀ this,
Step 4. Data extraction:
Since, we have passed the desired symbol name to the URL function, from which website extract the corresponding page, the soup that we have got using the previous step gives us the required data in a structured format.
Looking at the structure of the soup (A part of it is shown above) we get to know that we have to find all the table data or to in each of the table row or tr. After finding it we can store it in a python dictionary with the table header as the key. So essentially,
- We first extract theĀ soup.
- We find all the āthā or the table header and append the contents in the header_array.
- Next, we find all table rows using soup.findAll(āthā)and then you find the data in each row usingĀ .findAll(ātdā).
- We are then trying to store the extracted data in a JSONĀ format.
You must be feeling somewhat like this after you extract your dataĀ :P
Step 5. Storing the data inĀ CSV:
Storing the data is very important as it can be used for any referential purpose in the future. Although the data can be stored in JSON format, stock data is generally stored in the CSV format. So this is how it can beĀ done,
Step 6. Understanding & generating Candlestick chartsĀ :
What are candlestick charts and why are they important?
Candlestick chart is a kind of financial analysis chart, that carries information about open, high, low and close prices for a security, derivative or currency. Each candlestick carries information pertaining to a particular tradingĀ day.
The āCandlestickā function:
Candlestick chart can be generated using the candlestick_ohlc function. A basic candlestick function should look somewhat likeĀ this,
candlestick_ohlc(ax, opens, closes, highs, lows, width=4, colorup=ākā, colordown=ārā, alpha=0.75)
For more information regarding candlestick_ohlc, you can refer thisĀ link.
Instead of passing open, high, low and, close price separately, we can pass a list of tuples containing opening price, high price, low price and closing price for each day to the candlestick_ohlc function. Along with that, we would be passing other parameters required to plot theĀ chart.
Date issuesĀ :
The x-axis of our plot will be date. We would be passing the date to the candlestick_ohlc function. So it is important to pass the information with the appropriate data type andĀ format.
First, import datetime using from datetime importĀ datetime
You might encounter this error saying a non datetime object is passed to a datetime axis. This can be tackled by usingĀ this.
datetime_object = datetime.strptime(ā2018-10ā09ā,ā%Y-%m-%d ā)
In addition to converting a string into datetime object, you might as well want to convert it to a format of your choice. It can be done using strftime.
datetime.datetime.strptime("1996ā10ā25", '%Y-%m-%dā).strftime(ā%m/%d/%yā)
Therefore to sumĀ up,
strptime converts the string to datetimeĀ object.
strftime creates a formatted string for a given datetime object according to the specified format. This link contains the details about various formats that can be used in strftime.
āThou shalt not overlook the data type and format of the date that thou artĀ passingā
The completeĀ code!
This link takes you to my Github page where you can view the entireĀ code!
If you run this program successfully, viola! you must be seeing something like this for the security of yourĀ choice.
Candlestick chart for HDFCĀ BANK
Data abstraction:
In this program, as I said, I have followed a modular approach. I have coded in such a way that if you make a call to the last program, one before the other, it backtraces its path by calling the previous program. The main reason for adopting this approach is that, though there are about 5 functions, most of them act as a utility. When we look at it from an end user perspective, the function of those isnāt required.
Abstracting codes into functions enhances readability and usability!
Conclusions and Extensions
- This program is capable of generating a candlestick chart and storing it in aĀ .png format for the company that you enter. Apart from that, it can store the data in a CSV folder. To edit the path as per your wish to enjoy theĀ results!
- A lot of exciting web scraping tools are available here! Depending on your requirements and availabilities you can choose theĀ tool.
Hi, Iām Sruthi. This is my first shot at writing. Iām overjoyed on getting to know that you made it till here! If you enjoyed the post, press š toĀ support.
I would love to hear all of your views and suggestions. Comment them out in the comment sectionĀ below!
If you can see it, you can scrape it! was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.