Build An NLP Project From Zero To Hero (2): Data Collection
Dec 14, 2021
We continue the series and in this article, we are going to talk about Data Collection. As stated in the previous article, our objective is : train a Named Entity Recognition (NER) Model and use it to extract meaningful information from stock tweets and hopefully derive investment decision. To this end, we need to have high quality annotated data ready to be fed to our model and this is the goal if this article.
I also want to mention that we will be exploring multiple methods of Data Collection and we will be choosing one of them for this article. This is because I want every article to contain a roadmap showcasing the overall steps to execute every phase of the project.
Let’s dive in!
Data Collection
Roughly speaking, Data Collection is what it implies from the name: collecting all kinds of information (data) needed for the project. Rigorously speaking, it is a process of collecting reliable and rich data for the project at hand. Certain techniques are needed to measure and analyze the collected data and to make sure that it serves well our objective.
We can collect data through various ways:
1. Open Datasets:
Some companies, organizations, and institutions may share publicly their data. Anyone can access, use or share it.
Public APIs: When a service or a website presents a programmatic way to acquire data from them but according to certain rules. (For example, an API is usually set with a certain rate limit for incoming requests. Or it can cost you money). You will certainly need an API key in most cases. It will allow your application or your script to communicate with their API.
2. Web Scraping:
This is another programmatic way to acquire data from the internet. You program a script to harvest whatever data is accessible on a website. However, this approach can cause legal problems as many websites do not like to be crawled by every bot in the world. So before you decide to scrape a website, check its robots.txt route to see which routes are allowed. Web scraping can be done manually but automating the process will be significantly more efficient.
There exists also other types of data sources, like surveys, interviews, and libraries. Usually, they require a certain level of expertise, unlike the previous sources which need mostly programming skills. They also do not scale well.
Practically, since we want documents discussing the stock market, we decided to go for collecting financial tweets as our data. With the exponential growth of social media, big data has become the hottest mean for researchers and experts to analyze stock market tendencies. This has been influencing the financial and economic domains, which suggests social mood can help define future investment and business decisions.
Yet, it was not a simple decision.
At First, we wanted to scrape few websites (we really wanted to demonstrate some web scraping fundamentals.), but we were not lucky as the websites are restrictive. For your references, this is a non-exhaustive list of tools you can use:
- Requests: It is an elegant and simple HTTP library for Python, your first friend to meet when going for web scraping.
- BeautifulSoup: A Python library that parses data out of HTML and XML files. A typical routine is to get the page source code with the Requests Library and then use this library.
- Scrapy: An entire open source framework for web scraping. You can use it to build crawlers that can ‘navigate’ entire websites efficiently.
- Selenium: an automation tool that is used primarily for testing web applications. It can be used also for web scraping by simulating the behavior of a human user through a webdriver. This helps greatly in rendering dynamic websites which rely on bringing needed data using JavaScript requests. The previous tools will certainly struggle with dynamic websites, however, Selenium is not really scalable.
There are certainly other tools but these ones are really what you need if you are starting web scraping.
We decided to go for open datasets and public APIs.
A Public API Example: PRAW
- First, login in to Reddit and create a new application with this link https://ssl.reddit.com/prefs/apps/
- Install the praw library in your python environment.
!pip install praw
- Let your script access the Reddit API
import praw reddit_read_only = praw.Reddit(client_id="your client id you will find it near the Application name", client_secret="you will find it after creating your app", user_agent="scraper by u/username, some text to identify your app, mention your username") subreddit = reddit_read_only.subreddit("stocks") # Display the name of the Subredditprint("Display Name:", subreddit.display_name) # Display the title of the Subredditprint("Title:", subreddit.title) # Display the description of the Subredditprint("Description:", subreddit.description)
- You can get the top five hottest posts in the subreddit, or you can filter the last month posts according to their link flair text (a tag that labels a post) like Industry News!