Web Scraping two approaches: first with Requests and BeautifulSoup libraries and second with Request-HTML library

I completed the following tutorials on how to manually scrape the website using two approaches: 1) Using Requests and BeautifulSoup libraries approach. • The first tutorial “How to Scrape Stock Prices from Yahoo Finance with Python” by John Watson Rooney was on how to scrape financial data from Yahoo Finance website by using Requests and Beautifulsoup Libraries and saving JSON data to the file. The length of the tutorial is 0:20 total video hours. You can find this tutorial at https://www.youtube.com/watch?v=7sFCOunKL_Y&ab_channel=JohnWatsonRooney • The second tutorial is “Python Tutorial: Web Scraping with BeautifulSoup and Requests” by Corey Schafer on how to scrap data from his website. The length of the tutorial is 45:47 total video hours. You can find this tutorial at https://www.youtube.com/watch?v=ng2o98k983k&ab_channel=CoreySchafer • The third tutorial is “Web Scrape Websites with a LOGIN - Python Basic Auth” by John Watson Rooney about how to scrap website with login. The length of the tutorial is 00:14 total video hours. You can find this tutorial at https://www.youtube.com/watch?v=cV21EOf5bbA&ab_channel=JohnWatsonRooney • The fourth tutorial is “Web Scraping with Python - Beautiful Soup Crash Course” by freeCodeCamp.org where I simulated the search for job advertisements. The length of the tutorial is 01:08 total video hours. You can find this tutorial at https://www.youtube.com/watch?v=XVv6mJpFOb0&ab_channel=freeCodeCamp.org 2) Using Request-HTML library approach. • The first tutorial is “Python Tutorial: Web Scraping with Requests-HTML” by Corey Schafer on how to scrap data from his website. He has shown that Request-HTML library gives a definite number of advantages over the first approach: * Async support examples. * JavaScript support like grab text that is dynamically generated by JavaScript. and many other support functionalities available in this library. The length of the tutorial is 56:26 total video hours. You can find this tutorial at https://www.youtube.com/watch?v=a6fIbtFB46g&t=9s&ab_channel=CoreySchafer • The second tutorial is “Slow Web Scraper? Try this with ASYNC and Requests-html” by J.W. Rooney where I learned how to use Async/Await to speed up requests made to the server(s). The length of the tutorial is 00:09 total video hours. You can find this tutorial at https://www.youtube.com/watch?v=8drEB06QjLs&ab_channel=JohnWatsonRooney

Control Flow: async and await - optimize performance

Async/Await is a popular way to speed up requests being made to a server, it's used both client and server side. I completed five mini-tutorials to learn about Control Flow: async and await - optimize performance. o First, I completed the “Demystifying Python's Async and Await Keywords” intermediary-level tutorial by Michael Kennedy offered by JetBrainsTV on YouTube. Michael Kennedy is the host of Python Bytes and Talk Python to Me. In this tutorial, he introduces the entire spectrum of Python's parallel APIs. Then he focuses in on the most promising, most useful, and modern feature of Python's async capabilities: the async and await keywords. The duration of the video is 1:19 hours. This tutorial gave excellent examples of how to use async and await: • As producer_consumer • The unsync library • The use of aiohttp and async and BeautifulSoup parser for web scraping. You can find this tutorial at · https://www.youtube.com/watch?v=F19R_M4Nay4&ab_channel=JetBrainsTV o Second, I completed the following three tutorials where I learned two approaches to asynchronous HTTP Requests with Web Scraping: 1) The first approach on how requests_html library is used with async/await. I completed the tutorial “Slow Web Scraper? Try this with ASYNC and Requests-html” by John Watson Rooney on YouTube. The duration of the video is 0:09 hours. You can find this tutorial at https://www.youtube.com/watch?v=8drEB06QjLs&t=11s&ab_channel=JohnWatsonRooney 2) The second approach on how aiohttp library and asyncio are used with async/await and BeautifulSoup parser. I completed two tutorials on the second approach. 2.1) “Web Scraping with AIOHTTP and Python” by John Watson Rooney on YouTube. The duration of the video is 0:14 hours. Project: use http://books.toscrape.com to get the title, price, and article range. Time the request. You can find this tutorial at https://www.youtube.com/watch?app=desktop&v=lUwZ9rS0SeM&t=1s&ab_channel=JohnWatsonRooney 2.2) “How To Do Asynchronous Web Scraping In Python” by James Phoenix on YouTube. He is a data scientist. The duration of the video is 0:11 hours. Project: use multiple urls like (https//undestandingdata.com and https//twitter.com) to scrap the data. Time the request. You can find this tutorial at https://www.youtube.com/watch?v=vnzFN5FXqRI&t=4s&ab_channel=UnderstandingData o Thirdly, I completed a tutorial on how to do asynchronous HTTP Requests in Python with aiohttp and asyncio when using API. The tutorial is “Asynchronous HTTP Requests in Python with aiohttp and asyncio” by Sam Agnew on Twilio. Project: make a request to Pokemon API to get the data. Time the request. You can find this tutorial at https://www.twilio.com/blog/asynchronous-http-requests-in-python-with-aiohttp

Context Managers and Python's with Statement

In a Deep Dive course Part II by Fred Baptiste I learned about Context Manager in Python. However, I wanted to get a deeper understanding of the subject. Therefore, I took an additional tutorial on this subject. I completed the intermediate-level tutorial “Context Managers and Python's with Statement” by Leodanis Pozo Ramnos at Real Python. In this tutorial I learned about: • The try … finally Approach • The with Statement Approach • Working With Files • Traversing Directories • Performing High-Precision Calculations • Using Python with Statement: Handling Locks in Multithreaded Programs • Using Python with Statement: Testing for Exceptions With pytest • Using the async with Statement • How to use Context Manager Protocol • Creating Custom Context Managers • Coding Class-Based Context Managers • Writing a Sample Class-Based Context Manager • Handling Exceptions in a Context Manager • Opening Files for Writing: First Version • Redirecting the Standard Output • Measuring Execution Time • Creating Function-Based Context Managers • Opening Files for Writing: Second Version • Mocking the Time • Writing Good APIs With Context Managers • Creating an Asynchronous Context Manager You can find this tutorial at https://realpython.com/python-with-statement/

Certificate of Completion - Python 3: Deep Dive (Part 2 - Iteration, Generators)

I earned a Certificate of Completion that verifies that I successfully completed the intermediate to advance “Python 3: Deep Dive (Part 2 - Iteration, Generators)” course on 26/01/2022 as taught by instructor Fred Baptiste at Udemy Academy. Fred Baptiste is a Professional Developer and Mathematician. The certificate indicates the entire course was completed as validated by the student. The course duration represents the total video hours of the course at the time of most recent completion. Length 34.5 total hours. In this course, I learned about: • Sequence Types and the sequence protocol • Iterables and the iterable protocol • Iterators and the iterator protocol • Sequence slicing and how slicing relates to ranges • List comprehensions and their relation to closures • The reason why subtle bugs sometimes creep into list comprehensions • All the functions in the itertools module • Generator functions • Generator expressions • Context managers. The relationship between context managers and generator functions. When working with files it is more Pythonic to call pathlib.Path.open() • Creating context managers using generator functions • Using Generators as Coroutines o Five Capstone Projects: 1) First, create a ‘Polygon’ class. The initializer needs a number of edges/vertices for the largest polygon in the sequence and a common circumradius for all polygons. Properties: edges, vertices, interior angle, edge length, apothem, area, perimeter, and return the Polygon with the highest area(perimeter ratio). Functions: a proper representation(__repr__), implement equility(==) based on # of virtices and circumradius (__eq__), implements > based on # of virtices only(__gt__), functions as a sequence type(__getitems__), and support the (__len__). Second, create a ‘Polygons’ class which is a sequence of polygon objects by implementing sequence protocol. Identify the polygon that has the highest area-to-perimeter ratio by sorting the polygons. Goal one: Refactor the `Polygon` class so that all the calculated properties are lazy properties(cached), i.e. they should still be calculated properties, but they should not have to get recalculated more than once (since our `Polygon` class is "immutable"). Goal two: Refactor the `Polygons` (sequence) type, into an **iterable**. Make sure also that the elements in the iterator are computed lazily - i.e. you can no longer use a list as an underlying storage mechanism for your polygons. You'll need to implement both an iterable and an iterator. 2) Use data on New York parking cars to calculate the number of violations by Car Makers. Create a lazy iterator to extract data on New York parking cars from a CSV file. This will allow for keeping memory overhead to a minimum. Convert data to the correct types. Split data and clean it from trailing/ leading spaces. Set priority on the values in the rows: if a critical value in the row is empty then toss the row away. If the value is not critical and it is empty then leave the row. All the parsing functions should be generators unless need to iterate through the row more than once. Transform rows into structured forms like named tuple. 3) I am given four CSV data files: personal_info.csv, vehicles.csv, employement.csv, and update_status.csv. Each file contains different information on different people, but all files contain the same Social Security Number(SSN) per person which is a uniquely identifying key. Every SSN number appears only once in every file. Goal one: use the CSV module to create four independent lazy iterators for each of the four files. Return named tuples. Data types should be appropriate (string, data, int, etc). The four iterators are independent of each other(for now). Goal two: create a single iterable that combines all the data from all four files. Re-use the iterators created in goal 1. By combining I mean one row per SSN containing data from all four files in a single named tuple. Make sure that the SSN value is not repeated 4 times – one time per row is enough. Goal three: Identify any stale records, where stale simply means the record has not been updated since 3/1/2017 (e.g. last update date < 3/1/2017). Create an iterator that only contains current records (i.e. not stale) based on the `last_updated` field from the `status_update` file. Goal four: Find the largest group of car makes for each gender. 4) I am given two data files: cars.csv and personal_info.csv. The basic goal will be to create a context manager that only requires the file name and provides us an iterator we can use to iterate over the data in those files. The iterator should yield a named tuple with field names based on the header row in the CVS file. Goal one: create a single class that implements both the context manager protocol and the iterator protocol. Goal two: re-implement what I did in goal one, but use a generator function instead. Use @contextmanager from the contextlib module. 5) The goal of this project is to rewrite the pull pipeline we created in the **Application - Pipelines - Pulling** video in the **Generators as Coroutines** section. I should apply techniques used in the **Application - Pipelines - Broadcasting** video. The goal is to write a pipeline that will push data from the source file, `cars.csv`, and push it through some filters and a save coroutine to ultimately save the results as a CSV file. You can find this course at https://www.udemy.com/course/python-3-deep-dive-part-2/

1 ... 8 9 10 ... 13

Our Sidebar

You can put any information here you'd like.

  • Latest Posts
  • Announcements
  • Calendars
  • etc