Learn how to maintain the original number formatting in your DataFrame while using `pd.read_html` in Pandas, and avoid unwanted conversions!
---
This video is based on the question https://stackoverflow.com/q/68264711/ asked by the user 'Mary' ( https://stackoverflow.com/u/9846358/ ) and on the answer https://stackoverflow.com/a/68264773/ provided by the user 'Nk03' ( https://stackoverflow.com/u/15438033/ ) at 'Stack Overflow' website. Thanks to these great users and Stackexchange community for their contributions.
Visit these links for original content and any more details, such as alternate solutions, latest updates/developments on topic, comments, revision history etc. For example, the original title of the Question was: pd.read_html changed number formatting
Also, Content (except music) licensed under CC BY-SA https://meta.stackexchange.com/help/l...
The original Question post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/... ) license, and the original Answer post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/... ) license.
If anything seems off to you, please feel free to write me at vlogize [AT] gmail [DOT] com.
---
Introduction
Are you facing issues with unwanted number formatting changes when using pd.read_html in Pandas? You're not alone! Many users struggle with this problem, especially when they expect to obtain formatted numbers from an HTML table, but instead, they receive unformatted strings or incorrect values.
In this guide, we'll explore a common scenario where the expected format of a number changes from 1,2,3,4,5,6 to 123456 after reading HTML content into a DataFrame. We will then discuss a solution that preserves the number formatting you desire.
The Problem
When you extract tabular data from HTML using Pandas' read_html, you might encounter cases where columns containing formatted numbers (like 1,2,3,4,5,6) are converted into unformatted numbers (e.g., 123456). This can lead to confusion and inaccuracies in data analysis.
Example of the Issue
Imagine you have the following HTML structure with a table:
[[See Video to Reveal this Text or Code Snippet]]
When using pd.read_html, you should expect the CCCCCCC column to maintain its original format, but instead, it appears as 123456.
Execution and Result
The original code might look like this:
[[See Video to Reveal this Text or Code Snippet]]
And the output you received was:
[[See Video to Reveal this Text or Code Snippet]]
Your expected result was:
[[See Video to Reveal this Text or Code Snippet]]
The Solution
To resolve this issue, you can use the thousands parameter in the pd.read_html function. By setting this parameter to None, you can avoid the automatic conversion of formatted numbers.
Updated Code
Here's the updated version of your code:
[[See Video to Reveal this Text or Code Snippet]]
Output Verification
Upon running the updated code, you should obtain the desired output:
[[See Video to Reveal this Text or Code Snippet]]
Conclusion
In summary, when working with Pandas and HTML tables, it's essential to pay attention to number formatting to avoid unwanted conversions. By leveraging the thousands parameter in the pd.read_html method, you can maintain the original formatting of your data.
Next time you read an HTML table, remember this simple adjustment, and you'll prevent issues with number formatting!
Now go ahead and enhance your data processing workflow with this valuable tip! Happy analyzing!
Информация по комментариям в разработке