substitu superboog Beatiful Soup Parser (#2996)

* add lxml to requirments

add lxml to requirments

* Change Beaitful Soup Parser

"lxml" parser which might be more tolerant of certain kinds of parsing errors than "html.parser" and quicker at the same time.
This commit is contained in:
Juliano Henriquez 2023-07-11 18:02:49 -04:00 committed by GitHub
parent ab044a5a44
commit 1fc0b5041e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 2 additions and 1 deletions

View File

@ -2,3 +2,4 @@ beautifulsoup4==4.12.2
chromadb==0.3.18 chromadb==0.3.18
posthog==2.4.2 posthog==2.4.2
sentence_transformers==2.2.2 sentence_transformers==2.2.2
lxml

View File

@ -69,7 +69,7 @@ def feed_url_into_collector(urls, chunk_len, chunk_sep, strong_cleanup, threads)
cumulative += 'Processing the HTML sources...' cumulative += 'Processing the HTML sources...'
yield cumulative yield cumulative
for content in contents: for content in contents:
soup = BeautifulSoup(content, features="html.parser") soup = BeautifulSoup(content, features="lxml")
for script in soup(["script", "style"]): for script in soup(["script", "style"]):
script.extract() script.extract()